SDSS Storage Server Technical Note
Dan Yocum
Fermi National Lab
Sloan Digital Sky Survey
November 9, 2001
updated May 11, 2002
updated Oct 10, 2002
INTRODUCTION
The Sloan Digital Sky Survey is an ambitious effort to map 1/4 of the visible sky at optical and very-near infrared wavelengths, and take spectra of 1 million extra-galactic objects. The estimated amount of data that will be taken over the 5 year lifespan of the project is 15TB, however, the total amount of storage space required for object informational databases, corrected frames, and reduced spectra will be several factors higher than this. The goal is to have all the data online and available to the collaborators at all times, and provide all this within strict budget contraints. As of late 2001, a reasonably achievable goal is to purchase a sufficiently powered machine with over 1TB of storage space for less than $0.01/MB, or about $10,000USD.
HARDWARE DECISIONS
Our goal of purchasing storage servers for under a penny per megabyte limits us to the use of cost effective COTS equipment, and removes such hardware as SCSI and companies like Clarion and EMC from the search process. We do require high quality, fully tested and main stream equipment, primarily since this is what is the most prevalent and, therefore, the easiest to procure. Keeping these limits in mind we've made the following specifications:
Round #1
Intel STL2 Motherboard with the ServerWorks chipset
Intel Pentium III (Coppermine) CPUs
Mirrored EIDE system disks
Maxtor 81.9GB EIDE data disks
Hot swap disk trays with built in fans
1 GB 133MHz SDRAM
3ware 7810 RAID controller card
Gigabit ethernet card based on the Acenic chipset
An appropriate chassis with enough cooling
Round #2
Supermicro P4DC6+ Motherboard with Intel i860 chipset
Intel Xeon 2.0GHz (Foster) CPUs
Mirrored EIDE system disks
Maxtor 122.9GB EIDE data disks
Hot swap disk trays with built in fans
1 GB 400Mhz RDRAM
3ware 7850 RAID controller cards
Gigabit ethernet card
An appropriate chassis with enough cooling
Round #3
Supermicro P4DP6 Motherboard with Intel E7500 chipset
Intel Xeon 2.4GHz (Foster) CPUs
Maxtor 163.9GB EIDE data disks
1 GB 400Mhz RDRAM
3ware 7850 RAID controller cards
Gigabit ethernet card
An appropriate 4U chassis with enough cooling
SOFTWARE DECISIONS
The reasons we are using Linux instead of some other commercial operating system are numerous and obvious, so these will not be addressed here. Suffice it to say, since we're building a system with a >1TB FS we require a journaling filesystem. We chose XFS for several reasons: it has been in existence since 1994, it's now open source, and we have developed personal contacts with the XFS developers.
We installed the systems using the SGI boot CD which we created from the ISO image that is available at http://oss.sgi.com/projects/xfs. The official Red Hat CDs are also required to complete the installation. These can be obtained commercially or via ftp from various Red Hat mirrors. We then upgraded the kernel and XFS utilities via CVS. We are currently frozen on kernel version 2.4.18 with XFS v1.1 patches applied. These patches may be obtained from ftp://oss.sgi.com/projects/xfs/download/Release-1.1/kernel_patches/
This kernel is much more stable than it's predesessors. Kernels less than 2.4.18 had many VM problems which invariably led to system freezes when memory couldn't be allocated for processes.
This kernel is also modified with the following changes: 1) linux/drivers/char/console.c at line 2679 has been changed to look like this:
static void blank_screen(unsigned long dummy)
{
/* timer_do_blank_screen(0, 1); */
/* Do nothing. Don't blank. */
}
This prevents the console from blanking, allowing us to see kernel messages printed to the screen in the event of a system freeze.
2) The latest patched driver from 3ware (v1.02.00.008) which can be found at http://www.3ware.com/support/3warednload_7000_driver.asp.
3) The latest patched driver for the ns83820 gigabit ethernet driver, taken from kernel 2.4.19-pre8.
We achieved creating a >1TB filesystem by the following method: create 2 hardware RAID 5 arrays, one on each controller, then create a software RAID 0 array (striped) across the 2 hardware RAID 5 arrays using Ingo Molnar's raidtools v0.90. The /etc/raidtab file looks like this: http://home.fnal.gov/~yocum/raidtab After formatting, the resulting volume size is 1.12TB (caveat: see Inode corruption notes below).
4) The NFS client patches from Trond Myklebust which can be found at http://www.fys.uio.no/~trondmy/src/. These patches solve several issues we've experienced with IRIX NFS servers and 'glob' which are referred to in this email message on comp.sys.sgi.hardware, as well as stale NFS file handles.
For kernel 2.4.18 we applied the linux-2.4.18-NFS_ALL.dif patch and then backed out the linux-2.4.18-rpc_tweaks.dif patch (patch -R ...), since this patch is not working completely correctly, yet, and is limiting read and write speed to 10% of the expected speeds.
5) A patch from Ingo Molnar rectifying an IRQ problem associated with Foster CPUs. The problem description and the patch are in this email.
TESTING AND ENGINEERING
As with any cutting edge technology, there have been bugs that need to be worked out of the system. Here's a list of items we have had to solve before arriving at our current state, which we would subjectively describe as 97% stable:
1 - Data corruption issues with 3ware 6800 cards in RAID 5.With issue #1, we were able to reliably create data corruption by simply turning the power off to a drive in an array during a write() operation. To verify this we would copy a file from a known good filesystem, not on the RAID array to the filesystem on the RAID array. During the copy we would remove power to the drive by turning off the key on the hot-swap disk tray. After the copy would complete, we performed a bitwise comparison (cmp) of the copied file to the original. To make sure that we were continually flushing the memory buffer, we used large files (>500MB) and we performed 4 copies of the same file to different directories on the RAID volume, in parallel.
We created the large file with this bit of code: http://home.fnal.gov/~yocum/mkfile.c and we used this script to perform the copy and comparison: http://home.fnal.gov/~yocum/cpCmp.sh
Upgrading to the model 7810 card, and using the later drivers (>1.02.00.006b), data corruption due to these simulated failures ceased. An added bonus of the 7810 card is 3-10x write speed increase over the 6800, but that issue will be addressed in the Benchmark section, below.
The data corruption problem seems to be related to the fact that the 6800 card was never designed to be a RAID 5 capable card: it has an under powered CPU, no cache, and uses FPGAs which our contact at 3ware implied has been giving them problems as well. The 7810 has a higher clocked CPU, and uses APICs instead of FPGAs. It still has no write cache, but that may be remedied in later models in the future. Tom Tran at 3ware was invaluable in helping us diagnose the problem and getting us 7810 cards to test in our system.
Issue #2 manifested itself when we attempted to write data to our 1.12TB filesystem which resulted in corrupted inodes, disappearing directories, and misnamed files. The root of the problem seems to be related to how the Linux VFS handles inode numbers which are >32bits in length, which XFS supports (it's a 64bit filessystem). In short, the VFS doesn't handle these large inodes very well. Currently, the solution is to create the filessystem passing the option '-i size=512' to mkfs.xfs. This sets the inode table size to 512bytes/file, rather than the default of 256bytes/file and drops the significant bits for inodes to 32 bits. By doubling this number we have lost approximately 20MB of useable storage space, but considering the consequences, this is more than acceptable.
One note about Linux and filesystems in general: Linux, at the current time, has an artificial limit on filesystem size of 2TB. Caveat emptor to those who try to create filesystems larger than this limit. Many thanks to Martin K. Petersen and Eric Sandeen for their help in suggesting this answer, and all their great work on porting XFS to Linux.
Issue #3 was, again, a problem related to the 3ware cards, but has supposedly been solved in the latest driver version, v1.02.00.008, which was incorporated into the 2.4.9-ac5, and later, kernels. We have not fully tested this solution, but we expect the problem to be solved with this iteration of patches. Many thanks to Adam Radford at 3ware for his driver development work.
Issue #4 was a possible contributor to the inode corruption and it is strongly recommended that the kgcc (aka gcc 2.91.66, aka egcs 1.1.2.14) be used to compile the kernel. This compiler is found in the compat-egcs package which, in turn, requires the compat-glibc package, both of which can be found in the Red Hat 7.1 rpm packages. Originally we compiled the kernel using the default compiler that is shipped with Red Hat, gcc-2.96-85, but after much discussion with the XFS developers we re-compiled with the earlier compiler. This change must be made in the kernel Makefile and is well documented therein.
BENCHMARKS AND TUNING
The moment of truth. Please note that these benchmarks were performed on a few 2.4.8preX kernels with 3ware 7810 RAID cards. [As noted above, something happened in the kernel between 2.4.8 and 2.4.9 that reduced performance drastically. Downgrading to the 2.4.5 kernel with all patches described here, brought performance to within 95% of these values]. When time permits, these tests will be reproduced and this document will be modified accordingly.
For the benchmarks we used bonnie++ with the following command line options:
bonnie++ -d /export/data/ -s 2000:64k -m dp864k -r 512 -x 20 -u 0
and these are a sampling of the resulting numbers:
Version 1.01c ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block---Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP dp864k 2000M:64k 12104 99 33209 47 14017 12 10959 95 128553 59 209.8 8 dp864k 2000M:64k 12080 99 37975 35 13550 13 10867 94 129185 55 233.2 7 dp864k 2000M:64k 12057 99 36591 30 13870 12 10860 94 128894 57 235.4 7 dp864k 2000M:64k 12072 99 36811 36 13821 12 10876 94 129162 55 232.8 8 dp864k 2000M:64k 12057 99 37857 28 13697 12 10836 94 130170 55 235.3 8
The chart is mostly self explanatory, but for the uninitiated, the following are what we're really interested in: row #5, the block write speed, and row #11, the block read speed.
The STL2 motherboard is based on the ServerWorks ServerSet III LE Chipset. It has 6 PCI slots on 2 buses; one bus has 2 slots and is 64bit@66MHz (528MB/s) and the other has 4 slots and runs at 32bit@33MHz (132MB/s). The 7810 card is a 64bit@33MHz card. Since the gigabit ethernet card is a fast, wide card, it occupies one of the fast, wide slots. The other fast, wide slot is occupied by a 7810 card. The second 7810 card is on the slow, narrow PCI bus, but a single card can only write at about 17MB/s, which is far below the PCI bus speed so this poses no real problem and doesn't hinder the overall speed.
As you notice, our per block write speed is approximately 2 times faster than the advertised 17MB/s. This is due to the fact that the filesystem is on a striped volume occupying 2 RAID 5 cards, which effectively doubles the speed. You should also notice the block read speed of ~129MB/s when you compare the above results to these results:
Version 1.01c ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
dp9512k 2000M:64k 12022 99 34272 31 23061 20 10736 97 185982 86 239.9 7
dp9512k 2000M:64k 11991 99 35922 37 22323 19 10752 97 190026 86 239.6 8
dp9512k 2000M:64k 11971 99 37555 33 21498 18 10799 98 183898 85 238.3 8
dp9512k 2000M:64k 11985 99 34122 26 23061 20 10699 97 188867 89 238.1 8
dp9512k 2000M:64k 11983 99 34603 28 22976 20 10761 97 190599 87 242.6 8
The difference is due to how the software RAID0 array was created. In the raidtab file shown previously you will notice that one can set the chunksize of the software RAID 0 array. This sets how much data is written to a single device at a time. The hardware RAID 5 array is hardcoded with a 64KB chunksize.
In the former bonnie++ test, we configured the software RAID with a 64KB byte chunk size, so after each 64KB of data is written to one device, another 64KB of data is written to the 2nd device, and so forth. In the second instance, the software RAID 0 array was configured with a chunk size of 512KB, thus reducing the amount of overhead of writing many small chunks to each device individually. There is some speculation that a chunksize of 448KB may be even better for write performance but at ~190MB/s we're approaching the theoretical maximum bandwidth of the slow, narrow PCI bus.
The benchmarks for the Xeon systems with 3ware 7850 cards is below:
Version 1.01b ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
dp16 2G:64k 18453 88 107891 35 62639 27 21818 93 191840 38 300.8 5
dp16 2G:64k 17953 88 109829 34 60713 29 22202 95 192694 37 309.2 4
dp16 2G:64k 18522 88 113094 33 64315 26 22001 94 190383 40 306.9 5
dp16 2G:64k 18061 89 110736 33 61757 24 22018 94 194608 41 307.0 5
dp16 2G:64k 18826 89 116793 32 50297 20 23061 99 194914 41 308.5 4
As you can see, performance has increased dramatically with this 3ware card. The highest write speed that we've seen is ~124MB/s and the highest read speed is ~215MB/s. The 7850 cards have been placed in the 2 64bit slots available on the P4DC6+ motherboard, which are on the same PCI channel. One might assume that higher transfer rates might be achieved by placing them on 2 separate 64bit channels that are available on other motherboards, but we have not been able to verify this.
One problem has been discovered with the i860 chipset which essentially limits PCI bandwidth to ~200MB/s. Here's an email to one of our local experts (Don Holmgren) sent to me on the i860 PCI bandwidth problem.
With round #3 we've seen the performance increase dramatically, proving that there is indeed a big problem with the i860 chipset. Remember, theoretical max is 528MB/s and we're not there yet, but it's getting closer. Here are the Bonnie++ numbers for the E7500 based machines:
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
dp20 2G:64k 23969 87 137079 33 76044 28 27736 98 345211 72 295.0 4
dp20 2G:64k 24500 90 134878 34 63826 24 27945 99 338225 68 301.3 4
dp20 2G:64k 23988 88 140606 36 66681 25 27893 99 346833 73 304.1 4
dp20 2G:64k 23790 88 131784 32 59929 23 27961 99 349615 72 305.2 4
dp20 2G:64k 23797 88 131319 33 60297 24 28149 99 343438 70 302.1 3
Some testing has been done of bonnie++ using NFSv3 over gigabit ethernet. The system was tuned thusly:
On the server and client sides, this is appended to /etc/rc.local:
echo 262144 > /proc/sys/net/core/rmem_max echo 262144 > /proc/sys/net/core/wmem_max
On the server side, prior to starting the nfsd (in /etc/init.d/nfs), these values are set to the following before starting the nfsd, and reset to the defaults (65535) after the server has been started:
echo 262144 > /proc/sys/net/core/rmem_default echo 262144 > /proc/sys/net/core/wmem_default
On the client side, these are the options passed to mount the NFS volume:
mount -o "nfsvers=3, rsize=32768, wsize=32768" <exported volume> <mount point>
These tuning options resulted in these performance marks:
Version 1.01c ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
dp9nfsdp8 2000M:64k 10022 90 27694 40 10252 16 9574 92 51843 52 133.9 7
dp9nfsdp8 2000M:64k 11078 98 29912 28 8306 14 10701 99 41515 31 103.6 6
dp9nfsdp8 2000M:64k 7936 74 22719 16 8527 15 10748 99 42917 36 98.2 5
dp9nfsdp8 2000M:64k 11015 98 27402 25 8481 14 10775 99 41699 32 101.5 5
dp9nfsdp8 2000M:64k 10918 96 25679 23 8436 14 10731 99 42655 34 100.8 5
dp9nfsdp8 2000M:64k 10992 98 25247 23 9225 15 10749 99 48310 44 121.7 6
CONCLUSION
Using COTS PC hardware, EIDE hardrives, cost effective EIDE RAID controller cards, and the Linux operating system with XFS support, we have achieved our goal of creating a moderately powered, moderately sized, failure tolerant storage server for just under 1 penny per megabyte.
The CDF experiment has evaluated several similar servers from several companies. Their findings can be found here.
(Please note that all prices listed herein are estimates only, and may not be available to non-government affiliated institutions or commercial corporations.)