SDSS Storage Server Technical Note

Dan Yocum

Fermi National Lab

Sloan Digital Sky Survey

November 9, 2001

updated May 11, 2002

updated Oct 10, 2002


INTRODUCTION

The Sloan Digital Sky Survey is an ambitious effort to map 1/4 of the visible sky at optical and very-near infrared wavelengths, and take spectra of 1 million extra-galactic objects. The estimated amount of data that will be taken over the 5 year lifespan of the project is 15TB, however, the total amount of storage space required for object informational databases, corrected frames, and reduced spectra will be several factors higher than this. The goal is to have all the data online and available to the collaborators at all times, and provide all this within strict budget contraints. As of late 2001, a reasonably achievable goal is to purchase a sufficiently powered machine with over 1TB of storage space for less than $0.01/MB, or about $10,000USD.


HARDWARE DECISIONS

Our goal of purchasing storage servers for under a penny per megabyte limits us to the use of cost effective COTS equipment, and removes such hardware as SCSI and companies like Clarion and EMC from the search process. We do require high quality, fully tested and main stream equipment, primarily since this is what is the most prevalent and, therefore, the easiest to procure. Keeping these limits in mind we've made the following specifications:

Round #1

Intel STL2 Motherboard with the ServerWorks chipset
Intel Pentium III (Coppermine) CPUs
Mirrored EIDE system disks
Maxtor 81.9GB EIDE data disks
Hot swap disk trays with built in fans
1 GB 133MHz SDRAM
3ware 7810 RAID controller card
Gigabit ethernet card based on the Acenic chipset
An appropriate chassis with enough cooling


Round #2

Supermicro P4DC6+ Motherboard with Intel i860 chipset
Intel Xeon 2.0GHz (Foster) CPUs
Mirrored EIDE system disks
Maxtor 122.9GB EIDE data disks
Hot swap disk trays with built in fans
1 GB 400Mhz RDRAM
3ware 7850 RAID controller cards
Gigabit ethernet card
An appropriate chassis with enough cooling


Round #3

Supermicro P4DP6 Motherboard with Intel E7500 chipset
Intel Xeon 2.4GHz (Foster) CPUs
Maxtor 163.9GB EIDE data disks
1 GB 400Mhz RDRAM
3ware 7850 RAID controller cards
Gigabit ethernet card
An appropriate 4U chassis with enough cooling


SOFTWARE DECISIONS

The reasons we are using Linux instead of some other commercial operating system are numerous and obvious, so these will not be addressed here. Suffice it to say, since we're building a system with a >1TB FS we require a journaling filesystem. We chose XFS for several reasons: it has been in existence since 1994, it's now open source, and we have developed personal contacts with the XFS developers.

We installed the systems using the SGI boot CD which we created from the ISO image that is available at http://oss.sgi.com/projects/xfs. The official Red Hat CDs are also required to complete the installation. These can be obtained commercially or via ftp from various Red Hat mirrors. We then upgraded the kernel and XFS utilities via CVS. We are currently frozen on kernel version 2.4.18 with XFS v1.1 patches applied. These patches may be obtained from ftp://oss.sgi.com/projects/xfs/download/Release-1.1/kernel_patches/

This kernel is much more stable than it's predesessors. Kernels less than 2.4.18 had many VM problems which invariably led to system freezes when memory couldn't be allocated for processes.

This kernel is also modified with the following changes:

1) linux/drivers/char/console.c at line 2679 has been changed to look like this:

static void blank_screen(unsigned long dummy)
{
/* timer_do_blank_screen(0, 1); */
/* Do nothing.  Don't blank. */
}

This prevents the console from blanking, allowing us to see kernel messages printed to the screen in the event of a system freeze.

2) The latest patched driver from 3ware (v1.02.00.008) which can be found at http://www.3ware.com/support/3warednload_7000_driver.asp.

3) The latest patched driver for the ns83820 gigabit ethernet driver, taken from kernel 2.4.19-pre8.

We achieved creating a >1TB filesystem by the following method: create 2 hardware RAID 5 arrays, one on each controller, then create a software RAID 0 array (striped) across the 2 hardware RAID 5 arrays using Ingo Molnar's raidtools v0.90. The /etc/raidtab file looks like this: http://home.fnal.gov/~yocum/raidtab After formatting, the resulting volume size is 1.12TB (caveat: see Inode corruption notes below).

4) The NFS client patches from Trond Myklebust which can be found at http://www.fys.uio.no/~trondmy/src/. These patches solve several issues we've experienced with IRIX NFS servers and 'glob' which are referred to in this email message on comp.sys.sgi.hardware, as well as stale NFS file handles.

For kernel 2.4.18 we applied the linux-2.4.18-NFS_ALL.dif patch and then backed out the linux-2.4.18-rpc_tweaks.dif patch (patch -R ...), since this patch is not working completely correctly, yet, and is limiting read and write speed to 10% of the expected speeds.

5) A patch from Ingo Molnar rectifying an IRQ problem associated with Foster CPUs. The problem description and the patch are in this email.


TESTING AND ENGINEERING

As with any cutting edge technology, there have been bugs that need to be worked out of the system. Here's a list of items we have had to solve before arriving at our current state, which we would subjectively describe as 97% stable:

1 - Data corruption issues with 3ware 6800 cards in RAID 5.
2 - Inode corruption resulting in lost data with XFS.
3 - SCSI resets causing the NMI Watchdog to freeze the system.
4 - Possible problems from compiling gcc-2.96-85

With issue #1, we were able to reliably create data corruption by simply turning the power off to a drive in an array during a write() operation. To verify this we would copy a file from a known good filesystem, not on the RAID array to the filesystem on the RAID array. During the copy we would remove power to the drive by turning off the key on the hot-swap disk tray. After the copy would complete, we performed a bitwise comparison (cmp) of the copied file to the original. To make sure that we were continually flushing the memory buffer, we used large files (>500MB) and we performed 4 copies of the same file to different directories on the RAID volume, in parallel.

We created the large file with this bit of code: http://home.fnal.gov/~yocum/mkfile.c and we used this script to perform the copy and comparison: http://home.fnal.gov/~yocum/cpCmp.sh

Upgrading to the model 7810 card, and using the later drivers (>1.02.00.006b), data corruption due to these simulated failures ceased. An added bonus of the 7810 card is 3-10x write speed increase over the 6800, but that issue will be addressed in the Benchmark section, below.

The data corruption problem seems to be related to the fact that the 6800 card was never designed to be a RAID 5 capable card: it has an under powered CPU, no cache, and uses FPGAs which our contact at 3ware implied has been giving them problems as well. The 7810 has a higher clocked CPU, and uses APICs instead of FPGAs. It still has no write cache, but that may be remedied in later models in the future. Tom Tran at 3ware was invaluable in helping us diagnose the problem and getting us 7810 cards to test in our system.

Issue #2 manifested itself when we attempted to write data to our 1.12TB filesystem which resulted in corrupted inodes, disappearing directories, and misnamed files. The root of the problem seems to be related to how the Linux VFS handles inode numbers which are >32bits in length, which XFS supports (it's a 64bit filessystem). In short, the VFS doesn't handle these large inodes very well. Currently, the solution is to create the filessystem passing the option '-i size=512' to mkfs.xfs. This sets the inode table size to 512bytes/file, rather than the default of 256bytes/file and drops the significant bits for inodes to 32 bits. By doubling this number we have lost approximately 20MB of useable storage space, but considering the consequences, this is more than acceptable.

One note about Linux and filesystems in general: Linux, at the current time, has an artificial limit on filesystem size of 2TB. Caveat emptor to those who try to create filesystems larger than this limit. Many thanks to Martin K. Petersen and Eric Sandeen for their help in suggesting this answer, and all their great work on porting XFS to Linux.

Issue #3 was, again, a problem related to the 3ware cards, but has supposedly been solved in the latest driver version, v1.02.00.008, which was incorporated into the 2.4.9-ac5, and later, kernels. We have not fully tested this solution, but we expect the problem to be solved with this iteration of patches. Many thanks to Adam Radford at 3ware for his driver development work.

Issue #4 was a possible contributor to the inode corruption and it is strongly recommended that the kgcc (aka gcc 2.91.66, aka egcs 1.1.2.14) be used to compile the kernel. This compiler is found in the compat-egcs package which, in turn, requires the compat-glibc package, both of which can be found in the Red Hat 7.1 rpm packages. Originally we compiled the kernel using the default compiler that is shipped with Red Hat, gcc-2.96-85, but after much discussion with the XFS developers we re-compiled with the earlier compiler. This change must be made in the kernel Makefile and is well documented therein.

BENCHMARKS AND TUNING

The moment of truth. Please note that these benchmarks were performed on a few 2.4.8preX kernels with 3ware 7810 RAID cards. [As noted above, something happened in the kernel between 2.4.8 and 2.4.9 that reduced performance drastically. Downgrading to the 2.4.5 kernel with all patches described here, brought performance to within 95% of these values]. When time permits, these tests will be reproduced and this document will be modified accordingly.

For the benchmarks we used bonnie++ with the following command line options:

bonnie++ -d /export/data/ -s 2000:64k -m dp864k -r 512 -x 20 -u 0 

and these are a sampling of the resulting numbers:

Version  1.01c      ------Sequential Output------ --Sequential Input- --Random-
		    -Per Chr- --Block---Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
dp864k    2000M:64k 12104  99 33209  47 14017  12 10959  95 128553  59 209.8   8
dp864k    2000M:64k 12080  99 37975  35 13550  13 10867  94 129185  55 233.2   7
dp864k    2000M:64k 12057  99 36591  30 13870  12 10860  94 128894  57 235.4   7
dp864k    2000M:64k 12072  99 36811  36 13821  12 10876  94 129162  55 232.8   8
dp864k    2000M:64k 12057  99 37857  28 13697  12 10836  94 130170  55 235.3   8

The chart is mostly self explanatory, but for the uninitiated, the following are what we're really interested in: row #5, the block write speed, and row #11, the block read speed.

The STL2 motherboard is based on the ServerWorks ServerSet III LE Chipset. It has 6 PCI slots on 2 buses; one bus has 2 slots and is 64bit@66MHz (528MB/s) and the other has 4 slots and runs at 32bit@33MHz (132MB/s). The 7810 card is a 64bit@33MHz card. Since the gigabit ethernet card is a fast, wide card, it occupies one of the fast, wide slots. The other fast, wide slot is occupied by a 7810 card. The second 7810 card is on the slow, narrow PCI bus, but a single card can only write at about 17MB/s, which is far below the PCI bus speed so this poses no real problem and doesn't hinder the overall speed.

As you notice, our per block write speed is approximately 2 times faster than the advertised 17MB/s. This is due to the fact that the filesystem is on a striped volume occupying 2 RAID 5 cards, which effectively doubles the speed. You should also notice the block read speed of ~129MB/s when you compare the above results to these results:


Version  1.01c      ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
dp9512k   2000M:64k 12022  99 34272  31 23061  20 10736  97 185982  86 239.9   7
dp9512k   2000M:64k 11991  99 35922  37 22323  19 10752  97 190026  86 239.6   8
dp9512k   2000M:64k 11971  99 37555  33 21498  18 10799  98 183898  85 238.3   8
dp9512k   2000M:64k 11985  99 34122  26 23061  20 10699  97 188867  89 238.1   8
dp9512k   2000M:64k 11983  99 34603  28 22976  20 10761  97 190599  87 242.6   8

The difference is due to how the software RAID0 array was created. In the raidtab file shown previously you will notice that one can set the chunksize of the software RAID 0 array. This sets how much data is written to a single device at a time. The hardware RAID 5 array is hardcoded with a 64KB chunksize.

In the former bonnie++ test, we configured the software RAID with a 64KB byte chunk size, so after each 64KB of data is written to one device, another 64KB of data is written to the 2nd device, and so forth. In the second instance, the software RAID 0 array was configured with a chunk size of 512KB, thus reducing the amount of overhead of writing many small chunks to each device individually. There is some speculation that a chunksize of 448KB may be even better for write performance but at ~190MB/s we're approaching the theoretical maximum bandwidth of the slow, narrow PCI bus.

The benchmarks for the Xeon systems with 3ware 7850 cards is below:

Version  1.01b      ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
dp16         2G:64k 18453  88 107891  35 62639  27 21818  93 191840  38 300.8   5
dp16         2G:64k 17953  88 109829  34 60713  29 22202  95 192694  37 309.2   4
dp16         2G:64k 18522  88 113094  33 64315  26 22001  94 190383  40 306.9   5
dp16         2G:64k 18061  89 110736  33 61757  24 22018  94 194608  41 307.0   5
dp16         2G:64k 18826  89 116793  32 50297  20 23061  99 194914  41 308.5   4

As you can see, performance has increased dramatically with this 3ware card. The highest write speed that we've seen is ~124MB/s and the highest read speed is ~215MB/s. The 7850 cards have been placed in the 2 64bit slots available on the P4DC6+ motherboard, which are on the same PCI channel. One might assume that higher transfer rates might be achieved by placing them on 2 separate 64bit channels that are available on other motherboards, but we have not been able to verify this.

One problem has been discovered with the i860 chipset which essentially limits PCI bandwidth to ~200MB/s. Here's an email to one of our local experts (Don Holmgren) sent to me on the i860 PCI bandwidth problem.

With round #3 we've seen the performance increase dramatically, proving that there is indeed a big problem with the i860 chipset. Remember, theoretical max is 528MB/s and we're not there yet, but it's getting closer. Here are the Bonnie++ numbers for the E7500 based machines:

                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
dp20         2G:64k 23969  87 137079  33 76044  28 27736  98 345211  72 295.0   4
dp20         2G:64k 24500  90 134878  34 63826  24 27945  99 338225  68 301.3   4
dp20         2G:64k 23988  88 140606  36 66681  25 27893  99 346833  73 304.1   4
dp20         2G:64k 23790  88 131784  32 59929  23 27961  99 349615  72 305.2   4
dp20         2G:64k 23797  88 131319  33 60297  24 28149  99 343438  70 302.1   3

Some testing has been done of bonnie++ using NFSv3 over gigabit ethernet. The system was tuned thusly:

On the server and client sides, this is appended to /etc/rc.local:

echo 262144 > /proc/sys/net/core/rmem_max
echo 262144 > /proc/sys/net/core/wmem_max

On the server side, prior to starting the nfsd (in /etc/init.d/nfs), these values are set to the following before starting the nfsd, and reset to the defaults (65535) after the server has been started:

echo 262144 > /proc/sys/net/core/rmem_default
echo 262144 > /proc/sys/net/core/wmem_default

On the client side, these are the options passed to mount the NFS volume:

mount -o "nfsvers=3, rsize=32768, wsize=32768" <exported volume> <mount point>

These tuning options resulted in these performance marks:

Version  1.01c      ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
dp9nfsdp8 2000M:64k 10022  90 27694  40 10252  16  9574  92 51843  52 133.9   7
dp9nfsdp8 2000M:64k 11078  98 29912  28 8306  14 10701  99 41515  31 103.6   6
dp9nfsdp8 2000M:64k  7936  74 22719  16 8527  15 10748  99 42917  36  98.2   5
dp9nfsdp8 2000M:64k 11015  98 27402  25 8481  14 10775  99 41699  32 101.5   5
dp9nfsdp8 2000M:64k 10918  96 25679  23 8436  14 10731  99 42655  34 100.8   5
dp9nfsdp8 2000M:64k 10992  98 25247  23 9225  15 10749  99 48310  44 121.7   6

CONCLUSION

Using COTS PC hardware, EIDE hardrives, cost effective EIDE RAID controller cards, and the Linux operating system with XFS support, we have achieved our goal of creating a moderately powered, moderately sized, failure tolerant storage server for just under 1 penny per megabyte.

The CDF experiment has evaluated several similar servers from several companies. Their findings can be found here.

(Please note that all prices listed herein are estimates only, and may not be available to non-government affiliated institutions or commercial corporations.)