Saturday, August 27, 2011

Hitachi 7K3000 vs WD RE4 vs Seagate Constellation ES

These days, the Hitachi 7K3000 seems like the best bang for your bucks. You can get 2TB disks for around US$100. The 7K3000 isn't an "enterprise disk", so many people wouldn't buy it for their servers.
It's not clear what disks sold with the Enterprise™©® label really do to justify the big price difference. Often it seems like the hardware is exactly the same, but the firmware behaves differently, notably to report errors faster. In desktop environments, you want the disk to try hard to read bad sectors, but in RAID arrays it's better to give up quickly and let the RAID controller know, otherwise the disks might timeout from the controller's point of view, and the whole disk might be incorrectly considered dead and trigger a spurious rebuild.
So I recently benchmarked the Hitachi 7K3000 against two other "enterprise" disks, the Western Digital RE4 and the Seagate Constellation ES.

The line up

All disks are 3.5" 2TB SATA 7200rpm with 64MB of cache, all but the WD are 6Gb/s SATA. The WD is 3Gb/s – not that this really matters, as I have yet to see a spinning disk of this grade exceed 2Gb/s.
Both enterprise disks cost about $190, so about 90% more (almost double the price) than the Hitachi. Are they worth the extra money?

The test

I ended up using SysBench to compare the drives. I had all 3 drives connected to the motherboard of the same machine, a dual L5630 with 96GB of RAM, running Linux 2.6.32. Drives and OS were using their default config, except the "deadline" IO scheduler was in effect (whereas vanilla Linux uses CFQ by default since 2.6.18). SysBench used O_DIRECT for all its accesses. Each disk was formatted with ext4 – no partition table, the whole disk was used directly. Default formatting and mount options were used. SysBench was told to use 64 files, for a total of 100GB of data. Every single test was repeated 4 times and then averages were plotted. Running all the tests takes over 20h.
SysBench produces some kind of a free-form output which isn't very easy to use. So I wrote a Python script to parse the results and a bit of JavaScript to visualize them. The code is available on GitHub: tsuna/sysbench-tools.


A picture is worth a thousand words, so take a look at the graphs. Overall the WD RE4 is a clear winner for me, as it outperforms its 2 buddies on all tests involving random accesses. The Seagate doesn't seem worth the money. Although it's the best at sequential reads, the Hitachi is pretty much on par with it while being almost twice cheaper.
So I'll buy the Hitachi 7K3000 for everything, and pay the extra premium for the WD RE4 for MySQL servers, because MySQL isn't a cheap bastard and needs every drop of performance it can get out of the IO subsystem. No, I don't want to buy ridiculously expensive and power-hungry 15k RPM SAS drives, thank you.
The raw outputs of SysBench are available here:

Friday, August 19, 2011

Formatting XFS for optimal performance on RAID10

XFS has terribly bad performance out of the box, especially on large RAID arrays. Unlike ext4, the filesystem needs to be formatted with the right parameters to perform well. If you don't get the parameters right, you need to reformat the filesystem as they can't be changed later.

The 3 main parameters are:
  • agcount: Number of allocation groups
  • sunit: Stripe size (as configured on your RAID controller)
  • swidth: Stripe width (number of data disks, excluding parity / spare disks)
Let's take an example: you have 12 disks configured in a RAID 10 (so 6 pairs of disks in RAID 1, and RAID 0 across the 6 pairs). Let's assume the RAID controller was instructed to use a stripe size of 256k. Then we have:
  • sunit = 256k / 512 = 512, because sunit is in multiple of 512 byte sectors
  • swidth = 6 * 512 = 3072, because in a RAID10 with 12 disks we have 6 data disks excluding parity disks (and no hot spares in this case)
Now XFS internally split the filesystem into "allocation groups" (AG). Essentially an AG is like a filesystem on its own. XFS splits the filesystem into multiple AGs in order to help increase parallelism, because each AG has its own set of locks. My rule of thumb is to create as many AGs as you have hardware threads. So if you have a dual-CPU configuration, with 4 cores with HyperThreading, then you have 2 x 4 x 2 = 16 hardware threads, so you should create 16 AGs.
$ sudo mkfs.xfs -f -d sunit=512,swidth=$((512*6)),agcount=16 /dev/sdb
Warning: AG size is a multiple of stripe width. This can cause performance
problems by aligning all AGs on the same disk. To avoid this, run mkfs with
an AG size that is one stripe unit smaller, for example 182845376.
meta-data=/dev/sdb isize=256 agcount=16, agsize=182845440 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=2925527040, imaxpct=5
= sunit=64 swidth=384 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Now from the output above, we can see 2 problems:
  1. There's this warning message we better pay attention to.
  2. The values of sunit and swidth printed don't correspond to what we asked for.
The reason the values printed don't match what we wanted is because they're in multiples of "block size". We can see that bsize=4096, so sure enough the numbers match up: 4096 x 64 = 512 x 512 = our stripe size of 256k.

Now let's look at this warning message. It suggests us to use agsize=182845376 instead of agsize=182845440. When we specified the number of AGs we wanted, XFS automatically figured the size of each AG, but then it's complaining that this size is suboptimal. Yay. Now agsize is specified in blocks (so multiples of 4096), but the command line tool expects the value in bytes. At this point you're probably thinking like me: "you must be kidding me, right? Some options are in bytes, some in sectors, some in blocks?!" Yes.

So to make it all work:
$ sudo mkfs.xfs -f -d sunit=512,swidth=$((512*6)),agsize=$((182845376*4096)) /dev/sdb
meta-data=/dev/sdb isize=256 agcount=16, agsize=182845376 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=2925526016, imaxpct=5
= sunit=64 swidth=384 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
It's critical that you get this right before you start using the filesystem. There's no way to change them later. You might be tempted to try using mount -o remount,sunit=X,swidth=Y, and the command will succeed but do nothing. The only XFS parameter you can change at runtime is nobarrier (see the source code of XFS's remount support in the Linux kernel), which you should use if you have a battery-backup unit (BBU) on your RAID card, although the performance boost seems pretty small on DB-type workloads, even with 512MB of RAM on the controller.

Next post: how much of a performance difference is there when you give XFS the right sunit/swidth parameters, and does this allow XFS to beat ext4's performance.

Monday, August 15, 2011

e1000e scales a lot better than bnx2

At StumbleUpon we've had a never ending string of problems with Broadcom's cards that use the bnx2 driver. The machine cannot handle more than 100kpps (packets/s), the driver has bugs that will lock up the NIC until it gets reset manually when you use jumbo frames and/or TSO (TCP Segmentation Offloading).

So we switched everything to Intel NICs. Not only they don't have these nasty bugs, but also they scale better. They can do up to 170kpps each way before they start discarding packets. Graphs courtesy of OpenTSDB:
Packets/s vs. packets dropped/s
Packets/s vs. interrupts/s

We can also see how the NIC is doing interrupt coalescing at high packet rates. Yay.
Kernel tested: 2.6.32-31-server x86_64 from Lucid, running on 2 L5630 with 48GB of RAM.