Friday, August 19, 2011

Formatting XFS for optimal performance on RAID10

XFS has terribly bad performance out of the box, especially on large RAID arrays. Unlike ext4, the filesystem needs to be formatted with the right parameters to perform well. If you don't get the parameters right, you need to reformat the filesystem as they can't be changed later.

The 3 main parameters are:
  • agcount: Number of allocation groups
  • sunit: Stripe size (as configured on your RAID controller)
  • swidth: Stripe width (number of data disks, excluding parity / spare disks)
Let's take an example: you have 12 disks configured in a RAID 10 (so 6 pairs of disks in RAID 1, and RAID 0 across the 6 pairs). Let's assume the RAID controller was instructed to use a stripe size of 256k. Then we have:
  • sunit = 256k / 512 = 512, because sunit is in multiple of 512 byte sectors
  • swidth = 6 * 512 = 3072, because in a RAID10 with 12 disks we have 6 data disks excluding parity disks (and no hot spares in this case)
Now XFS internally split the filesystem into "allocation groups" (AG). Essentially an AG is like a filesystem on its own. XFS splits the filesystem into multiple AGs in order to help increase parallelism, because each AG has its own set of locks. My rule of thumb is to create as many AGs as you have hardware threads. So if you have a dual-CPU configuration, with 4 cores with HyperThreading, then you have 2 x 4 x 2 = 16 hardware threads, so you should create 16 AGs.
$ sudo mkfs.xfs -f -d sunit=512,swidth=$((512*6)),agcount=16 /dev/sdb
Warning: AG size is a multiple of stripe width. This can cause performance
problems by aligning all AGs on the same disk. To avoid this, run mkfs with
an AG size that is one stripe unit smaller, for example 182845376.
meta-data=/dev/sdb isize=256 agcount=16, agsize=182845440 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=2925527040, imaxpct=5
= sunit=64 swidth=384 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Now from the output above, we can see 2 problems:
  1. There's this warning message we better pay attention to.
  2. The values of sunit and swidth printed don't correspond to what we asked for.
The reason the values printed don't match what we wanted is because they're in multiples of "block size". We can see that bsize=4096, so sure enough the numbers match up: 4096 x 64 = 512 x 512 = our stripe size of 256k.

Now let's look at this warning message. It suggests us to use agsize=182845376 instead of agsize=182845440. When we specified the number of AGs we wanted, XFS automatically figured the size of each AG, but then it's complaining that this size is suboptimal. Yay. Now agsize is specified in blocks (so multiples of 4096), but the command line tool expects the value in bytes. At this point you're probably thinking like me: "you must be kidding me, right? Some options are in bytes, some in sectors, some in blocks?!" Yes.

So to make it all work:
$ sudo mkfs.xfs -f -d sunit=512,swidth=$((512*6)),agsize=$((182845376*4096)) /dev/sdb
meta-data=/dev/sdb isize=256 agcount=16, agsize=182845376 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=2925526016, imaxpct=5
= sunit=64 swidth=384 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
It's critical that you get this right before you start using the filesystem. There's no way to change them later. You might be tempted to try using mount -o remount,sunit=X,swidth=Y, and the command will succeed but do nothing. The only XFS parameter you can change at runtime is nobarrier (see the source code of XFS's remount support in the Linux kernel), which you should use if you have a battery-backup unit (BBU) on your RAID card, although the performance boost seems pretty small on DB-type workloads, even with 512MB of RAM on the controller.

Next post: how much of a performance difference is there when you give XFS the right sunit/swidth parameters, and does this allow XFS to beat ext4's performance.

8 comments:

Tim said...

I think you've got your stripe width wrong - it shouldn't be multiplied by 512. This could be impacting your performance tests you ran subsequently.

Benoit Sigoure said...

"swidth" is specified in number of 512B sectors, maybe you're confusing it with "sw"?

Steve Bergman said...

"XFS has terribly bad performance out of the box"
False. If XFS can determine the underlying geometry, it autotunes itself automatically. Linux MD, and most RAID controllers, provide the proper information. And the XFS devs, including Eric Sandeen, are explicit in that they recommend using all the XFS defaults and not to try to second-guess mkfs.xfs. (Unless, of course, there is reason to think the RAID controller is lying). Ignore all the "Tuning XFS" blog posts out there. They are outdated and/or outright wrong according to the devs. Ext4 also requires stripewidth, etc. to be correct for best performance. I *think* mkfs.ext4 also autotunes. But I'm not certain.

Benoit Sigoure said...

Steve, this post is a bit dated now. At the time XFS wasn't able to infer the proper parameters, unfortunately. This may have changed.

Jasonmicron said...
This comment has been removed by the author.
Jasonmicron said...
This comment has been removed by the author.
Jasonmicron said...

Friday, August 19, 2011

Yea, this post is dated. But I learned something from it, so thank you blogger for the post. Looking at reformatting my home RAID-5 MD software raid from ext4 to xfs. Glad to hear that XFS supposedly does these options automatically now, though a reference via a link to that claim would be useful.

Off to the googles.

RHEL7 uses XFS out of the box, so I would expect more traffic to pages like this in the future due to increased exposure to the filesystem.

edit: corrected RHEL version, and it was a double-post.

Galaxy said...

I check the source and find su and sw works in mount.