Saturday, October 8, 2011

Hardware Growler for Mac OS X Lion

Just in case this could be of any use to someone else, I compiled Growl 1.2.2 for Lion with the fix for HardwareGrowler crash on Lion that happens when disconnecting from a wireless network or waking up the Mac. You can download it here. The binary should work on Snow Leopard too. It's only compiled for x86_64 CPUs.

Tuesday, September 13, 2011

ext4 2x faster than XFS?

For a lot of people, the conventional wisdom is that XFS outperforms ext4. I'm not sure whether this is just because XFS used to be a lot faster than ext2 or ext3 or what. I don't have anything against XFS, and actually I would like to see it outperform ext4, unfortunately my benchmarks show otherwise. I'm wondering whether I'm doing something wrong.

In the benchmark below, the same machine and same HDDs were tested with 2 different RAID controllers. In most tests, ext4 has better results than XFS. In some tests, the difference is as much as 2x. Here are the details of the config:

CPU: 2 x Intel L5630 (Westmere microarchitecture, so 2x4x2 = 16 hardware threads and lots of caches)
RAM: 2 x 6 x 8GB = 96GB DDR3 ECC+Reg Dual-Rank DIMMs
Disks: 12 x Western Digital (WD) RE4 (model: WD2003FYYS – 2TB SATA 7200rpm)
RAID controllers: Adaptec 51645 and LSI MegaRaid 9280-16i4e

Both RAID controllers are equipped with 512MB of RAM and are in their respective default factory config, except that WriteBack mode was enabled on the LSI because it's disabled by default (!). One other notable difference between the default configurations is that the Adaptec uses a strip size of 256k whereas the LSI uses 64k – this was left unchanged. Both arrays were created as RAID10 (6 pairs of 2 disks, so no spares). One controller was tested at a time, in the same machine and with the same disks. The OS (Linux 2.6.32) was on a separate RAID1 of 2 drives. The IO scheduler in use was "deadline". SysBench was using O_DIRECT on 64 files, for a total of 100GB of data.

Some observations:

Formatting XFS with the optimal values for sunit and swidth doesn't lead to much better performance. The gain is about 2%, except for sequential writes where it actually makes things worse. Yes, there was no partition table, the whole array was formatted directly as one single big filesystem.
Creating more allocation groups in XFS than physical threads doesn't lead to better performance.
XFS has much better random write throughput at low concurrency levels, but quickly degrades to the same performance level as ext4 with more than 8 threads.
ext4 has consistently better random read/write throughput and latency, even at high concurrency levels.
Similarly, for random reads ext4 also has much better throughput and latency.
By default XFS creates too few allocation groups, which artificially limits its performance at high concurrency levels. It's important to create as many AGs as hardware threads. ext4, on the other hand, doesn't really need any tuning as it performs well out of the box.

See the benchmark results in full screen or look at the raw outputs of SysBench. <a href="http://tsunanet.net/~tsuna/benchmarks/ext4-xfs-raid10/sysbench.html">See the benchmark results</a>

Saturday, August 27, 2011

Hitachi 7K3000 vs WD RE4 vs Seagate Constellation ES

These days, the Hitachi 7K3000 seems like the best bang for your bucks. You can get 2TB disks for around US$100. The 7K3000 isn't an "enterprise disk", so many people wouldn't buy it for their servers.
It's not clear what disks sold with the Enterprise™©® label really do to justify the big price difference. Often it seems like the hardware is exactly the same, but the firmware behaves differently, notably to report errors faster. In desktop environments, you want the disk to try hard to read bad sectors, but in RAID arrays it's better to give up quickly and let the RAID controller know, otherwise the disks might timeout from the controller's point of view, and the whole disk might be incorrectly considered dead and trigger a spurious rebuild.
So I recently benchmarked the Hitachi 7K3000 against two other "enterprise" disks, the Western Digital RE4 and the Seagate Constellation ES.

The line up

Hitachi 7K3000 model: HDS723020BLA642 – the baseline
Western Digital (WD) RE4 model: WD2003FYYS
Seagate Constellation ES model: ST2000NM0011

All disks are 3.5" 2TB SATA 7200rpm with 64MB of cache, all but the WD are 6Gb/s SATA. The WD is 3Gb/s – not that this really matters, as I have yet to see a spinning disk of this grade exceed 2Gb/s.
Both enterprise disks cost about $190, so about 90% more (almost double the price) than the Hitachi. Are they worth the extra money?

The test

I ended up using SysBench to compare the drives. I had all 3 drives connected to the motherboard of the same machine, a dual L5630 with 96GB of RAM, running Linux 2.6.32. Drives and OS were using their default config, except the "deadline" IO scheduler was in effect (whereas vanilla Linux uses CFQ by default since 2.6.18). SysBench used O_DIRECT for all its accesses. Each disk was formatted with ext4 – no partition table, the whole disk was used directly. Default formatting and mount options were used. SysBench was told to use 64 files, for a total of 100GB of data. Every single test was repeated 4 times and then averages were plotted. Running all the tests takes over 20h.
SysBench produces some kind of a free-form output which isn't very easy to use. So I wrote a Python script to parse the results and a bit of JavaScript to visualize them. The code is available on GitHub: tsuna/sysbench-tools.

Results

A picture is worth a thousand words, so take a look at the graphs. Overall the WD RE4 is a clear winner for me, as it outperforms its 2 buddies on all tests involving random accesses. The Seagate doesn't seem worth the money. Although it's the best at sequential reads, the Hitachi is pretty much on par with it while being almost twice cheaper.
So I'll buy the Hitachi 7K3000 for everything, and pay the extra premium for the WD RE4 for MySQL servers, because MySQL isn't a cheap bastard and needs every drop of performance it can get out of the IO subsystem. No, I don't want to buy ridiculously expensive and power-hungry 15k RPM SAS drives, thank you.
The raw outputs of SysBench are available here: http://tsunanet.net/~tsuna/benchmarks/7K3000-RE4-ConstellationES

Friday, August 19, 2011

Formatting XFS for optimal performance on RAID10

XFS has terribly bad performance out of the box, especially on large RAID arrays. Unlike ext4, the filesystem needs to be formatted with the right parameters to perform well. If you don't get the parameters right, you need to reformat the filesystem as they can't be changed later.

The 3 main parameters are:

agcount: Number of allocation groups
sunit: Stripe size (as configured on your RAID controller)
swidth: Stripe width (number of data disks, excluding parity / spare disks)

Let's take an example: you have 12 disks configured in a RAID 10 (so 6 pairs of disks in RAID 1, and RAID 0 across the 6 pairs). Let's assume the RAID controller was instructed to use a stripe size of 256k. Then we have:

sunit = 256k / 512 = 512, because sunit is in multiple of 512 byte sectors
swidth = 6 * 512 = 3072, because in a RAID10 with 12 disks we have 6 data disks excluding parity disks (and no hot spares in this case)

Now XFS internally split the filesystem into "allocation groups" (AG). Essentially an AG is like a filesystem on its own. XFS splits the filesystem into multiple AGs in order to help increase parallelism, because each AG has its own set of locks. My rule of thumb is to create as many AGs as you have hardware threads. So if you have a dual-CPU configuration, with 4 cores with HyperThreading, then you have 2 x 4 x 2 = 16 hardware threads, so you should create 16 AGs.

$ sudo mkfs.xfs -f -d sunit=512,swidth=$((512*6)),agcount=16 /dev/sdb
Warning: AG size is a multiple of stripe width.  This can cause performance
problems by aligning all AGs on the same disk.  To avoid this, run mkfs with
an AG size that is one stripe unit smaller, for example 182845376.
meta-data=/dev/sdb               isize=256    agcount=16, agsize=182845440 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=2925527040, imaxpct=5
         =                       sunit=64     swidth=384 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Now from the output above, we can see 2 problems:

There's this warning message we better pay attention to.
The values of sunit and swidth printed don't correspond to what we asked for.

The reason the values printed don't match what we wanted is because they're in multiples of "block size". We can see that bsize=4096, so sure enough the numbers match up: 4096 x 64 = 512 x 512 = our stripe size of 256k.

Now let's look at this warning message. It suggests us to use agsize=182845376 instead of agsize=182845440. When we specified the number of AGs we wanted, XFS automatically figured the size of each AG, but then it's complaining that this size is suboptimal. Yay. Now agsize is specified in blocks (so multiples of 4096), but the command line tool expects the value in bytes. At this point you're probably thinking like me: "you must be kidding me, right? Some options are in bytes, some in sectors, some in blocks?!" Yes.

So to make it all work:

$ sudo mkfs.xfs -f -d sunit=512,swidth=$((512*6)),agsize=$((182845376*4096)) /dev/sdb
meta-data=/dev/sdb               isize=256    agcount=16, agsize=182845376 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=2925526016, imaxpct=5
         =                       sunit=64     swidth=384 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

It's critical that you get this right before you start using the filesystem. There's no way to change them later. You might be tempted to try using mount -o remount,sunit=X,swidth=Y, and the command will succeed but do nothing. The only XFS parameter you can change at runtime is nobarrier (see the source code of XFS's remount support in the Linux kernel), which you should use if you have a battery-backup unit (BBU) on your RAID card, although the performance boost seems pretty small on DB-type workloads, even with 512MB of RAM on the controller.

Next post: how much of a performance difference is there when you give XFS the right sunit/swidth parameters, and does this allow XFS to beat ext4's performance.

Monday, August 15, 2011

e1000e scales a lot better than bnx2

At StumbleUpon we've had a never ending string of problems with Broadcom's cards that use the bnx2 driver. The machine cannot handle more than 100kpps (packets/s), the driver has bugs that will lock up the NIC until it gets reset manually when you use jumbo frames and/or TSO (TCP Segmentation Offloading).

So we switched everything to Intel NICs. Not only they don't have these nasty bugs, but also they scale better. They can do up to 170kpps each way before they start discarding packets. Graphs courtesy of OpenTSDB:

Packets/s vs. packets dropped/s

Packets/s vs. interrupts/s

We can also see how the NIC is doing interrupt coalescing at high packet rates. Yay.
Kernel tested: 2.6.32-31-server x86_64 from Lucid, running on 2 L5630 with 48GB of RAM.

Thursday, July 28, 2011

VM warning: GC locker is held; pre-dump GC was skipped

If you ever run into this message while using the Sun JVM / OpenJDK:

Java HotSpot(TM) 64-Bit Server VM warning: GC locker is held; pre-dump GC was skipped

then I wouldn't worry too much about it as it seems like it's printed when running a jmap -histo:live while the GC is already running or holding a certain lock in the jVM.

Friday, June 3, 2011

Clarifications on Linux's NUMA stats

After reading the excellent post on The MySQL “swap insanity” problem and the effects of the NUMA architecture, I remembered about the existence of /sys/devices/system/node/node*/numastat and decided to add these numbers to a collector for OpenTSDB. But whenever I add a collector that reads metrics from /proc or /sys, I always need to go read the Linux kernel's source code, because most metrics tend to be misleading and under-documented (when they're documented at all).

In this case, if you RTFM, you'll get this:

Numa policy hit/miss statistics

/sys/devices/system/node/node*/numastat

All units are pages. Hugepages have separate counters.

numa_hit      A process wanted to allocate memory from this node, and succeeded.
numa_miss     A process wanted to allocate memory from another node, but ended up with memory from this node.
numa_foreign  A process wanted to allocate on this node, but ended up with memory from another one.
local_node    A process ran on this node and got memory from it.
other_node    A process ran on this node and got memory from another node.
interleave_hit  Interleaving wanted to allocate from this node and succeeded.

I was very confused about the last one, about the exact difference between the second and the third one, and about the difference between the first 3 metrics and the next 2.

After RTFSC, the relevant part of the code appeared to be in mm/vmstat.c:

void zone_statistics(struct zone *preferred_zone, struct zone *z, gfp_t flags)
{       
        if (z->zone_pgdat == preferred_zone->zone_pgdat) {
                __inc_zone_state(z, NUMA_HIT);
        } else {
                __inc_zone_state(z, NUMA_MISS);
                __inc_zone_state(preferred_zone, NUMA_FOREIGN);
        }
        if (z->node == ((flags & __GFP_OTHER_NODE) ?
                        preferred_zone->node : numa_node_id()))
                __inc_zone_state(z, NUMA_LOCAL);
        else
                __inc_zone_state(z, NUMA_OTHER);
}

So here's what it all really means:

numa_hit: Number of pages allocated from the node the process wanted.
numa_miss: Number of pages allocated from this node, but the process preferred another node.
numa_foreign: Number of pages allocated another node, but the process preferred this node.
local_node: Number of pages allocated from this node while the process was running locally.
other_node: Number of pages allocated from this node while the process was running remotely (on another node).
interleave_hit: Number of pages allocated successfully with the interleave strategy.

I was originally confused about numa_foreign but this metric can actually be useful to see what happens when a node runs out of free pages. If a process attempts to get a page from its local node, but this node is out of free pages, then the numa_miss of that node will be incremented (indicating that the node is out of memory) and another node will accomodate the process's request. But in order to know which nodes are "lending memory" to the out-of-memory node, you need to look at numa_foreign. Having a high value for numa_foreign for a particular node indicates that this node's memory is under-utilized so the node is frequently accommodating memory allocation requests that failed on other nodes.

Saturday, May 7, 2011

JVM u24 segfault in clearerr on Jaunty

At StumbleUpon we've been tracking down a weird problem with one of our application servers written in Java. We run Sun's jdk1.6.0_24 on Ubuntu Jaunty (9.04 – yes, these servers are old and due for an upgrade) and this application seems to do something that causes the JVM to segfault:

[6972247.491417] hbase_regionser[32760]: segfault at 8 ip 00007f26cabd608b sp 00007fffb0798270 error 4 in libc-2.9.so[7f26cab66000+168000]
[6972799.682147] hbase_regionser[30904]: segfault at 8 ip 00007f8878fb608b sp 00007fff09b69900 error 4 in libc-2.9.so[7f8878f46000+168000]

What's odd is that the problem always happens on different hosts, and almost always around 6:30 - 6:40 am. Go figure.

Understanding segfault messages from the Linux kernel

Let's try to make sense of the messages shown above, logged by the Linux kernel. Back in Linux v2.6.28, it was logged by do_page_fault, but since then this big function has been refactored into multiple smaller functions, so look for show_signal_msg now.

 791                         printk(
 792                         "%s%s[%d]: segfault at %lx ip %p sp %p error %lx",
 793                         task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
 794                         tsk->comm, task_pid_nr(tsk), address,
 795                         (void *) regs->ip, (void *) regs->sp, error_code);
 796                         print_vma_addr(" in ", regs->ip);

From the above, we see that segfault at 8 means that the code attempted to access the address "8", which is what caused the segfault (because there is no page ever mapped at address 0 (normally)). ip stands for instruction pointer, so the code that triggered the segfault was mapped at the address 0x00007f8878fb608b. sp is stack pointer and isn't very relevant here. error 4 means that this was a read access (4 = PF_USER, which used to be a #define but is now part of enum x86_pf_error_code). The rest of the message tells us that the address of the instruction pointer falls inside the memory region mapped for the code of the libc, and it tells us in square brackets that the libc is mapped at the base address 0x7f8878f46000 and that there's 168000 bytes of code mapped. So that means that we were at 0x00007f8878fb608b - 0x7f8878f46000 = 0x7008b into the libc when the segfault occurred.

So where did the segfault occur exactly?

Since now we know what offset into the libc we were while the segfault happened, we can fire gdb and see what's up with that code:

$ gdb -q /lib/libc.so.6
(no debugging symbols found)
(gdb) x/i 0x7008b
0x7008b <clearerr+27>: cmp    %r8,0x8(%r10)

Interesting... So the JVM is segfaulting in clearerr. We're 27 bytes into this function when the segfault happens. Let's see what the function does up to here:

(gdb) disas clearerr
Dump of assembler code for function clearerr:
0x0000000000070070 <clearerr+0>:        push   %rbx
0x0000000000070071 <clearerr+1>:        mov    (%rdi),%eax
0x0000000000070073 <clearerr+3>:        mov    %rdi,%rbx
0x0000000000070076 <clearerr+6>:        test   %ax,%ax
0x0000000000070079 <clearerr+9>:        js     0x700c7 <clearerr+87>
0x000000000007007b <clearerr+11>:       mov    0x88(%rdi),%r10
0x0000000000070082 <clearerr+18>:       mov    %fs:0x10,%r8
0x000000000007008b <clearerr+27>:       cmp    %r8,0x8(%r10)
0x000000000007008f <clearerr+31>:       je     0x700c0 <clearerr+80>
0x0000000000070091 <clearerr+33>:       xor    %edx,%edx
0x0000000000070093 <clearerr+35>:       mov    $0x1,%esi
0x0000000000070098 <clearerr+40>:       mov    %edx,%eax
0x000000000007009a <clearerr+42>:       cmpl   $0x0,0x300fa7(%rip)        # 0x371048
0x00000000000700a1 <clearerr+49>:       je     0x700ac <clearerr+60>
0x00000000000700a3 <clearerr+51>:       lock cmpxchg %esi,(%r10)
[...]

Reminder: the prototype of the function is void clearerr(FILE *stream); so there's one pointer argument and no return value. The code above starts by saving rbx (because it's the callee's responsibility to save this register), then dereferences the first (and only) argument (passed in rdi) and saves the dereferenced address in eax. Then it copies the pointer passed in argument in rbx. It then tests whether low 16 bits in eax are negative and jumps over some code if it is, because they contain the _flags field of the FILE* passed in argument. At this point it helps to know what a FILE looks like. This structure is opaque so it depends on the libc implementation. In this case, it's the glibc's:

 271 struct _IO_FILE {
 272   int _flags;           /* High-order word is _IO_MAGIC; rest is flags. */
[...]
 310   _IO_lock_t *_lock;
 311 #ifdef _IO_USE_OLD_IO_FILE
 312 };

Then it's looking 0x88 = 136 bytes into the FILE* passed in argument and storing this in r10. If you look at the definition of FILE* and add up the offsets, 136 bytes into the FILE* you'll find the _IO_lock_t *_lock; member of the struct, the mutex that protects this FILE*. Then we're loading address 0x10 from the FS segment in r8. On Linux x86_64, the F segment is used for thread-local data. In this case it's loading a pointer to a structure that corresponds to the local thread. Finally, we're comparing r8 to the value 8 bytes into the value pointed to by r10, and kaboom, we get a segfault. This suggest that r10 is a NULL pointer, meaning that the _lock of the FILE* given in argument is NULL. Now that's weird. I'm not sure how this happened. So the assembly code above is essentially doing:

void clearerr(FILE *stream) {
  if (stream->_flags & 0xFFFF >= 0) {
    struct pthread* self = /* mov %fs:0x10,%r8 -- (can't express this in C, but you can use arch_prctl) */;
    struct lock_t lock = *stream->_lock;
    if (lock.owner != self) {  // We segfault here, when doing lock->owner
      mutex_lock(lock.lock);
      lock.owner = self;
    }
    // ...
  }
  // ...
}

What's odd is that the return value of the JVM is 143 (128+SIGTERM) and not 139 (=128+SIGSEGV). Maybe it's because the JVM is always catching and handling SIGSEGV (they do this to allow the JIT to optimize away some NULL-pointer checks and translate them into NullPointerExceptions, among other things). But even then, normally the JVM will write a file where it complains about the segfault, asks you to file a bug, and dumps all the registers and whatnot... We should see that file somewhere. Yet it's nowhere to be found in the JVM's current working directory or anywhere else I looked.

So this segfault remains a mystery so far. Next step: run the application server with ulimit -c unlimited and analyze a core dump.

Monday, March 14, 2011

The "Out of socket memory" error

I recently did some work on some of our frontend machines (on which we run Varnish) at StumbleUpon and decided to track down some of the errors the Linux kernel was regularly throwing in kern.log such as:

Feb 25 08:23:42 foo kernel: [3077014.450011] Out of socket memory

Before we get started, let me tell you that you should NOT listen to any blog or forum post without doing your homework, especially when the post recommends that you tune up virtually every TCP related knob in the kernel. These people don't know what they're doing and most probably don't understand much to TCP/IP. Most importantly, their voodoo won't help you fix your problem and might actually make it worse.

Dive in the Linux kernel

In order to best understand what's going on, the best thing is to go read the code of the kernel. Unfortunately, the kernel's error messages or counters are often imprecise, confusing, or even misleading. But they're important. And reading the kernel's code isn't nearly as hard as what people say.

The "Out of socket memory" error

The only match for "Out of socket memory" in the kernel's code (as of v2.6.38) is in net/ipv4/tcp_timer.c:

 66 static int tcp_out_of_resources(struct sock *sk, int do_reset)
 67 {
 68         struct tcp_sock *tp = tcp_sk(sk);
 69         int shift = 0;
 70 
 71         /* If peer does not open window for long time, or did not transmit
 72          * anything for long time, penalize it. */
 73         if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset)
 74                 shift++;
 75 
 76         /* If some dubious ICMP arrived, penalize even more. */
 77         if (sk->sk_err_soft)
 78                 shift++;
 79 
 80         if (tcp_too_many_orphans(sk, shift)) {
 81                 if (net_ratelimit())
 82                         printk(KERN_INFO "Out of socket memory\n");

So the question is: when does tcp_too_many_orphans return true? Let's take a look in include/net/tcp.h:

 268 static inline bool tcp_too_many_orphans(struct sock *sk, int shift)
 269 {
 270         struct percpu_counter *ocp = sk->sk_prot->orphan_count;
 271         int orphans = percpu_counter_read_positive(ocp);
 272 
 273         if (orphans << shift > sysctl_tcp_max_orphans) {
 274                 orphans = percpu_counter_sum_positive(ocp);
 275                 if (orphans << shift > sysctl_tcp_max_orphans)
 276                         return true;
 277         }
 278 
 279         if (sk->sk_wmem_queued > SOCK_MIN_SNDBUF &&
 280             atomic_long_read(&tcp_memory_allocated) > sysctl_tcp_mem[2])
 281                 return true;
 282         return false;
 283 }

So two conditions that can trigger this "Out of socket memory" error:

There are "too many" orphan sockets (most common).
The socket already has the minimum amount of memory and we can't give it more because TCP is already using more than its limit.

In order to remedy to your problem, you need to figure out which case you fall into. The vast majority of the people (especially those dealing with frontend servers like Varnish) fall into case 1.

Are you running out of TCP memory?

Ruling out case 2 is easy. All you need is to see how much memory your kernel is configured to give to TCP vs how much is actually being used. If you're close to the limit (uncommon), then you're in case 2. Otherwise (most common) you're in case 1. The kernel keeps track of the memory allocated to TCP in multiple of pages, not in bytes. This is a first bit of confusion that a lot of people run into because some settings are in bytes and other are in pages (and most of the time 1 page = 4096 bytes).

Rule out case 2: find how much memory the kernel is willing to give to TCP:

$ cat /proc/sys/net/ipv4/tcp_mem
3093984 4125312 6187968

The values are in number of pages. They get automatically sized at boot time (values above are for a machine with 32GB of RAM). They mean:

When TCP uses less than 3093984 pages (11.8GB), the kernel will consider it below the "low threshold" and won't bother TCP about its memory consumption.
When TCP uses more than 4125312 pages (15.7GB), enter the "memory pressure" mode.
The maximum number of pages the kernel is willing to give to TCP is 6187968 (23.6GB). When we go above this, we'll start seeing the "Out of socket memory" error and Bad Things will happen.

Now let's find how much of that memory TCP actually uses.

$ cat /proc/net/sockstat
sockets: used 14565
TCP: inuse 35938 orphan 21564 tw 70529 alloc 35942 mem 1894
UDP: inuse 11 mem 3
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

The last value on the second line (mem 1894) is the number of pages allocated to TCP. In this case we can see that 1894 is way below 6187968, so there's no way we can possibly be running out of TCP memory. So in this case, the "Out of socket memory" error was caused by the number of orphan sockets.

Do you have "too many" orphan sockets?

First of all: what's an orphan socket? It's simply a socket that isn't associated to a file descriptor. For instance, after you close() a socket, you no longer hold a file descriptor to reference it, but it still exists because the kernel has to keep it around for a bit more until TCP is done with it. Because orphan sockets aren't very useful to applications (since applications can't interact with them), the kernel is trying to limit the amount of memory consumed by orphans, and it does so by limiting the number of orphans that stick around. If you're running a frontend web server (or an HTTP load balancer), then you'll most likely have a sizeable number of orphans, and that's perfectly normal.

In order to find the limit on the number of orphan sockets, simply do:

$ cat /proc/sys/net/ipv4/tcp_max_orphans
65536

Here we see the default value, which is 64k. In order to find the number of orphan sockets in the system, look again in sockstat:

$ cat /proc/net/sockstat
sockets: used 14565
TCP: inuse 35938 orphan 21564 tw 70529 alloc 35942 mem 1894
[...]

So in this case we have 21564 orphans. That doesn't seem very close to 65536... Yet, if you look once more at the code above that prints the warning, you'll see that there is this shift variable that has a value between 0 and 2, and that the check is testing if (orphans << shift > sysctl_tcp_max_orphans). What this means is that in certain cases, the kernel decides to penalize some sockets more, and it does so by multiplying the number of orphans by 2x or 4x to artificially increase the "score" of the "bad socket" to penalize. The problem is that due to the way this is implemented, you can see a worrisome "Out of socket memory" error when in fact you're still 4x below the limit and you just had a couple "bad sockets" (which happens frequently when you have an Internet facing service). So unfortunately that means that you need to tune up the maximum number of orphan sockets even if you're 2x or 4x away from the threshold. What value is reasonable for you depends on your situation at hand. Observe how the count of orphans in /proc/net/sockstat is changing when your server is at peak traffic, multiply that value by 4, round it up a bit to have a nice value, and set it. You can set it by doing a echo of the new value in /proc/sys/net/ipv4/tcp_max_orphans, and don't forget to update the value of net.ipv4.tcp_max_orphans in /etc/sysctl.conf so that your change persists across reboots.

That's all you need to get rid of these "Out of socket memory" errors, most of which are "false alarms" due to the shift variable of the implementation.

Tsuna's blog

Saturday, October 8, 2011

Hardware Growler for Mac OS X Lion

Tuesday, September 13, 2011

ext4 2x faster than XFS?

Saturday, August 27, 2011

Hitachi 7K3000 vs WD RE4 vs Seagate Constellation ES

The line up

The test

Results

Friday, August 19, 2011

Formatting XFS for optimal performance on RAID10

Monday, August 15, 2011

e1000e scales a lot better than bnx2

Thursday, July 28, 2011

VM warning: GC locker is held; pre-dump GC was skipped

Friday, June 3, 2011

Clarifications on Linux's NUMA stats

Saturday, May 7, 2011

JVM u24 segfault in clearerr on Jaunty

Understanding segfault messages from the Linux kernel

So where did the segfault occur exactly?

Monday, March 14, 2011

The "Out of socket memory" error

Dive in the Linux kernel

The "Out of socket memory" error

Are you running out of TCP memory?

Do you have "too many" orphan sockets?

About Me

Blog Archive