Friday, December 10, 2010

Java IO: slowest readLine ever

I have a fairly simple problem: I want to count the number of lines in a file, then seek back to after the first line, and then read the file line by line. Easy heh? Not in Java. Enter the utterly retarded world of the JDK.

So if you're n00b, you'll start with a FileInputStream, but quickly you'll realize that seeking around with it isn't really possible... Indeed, the only way to go back to a previous position in the file is to call reset(), which will take you back to the previous location you marked with mark(int). The argument to mark is "the maximum limit of bytes that can be read before the mark position becomes invalid". OK WTF.

If you dig around some more, you'll see that you should really be using a RandomAccessFile – so much for good OO design. The other seemingly cool thing about RandomAccessFile is that it's got a readLine() method. Unfortunately, this method was implemented by a 1st year CS student who probably dropped out before understanding the basics of systems programming.

Believe it or not, but readLine() reads the file one byte at a time. It does one system call to read per byte. As such, it's 2 orders of magnitude slower than it could be... In fact, you can't really implement a readline function that's much slower than that. facepalm.

PS: This is with Sun's JRE version 1.6.0_22-b04. JDK7/OpenJDK has the same implementation. Apache's Harmony implementation is the same, so Android has the same retarded implementation.

Thursday, December 9, 2010

OpenTSDB at Strata'11

I will be speaking about OpenTSDB at the Strata conference, Wednesday, February 02, 2011, in Santa Clara, CA. You can sign up with this promo code and get a 25% discount: str11fsd.
Strata is a new conference about large scale systems put together by O'Reilly.
Strata 2011

Sunday, November 14, 2010

How long does it take to make a context switch?

That's a interesting question I'm willing to spend some of my time on. Someone at StumbleUpon emitted the hypothesis that with all the improvements in the Nehalem architecture (marketed as Intel i7), context switching would be much faster. How would you devise a test to empirically find an answer to this question? How expensive are context switches anyway? (tl;dr answer: very expensive)

The lineup

April 21, 2011 update: I added an "extreme" Nehalem and a low-voltage Westmere.
April 1, 2013 update: Added an Intel Sandy Bridge E5-2620.
I've put 4 different generations of CPUs to test:
  • A dual Intel 5150 (Woodcrest, based on the old "Core" architecture, 2.67GHz). The 5150 is a dual-core, and so in total the machine has 4 cores available. Kernel: 2.6.28-19-server x86_64.
  • A dual Intel E5440 (Harpertown, based on the Penrynn architecture, 2.83GHz). The E5440 is a quad-core so the machine has a total of 8 cores. Kernel: 2.6.24-26-server x86_64.
  • A dual Intel E5520 (Gainestown, based on the Nehalem architecture, aka i7, 2.27GHz). The E5520 is a quad-core, and has HyperThreading enabled, so the machine has a total of 8 cores or 16 "hardware threads". Kernel: 2.6.28-18-generic x86_64.
  • A dual Intel X5550 (Gainestown, based on the Nehalem architecture, aka i7, 2.67GHz). The X5550 is a quad-core, and has HyperThreading enabled, so the machine has a total of 8 cores or 16 "hardware threads". Note: the X5550 is in the "server" product line. This CPU is 3x more expensive than the previous one. Kernel: 2.6.28-15-server x86_64.
  • A dual Intel L5630 (Gulftown, based on the Westmere architecture, aka i7, 2.13GHz). The L5630 is a quad-core, and has HyperThreading enabled, so the machine has a total of 8 cores or 16 "hardware threads". Note: the L5630 is a "low-voltage" CPU. At equal price, this CPU is in theory 16% less powerful than a non-low-voltage CPU. Kernel: 2.6.32-29-server x86_64.
  • A dual Intel E5-2620 (Sandy Bridge-EP, based on the Sandy Bridge architecture, aka E5, 2Ghz). The E5-2620 is a hexa-core, has HyperThreading, so the machine has a total of 12 cores, or 24 "hardware threads". Kernel: 3.4.24 x86_64.
As far as I can say, all CPUs are set to a constant clock rate (no Turbo Boost or anything fancy). All the Linux kernels are those built and distributed by Ubuntu.

First idea: with syscalls (fail)

My first idea was to make a cheap system call many times in a row, time how long it took, and compute the average time spent per syscall. The cheapest system call on Linux these days seems to be gettid. Turns out, this was a naive approach since system calls don't actually cause a full context switch anymore nowadays, the kernel can get away with a "mode switch" (go from user mode to kernel mode, then back to user mode). That's why when I ran my first test program, vmstat wouldn't show a noticeable increase in number of context switches. But this test is interesting too, although it's not what I wanted originally.

Source code: timesyscall.c Results:
  • Intel 5150: 105ns/syscall
  • Intel E5440: 87ns/syscall
  • Intel E5520: 58ns/syscall
  • Intel X5550: 52ns/syscall
  • Intel L5630: 58ns/syscall
  • Intel E5-2620: 67ns/syscall
Now that's nice, more expensive CPUs perform noticeably better (note however the slight increase in cost on Sandy Bridge). But that's not really what we wanted to know. So to test the cost of a context switch, we need to force the kernel to de-schedule the current process and schedule another one instead. And to benchmark the CPU, we need to get the kernel to do nothing but this in a tight loop. How would you do this?

Second idea: with futex

The way I did it was to abuse futex (RTFM). futex is the low level Linux-specific primitive used by most threading libraries to implement blocking operations such as waiting on a contended mutexes, semaphores that run out of permits, condition variables and friends. If you would like to know more, go read Futexes Are Tricky by Ulrich Drepper. Anyways, with a futex, it's easy to suspend and resume processes. What my test does is that it forks off a child process, and the parent and the child take turn waiting on the futex. When the parent waits, the child wakes it up and goes on to wait on the futex, until the parent wakes it and goes on to wait again. Some kind of a ping-pong "I wake you up, you wake me up...".

Source code: timectxsw.c Results:
  • Intel 5150: ~4300ns/context switch
  • Intel E5440: ~3600ns/context switch
  • Intel E5520: ~4500ns/context switch
  • Intel X5550: ~3000ns/context switch
  • Intel L5630: ~3000ns/context switch
  • Intel E5-2620: ~3000ns/context switch
Note: those results include the overhead of the futex system calls.

Now you must take those results with a grain of salt. The micro-benchmark does nothing but context switching. In practice context switching is expensive because it screws up the CPU caches (L1, L2, L3 if you have one, and the TLB – don't forget the TLB!).

CPU affinity

Things are harder to predict in an SMP environment, because the performance can vary wildly depending on whether a task is migrated from one core to another (especially if the migration is across physical CPUs). I ran the benchmarks again but this time I pinned the processes/threads on a single core (or "hardware thread"). The performance speedup is dramatic.

Source code: Results:
  • Intel 5150: ~1900ns/process context switch, ~1700ns/thread context switch
  • Intel E5440: ~1300ns/process context switch, ~1100ns/thread context switch
  • Intel E5520: ~1400ns/process context switch, ~1300ns/thread context switch
  • Intel X5550: ~1300ns/process context switch, ~1100ns/thread context switch
  • Intel L5630: ~1600ns/process context switch, ~1400ns/thread context switch
  • Intel E5-2620: ~1600ns/process context switch, ~1300ns/thread context siwtch
Performance boost: 5150: 66%, E5440: 65-70%, E5520: 50-54%, X5550: 55%, L5630: 45%, E5-2620: 45%.

The performance gap between thread switches and process switches seems to increase with newer CPU generations (5150: 7-8%, E5440: 5-15%, E5520: 11-20%, X5550: 15%, L5630: 13%, E5-2620: 19%). Overall the penalty of switching from one task to another remains very high. Bear in mind that those artificial tests do absolutely zero computation, so they probably have 100% cache hit in L1d and L1i. In the real world, switching between two tasks (threads or processes) typically incurs significantly higher penalties due to cache pollution. But we'll get back to this later.

Threads vs. processes

After producing the numbers above, I quickly criticized Java applications, because it's fairly common to create shitloads of threads in Java, and the cost of context switching becomes high in such applications. Someone retorted that, yes, Java uses lots of threads but threads have become significantly faster and cheaper with the NPTL in Linux 2.6. They said that normally there's no need to do a TLB flush when switching between two threads of the same process. That's true, you can go check the source code of the Linux kernel (switch_mm in mmu_context.h):
static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
                             struct task_struct *tsk)
       unsigned cpu = smp_processor_id();

       if (likely(prev != next)) {
       } else {
               [don't typically reload cr3]
In this code, the kernel expects to be switching between tasks that have different memory structures, in which cases it updates CR3, the register that holds a pointer to the page table. Writing to CR3 automatically causes a TLB flush on x86.

In practice though, with the default kernel scheduler and a busy server-type workload, it's fairly infrequent to go through the code path that skips the call to load_cr3. Plus, different threads tend to have different working sets, so even if you skip this step, you still end up polluting the L1/L2/L3/TLB caches. I re-ran the benchmark above with 2 threads instead of 2 processes (source: timetctxsw.c) but the results aren't significantly different (this varies a lot depending on scheduling and luck, but on average on many runs it's typically only 100ns faster to switch between threads if you don't set a custom CPU affinity).

Indirect costs in context switches: cache pollution

The results above are in line with a paper published a bunch of guys from University of Rochester: Quantifying The Cost of Context Switch. On an unspecified Intel Xeon (the paper was written in 2007, so the CPU was probably not too old), they end up with an average time of 3800ns. They use another method I thought of, which involves writing / reading 1 byte to / from a pipe to block / unblock a couple of processes. I thought that (ab)using futex would be better since futex is essentially exposing some scheduling interface to userland.

The paper goes on to explain the indirect costs involved in context switching, which are due to cache interference. Beyond a certain working set size (about half the size of the L2 cache in their benchmarks), the cost of context switching increases dramatically (by 2 orders of magnitude).

I think this is a more realistic expectation. Not sharing data between threads leads to optimal performance, but it also means that every thread has its own working set and that when a thread is migrated from one core to another (or worse, across physical CPUs), the cache pollution is going to be costly. Unfortunately, when an application has many more active threads than hardware threads, this is happening all the time. That's why not creating more active threads than there are hardware threads available is so important, because in this case it's easier for the Linux scheduler to keep re-scheduling the same threads on the core they last used ("weak affinity").

Having said that, these days, our CPUs have much larger caches, and can even have an L3 cache.
  • 5150: L1i & L1d = 32K each, L2 = 4M
  • E5440: L1i & L1d = 32K each, L2 = 6M
  • E5520: L1i & L1d = 32K each, L2 = 256K/core, L3 = 8M (same for the X5550)
  • L5630: L1i & L1d = 32K each, L2 = 256K/core, L3 = 12M
  • E5-2620: L1i & L1d = 64K each, L2 = 256K/core, L3 = 15M
Note that in the case of the E5520/X5550/L5630 (the ones marketed as "i7") as well as the Sandy Bridge E5-2520, the L2 cache is tiny but there's one L2 cache per core (with HT enabled, this gives us 128K per hardware thread). The L3 cache is shared for all cores that are on each physical CPU.

Having more cores is great, but it also increases the chance that your task be rescheduled onto a different core. The cores have to "migrate" cache lines around, which is expensive. I recommend reading What Every Programmer Should Know About Main Memory by Ulrich Drepper (yes, him again!) to understand more about how this works and the performance penalties involved.

So how does the cost of context switching increase with the size of the working set? This time we'll use another micro-benchmark, timectxswws.c that takes in argument the number of pages to use as a working set. This benchmark is exactly the same as the one used earlier to test the cost of context switching between two processes except that now each process does a memset on the working set, which is shared across both processes. Before starting, the benchmark times how long it takes to write over all the pages in the working set size requested. This time is then discounted from the total time taken by the test. This attempts to estimate the overhead of overwriting pages across context switches.

Here are the results for the 5150:As we can see, the time needed to write a 4K page more than doubles once our working set is bigger than what we can fit in the L1d (32K). The time per context switch keeps going up and up as the working set size increases, but beyond a certain point the benchmark becomes dominated by memory accesses and is no longer actually testing the overhead of a context switch, it's simply testing the performance of the memory subsystem.

Same test, but this time with CPU affinity (both processes pinned on the same core):Oh wow, watch this! It's an order of magnitude faster when pinning both processes on the same core! Because the working set is shared, the working set fits entirely in the 4M L2 cache and cache lines simply need to be transfered from L2 to L1d, instead of being transfered from core to core (potentially across 2 physical CPUs, which is far more expensive than within the same CPU).

Now the results for the i7 processor:Note that this time I covered larger working set sizes, hence the log scale on the X axis.

So yes, context switching on i7 is faster, but only for so long. Real applications (especially Java applications) tend to have large working sets so typically pay the highest price when undergoing a context switch. Other observations about the Nehalem architecture used in the i7:
  • Going from L1 to L2 is almost unnoticeable. It takes about 130ns to write a page with a working set that fits in L1d (32K) and only 180ns when it fits in L2 (256K). In this respect, the L2 on Nehalem is more of a "L1.5", since its latency is simply not comparable to that of the L2 of previous CPU generations.
  • As soon as the working set increases beyond 1024K, the time needed to write a page jumps to 750ns. My theory here is that 1024K = 256 pages = half of the TLB of the core, which is shared by the two HyperThreads. Because now both HyperThreads are fighting for TLB entries, the CPU core is constantly doing page table lookups.
Speaking of TLB, the Nehalem has an interesting architecture. Each core has a 64 entry "L1d TLB" (there's no "L1i TLB") and a unified 512 entry "L2TLB". Both are dynamically allocated between both HyperThreads.


I was wondering how much overhead there is when using virtualization. I repeated the benchmarks for the dual E5440, once in a normal Linux install, once while running the same install inside VMware ESX Server. The result is that, on average, it's 2.5x to 3x more expensive to do a context switch when using virtualization. My guess is that this is due to the fact that the guest OS can't update the page table itself, so when it attempts to change it, the hypervisor intervenes, which causes an extra 2 context switches (one to get inside the hypervisor, one to get out, back to the guest OS).

This probably explains why Intel added the EPT (Extended Page Table) on the Nehalem, since it enables the guest OS to modify its own page table without help of the hypervisor, and the CPU is able to do the end-to-end memory address translation on its own, entirely in hardware (virtual address to "guest-physical" address to physical address).

Parting words

Context switching is expensive. My rule of thumb is that it'll cost you about 30µs of CPU overhead. This seems to be a good worst-case approximation. Applications that create too many threads that are constantly fighting for CPU time (such as Apache's HTTPd or many Java applications) can waste considerable amounts of CPU cycles just to switch back and forth between different threads. I think the sweet spot for optimal CPU use is to have the same number of worker threads as there are hardware threads, and write code in an asynchronous / non-blocking fashion. Asynchronous code tends to be CPU bound, because anything that would block is simply deferred to later, until the blocking operation completes. This means that threads in asynchronous / non-blocking applications are much more likely to use their full time quantum before the kernel scheduler preempts them. And if there's the same number of runnable threads as there are hardware threads, the kernel is very likely to reschedule threads on the same core, which significantly helps performance.

Another hidden cost that severely impacts server-type workloads is that after being switched out, even if your process becomes runnable, it'll have to wait in the kernel's run queue until a CPU core is available for it. Linux kernels are often compiled with HZ=100, which entails that processes are given time slices of 10ms. If your thread has been switched out but becomes runnable almost immediately, and there are 2 other threads before it in the run queue waiting for CPU time, your thread may have to wait up to 20ms in the worst scenario to get CPU time. So depending on the average length of the run queue (which is reflected in load average), and how long your threads typically run before getting switched out again, this can considerably impact performance.

It is illusory to imagine that NPTL or the Nehalem architecture made context switching cheaper in real-world server-type workloads. Default Linux kernels don't do a good job at keeping CPU affinity, even on idle machines. You must explore alternative schedulers or use taskset or cpuset to control affinity yourself. If you're running multiple different CPU-intensive applications on the same server, manually partitioning cores across applications can help you achieve very significant performance gains.

Saturday, October 16, 2010

thread.error: can't start new thread

Today I was puzzled for a short while by a machine on which Python seemed to be acting up:
$ python
Python 2.5.2 (r252:60911, Jan 20 2010, 23:14:04)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
>>> import thread
>>> thread.start_new(lambda: None, ())
Traceback (most recent call last):
File "", line 1, in
thread.error: can't start new thread
The machine had 8G of RAM free and was running only about 200 processes (~500 threads total). So, WTF? While running the command above, I straced Python and here's what I saw:
mmap(NULL, 17592186048512, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x40, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 17592186048512, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
write(2, "Traceback (most recent call last"..., 35) = 35
The second argument to mmap is the amount of memory you want to allocate, and yes that's 16GB. I checked /etc/security/limits.conf and I saw that the problem came from there. Someone put this line:
*       -       stack   17179869184  # 16GB
Now when Python creates a thread, it first allocates* memory for the stack with mmap before calling clone to start the thread. If you're also puzzled by this can't start new thread error, make sure it's not a ulimit problem!

* Of course mmap doesn't actually allocate the memory, it just "reserves" it. Actual memory isn't allocated until you use it, but in this case the OS was refusing the "reservation request" because it was more than the maximum amount of RAM that a process is allowed to have according to other ulimits

Monday, October 4, 2010

IOException: well-known file is not secure

I implemented a command-line tool to discover and read JMX attributes on Java processes to make it easier to write monitoring scripts for things that, unfortunately, use JMX. The tool uses Sun's Attach API, which is a Sun specific API to attach to a running, local VM, which allows you to load a JMX agent in an existing JVM without requiring that JVM to be started with all the -Djmx flags crap.

It worked like a charm, except when run as root. As root, it would instead fail most of the time (but not always!) with this perfectly intelligible exception: well-known file is not secure
at Method)
I Googled that to find it was bug #6649594, but the bug report was completely unhelpful. It mentions a race condition but doesn't explain what the race condition is or how to work around it. Awesome.

So, Google Code Search to the rescue! This exception is thrown by Java_sun_tools_attach_LinuxVirtualMachine_checkPermissions, which is checking that the permissions and mode of a so-called "well-known file" match that of the running JVM. I couldn't trace the code to the location that produces the path, as it's the usual Java-style code with a bazillion layers of abstractions and indirections, but I'm pretty certain this is due to the /tmp/hsperfdata_$USER/$PID files. Sure enough, one of my files was owned by the group root instead of the group of the user (this is because before forking the JVM, I was calling setuid to drop the privileges, but I forgot to call setgid). Fixing the permissions on the file solved the error.

So if you too are scratching your head over this mysterious "well-known file", make sure the permissions on all the /tmp/hsperfdata_$USER/$PID files are consistent.

The jLOL of the day: it's impossible, in Java, to portably find the PID of the current JVM. So when iterating on all the VMs running on the localhost, here's a workaround to detect whether a VirtualMachine instance corresponds to the current JVM. Insert something in the system properties (System.setProperty(..)) and then check whether the VirtualMachine you have at hand has this property. There's a bug filed in — I kid you not — 1999 about this: Bug #4244896 with a whopping 109 votes for it. The bug is in state "Fix Understood", I guess I'm too dumb to understand the fix.

Friday, August 27, 2010

JPMP (Java's Poor Man's Profiler)

I recently came across PMP and found it surprisingly useful to troubleshoot and analyze live production systems, despite it being so ridiculously simple and naive. Unfortunately for me, I've had to deal with quite a bit of Java lately. So here we go, JPMP:
$ cat jpmp

for x in $(seq 1 $nsamples)
jstack $pid
sleep $sleeptime
done | \
awk 'BEGIN { s = "" }
/^"/ { if (s) print s; s = "" }
/^ at / { sub(/\([^)]*\)?$/, "", $2); sub(/^java/, "j", $2);
if (s) s = s "," $2; else s = $2 }
END { if(s) print s }' | \
sort | uniq -c | sort -rnk1
$ jpmp 28881 5 0
120 sun.misc.Unsafe.park,j.util.concurrent.locks.LockSupport.park,j.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await,j.util.concurrent.LinkedBlockingQueue.take,j.util.concurrent.ThreadPoolExecutor.getTask,j.util.concurrent.ThreadPoolExecutor$,
5 sun.misc.Unsafe.park,j.util.concurrent.locks.LockSupport.park,j.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await,j.util.concurrent.LinkedBlockingQueue.take,org.apache.zookeeper.ClientCnxn$
5 sun.misc.Unsafe.park,j.util.concurrent.locks.LockSupport.park,j.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await,j.util.concurrent.DelayQueue.take,,,
5 j.lang.Thread.sleep,,
5 j.lang.Object.wait,j.lang.ref.ReferenceQueue.remove,j.lang.ref.ReferenceQueue.remove,j.lang.ref.Finalizer$
5 j.lang.Object.wait,j.lang.Object.wait,j.lang.ref.Reference$

Java NIO and overengineering

In an earlier post about some of the most horrible APIs in the JDK (the date related APIs), I mentioned how the Java's IO APIs were equally (if not more) retarded. Today I had the joy to dig in the code of Java NIO try to understand why NIO doesn't have "built-in" proxy support like OIO through -D socksProxyHost and -DsocksProxyPort. I wanted to see if I could find a way to work around this limitation, possibly by doing evil things with reflection.

The problems boils down to the fact that with NIO, you use a SocketChannel, and under the hood, some kind of a Socket must be used. Here the fun starts, because as the javadoc says, "The actual work of the socket is performed by an instance of the SocketImpl class", which is itself an abstract class (it's an abstract Impl, yeah, right...).

So let's see how the code goes from a SocketChannel to a Socket or a SocketImpl:
  1. People typically create a SocketChannel like so:
  2. does return SelectorProvider.provider().openSocketChannel();
  3. SelectorProvider.provider() checks to see if the system property java.nio.channels.spi.SelectorProvider is set and a few other things. Unless you specifically do something manually, none of that stuff will end up finding a provider, so the code will end up returning;.
  4. Now here things get a little bit blurry. I think we end up in here (on Linux) doing return new;
  5. So in step 2., when openSocketChannel() is called on the provider, we end up in here, doing return new SocketChannelImpl(this);
  6. SocketChannelImpl then does all the work for NIO sockets, mostly using which is chiefly made of native methods.
  7. If you want to view a SocketChannelImpl as a Socket by calling socket(), the SocketChannelImpl wraps itself in a SocketAdaptor, but the adaptor simply transforms all the calls to calls on the underlying SocketChannelImpl which uses the native functions in, so there's nothing really you can do to make it use a SOCKS proxy.
So the bottom line of the story is that there's no way to use a SOCKS proxy with NIO unless you implement the SOCKS proxy protocol yourself (and if you do, make sure you use enough abstract factories implementations and other layers of indirections, so that your code will fit well in the Java IO philosophy).

Also, just for the fun of it, I was curious to see how many objects were involved under the hood of a single SocketChannel. The NIO code is so bloated that I was expecting quite a bit of overhead. I used the following not-very-scientific method (which is kinda similar to PMP) to try to get an answer:
$ cat
import java.nio.channels.SocketChannel;

final class t {
  public static void main(String[]a) throws Exception {
    SocketChannel socket =;
    InetSocketAddress addr = new InetSocketAddress("", 80);
then I run this code and use jmap -histo:live right after seeing the "ready" message and right after seeing the "connected" message. In between both of those messages, 2933 objects occupying 371744 bytes of RAM (on my MBP with the JDK 1.6.0_20-b02-279-10M3065 and HotSpot JVM 16.3-b01-279, both shipped by Apple with OS X 10.6.4). I don't know if I'm supposed to laugh or cry.
Class name                         Diff:       # instances Memory used (bytes)
<methodKlass>                                         +615 +74264
<constMethodKlass>                                    +615 +72416
<constantPoolKlass>                                    +65 +47288
<instanceKlassKlass>                                   +65 +46264
<symbolKlass>                                         +705 +33312
<constantPoolCacheKlass>                               +67 +32024
java.lang.Class                                        +69 +12696
[C                                                     +85 +10664
[[I                                                   +121  +8760
[I                                                     +73  +5928
[S                                                     +96  +5232
[B                                                     +58  +4600
java.lang.String                                       +94  +3760
java.util.LinkedList$Entry                             +65  +2600
<objArrayKlassKlass>                                    +4  +2336
<methodDataKlass>                                       +4  +1800                                           +15  +1560
java.util.HashMap$Entry                                +29  +1392
[Ljava.util.HashMap$Entry;                              +8  +1344
java.util.LinkedHashMap$Entry                          +15   +960
java.util.LinkedList                                   +23   +920
sun.misc.URLClassPath$JarLoader                         +6   +432
java.util.HashMap                                       +6   +384$Entry                            +11   +352
java.util.jar.JarFile$JarFileEntry                      +3   +336
java.lang.reflect.Constructor                           +2   +240
[Ljava.lang.Object;                                     +2   +208                                   +5   +200                            +1   +176
java.util.LinkedHashMap                                 +2   +160
java.lang.Object                                        +8   +128
java.lang.ThreadLocal                                   +5   +120
java.lang.ClassLoader$NativeLibrary                     +2    +96
[;                                 +2    +88
java.util.ArrayList                                     +2    +80
sun.misc.JarIndex                                       +2    +80
[Ljava.lang.String;                                     +2    +64$CacheEntry                         +2    +64$Cache                              +2    +64$Cache$Type                         +2    +64
sun.misc.URLClassPath                                   +1    +56
java.lang.ref.SoftReference                             +1    +56
[Ljava.lang.ThreadLocal;                                +1    +48
java.util.Stack                                         +1    +40
sun.reflect.NativeConstructorAccessorImpl               +1    +40                              +1    +40
[$Cache$Type;                      +1    +40
[Ljava.lang.reflect.Constructor;                        +1    +32                               +1    +32
sun.reflect.DelegatingConstructorAccessorImpl           +1    +24
java.util.HashMap$KeySet                                +1    +24
java.nio.channels.spi.AbstractInterruptibleChannel$1    +1    +24
[Ljava.lang.Class;                                      +1    +24$1                                  +1    +16                        +1    +16                       +1    +16                             +1    +16                                  -1    -24
java.util.Vector                                        -1    -40                                 -2    -64
java.util.jar.JarFile                                   -1    -80                                -4    -96                                  -4   -192$ZipFileInputStream               -12   -672
java.lang.ref.Finalizer                                -17  -1088
Er.. OK I guess that wasn't entirely fair, since 2286 objects (taking 328928 bytes of RAM) have been created by the VM for classes that have been loaded and compiled. So if we discount those, we're still left with 647 objects taking 42816 bytes of RAM... Ahem... Just for a little SocketChannel.

PS: I'm half-ashamed but to produce the output above, I used what could possibly be the most horrible "one-liner" I ever wrote in Python. I put the output of jmap in /tmp/a for the first run and /tmp/b for the second run:
python -c 'load = lambda path: dict((klass, (int(instances), int(bytes))) for (num, instances, bytes, klass) in (line.split() for line in open(path) if line[4] == ":")); a = load("/tmp/a"); b = load("/tmp/b"); print "\n".join("%s\033[55G%+4d\t%+6d" % (klass, diffinst, diffbytes) for (klass, diffinst, diffbytes) in sorted(((klass, b[klass][0] - a.get(klass, (0, 0))[0], b[klass][1] - a.get(klass, (0, 0))[1]) for klass in b), key=lambda i: i[2], reverse=True) if diffbytes != 0).replace(">", "&lt;").replace(">", "&gt;")'

Tuesday, June 22, 2010

Linux's dentry_cache takes all available memory

So for some reason one of my servers had 12G of memory allocated to the dentry_cache that wasn't being reclaimed. Probably a bug in the old version of the Linux kernel that this server is using. The solution that worked for me was:
# sync && echo 2 >/proc/sys/vm/drop_caches
If you're curious as to what this does, RTFM.

SimpleDateFormat and "AssertionError: cache control: inconsictency"

Oh boy, the JDK was really poorly """designed""". Today's fail is:
java.lang.AssertionError: cache control: inconsictency, cachedFixedDate=733763, computed=733763, date=2009-12-22T00:00:00.000Z
at java.util.GregorianCalendar.computeFields(
at java.util.GregorianCalendar.computeTime(
at java.util.Calendar.updateTime(
at java.util.Calendar.getTimeInMillis(
at java.util.Calendar.getTime(
at java.text.SimpleDateFormat.parse(
at java.text.DateFormat.parse(
(Courtesy of -enablesystemassertions). Because, yeah, SimpleDateFormat is not thread-safe.
Date formats are not synchronized. It is recommended to create separate format instances for each thread. If multiple threads access a format concurrently, it must be synchronized externally.
Like many things in Java, this is just retarded. If SimpleDateFormat was written in a more functional way instead of storing the result of the last call to parse in itself, this problem wouldn't exist.
So yeah, I'm probably a JDK n00b and did not RTFM properly, but this just adds up to the already overflowing list of crap in the JDK.

In the next episode we'll review the OMGWTFBBQ code you have to write to use Java's crappy IO library – and I'm not even talking about the fact that the library exclusively allows 1975-style programming... which is generally the case for everything written in Java anyway.

Also, it's spelled "inconsistency" not "inconsictency", dumbass.

The moral of the story is that if grep 'static .* SimpleDateFormat' on your code returns anything, you have a bug!

Edit 2010-09-27: This can also manifest itself if parsing a date fails randomly with weird error messages such as: ParseException: For input string: "" (even though the string passed in argument was definitely not empty). With such concurrency bugs, a number of weird things can happen...

Sunday, June 13, 2010

Go fix the bug yourself, KTHXBYE

Have you ever wondered why, often, small open-source projects led by a handful of bearded hackers become more successful than large, entreprise-class software? They don't necessarily become more successful in terms of adoption (actually, this rarely happens), but technically speaking, and for those who adopt them, they're much, much, much better.

I'm sure a number of people have already observed how open source communities work, how people interact together, how projects move forward, and so on and so forth, to try to understand what makes the successful ones. But so far I've yet to see a "definitive guide" on "How To Make The Engineers In Your Company As Efficient As In An Awesome Open Source Community".

One of the differences that struck me today is the following. Someone wrote a piece of code that someone else uses. The user complains to the original author "hey, your code is broken, go fix it!".

In an open-source setting, frequently the user will be told "Oh, I see, it's probably something in file hmm, I'm pretty busy right now, how about you fix it and send us the patch? [KTHXBYE]".

And that's fine. Hardly anyone will be shocked by that. Hey, it's an open-source project after all. The author is probably not getting paid for this and doing it on his free time. Now if the user really cares about this bug, and there's no easy workaround, it's actually rather likely that he'll bite the bullet, dive into the code, and fix the bug. And when he does so, it's even more likely that he'll contribute the fix back. Great.

But in a corporate setting, some parameters are different. If the author refuses to fix the bug found, often the user will complain that they're being blocked by the author. It's like, "Hey, it's gonna take you just 10 minutes to fix it, whereas if I fix it myself I'll have to learn X/Y/Z first and it's gonna take me hours. Plus, it's your code.". To the defense of the user, this argument is probably almost true, except it's probably gonna take 30 minutes to the guy to fix the bug and just 1-2 hours for the reporter to learn whatever is required and fix the bug.

Depending on the likelihood that the user to re-use this piece of code where the bug was, getting the user to fix the bug themselves is actually a lot more beneficial, I think. If he does it, he'll know at least a little bit about the project and its implementation. Should another bug crop up, he'll be able to get the fix in place a lot faster since he'll be familiar with the code and won't need to ask anyone else for help (asking for help is the equivalent of a huge cache miss + TLB miss + page fault and incurs a significant performance penalty).

There are other nice side effects of getting the user to fix the code themselves. Maybe, by going through the code, they'll stumble upon a feature they didn't know was there, and they'll re-use it instead of re-implementing something similar. Maybe they'll find that the feature isn't exactly what they need, but it's half-way there, so they'll improve the code, which in turns will benefit everyone else already using that piece of code, or looking for a similar-ish feature, so they won't have to make a third implementation.

Now what I've seen so far is that in practice this frequently doesn't happen in companies. Each individual owns a number of projects, libraries, systems and such. And they're the only one knowing everything inside out about what they own. And when there's a problem, they're asked to fix it because, hey, they're the ones who know how the damn thing works (or doesn't). And they don't want to block their co-workers, do they? Right? Right?! So there's no incentive for others to learn and start sharing the ownership of those systems. And when the person leaves, gets fired, takes some extended time off, no one's left to deal with the problems. So what happens is that, after a while, once everyone is fed up with this goddamn thing that doesn't-work-and-no-one-knows-how-to-fix-it, someone gets assigned to find a durable solution. "Hey, this has been causing a number of problems, others are blocked by it or it's affecting their productivity, you gotta do something about it.", says the manager / project lead. And then what happens is that either the victim starts taking ownership of the system, and we're back with the original problem with exclusive ownership, or the victim rewrites the entire thing and .. still has exclusive ownership over it.

I think this is one of the many many factors that explain why, in a corporate environment, it's so frequent to have lots of "projects" exclusively owned by a single individual and why the projects aren't moving as fast as they could. There's a significant amount of time spent re-inventing the wheel, ditching and rewriting existing stuff, going back and forth with another engineer/team about how the problem wasn't fixed properly, etc. In the open-source world those inefficiencies don't exist so much because engineers maintain different relationships.

It's probably not a coincidence that this is something that happens a lot at Google. I mean, people are encouraged to fix the bugs they find. Every engineer has access to virtually all the code (crazy heh? yes this model works, and it's marvelous) so there are no barriers to prevent you from fixing something you see is wrong. It's so easy, you check out the code, learn what you need to learn about it (hopefully it's slightly better documented than your average corporate code as code reviews are mandatory at Google, and hopefully people complain when they're asked to review undocumented code), you fix the problem, and send a code review to the owner of the code. With a little bit of luck, they'll quickly get back to you with an LGTM (Looks Good To Me), et voilà, your fix/improvement has been submitted and is available to ~10k other engineers. Google is effectively the large corporation that, internally, looks the most like a gigantic open-source community. And the fact that they are where they are today is probably not a coincidence when you mix this with an engineering- & data-driven culture.

Wednesday, June 9, 2010

Beamer's semiverbatim, spaces and newlines

If like me, you're puzzled because the semiverbatim environment isn't respecting your spaces and newlines like verbatim does, make sure you haven't forgotten to declare the frame [fragile]!

It took me a little while to realize this.

Sunday, February 21, 2010

How to disable Bonjour on OSX 10.6 (Snow Leopard)

Bonjour is Apple's zeroconf protocol and iTunes (among others) uses it to broadcast the existence of its shared library, which I find annoying because it's quite chatty on the network. In the past shutting down mDNSResponder was enough to keep it quiet but this doesn't work as intended anymore on Snow Leopard as all DNS activity goes through mDNSResponder now.

Apple's KB explains how to properly disable those broadcast advertisements:
  1. sudo vim /System/Library/LaunchDaemons/
  2. Add <string>-NoMulticastAdvertisements</string> at the end of the ProgramArguments array.
  3. Save the file
  4. Restart mDNSResponder:
    sudo launchctl unload /System/Library/LaunchDaemons/
    sudo launchctl load /System/Library/LaunchDaemons/
There's no need to reboot unlike what the KB says, as you long as you restart mDNSResponder.

Friday, February 5, 2010

Linux pipe count

I recently needed to keep track of the number of pipes opened by a server and I was using lsof -nw | fgrep FIFO but this was rather slow.

Here's a better way of counting the number of pipes opened:
# find /proc/[0-9]*/fd -lname 'pipe:*' 2>/dev/null -printf "%l\n" | wc -l
# find /proc/[0-9]*/fd -lname 'pipe:*' 2>/dev/null -printf "%l\n" | sort -u | wc -l
So in this case there are 16 different pipes opened on this system and 1294 FDs associated with one or the other end of each pipe. Let's see how many FDs are associated with each pipe:
# find /proc/[0-9]*/fd -lname 'pipe:*' 2>/dev/null -printf "%l\n" | sort | uniq -c
1 pipe:[10420]
4 pipe:[1752247694]
4 pipe:[1752247702]
2 pipe:[1752247883]
2 pipe:[234248737]
1 pipe:[2597396375]
504 pipe:[4194986548]
255 pipe:[4194986549]
252 pipe:[4194986551]
252 pipe:[4194986552]
2 pipe:[4213391983]
1 pipe:[4214019609]
2 pipe:[480771901]
8 pipe:[480771902]
2 pipe:[6087]
2 pipe:[6288]
So certain pipes are much more popular than others.

Bonus script to find out which pipe is used by which process:
# find /proc/[0-9]*/fd -lname 'pipe:*' -printf "%p/%l\n" 2>/dev/null | python -c 'import os
import sys
pid2cmd = {}
def cmdname(pid):
cmd = os.path.basename(os.readlink("/proc/%s/exe" % pid))
pid2cmd[pid] = cmd
return cmd
pipes = {}
for line in sys.stdin:
line = line.rstrip().split("/")
pid, pipe = line[2], line[5]
cmd = pid2cmd.get(pid) or cmdname(pid)
pipes.setdefault(pipe, {}).setdefault(cmd, 0)
pipes[pipe][cmd] += 1
n = 0
for pipe, cmds in sorted(pipes.iteritems()):
print pipe,
for cmd, cnt in cmds.iteritems():
n += int(cnt)
print "%s=%s" % (cmd, cnt),
print len(pipes), "pipes using", n, "fds"'
pipe:[10420] rpc.statd=1
pipe:[1752247694] dsm_sa_datamgr32d.
pipe:[1752247702] dsm_sa_datamgr32d.
pipe:[1752247883] dsm_sa_datamgr32d.
pipe:[234248737] famd=2
pipe:[2597396375] ntpd=1
pipe:[4194986548] apache2=504
pipe:[4194986549] dash=1 rotatelogs=1 apache2=253
pipe:[4194986551] apache2=252
pipe:[4194986552] apache2=252
pipe:[4214440803] sshd=2
pipe:[4214877266] find=1
pipe:[480771901] master=2
pipe:[480771902] qmgr=2 pickup=2 master=2 tlsmgr=2
pipe:[6087] init=2
pipe:[6288] udevd=2
16 pipes using 1294 fds
So apache2 is the heavy pipe user here. Yay for mpm_prefork</irony>