Saturday, October 16, 2010

thread.error: can't start new thread

Today I was puzzled for a short while by a machine on which Python seemed to be acting up:
$ python
Python 2.5.2 (r252:60911, Jan 20 2010, 23:14:04)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
>>> import thread
>>> thread.start_new(lambda: None, ())
Traceback (most recent call last):
File "", line 1, in
thread.error: can't start new thread
The machine had 8G of RAM free and was running only about 200 processes (~500 threads total). So, WTF? While running the command above, I straced Python and here's what I saw:
mmap(NULL, 17592186048512, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|0x40, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 17592186048512, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
write(2, "Traceback (most recent call last"..., 35) = 35
The second argument to mmap is the amount of memory you want to allocate, and yes that's 16GB. I checked /etc/security/limits.conf and I saw that the problem came from there. Someone put this line:
*       -       stack   17179869184  # 16GB
Now when Python creates a thread, it first allocates* memory for the stack with mmap before calling clone to start the thread. If you're also puzzled by this can't start new thread error, make sure it's not a ulimit problem!

* Of course mmap doesn't actually allocate the memory, it just "reserves" it. Actual memory isn't allocated until you use it, but in this case the OS was refusing the "reservation request" because it was more than the maximum amount of RAM that a process is allowed to have according to other ulimits

Monday, October 4, 2010

IOException: well-known file is not secure

I implemented a command-line tool to discover and read JMX attributes on Java processes to make it easier to write monitoring scripts for things that, unfortunately, use JMX. The tool uses Sun's Attach API, which is a Sun specific API to attach to a running, local VM, which allows you to load a JMX agent in an existing JVM without requiring that JVM to be started with all the -Djmx flags crap.

It worked like a charm, except when run as root. As root, it would instead fail most of the time (but not always!) with this perfectly intelligible exception:
java.io.IOException: well-known file is not secure
at sun.tools.attach.LinuxVirtualMachine.checkPermissions(Native Method)
at sun.tools.attach.LinuxVirtualMachine.(LinuxVirtualMachine.java:93)
at sun.tools.attach.LinuxAttachProvider.attachVirtualMachine(LinuxAttachProvider.java:46)
at com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:195)
I Googled that to find it was bug #6649594, but the bug report was completely unhelpful. It mentions a race condition but doesn't explain what the race condition is or how to work around it. Awesome.

So, Google Code Search to the rescue! This exception is thrown by Java_sun_tools_attach_LinuxVirtualMachine_checkPermissions, which is checking that the permissions and mode of a so-called "well-known file" match that of the running JVM. I couldn't trace the code to the location that produces the path, as it's the usual Java-style code with a bazillion layers of abstractions and indirections, but I'm pretty certain this is due to the /tmp/hsperfdata_$USER/$PID files. Sure enough, one of my files was owned by the group root instead of the group of the user (this is because before forking the JVM, I was calling setuid to drop the privileges, but I forgot to call setgid). Fixing the permissions on the file solved the error.

So if you too are scratching your head over this mysterious "well-known file", make sure the permissions on all the /tmp/hsperfdata_$USER/$PID files are consistent.

The jLOL of the day: it's impossible, in Java, to portably find the PID of the current JVM. So when iterating on all the VMs running on the localhost, here's a workaround to detect whether a VirtualMachine instance corresponds to the current JVM. Insert something in the system properties (System.setProperty(..)) and then check whether the VirtualMachine you have at hand has this property. There's a bug filed in — I kid you not — 1999 about this: Bug #4244896 with a whopping 109 votes for it. The bug is in state "Fix Understood", I guess I'm too dumb to understand the fix.