Tuesday, June 22, 2010

Linux's dentry_cache takes all available memory

So for some reason one of my servers had 12G of memory allocated to the dentry_cache that wasn't being reclaimed. Probably a bug in the old version of the Linux kernel that this server is using. The solution that worked for me was:
# sync && echo 2 >/proc/sys/vm/drop_caches
If you're curious as to what this does, RTFM.

SimpleDateFormat and "AssertionError: cache control: inconsictency"

Oh boy, the JDK was really poorly """designed""". Today's fail is:
java.lang.AssertionError: cache control: inconsictency, cachedFixedDate=733763, computed=733763, date=2009-12-22T00:00:00.000Z
at java.util.GregorianCalendar.computeFields(GregorianCalendar.java:2070)
at java.util.GregorianCalendar.computeTime(GregorianCalendar.java:2472)
at java.util.Calendar.updateTime(Calendar.java:2468)
at java.util.Calendar.getTimeInMillis(Calendar.java:1087)
at java.util.Calendar.getTime(Calendar.java:1060)
at java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1368)
at java.text.DateFormat.parse(DateFormat.java:335)
(Courtesy of -enablesystemassertions). Because, yeah, SimpleDateFormat is not thread-safe.
Date formats are not synchronized. It is recommended to create separate format instances for each thread. If multiple threads access a format concurrently, it must be synchronized externally.
Like many things in Java, this is just retarded. If SimpleDateFormat was written in a more functional way instead of storing the result of the last call to parse in itself, this problem wouldn't exist.
So yeah, I'm probably a JDK n00b and did not RTFM properly, but this just adds up to the already overflowing list of crap in the JDK.

In the next episode we'll review the OMGWTFBBQ code you have to write to use Java's crappy IO library – and I'm not even talking about the fact that the library exclusively allows 1975-style programming... which is generally the case for everything written in Java anyway.

Also, it's spelled "inconsistency" not "inconsictency", dumbass.

The moral of the story is that if grep 'static .* SimpleDateFormat' on your code returns anything, you have a bug!

Edit 2010-09-27: This can also manifest itself if parsing a date fails randomly with weird error messages such as: ParseException: For input string: "" (even though the string passed in argument was definitely not empty). With such concurrency bugs, a number of weird things can happen...

Sunday, June 13, 2010

Go fix the bug yourself, KTHXBYE

Have you ever wondered why, often, small open-source projects led by a handful of bearded hackers become more successful than large, entreprise-class software? They don't necessarily become more successful in terms of adoption (actually, this rarely happens), but technically speaking, and for those who adopt them, they're much, much, much better.

I'm sure a number of people have already observed how open source communities work, how people interact together, how projects move forward, and so on and so forth, to try to understand what makes the successful ones. But so far I've yet to see a "definitive guide" on "How To Make The Engineers In Your Company As Efficient As In An Awesome Open Source Community".

One of the differences that struck me today is the following. Someone wrote a piece of code that someone else uses. The user complains to the original author "hey, your code is broken, go fix it!".

In an open-source setting, frequently the user will be told "Oh, I see, it's probably something in file foo.bar... hmm, I'm pretty busy right now, how about you fix it and send us the patch? [KTHXBYE]".

And that's fine. Hardly anyone will be shocked by that. Hey, it's an open-source project after all. The author is probably not getting paid for this and doing it on his free time. Now if the user really cares about this bug, and there's no easy workaround, it's actually rather likely that he'll bite the bullet, dive into the code, and fix the bug. And when he does so, it's even more likely that he'll contribute the fix back. Great.

But in a corporate setting, some parameters are different. If the author refuses to fix the bug found, often the user will complain that they're being blocked by the author. It's like, "Hey, it's gonna take you just 10 minutes to fix it, whereas if I fix it myself I'll have to learn X/Y/Z first and it's gonna take me hours. Plus, it's your code.". To the defense of the user, this argument is probably almost true, except it's probably gonna take 30 minutes to the guy to fix the bug and just 1-2 hours for the reporter to learn whatever is required and fix the bug.

Depending on the likelihood that the user to re-use this piece of code where the bug was, getting the user to fix the bug themselves is actually a lot more beneficial, I think. If he does it, he'll know at least a little bit about the project and its implementation. Should another bug crop up, he'll be able to get the fix in place a lot faster since he'll be familiar with the code and won't need to ask anyone else for help (asking for help is the equivalent of a huge cache miss + TLB miss + page fault and incurs a significant performance penalty).

There are other nice side effects of getting the user to fix the code themselves. Maybe, by going through the code, they'll stumble upon a feature they didn't know was there, and they'll re-use it instead of re-implementing something similar. Maybe they'll find that the feature isn't exactly what they need, but it's half-way there, so they'll improve the code, which in turns will benefit everyone else already using that piece of code, or looking for a similar-ish feature, so they won't have to make a third implementation.

Now what I've seen so far is that in practice this frequently doesn't happen in companies. Each individual owns a number of projects, libraries, systems and such. And they're the only one knowing everything inside out about what they own. And when there's a problem, they're asked to fix it because, hey, they're the ones who know how the damn thing works (or doesn't). And they don't want to block their co-workers, do they? Right? Right?! So there's no incentive for others to learn and start sharing the ownership of those systems. And when the person leaves, gets fired, takes some extended time off, no one's left to deal with the problems. So what happens is that, after a while, once everyone is fed up with this goddamn thing that doesn't-work-and-no-one-knows-how-to-fix-it, someone gets assigned to find a durable solution. "Hey, this has been causing a number of problems, others are blocked by it or it's affecting their productivity, you gotta do something about it.", says the manager / project lead. And then what happens is that either the victim starts taking ownership of the system, and we're back with the original problem with exclusive ownership, or the victim rewrites the entire thing and .. still has exclusive ownership over it.

I think this is one of the many many factors that explain why, in a corporate environment, it's so frequent to have lots of "projects" exclusively owned by a single individual and why the projects aren't moving as fast as they could. There's a significant amount of time spent re-inventing the wheel, ditching and rewriting existing stuff, going back and forth with another engineer/team about how the problem wasn't fixed properly, etc. In the open-source world those inefficiencies don't exist so much because engineers maintain different relationships.

It's probably not a coincidence that this is something that happens a lot at Google. I mean, people are encouraged to fix the bugs they find. Every engineer has access to virtually all the code (crazy heh? yes this model works, and it's marvelous) so there are no barriers to prevent you from fixing something you see is wrong. It's so easy, you check out the code, learn what you need to learn about it (hopefully it's slightly better documented than your average corporate code as code reviews are mandatory at Google, and hopefully people complain when they're asked to review undocumented code), you fix the problem, and send a code review to the owner of the code. With a little bit of luck, they'll quickly get back to you with an LGTM (Looks Good To Me), et voilĂ , your fix/improvement has been submitted and is available to ~10k other engineers. Google is effectively the large corporation that, internally, looks the most like a gigantic open-source community. And the fact that they are where they are today is probably not a coincidence when you mix this with an engineering- & data-driven culture.

Wednesday, June 9, 2010

Beamer's semiverbatim, spaces and newlines

If like me, you're puzzled because the semiverbatim environment isn't respecting your spaces and newlines like verbatim does, make sure you haven't forgotten to declare the frame [fragile]!

It took me a little while to realize this.