Saturday, April 14, 2012

How Apache Hadoop is molesting IOException all day

Today I'd like to rant on one thing that's been bugging me for last couple years with Apache Hadoop (and all its derived projects). It's a big issue that concerns us all. We have to admit it, each time we write code for the Apache Hadoop stack, we feel bad about it, but we try hard to ignore what's happening right before our eyes. I'm talking, of course, about the constant abuse and molestation of IOException.
I'm not even going to debate how checked exceptions are like communism (good idea in theory, totally fails in practice). Even if people don't get that, I wish they at least stopped the madness with this poor little IOException.
Let's review again what IOException is for:
"Signals that an I/O exception of some sort has occurred. This class is the general class of exceptions produced by failed or interrupted I/O operations."
In Hadoop everything is an IOException. Everything. Some assertion fails, IOException. A number exceeds the maximum allowed by the config, IOException. Some protocol versions don't match, IOException. Hadoop needs to fart, IOException.
How are you supposed to handle these exceptions? Everything is declared as throws IOException and everything is catching, wrapping, re-throwing, logging, eating, and ignoring IOExceptions. Impossible. No matter what goes wrong, you're left clueless. And it's not like there is a nice exception hierarchy to help you handle them. No, virtually everything is just a bare IOException.
Because of this, it's not uncommon to see code that inspects the message of the exception (a bare String) to try to figure out what's wrong and what to do with it. A friend of mine was recently explaining to me how Apache Kafka was "stringly typed" (a new cutting-edge paradigm whereby you show the middle finger to the type system and stuff everything in Strings). Well Hadoop has invented better than checked exceptions, they have stringed exceptions. Unfortunately, half of the time you can't even leverage this awesome new idiom because the message of the exception itself is useless. For example when a MapReduce chokes on a corrupted file, it will just throw an IOException without telling you the path of the problematic file. This way it's more fun, once you nail it down (with a binary search of course), you feel like you accomplished something. Or you'll get messages like "IOException: Split metadata size exceeded 10000000.". Figuring out what was the actual value is left as an exercise to the reader.
So, seriously Apache folks...
Stop Abusing IOException!
Leave this poor little IOException alone!
Hadoop (0.20.2) currently has a whopping 1300+ lines of code creating bare IOExceptions. HBase (0.92.1) has over 400. Apache committers should consider every single one of these lines as a code smell that needs to be fixed, that's begging to be fixed. Please introduce a new base exception type, and create a sound exception hierarchy.
Updates: