Friday, December 10, 2010

Java IO: slowest readLine ever

I have a fairly simple problem: I want to count the number of lines in a file, then seek back to after the first line, and then read the file line by line. Easy heh? Not in Java. Enter the utterly retarded world of the JDK.

So if you're n00b, you'll start with a FileInputStream, but quickly you'll realize that seeking around with it isn't really possible... Indeed, the only way to go back to a previous position in the file is to call reset(), which will take you back to the previous location you marked with mark(int). The argument to mark is "the maximum limit of bytes that can be read before the mark position becomes invalid". OK WTF.

If you dig around some more, you'll see that you should really be using a RandomAccessFile – so much for good OO design. The other seemingly cool thing about RandomAccessFile is that it's got a readLine() method. Unfortunately, this method was implemented by a 1st year CS student who probably dropped out before understanding the basics of systems programming.

Believe it or not, but readLine() reads the file one byte at a time. It does one system call to read per byte. As such, it's 2 orders of magnitude slower than it could be... In fact, you can't really implement a readline function that's much slower than that. facepalm.

PS: This is with Sun's JRE version 1.6.0_22-b04. JDK7/OpenJDK has the same implementation. Apache's Harmony implementation is the same, so Android has the same retarded implementation.

11 comments:

Flagthis said...

Wow, thats very interesting (and sort of dumb). How expensive is to reopen the file if its huge file, or cache the contents of small files to avoid the "re-seeking start of file" issue.

Unknown said...

Nice history !

How did you dive that deep into readLine() implementation ? Strace ?

What about readFully() ? Did it also fill the byte array... byte after byte ?

tsuna said...

@royans: with a RandomAccessFile you can seek around freely, no need to reopen the file.

@Adrien: I always read the code of library functions before calling them from my code. I know not to trust the JDK especially, because there are a lot of dumb things in it. I spend a lot of time reading the code of the JDK on Google Code Search, but in this case I also double-checked that the JRE I was using had the problem with strace indeed. A number of packages in the JDK are very poorly implemented. Some packages are pretty awesome however, such as everything that Doug Lea wrote (no surprise there) for java.util.concurrent.

Anonymous said...

I'm still new to programming, but wouldn't it be better to use

String s = "";
ArrayList a = new ArrayList();
while ((s=in.readline())!=null) a.add(s);


to get the entire file which you intend to read line-by-line anyway. Then you'd have a.size() as the number of lines in the file, and could use a for loop to access them quickly for whatever processing needs done. I don't know much about speed or memory usage yet, so please don't yell at me if I said something stupid. T.T

tsuna said...

@Anonymous: you could do this if you know the file isn't too large and can fit in memory – but still, you'd have to implement readLine yourself because the implementation in the JDK is unacceptable. In my case though, there's another reason why I require two passes through the file, the first pass pre-computes some data that the second pass needs.

Anonymous said...

And what about using java BufferReader, or read(byte[] b, int off, int len) if you need to use a RandomAccessFile ?
If I'm not wrong, javadocs tell us that it'll be read byte per byte, and might be inefficient...

tsuna said...

@Anonymous: the javadoc of RandomAccessFile doesn't mention anything about reading byte by byte, it just says "This method successively reads bytes from the file". Either way, there's no reason whatsoever to have such an utterly stupid implementation of readLine. BufferedReader doesn't have a seek method. So you can either seek around or have a non-stupid readLine, but not both. Yay.

Anonymous said...

you posted this on my birthday!


YO BEN! you in sf now? come visit!

danmux said...

For any speedy file reading on Android you need to build some class that uses an InputStream like DataInputStream then use chunked buffers and parse.

Buffering the whole file obv gives you random access - into the buffer.

If you have a size problem (most of us do! ;) ) then
Performant random access and seeking makes things more complicated.

Stanimir Simeonoff said...

If you can help it, always use mapped files with DirectBuffers. RandomAccessFile is retarded in reading all primitives, incl. readInt(), readLong(). Full JNI switch for each bytes, magic!!

Of course mapping comes w/ a price of being unable ( to unmap (baring hacks)

Anonymous said...

In case anyone was looking for a good solution using just the standard JDK...

BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
String line;
while ((line = bufferedReader.readLine()) != null)
{
// process line
}
bufferedReader.close();