Looking at Java NIO Buffer performance
2 Comments Published by vladimir October 5th, 2011 in Java, PerformanceWhile starting to convert much of our IO to using nio byte buffers, with an eventual goal in pushing that further up into the application, I decided to investigate in some more detail performance. I’d seen some blog posts that claimed that performance wasn’t great, in particular a very old blog post from 2004. That post included a simple benchmark, which I grabbed, converted to use Int buffers, dropped the count to 10,000,000 int values, and ran it. The source is available as niotest.java. The results weren’t encouraging:
Java Version 1.6 1.7 array[] put 26 ms 31 ms absolute put 129 ms 130 ms relative put 130 ms 132 ms array[] get 20 ms 19 ms absolute get 116 ms 119 ms relative get 130 ms 137 ms
Not only was there no really visible perf changes between Java 1.6 and 1.7, using the nio buffers was 4x-6x slower than regular java arrays! I wrote a quick equivalent benchmark in JavaScript, using Typed Arrays, and originally saw numbers in the 11ms range. (Note: the original benchmark numbers aboe were inthe 300ms range for nio arrays, before a laptop suspend/unsuspend — I incredulously tweeted the 30x difference, and then went about cleaning up the benchmarks. I can’t reproduce either result now; the Java version got faster, and the JavaScript version got slower.) The JS benchmark (source code buftest.html) gives about 65ms for writing and 40ms for reading. That still seemed faster, and I set about writing this blog post.
As part of that, I decided to clean up the benchmark code and put everything together in a nice package. The source for the new benchmark is ArrayBenchmark.java. Like the original, it works on arrays/buffers of 10,000,000 integers, first writing each element (with just its index) and then reading each element in the get operation. The additional “copy into” benchmarks time how long it takes to copy all the ints into an existing int[] array. Here are the results:
Java 1.6 Java 1.7
===== native java int[] array
put: 27.971961 ms 42.464894 ms
get: 32.949032 ms 14.826696 ms
copy into: 20.069191 ms 15.853778 ms
===== nio heap buffers
put: 839.730766 ms 57.876372 ms
put (relative): 844.618171 ms 80.951102 ms
get: 742.287840 ms 80.578592 ms
get (relative): 759.317101 ms 79.563458 ms
copy into: 769.494235 ms 91.685437 ms
===== nio direct buffers
put: 161.480338 ms 31.951206 ms
put (relative): 170.194344 ms 47.541457 ms
get: 179.621322 ms 18.913808 ms
get (relative): 164.425689 ms 29.387186 ms
copy into: 21.940450 ms 16.936357 ms
===== custom buffers
put: 151.095845 ms 48.125012 ms
put unchecked: 148.538301 ms 51.241096 ms
get: 146.243837 ms 36.636723 ms
get unchecked: 138.765277 ms 31.641897 ms
copy into: 41.643050 ms 20.206091 ms
copy into (copyMemory): N/A 16.845686 ms
These numbers show a significant improvement in Java 1.7! Direct buffers are roughly about as fast as regular arrays, which is what I had hoped to see originally. The “custom buffers” section is a hand-rolled integer buffer class that uses Unsafe.getInt/putInt without much of the additional nio buffer machinery or abstractions, to see how much that was contributing to overhead. It’s noticable in Java 1.6, but in Java 1.7 the original nio buffers win handily, even against “unchecked” versions of get/put that don’t do any bounds checking. I also added heap (non-direct) buffers, to see if there was any truth to a claim I read regarding mixing direct and non-direct buffers causing an overall slowdown, because then there would be two implementations of the abstract parent class, and the VM couldn’t optimize the virtual calls. That doesn’t seem to be the case any more — the JIT doesn’t care.
But, I am now very confused why the original benchmark code and the new code give such different results. The normal int[] ut is down to 42ms, slower than the 31ms in in the original benchmark, and slower still than the 27ms that the same benchmark gets in Java 1.6. The other numbers are all much better though — compare, for example, direct buffer absolute “get” performance — 119ms in the first benchmark, 19ms in the second. This is a 6x speed difference. The same compiler and JVM are used for both. I even added a ‘mixed’ set to the new benchmark, that does the operations in the exact same order as the first one (interlaving operations on arrays and int buffers), and it didn’t matter.
The new benchmark numbers are really encouraging, and mean that we’re going to probably push the nio buffers into many places, simplifying our interaction both with IO, OpenGL, algorithms implemented in JNI, etc. as well as letting us move the bulk of our large data out of the Java heap. However, I’d like to understand why the two benchmarks give such vastly different performance results. I’ve stared at the source for a while, and I’m virtually certain that they’re doing the same operations, on identically-sized arrays. Can someone explain the overall slowness of the first benchmark? Why didn’t the numbers change hardly at all between Java 1.6 and 1.7? Why are the 1.6 numbers in the second benchmark slower than the 1.6 numbers in the first?
The original benchmark runs way too fast, so I guess it’s just too short to let the JIT do its job. So I only spent real time with the newer bench. First off, have you tested with HotSpot Server or Client? Always an essential information in any Java benchmarking report :) as well as any other relevant VM switches (32-bit vs. 64-bit too). I suppose it’s Client, 32-bit, no other switches. Maybe you could test with JDK 7u2-b07 and 6u29-b08, these have updated builds of HotSpot.
The most important factor for this benchmark is intrinsic optimization of DirectBuffer methods such as get*() and put*(). HotSpot Server fully optimizes this, meaning that it generates inline code for these operations instead of making JNI invocations to native libraries (bad not only because of JNI overhead, but because the invoked code cannot be inlined into the caller). HotSpot Client traditionally sucked on java.nio because it lacked intrinsification of these methods; but I think they are gradually fixing this. With these latst, EA builds, you will find that JDK 6 and 7 have basically the same performance. Comparing Client to Server, the Server VM still has a significant lead in several tests, but the gap is closing.
Actually it is Server, 64-bit; I thought I mentioned that, but it got lost in an edit :-) Specifically:
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
The (even odder) odd thing is that even if I change the second, currently longer, benchmark to do nothing but the exact same timings as the first, it’s still much faster. (Just completely deleting the other tests in the source file.) I’ve been looking at the disassembly and nothing was jumping out at me… the only thing I could think of is that the codegen and memory alloc just happens to get lucky and generates better aligned code/ops, but that’s too big of a perf difference even with that. I can force the interpreter and both benchmarks get significantly slower, as expected; hotspot is definitely kicking in. (I can also see that if I trace method compilation etc.)
I’ll try the early access version of 7u2, though; but were you able to reproduce the speed difference between the two with the same VM?