Home > Tuning > Speed-Testing DirectByteBuffers and the Unsafe class against plain old arrays

Speed-Testing DirectByteBuffers and the Unsafe class against plain old arrays

I’ve long been a fan of Martin Thompson’s talks on youtube, and blogs on the “Mechanical Sympathy” blog. The more you dig into the world of people who care about high performance in Java, you will inevitably come across the technique to optimize memory access using direct ByteBuffers and the sun.misc.Unsafe class. But how so they stack up? Is it true that using them will give you more performance in general?

Well, I tried to have a look, so I took the Java Microbenchmark Harness (JMH), my Bloom Filter implementation and modified my bloom filter implementation to use a direct ByteBuffer and the Unsafe class to compare it to the original implementation that uses a long[].

The benchmark setup

Here’s the source code for the benchmark using JMH.
With almost no experience using JMH, I’m not entirely sure that the settings make perfect sense (especially the scope), so in case you know better than I, please leave a comment.

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public class BloomFilterBenchmark {

 @State(Scope.Thread)
 public static class ThreadState {

  ScalableBloomFilter scalableBloomFilter;

  @Param({ "0", "10", "100", "1000", "10000", "100000", "1000000", "10000000", "100000000" })

  long writes;

  @Param({ "0", "10", "100", "1000", "10000", "100000", "1000000", "10000000", "100000000" })
  
  long reads;

  @Setup(Level.Invocation)
  public void doSetup() {
   scalableBloomFilter = new ScalableBloomFilter(0.01);
  }

  @TearDown(Level.Invocation)
  public void doTearDown() {
   scalableBloomFilter.close();
  }
 }

 @State(Scope.Thread)
 public static class ThreadStateDirect {
  ScalableBloomFilterDirectByteBuffer scalableBloomFilterDirect;

  @Param({ "0", "10", "100", "1000", "10000", "100000", "1000000", "10000000", "100000000" })

  long writes;

  @Param({ "0", "10", "100", "1000", "10000", "100000", "1000000", "10000000", "100000000" })

  long reads;

  @Setup(Level.Invocation)
  public void doSetup() {
   scalableBloomFilterDirect = new ScalableBloomFilterDirectByteBuffer(0.01);
  }

  @TearDown(Level.Invocation)
  public void doTearDown() {
   scalableBloomFilterDirect.close();
  }
 }

 @State(Scope.Thread)
 public static class ThreadStateUnsafe {

  ScalableBloomFilterUnsafe scalableBloomFilterUnsafe;

  @Param({ "0", "10", "100", "1000", "10000", "100000", "1000000", "10000000", "100000000" })

  long writes;
 
  @Param({ "0", "10", "100", "1000", "10000", "100000", "1000000", "10000000", "100000000" })

  long reads;

  @Setup(Level.Invocation)
  public void doSetup() {
   scalableBloomFilterUnsafe = new ScalableBloomFilterUnsafe(0.01);
  }

  @TearDown(Level.Invocation)
  public void doTearDown() {
   scalableBloomFilterUnsafe.close();
  }
}

 @Benchmark
 public void baseline() {
  // intentionally left blank
 }

 @Benchmark
 public Object readWrite(final ThreadState threadState, final Blackhole bh) {
  for (long t = 0; t < threadState.writes; ++t) {
   threadState.scalableBloomFilter.add(String.valueOf(t));
  }
  for (long t = 0; t < threadState.reads; ++t) {
   bh.consume(threadState.scalableBloomFilter.contains(String.valueOf(t)));
  }
  return threadState.scalableBloomFilter;
 }

 @Benchmark
 public Object readWriteDirectByteBuffer(final ThreadStateDirect threadState, final Blackhole bh) {
  for (long t = 0; t < threadState.writes; ++t) {
   threadState.scalableBloomFilterDirect.add(String.valueOf(t));
  }
  for (long t = 0; t < threadState.reads; ++t) {
   bh.consume(threadState.scalableBloomFilterDirect.contains(String.valueOf(t)));
  }
  return threadState.scalableBloomFilterDirect;
 }

 @Benchmark
 public Object readWriteUnsafe(final ThreadStateUnsafe threadState, final Blackhole bh) {
  for (long t = 0; t < threadState.writes; ++t) {
   threadState.scalableBloomFilterUnsafe.add(String.valueOf(t));
  }
  for (long t = 0; t < threadState.reads; ++t) {
   bh.consume(threadState.scalableBloomFilterUnsafe.contains(String.valueOf(t)));
  }
  return threadState.scalableBloomFilterUnsafe;
 }

 public static void main(final String[] args) throws RunnerException {
  final Options opt = new OptionsBuilder().include(".*" +  BloomFilterBenchmark.class.getSimpleName() + ".*")//
   .jvmArgs("-Xms3g", "-Xmx3g")//
   .threads(Runtime.getRuntime().availableProcessors())//
   .forks(1)//
   .build();
  new Runner(opt).run();
 }
}

 
JVisualVMs sampler suggests I’m not merely testing the String.valueOf() method, so that looks like the benchmark should measure the right thing:

BloomFilterBench_sampler

Note that I’m not programming against an interface either, as I want to avoid to test a class that benefits from optimistic monomorphic call site optimization (which inlines virtual method calls for which no other implementations are known to the class loader), against a class that suffers from a megamorphic method calls, which is when the third implementation of a method gets loaded and the JIT will have to roll back the optimization (inlining) and transform the call to a true virtual method call. With short methods an inlined, monomorphic call will always be slower than a megamorphic method call. So I’m not testing against an interface with several implementations, but against a class each with no interface an no inheritance from another implementation.

Results

Benchmarking using JDK 8 on my old 2,8 GHz Intel Core i70 860 with 4GB of DDR3-1333 CL9 RAM took 8 days, 04:43:39 hours with a minimum of programs running and no touching the machine except for checking the progress. So I was pretty eager to see the results. The outcome was a little bit underwhelming though. I did leave out the extreme values for 100000000 reads/writes in this graph, as they don’t make a difference, other than stretching it.

Without further ado, here is a plot of the JMH-result:

BloomFilter_Timing
BloomFilter_Timing2
Graph done using https://plot.ly/

One can see that with a large amount of writes, the filter will slow down considerably, which is to be expected when a lot of memory must be allocated and examined. However the scalable bloom filter behaves pretty linear, which is a good sign.

The interesting part though is that direct ByteBuffer and Unsafe are pretty much equally fast, while the original long[] implementation is a good bit faster than the other two – and the difference grows larger, the more memory is used. This is almost the opposite of what I was expecting.

Advertisements
Categories: Tuning Tags: ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: