## Towards a scalable Bloom filter

Recently I have given the tool „jdeps” a test ride. It’s a tool that is included with Java 8 and extracts packet dependencies of usual java code from jar files, classes, etc. It turned out that the Trove library that we ship with our application was used in very few places and considering the number of usages and its size after signing, did not seem worth the download for the customer.

So I replaced it where possible and came across our ID-manager class, that would hold a couple of long values in a TlongHashSet, and replacing gnu.trove.TLongHashSet with java.util.HashSet meant wrapping all those longs in Objects. That would waste quite some space. Being cautious I tested this change with the outrageous number of about 5 million IDs, just to see how memory consumption compares in the extreme. The result was interesting: TlongHashSet used about 116 megabytes and java.util.HashSet used about 339 megabytes of RAM to store these 5 million IDs. This number of IDs is not something we’d expect to happen in the real world and more realistic numbers suggest that while java.util.HashSet may be a viable solution, it’s still considerably more wasteful than TlongHashSet and memory should be used for better things. Unfortunately the ID manager is also restricted to long values; relaxing that constraint didn’t look like bad idea in the long run. The ID-manager’s job is to generate new IDs and it should pick the shortest one it can find, so the numbers don’t grow too large to keep file sizes low. It just needs to know if an ID is definitely not in use, but it is okay to mistakenly assume an ID is in use, while it’s not. It would be nice if IDs could be alphanumeric. This is a classic case for a Bloom filter. I thought about it for a while and decided that, due to the rather small amount of IDs that we use in our system, it wasn’t worth adding the code to our project and got on with more important things. But sometimes work haunts me. Back at home I could not get this Bloom filter idea out of my head, so I set out to explore this data structure.

## How Bloom filters work

Bloom filters in its basic form, are one big array of bits and a couple of hash functions. If you add a value to a bloom filter, it will get hashed by a couple of hash functions and each hash serves as an index to the bit-array. The filter will set the bits at these indexes when adding an entry. To check if the filter contains a value, the bits at these indexes are checked. Since there is a possibility of hash collisions, two values may have the same hash. As a result a bloom filter will always return a correct answer, if it says: “I don’t know this value”, but it may mistakenly say: “Yes, I know that one”. Bloom filters use multiple hashes to get fewer collisions.

The optimal number of bits (m) in a bloom filter can be found by estimating the number of values inserted (n) and choosing a desired probability of false positives (p), using the function:

`m=-n*(ln p)/(ln 2)²`

.

The optimal number of hashes (k) relates to the number of bits (m), the number of values inserted (n) and is estimated using the function: `k = (m/n) ln(2)`

.

## Reducing the number of hash functions

Using the above functions I could estimate that for 5 million entries and 1% probability of false positives, I’d need 7 distinct hashes. Higher number of entries suggested up to 10 different hash functions. These are not easy to come by. Fortunately this issue has been investigated by Adam Kirsch and Michael Mitzenmacher at Harvard in 2006: https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf

According to this paper the number of hash functions (h) can be reduced to just two, if for a given entry x, hashes g0..gi are calculated as: `gi(x) = h1(x) + ih2(x) + i2 mod m`

, where m is the number of bits used by the filter.

I chose stick with Java’s default hash function for h1, so for h2 I looked for one that doesn’t need to be cryptographically strong, but properly spreads its values over the given spectrum. 32 bits would be enough for the moment, as this is multiplied with another hash anyway. I found a Java implementation of XXhash32 under the Apache Licence 2.0 by some guy named “tamtam180” that would probably be fine and modified it to be used with strings rather than byte[], as I wanted to avoid the costly encoding part and array-copying in Java’s string class:

/* * Copyright (C) 2012 tamtam180 * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. * * This class has been modified from the original to work with String instead of byte[] * to avoid having to deal with the slow String#getBytes()" method. * A minor performance tweak to use the prefix instead of postfix increment operator has been added. * A method function has been added to create a hash function instance with a fixed seed. */ package bloomfilter; import java.util.function.ToIntFunction; public final class XXHash { private static final int PRIME1 = (int) 2654435761L; private static final int PRIME2 = (int) 2246822519L; private static final int PRIME3 = (int) 3266489917L; private static final int PRIME4 = 668265263; private static final int PRIME5 = 0x165667b1; public static ToIntFunction<String> toIntFunction(final int seed) { return new ToIntFunction<String>() { @Override public int applyAsInt(String value) { return hash(value, seed); } }; } public static int hash(final String string, final int seed) { final int len = string.length(); if (len < 16) { return digestSmall(string, seed); } final int bEnd = len; final int limit = bEnd - 16; int v1 = seed + PRIME1; int v2 = v1 * PRIME2 + len; int v3 = v2 * PRIME3; int v4 = v3 * PRIME4; int i = 0; int crc = 0; while (i < limit) { v1 = Integer.rotateLeft(v1, 13) + string.charAt(i); i += 4; v2 = Integer.rotateLeft(v2, 11) + string.charAt(i); i += 4; v3 = Integer.rotateLeft(v3, 17) + string.charAt(i); i += 4; v4 = Integer.rotateLeft(v4, 19) + string.charAt(i); i += 4; } i = bEnd - 16; v1 += Integer.rotateLeft(v1, 17); v2 += Integer.rotateLeft(v2, 19); v3 += Integer.rotateLeft(v3, 13); v4 += Integer.rotateLeft(v4, 11); v1 *= PRIME1; v2 *= PRIME1; v3 *= PRIME1; v4 *= PRIME1; v1 += string.charAt(i); i += 4; v2 += string.charAt(i); i += 4; v3 += string.charAt(i); i += 4; v4 += string.charAt(i); v1 *= PRIME2; v2 *= PRIME2; v3 *= PRIME2; v4 *= PRIME2; v1 += Integer.rotateLeft(v1, 11); v2 += Integer.rotateLeft(v2, 17); v3 += Integer.rotateLeft(v3, 19); v4 += Integer.rotateLeft(v4, 13); v1 *= PRIME3; v2 *= PRIME3; v3 *= PRIME3; v4 *= PRIME3; crc = v1 + Integer.rotateLeft(v2, 3) + Integer.rotateLeft(v3, 6) + Integer.rotateLeft(v4, 9); crc ^= crc >>> 11; crc += (PRIME4 + len) * PRIME1; crc ^= crc >>> 15; crc *= PRIME2; crc ^= crc >>> 13; return crc; } private static int digestSmall(final String string, final int seed) { final int len = string.length(); final int bEnd = len; final int limit = bEnd - 4; int idx = seed + PRIME1 - 1; int crc = PRIME5; int i = 0; while (i < limit) { crc += string.charAt(i) + (++idx); crc += Integer.rotateLeft(crc, 17) * PRIME4; crc *= PRIME1; i += 4; } while (i < bEnd) { crc += (string.charAt(i) & 0xFFFF) + (++idx); crc *= PRIME1; ++i; } crc += len; crc ^= crc >>> 15; crc *= PRIME2; crc ^= crc >>> 13; crc *= PRIME3; crc ^= crc >>> 16; return crc; } }

Using these building blocks, adding values to a simple Bloom filter would be easy:

private void add(final String string) { // simulating optimum number of hashFunctions (i) using 2 real hash functions: // gi(x) = h1(x) + ih2(x) + i2 mod m, // see: https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf assert hashes > 0; final int h1 = string.hashCode(); final int h2 = hashFunction.hash(string); for (int i = 0; i < hashes; ++i) { final long g = restrict(h1 + i * h2 + i * i); setBit(g); } } private int restrict(final long hash) { return (hash & 0x7f_ffff_ffffL) % maxBits; }

Note that I’m restricting the range of the computed hash to 64 *Integer.MAX_VALUE, since we’re working on a long-array that captures 64 bits per index and we can have Integer.MAX_VALUE array positions.

## From a fixed Bloom filter to a scalable Bloom filter

So a simple, fixed bloom filter is quite easy to write: Using the given formulas, I estimated the number of bits I need for a certain number of inserts and allocated a long[] that contains at least so many bits:

public class BloomFilter { private static final int BITS_PER_ENTRY = Long.SIZE; private final long[] entries; private final long bits; private final long maxBits; private final int hashes; public BloomFilter(final int expectedNumberOfEntries, final double falsePositiveProbability) { bits = optimalNumOfBits(expectedNumberOfEntries, falsePositiveProbability); final long arrayLength = min(bits/ BITS_PER_ENTRY, Integer.MAX_VALUE -1L); //rounded to floor hashes = getOptimalNumberOfHashes(expectedNumberOfEntries, bits); assert hashes > 0; final int numEntries = (int) arrayLength + 1; // round up to cover all bits + some extra entries = new long[numEntries]; maxBits = (long)numEntries * (long)BITS_PER_ENTRY; } …

STOP! Yes, this will work, but unfortunately with our ID-manager it would be stupid to permanently, eagerly allocate about 5.7 megabytes for some wildly guessed “limit that we’d never cross” of 5 million entries and live with the fact that most entries would never be used.

A Bloom filter would need to start small and grow if it gets filled. Unfortunately, since a Bloom filter doesn’t remember the actual values, but just their hashes, I can’t just double the size of the bit array and “rehash” or move all entries. Even if I could, that would be a tedious time consuming operation that is to be avoided. One solution is to create a sequence of Bloom filters, with increasing capacity and tighter false positive probabilities. This was proposed in a paper by by Paulo Sérgio Almeida and Carlos Baquero at Minho University, Nuno Preguiça at Lisboa University and David Hutchison at Lancerster University:

http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf

This paper states that one can retain a stable false positive rate, if the number of bits in each subsequent bloom filter is doubled or quadrupled, while the false positive probability is tightened by a factor of 0.8 to 0.9 for each subsequent filter.

My implementation is straightforward: It starts with a rather small bloom filter for a maximum of 4096 entries with any given false positive rate. If that filter is saturated (the number of entries made grows larger than number of entries it was created for), it creates a sub-filter for double the amount of entries and a reduced probability for false positives. To find a value in a filter, it looks if it is present in either this or the sub filter. A value gets added to this filter until it is saturated, if it is saturated the value is added to the sub-filter. Since the sub-filter is of the same type it’s turtles all the way down: The sub filter will, if needed, create a bigger sub filter for itself, which will act the same.

## Optimization

Looking at the code I found that the ScalableBloomFilter would call the XXHash class unnecessarily often, because for 5 million Strings it was pushing each string of max 22 chars (like “id.9223372036854775807”) through a maximum of 11 stages, where in each stage the XXHash would be computed. The hashCode implementation of java.lang.String looked fine though, because it caches its result. So I took the strategy to compute the hashes once and pass them down to the methods that set and test the corresponding bits. As a result multiple stages would not compute the same hash over and over again. Interestingly this made no noticeable difference at all, so I reverted to the simpler code. Profiling the code with JVisualVM showed that most of the time was indeed spent in the “restrict” method that just does a modulo-operation and a bitwise “and” on the computed hash value to make the hash a valid index to the underlying array. This method is called for every bit that is set or queried. I guess what you can learn again is: What you think about performance is probably wrong. It turned out that the false positive rate was a way better knob to turn in order to optimize speed and memory consumption.

## Implementation

First a small functional interface, so one can use their own hash function:

package bloomfilter; @FunctionalInterface public interface StringHashFunction { int hash(String string, int seed); }

Following now is the complete source code for the scalable Bloom filter:

package bloomfilter; import static java.lang.Math.log; import static java.lang.Math.max; import static java.lang.Math.min; import static java.lang.Math.pow; import static java.lang.Math.round; import java.util.Arrays; /* * Copyright 2016 Wanja Gayk * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /** * Implementation of a scalable Bloom filter. * * "A Bloom filter is a space-efficient probabilistic data * structure, conceived by Burton Howard Bloom in 1970, that is used * to test whether an element is a * member of a set. False positive matches are possible, but false * negatives are not, thus a Bloom filter has a 100% recall rate. In * other words, a query * returns either "possibly in set" or "definitely not in set". * Elements can be added to the set, but not removed (though this * can be addressed with a * "counting" filter). The more elements that are added to the set, * the larger the probability of false positives. * " [<a href = " https://en.wikipedia.org/wiki/Bloom_filter">https: * //en. wikipedia. org/wiki/Bloom_filter</>] * * Uses {@link XXHash} and {@link String#hashCode()} as default hash * functions. * * @author Wanja Gayk */ public class ScalableBloomFilter{ protected static final int DEFAULT_START_SIZE=4096; protected static final float MAX_SATURATION=0.38f; private static final int HASH_SEED=123; /** see: http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf */ private static final int GROWTH_RATE=2; private static final double FALSE_POSITIVE_TIGHTENING_RATIO=0.8; private static final int BITS_PER_ENTRY=Long.SIZE; private final long[] entries; private final long maxBits; private final int hashes; private final int maxNumberOfInsertions; private final double falsePositiveProbability; private ScalableBloomFilter subFilter; private int insertions; private final StringHashFunction hashFunction; /** * Creates a new Bloom Filter that tries to yield the given * false-positive probability. * * @param falsePositiveProbability * a value ]0..1[. Realistically a value > 0.3 will not * yield the desired result. */ public ScalableBloomFilter(final double falsePositiveProbability){ this(DEFAULT_START_SIZE,falsePositiveProbability,XXHash::hash); } /** * Creates a new Bloom Filter that tries to yield the given * false-positive probability. * * @param falsePositiveProbability * a value ]0..1[. Realistically a value > 0.3 will not * yield the desired result. * @param hashFunction * an alternative hashFunction, should be fast and different * from {@link String#hashCode()}. */ public ScalableBloomFilter(final double falsePositiveProbability, final StringHashFunction hashFunction){ this(DEFAULT_START_SIZE,falsePositiveProbability,hashFunction); } /** * @param maxNumberOfInsertions * the threshold that triggers the creation of a sub filter * @param falsePositiveProbability * a value ]0..1[. Realistically a value > 0.3 will not * yield the desired result. * @param hashFunction * an alternative hashFunction, should be fast and different * from {@link String#hashCode()}. */ private ScalableBloomFilter(final int maxNumberOfInsertions,final double falsePositiveProbability, final StringHashFunction hashFunction){ this.hashFunction=hashFunction; this.maxNumberOfInsertions=maxNumberOfInsertions; this.falsePositiveProbability=falsePositiveProbability; final long bits=optimalNumOfBits(maxNumberOfInsertions,falsePositiveProbability); final long arrayLength = min(bits/BITS_PER_ENTRY, Integer.MAX_VALUE - 1L); // rounded to floor hashes=getOptimalNumberOfHashes(maxNumberOfInsertions,bits); assert hashes > 0; final int numEntries=(int)arrayLength + 1; //round up to all bits + some extra entries=new long[numEntries]; maxBits=(long)numEntries * (long)BITS_PER_ENTRY; } /** * Adds a string to this filter. * * @param string * a string to add to the filter */ public void add(final String string){ if(isSaturated() && subFilter == null){ subFilter=new ScalableBloomFilter(maxNumberOfInsertions * GROWTH_RATE, falsePositiveProbability * FALSE_POSITIVE_TIGHTENING_RATIO,hashFunction); } if(subFilter != null){ subFilter.add(string); }else{ addLocal(string); } } /** * <b>Attention: may yield false positives!</b> * * @param string * a string to query * @return <tt>false</tt> if the string is not contained in this * filter, <tt>true</tt> if there is a high probability * that the string is contained. */ public boolean contains(final String string){ // simulating optimum number of hashFunctions (i) using 2 real // hash functions: // gi(x) = h1(x) + ih2(x) + i2 mod m, // see: // https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf assert hashes > 0; final int h1=string.hashCode(); final int h2=hashFunction.hash(string,HASH_SEED); boolean isRegistered=true; for(int i=0;isRegistered && i < hashes;++i){ final long g=restrict(h1 + i * h2 + (i * i)); isRegistered&=getBit(g); } if(subFilter != null && !isRegistered){ isRegistered|=subFilter.contains(string); } return isRegistered; } /** * Resets the filter, discarding all information about known * strings. THis will also trim the internal storage to its default * capacity, */ public void clear(){ Arrays.fill(entries,0l); subFilter=null; insertions=0; } private void addLocal(final String string){ // simulating optimum number of hashFunctions (i) using 2 real // hash functions: // gi(x) = h1(x) + ih2(x) + i2 mod m, // see: // https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf boolean previouslyAdded=true; assert hashes > 0; final int h1=string.hashCode(); final int h2=hashFunction.hash(string,HASH_SEED); for(int i=0;i < hashes;++i){ final long g=restrict(h1 + i * h2 + i * i); previouslyAdded&=setBit(g); } if(!previouslyAdded){ ++insertions; } } private boolean isSaturated(){ return insertions > maxNumberOfInsertions; } private long restrict(final long hash){ return (hash & 0x7f_ffff_ffffL) % maxBits; } /** * Sets the nth bit and returns its previous status; * * @param n * the index * @return <tt>true</tt> if it had previously been set */ private boolean setBit(final long n){ checkBounds(n); final int pos=getArrayIndex(n); final long mask=getMask(n); // tested: JIT properly inlines and removes multiple // /bounds checks and pos/mark calculations of getBits(n) final boolean old=getBit(n); entries[pos]=entries[pos] | mask; return old; } private boolean getBit(final long n){ checkBounds(n); final int pos=getArrayIndex(n); final long mask=getMask(n); return (entries[pos] & mask) != 0; } private static int getArrayIndex(final long n){ return (int)(n / BITS_PER_ENTRY); } private static long getMask(final long n){ return 1 << (n % BITS_PER_ENTRY); } private void checkBounds(final long n){ final long max=min(maxBits,(long)Integer.MAX_VALUE * BITS_PER_ENTRY); if(n < 0 || n >= max){ throw new IndexOutOfBoundsException("n < 0 || n >= " + max + "for n=" + n); } } /** * Computes the optimal number of hashes per element inserted in * Bloom filter), given the expected insertions and total number of * bits in the Bloom filter. * * See https://en.wikipedia.org/wiki/Bloom_filter# * Optimal_number_of_hash_functions for the formula. * * @param expectedInserts * a number > 0 * @param bits * number of bits in Bloom filter > 0 */ static int getOptimalNumberOfHashes(final long expectedInserts,final long bits){ // (m / n) * log(2), but avoid truncation due to division return max(1,(int)round((double)bits / expectedInserts * log(2))); } /** * Computes the optimal number of bits in the Bloom filter, which * is expected to achieve, for the specified expected insertions, * the required false positive * probability. * * See https://en.wikipedia.org/wiki/Bloom_filter# * Probability_of_false_positives for the formula. * * @param expectedInserts * a number > 0 * @param falsePositiveProbability * a number > 0 and < 1 */ static long optimalNumOfBits(final long expectedInserts,final double falsePositiveProbability){ return (long)(-expectedInserts * log(max(Double.MIN_VALUE,falsePositiveProbability)) / pow(log(2),2)); } }

## Results

I created a unit test that adds about 5 million random IDs, adds them to a filter and tries to find one of the lowest free IDs (which would pick the first value it tries, so this is almost a write only test). Using java.util.HashSet the test needed about 339 megabytes for the data, TLongHashSet used about 116 megabytes and the ScalableBloomFilter with a false positive probability of 0.01 needed just 14 megs. The test needed about 4 seconds with java.util.HashSet, 2 seconds with ScalableBloomFilter and gnu.trove.TLongHashSet topped the time sheet with about 0.9 seconds. The TLongHashSets performance looks impressive, but can be explained by the fact that hashing a simple long is a pretty easy job compared to hashing strings multiple times and setting several bits and testing the bits in several filter levels.

A second test contained a continuous region of 100,000 IDs (“id.0” to “id.99999”), followed by about 2.6 million numbers > 100_010. Looking for one of the lowest free IDs (trying all numbers sequentially, starting with 0) took about 1.1 seconds using the ScalableBloomFilter, 0.6 seconds using the java.util.HashSet and 0.5 seconds using gnu.trove.TLongHashSet. In this test I could push the Bloom filter’s false positive probability to 0.3 and get a result one could live with in about 0.8 seconds, using just 3.2 Mb of memory compared to 6.8 Mb with a rate of 0.01.

Test number three generates about 7.5 million IDs, where the first 5 million build a contiguous space. It showed about equal performance for the ScalableBloomFilters and java.util.HashSet with gnu.troveTLongHashSet way ahead in terms of speed. However, both could not match the memory efficiency of the scalable Bloom filter that kept its bit array at 14 megabytes. Pushing the false positive rate to 0.3 resulted in using just 6.8 Mb of memory, and faster operation than using a false positive rate of 0.01. Tuning the Bloom filter seems to be an art of its own. I took the strategy to shape the tests for our ID management class in a way that defined the quality of the results first, also exploring the edge case of bigger than expected numbers and then gradually relaxed the false positive rate of the bloom filter until I found the sweet spot of speed, memory consumption and result quality.

## tl;dr

A scalable Bloom filter is a relatively easy to create data structure that offers a performance comparable or faster than java.util.HashSet while using considerably less memory. As a drawback it allows for false positives when asking if a value has already been seen by the filter and it does not allow listing the added values, as it just stores their fingerprints.