Archive for the ‘Algorithms and Data Structures’ Category

Towards a scalable Bloom filter

August 7, 2016 2 comments

Recently I have given the tool „jdeps” a test ride. It’s a tool that is included with Java 8 and extracts packet dependencies of usual java code from jar files, classes, etc. It turned out that the Trove library that we ship with our application was used in very few places and considering the number of usages and its size after signing, did not seem worth the download for the customer.
So I replaced it where possible and came across our ID-manager class, that would hold a couple of long values in a TlongHashSet, and replacing gnu.trove.TLongHashSet with java.util.HashSet meant wrapping all those longs in Objects. That would waste quite some space. Being cautious I tested this change with the outrageous number of about 5 million IDs, just to see how memory consumption compares in the extreme. The result was interesting: TlongHashSet used about 116 megabytes and java.util.HashSet used about 339 megabytes of RAM to store these 5 million IDs. This number of IDs is not something we’d expect to happen in the real world and more realistic numbers suggest that while java.util.HashSet may be a viable solution, it’s still considerably more wasteful than TlongHashSet and memory should be used for better things. Unfortunately the ID manager is also restricted to long values; relaxing that constraint didn’t look like bad idea in the long run. The ID-manager’s job is to generate new IDs and it should pick the shortest one it can find, so the numbers don’t grow too large to keep file sizes low. It just needs to know if an ID is definitely not in use, but it is okay to mistakenly assume an ID is in use, while it’s not. It would be nice if IDs could be alphanumeric. This is a classic case for a Bloom filter. I thought about it for a while and decided that, due to the rather small amount of IDs that we use in our system, it wasn’t worth adding the code to our project and got on with more important things. But sometimes work haunts me. Back at home I could not get this Bloom filter idea out of my head, so I set out to explore this data structure. Read more…