Bloom Filters, or How I Learned PHP

I’ve completed almost all my degree requirements at SJSU and have started my two semester thesis. I’m still trying to narrow my thesis scope, but early on I picked out a good thesis advisor. This professor is in high demand among the graduate students, and no wonder: He’s extremely bright, an excellent instructor, and a prolific developer himself. He’s also extremely busy, and could take only two more graduate students. Over the Christmas break he held a coding competition with the two top submissions getting the spots.

The challenge was the following: Given a URL, implement (in PHP) a hashing function that produces a fixed length, approximately unique key that can still identify some of the “tokens” that created it.

To understand the problem, consider the following:

www.my-yahoo.com contains the tokens {www, my, yahoo, com}

So $key = hash(“www.my-yahoo.com”);
contains($key, “yahoo”); should respond TRUE but
contains($key, “google”); should respond FALSE

It might make sense to use the URL as its own hash, but it is pretty easy to find URLs that don’t work well with this approach (either too long or with lots of small tokens).

My solution uses a 16-byte key composed of a 4-byte preamble and a 12-byte Bloom filter. The 4-byte preamble is the first 4 bytes of the md4 digest of the entire url. I used md4 because it comes standard in PHP and is fast compared with the other standard hashes. The cryptographic aspect is irrelevant except that a good cryptographic hash will also show a very good uniform distribution over the keyspace. For my application, this means I can take a substring of the digest and still have a good uniform distribution. It would make sense to use a faster, simpler hash like FNV but as I was just starting with PHP and nervous about my poor grasp of its weak bit manipulation abilities I decided to stick with something reliable.

The Bloom filter is more interesting. Once again, Wikipedia is your friend, but the short version is that a Bloom filter is a bitfield, populated with the outputs of a number of different hash functions on each token inserted in to the Bloom. To add a token, one computes a hash or set of independent hashes for the token and then adds them in to the bitfield. As more tokens are added and more hash functions are used the bitfield “fills up” with ones, resulting in higher false positive rates because all the bits that would be set for a given token are already set by a combination of other tokens. That becomes a design tradeoff influencing the bitfield length, the number of hash functions used, and the number of tokens stored.

In the end, my keys are 16 bytes long, unique over the million URLs I tested, and have a pretty low false positive rate. I note in my writeup that although the general false positive rate is low, the worst Bloom filters have high error rates so in practice it makes sense to reduce the number of tokens added. I also tried using multiple different Blooms to increase the number of tokens contained (three 4-byte filters, two tokens each). As you can imagine, each filter is less “full”, but over the whole key the false positive rate goes up. I eventually discarded it.

It was interesting writing in PHP, something I had never done before. It was also a challenge to write a program that would really benefit from low level bit manipulation and strong typing in a language that doesn’t really make that sort of programming easy. PHP is sweet for string manipulation, though, and I ended up with a compromise I liked well enough. This compromise actually use a character to store each nibble (for conversion back and forth from strings to bytes), so it’s underlying representation is at least 32 bytes long. It should be straightfoward to pack these nibbles smaller, obviously, but I wanted to concentrate on getting the filter to work.

I’ve included my hashing class, my test apparatus, and my writeup. I’ll get around to adding some relevant links below as well. Be advised: the test code takes a long time to run and consumes a ton of memory, as I made my life simpler by keeping everything in memory, rather than sorting and merging on disk.

php_bloom

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>