Lecture 7: Bloom Filters

Attribution

Much of the content from these notes is taken directly or adapted from the notes of the same course taught by Dr. Andrew Forney available at forns.lmu.build.

Introduction

Spellchecking, as seen in the previous lecture, is only required after we’ve determined that a word has been misspelled. As we will explore in this lecture, it is a separate problem entirely to determine which words have been misspelled, and to do so in a way that is quick and efficient.

This problem is a special case of a class of problems with a similar flavor:

How can we determine if a giant genetic sequence is a known strand or not?
Is someone sending you an email in your known contacts list without having to request that information from the server?
Has a URL shortener already allocated a given short URL to someone else before giving one to you?

All of these cases suggest the need for a data structure that can quickly determine if a value belongs to a set without needing to store the individual items.

Question

What kind of data structure or implementation might be useful for these problems?

If you guessed a HashSet, you’re on the right track. However, unlike a HashSet, which stores the objects to test for membership, we don’t necessarily need to store the individual items if we can develop a filter to determine whether or not values belong to a hypothetical set.

Having such a filter is useful in the same way that a lawyer having a receptionist is useful: consulting the lawyer is expensive, so knowing whether or not they can help you before paying their fees would be a big savings!

By the same intuition, it is sometimes helpful to be able to test for properties of a data structure without actually interacting with the data structure itself.

This is usually the case when:

The dataset is located in some remote location like a server, and roundtrip communication is expensive.
The dataset is large, and it can be memory-prohibitive to load all at once for purposes of querying.

Intuitions

The primary objective at hand is to find a space- and time-efficient data structure and algorithm for testing set membership without needing to consult the actual set.

Let’s start by refreshing our memory on what a hash function is and how it is used in the context of a hash table.

Question

What was a hash function’s purpose in the context of a hash table?

A hash function provided a means of converting the fields of a given object into an integer index corresponding to a bucket in which to store or find that key.

For example, consider a hash table storing strings with a hash function $f (s) = s . length () mod b$ for $b = 8$ :

A hash function $f$ , as we saw in the context of hash tables, provides a semi-unique numerical fingerprint for a particular key based on its data:

$f (key) = index$

If we were to hash a key using multiple hash functions $f_{1}, f_{2}$ , then the resulting tuple of hash indexes will look more unique and result in fewer collisions than any one hash function alone:

$(f_{1} (key), f_{2} (key)) = (index_{1}, index_{2})$

Rather than have buckets that store the keys themselves (which can take up a lot of space, as in a set), what if we merely stored whether or not a contained item had been hashed to that bucket (requiring only 1 bit)?

Bloom Filters

To achieve the goal of testing set membership in large sets without needing to consult the actual set, we will explore a new programming paradigm: probabilistic programming.

In contrast to most of the algorithms we have studied so far, probabilistic programming does not guarantee optimal results 100% of the time.

Bloom filters are space-efficient, probabilistic data structures used to test set membership, and they are used in a variety of applications with large sets that are read-heavy.

Components

Borrowing from the intuition of the hash table, we will still maintain state in our Bloom filters using an array of buckets. However, instead of storing our keys directly in those buckets, we will only need to store an array of bits.

Component 1: an array of $m$ bits, each initialized to 0, indicating that the filter is empty.
Component 2: a set of $k$ hash functions, each of which maps a stored key to one of the $m$ bits in the array.

Some notes on the above:

Choices of both $m$ and $k$ will depend on some other properties we’ll define later; for now, let’s just take for granted that we have an array of some length $m$ and some number of hash functions $k$ .
Note that these are bits stored in each index of the array; by comparison, a single character of a string stored in a HashSet or Trie will be at least 8 bits, which can grow arbitrarily large for arbitrary strings.
Because each “bucket” is so small, typically $m$ is much, much larger than $k$ .

Assuming we have these two pieces, the operations are straightforward.

Operations

Inserting a key into a Bloom filter is a simple, 2-step process:

For each of the $k$ hash functions, obtain an index for the key that will be between $[0, m - 1]$ : $(f_{1} (key), f_{2} (key), \dots, f_{k} (key))$
Set the bit at each of those found indexes to 1.

Intuitively, setting these bits to 1 is like leaving a “breadcrumb” that we’ve hashed a key into this position in the past.

Consider hashing two strings, $A$ and $B$ , into a Bloom filter with $m = 8$ bits (very small, but good for illustration) and with $k = 2$ hash functions $f_{1}, f_{2}$ that produce the indexes shown above.

Question

What do you notice that’s troubling with the example above? Where do you foresee problems down the line?

The two strings had a collision with one of their hash functions at index 4, which will make it difficult to disentangle which keys were responsible for setting which bits to 1.

To illustrate this problem, let’s consider the other primary operation: querying a Bloom filter to determine whether or not a key is contained within.

This is a similar 2-step process to insertion:

For each of the $k$ hash functions, obtain an index for the key that will be between $[0, m - 1]$ : $(f_{1} (key), f_{2} (key), \dots, f_{k} (key))$
Examine the bits at each hashed index in the bit array:
- If any bit is 0: the key is certainly not contained within (or else it would’ve been set to 1 during insertion).
- If all bits are 1: the key is likely contained within (though not positive due to possible collisions).

Herein lies the cost that Bloom filters pay for their spatial parsimony: they can exhibit false positives for certain keys that happen to hash to the set bits of other keys, whether or not they were inserted themselves.

Observe that if we were to query any one of strings $A$ , $B$ , $C$ , $D$ on the Bloom filter, we would end up with the following answers and their truth values:

Query	Result
$A$	True (Positive)
$B$	True (Positive)
$C$	False (Negative)
$D$	True (False Positive)

False Positives

Warning

False positives are the primary risk run by using a Bloom filter, so let’s take a deeper look at these.

Question

What will increase the likelihood of a false positive query? What will decrease it?

The more keys we store, the more bits will be flipped to 1, thus increasing our likelihood of false positives. However, assuming we have good hash functions (evenly distributing keys), a larger number of bits (i.e., $m$ ) will decrease that likelihood.

This means that we can express the likelihood of a false positive in terms of $m$ , $n$ , $k$ !

Question

What is the false positive likelihood of a Bloom filter with 8 bits, 2 hash functions, and 2 stored keys?

$p = (1 - (1 - \frac{1}{8})^{2 \cdot 2})^{2} \approx 0.17$

17% is not too good, which is why we see that increasing $m$ substantially reduces that likelihood.

For example, doubling our number of available bits yields a better result:

$p = (1 - (1 - \frac{1}{16})^{2 \cdot 2})^{2} \approx 0.05$

Using the above equation, you can solve for the optimal value of $m$ , $k$ for a desired false-positive likelihood $p$ !

Theoretical Guarantees and Miscellany

To conclude, here are some interesting properties of Bloom filters:

Time Complexity: $O (k)$ for $k$ presumably fast hash functions that are generally assumed to be $O (1)$ .
Space Complexity: $O (m)$ for the $m$ bits required to form the bit array.
In addition to this very sparse space, Bloom filters can accommodate a potentially infinite number of stored keys with a fixed size, though the chance of false positives grows with each insertion.
There are tons of variants of Bloom filters used in a variety of different contexts, but most use the above definition as a starting point.
One of the earliest Bloom filters applied for phone spell checking used only 32KB to store the entire dictionary!

Algos @ LMU

Explorer

Lecture 7 - Bloom Filters

Lecture 7: Bloom Filters

Introduction

Intuitions

Bloom Filters

Components

Operations

False Positives

Theoretical Guarantees and Miscellany

Graph View

Table of Contents

Backlinks