 Ph.D. student studying programming languages at Purdue University. Advised by Ben Delaware. Formerly Square and elsewhere.

This post describes an implementation of BK-trees in Haskell. Code can be found on Github.

A BK-tree is a type of metric tree, which is to say it’s a tree structure ideal for storing and retrieving elements of metric spaces.

So what’s a metric space? A metric space is made up of two things:

1. A set $$E$$ of elements, and
2. a function $$m$$ that, given any two elements $$x,y \in E$$, returns some distance $$d \in \mathbb{R}$$ between them.

The distance function $$m$$ is called the “metric”. There are some restrictions on how the metric must behave, but any metric that meets the qualifications can be combined with $$E$$ to form a metric space. The restrictions are as follows:

$$\forall x,y \in E$$, the metric $$m(x, y)$$ must

• Produce a non-negative value.
• Produce zero if and only if $$x = y$$.
• Be equal to $$m(y, x)$$ (in other words, $$m$$ must be symmetric).
• Satisfy the triangle inequality.

An example of a metric space is the set of integers with the metric $$m(x,y) = \vert x - y \vert$$. The set of integers with metric $$m'(x,y) = \sqrt{x^2 - y^2}$$ forms a different metric space.

You can also form metric spaces over non-numerical sets. The set of strings with a Levenshtein distance (or “edit distance”) function forms a metric space, for example.

BK-trees exploit properties of a metric function to efficiently retrieve elements based on distance. For example, a spelling corrector might want to find all words within a Levenshtein distance of one from some misspelled word. BK-trees are designed for this type of problem. (There are, however, better ways of tackling the spelling correction problem.)

Nodes in a BK-tree can have an arbitrary number of children, but the trick is that each node can have at most one child any given distance from it. So, in our String/Levenshtein space, the word “steam” could have a child “stream” (distance 1); but if it did, it could not also have a child of “steal” (also distance 1).

To insert a node $$ins$$ into a BK-tree, start with the root. Compute $$dist = m(root, ins)$$. If the root already has a child $$c$$ that is $$dist$$ away, recursively insert $$ins$$ as a child of $$c$$. Otherwise, $$ins$$ becomes the child of the root with distance $$dist$$.

A tree constructed this way lets you prune entire branches as you search for elements within a distance. If the maximum distance in your search is $$d\_{max}$$, you can ignore subtrees of the root greater than $$d\_{max}$$ away. For children of the root with distance $$d$$, you can ignore any of their children not within the range $$d \pm d\_{max}$$; this follows from the metric’s adherence to the triangle inequality. The pruning cutoff range increases by the distance from node to node as you recurse down the tree.

For a more detailed walkthrough of the insertion and search algorithms, see this blog post.

As mentioned at the start of this post, my implementation of BK-trees is up on github. The exported parts of the module are:

Using the BKTree module in the String/Levenshtein context looks like this:

By building a BKTree from the Levenshtein distance metric and list of know dictionary words, we can write an easy function to suggest words one edit away from any given string: