# Rob Dickerson

This post describes an implementation of BK-trees in Haskell. Code can be found on Github.

A BK-tree is a type of metric tree, which is to say it’s a tree structure ideal for storing and retrieving elements of metric spaces.

So what’s a metric space? A metric space is made up of two things:

1. A set $E$ of elements, and
2. a function $m$ that, given any two elements $x,y \in E$, returns some distance $d \in \mathbb{R}$ between them.

The distance function $m$ is called the “metric”. There are some restrictions on how the metric must behave, but any metric that meets the qualifications can be combined with $E$ to form a metric space. The restrictions are as follows:

$\forall x,y \in E$, the metric $m(x, y)$ must

• Produce a non-negative value.
• Produce zero if and only if $x = y$.
• Be equal to $m(y, x)$ (in other words, $m$ must be symmetric).
• Satisfy the triangle inequality.

An example of a metric space is the set of integers with the metric $m(x,y) = \vert x - y \vert$. The set of integers with metric $m'(x,y) = \sqrt{x^2 - y^2}$ forms a different metric space.

You can also form metric spaces over non-numerical sets. The set of strings with a Levenshtein distance (or “edit distance”) function forms a metric space, for example.

BK-trees exploit properties of a metric function to efficiently retrieve elements based on distance. For example, a spelling corrector might want to find all words within a Levenshtein distance of one from some misspelled word. BK-trees are designed for this type of problem. (There are, however, better ways of tackling the spelling correction problem.)

Nodes in a BK-tree can have an arbitrary number of children, but the trick is that each node can have at most one child any given distance from it. So, in our String/Levenshtein space, the word “steam” could have a child “stream” (distance 1); but if it did, it could not also have a child of “steal” (also distance 1).

To insert a node $ins$ into a BK-tree, start with the root. Compute $dist = m(root, ins)$. If the root already has a child $c$ that is $dist$ away, recursively insert $ins$ as a child of $c$. Otherwise, $ins$ becomes the child of the root with distance $dist$.

A tree constructed this way lets you prune entire branches as you search for elements within a distance. If the maximum distance in your search is $d\_{max}$, you can ignore subtrees of the root greater than $d\_{max}$ away. For children of the root with distance $d$, you can ignore any of their children not within the range $d \pm d\_{max}$; this follows from the metric’s adherence to the triangle inequality. The pruning cutoff range increases by the distance from node to node as you recurse down the tree.

For a more detailed walkthrough of the insertion and search algorithms, see this blog post.

As mentioned at the start of this post, my implementation of BK-trees is up on github. The exported parts of the module are:

Using the BKTree module in the String/Levenshtein context looks like this:

By building a BKTree from the Levenshtein distance metric and list of know dictionary words, we can write an easy function to suggest words one edit away from any given string: