# BK-Trees in Haskell

This post describes implementation of BK-trees in Haskell. Code can be found here.

A BK-tree is a type of metric tree, which is to say it’s a tree structure ideal for storing and retrieving elements of metric spaces.

So what’s a metric space? A metric space is made up of two things:

- A set \( E \) of elements, and
- a function \( m \) that, given any two elements \( x,y \in E \), returns some distance \( d \in \mathbb{R} \) between them.

The distance function \( m \) is called the “metric”. There are some restrictions on how the metric must behave, but any metric that meets the qualifications can be combined with \( E \) to form a metric space. The restrictions are as follows:

\( \forall x,y \in E \), the metric \( m(x, y) \) must

- Produce a non-negative value.
- Produce zero if and only if \(x = y \).
- Be equal to \( m(y, x) \) (in other words, \( m \) must be symmetric).
- Satisfy the triangle inequality.

An example of a metric space is the set of integers with the metric \( m (x,y) = | x - y | \). The set of integers with metric \( m’(x,y) = \sqrt{x^2 - y^2} \) forms a different metric space. |

You can also form metric spaces over non-numerical sets. The set of strings with a Levenshtein distance (or “edit distance”) function forms a metric space, for example.

BK-trees exploit properties of a metric function to efficiently retrieve elements based on distance. For example, a spelling corrector might want to find all words within a Levenshtein distance of one from some misspelled word. BK-trees are designed for this type of problem. (There are, however, better ways of tackling the spelling correction problem.)

Nodes in a BK-tree can have an arbitrary number of children, but the trick is that each node can have at most one child any given distance from it. So, in our String/Levenshtein space, the word “steam” could have a child “stream” (distance 1); but if it did, it could not also have a child of “steal” (also distance 1).

To insert a node \( ins \) into a BK-tree, start with the root. Compute \( dist = m(root, ins) \). If the root already has a child \( c \) that is \( dist \) away, recursively insert \( ins \) as a child of \( c \). Otherwise, \( ins \) becomes the child of the root with distance \( dist \).

A tree constructed this way lets you prune entire branches as you search for elements within a distance. If the maximum distance in your search is \( d_{max} \), you can ignore subtrees of the root greater than \( d_{max} \) away. For children of the root with distance \( d \), you can ignore any of their children not within the range \( d \pm d_{max} \); this follows from the metric’s adherence to the triangle inequality. The pruning cutoff range increases by the distance from node to node as you recurse down the tree.

For a more detailed walkthrough of the insertion and search algorithms, see this blog post.

As mentioned at the start of this post, my implementation of BK-trees is up on github. The exported parts of the module are:

Using the BKTree module in the String/Levenshtein context looks like this:

By building a BKTree from the Levenshtein distance metric and list of know dictionary words, we can write an easy function to suggest words one edit away from any given string:

Know any fun metric spaces to play with?