Rob Dickerson

This post describes an implementation of BK-trees in Haskell. Code can be found on Github.

A BK-tree is a type of metric tree, which is to say it’s a tree structure ideal for storing and retrieving elements of metric spaces.

So what’s a metric space? A metric space is made up of two things:

A set \(E\) of elements, and
a function \(m\) that, given any two elements \(x,y \in E\), returns some distance \(d \in \mathbb{R}\) between them.

The distance function \(m\) is called the “metric”. There are some restrictions on how the metric must behave, but any metric that meets the qualifications can be combined with \(E\) to form a metric space. The restrictions are as follows:

\(\forall x,y \in E\), the metric \(m(x, y)\) must

Produce a non-negative value.
Produce zero if and only if \(x = y\).
Be equal to \(m(y, x)\) (in other words, \(m\) must be symmetric).
Satisfy the triangle inequality.

An example of a metric space is the set of integers with the metric \(m(x,y) = \vert x - y \vert\). The set of integers with metric \(m'(x,y) = \sqrt{x^2 - y^2}\) forms a different metric space.

You can also form metric spaces over non-numerical sets. The set of strings with a Levenshtein distance (or “edit distance”) function forms a metric space, for example.

BK-trees exploit properties of a metric function to efficiently retrieve elements based on distance. For example, a spelling corrector might want to find all words within a Levenshtein distance of one from some misspelled word. BK-trees are designed for this type of problem. (There are, however, better ways of tackling the spelling correction problem.)

Nodes in a BK-tree can have an arbitrary number of children, but the trick is that each node can have at most one child any given distance from it. So, in our String/Levenshtein space, the word “steam” could have a child “stream” (distance 1); but if it did, it could not also have a child of “steal” (also distance 1).

To insert a node \(ins\) into a BK-tree, start with the root. Compute \(dist = m(root, ins)\). If the root already has a child \(c\) that is \(dist\) away, recursively insert \(ins\) as a child of \(c\). Otherwise, \(ins\) becomes the child of the root with distance \(dist\).

A tree constructed this way lets you prune entire branches as you search for elements within a distance. If the maximum distance in your search is \(d\_{max}\), you can ignore subtrees of the root greater than \(d\_{max}\) away. For children of the root with distance \(d\), you can ignore any of their children not within the range \(d \pm d\_{max}\); this follows from the metric’s adherence to the triangle inequality. The pruning cutoff range increases by the distance from node to node as you recurse down the tree.

For a more detailed walkthrough of the insertion and search algorithms, see this blog post.

As mentioned at the start of this post, my implementation of BK-trees is up on github. The exported parts of the module are:

type (Metric a) = a -> a -> Int
mkBkTree :: (Metric a) -> (BKTree a)
bkInsert :: (BKTree a) -> a -> (BKTree a)
bkLookup :: (BKTree a) -> Int -> a -> [a]

Using the BKTree module in the String/Levenshtein context looks like this:

import BKTree
-- From http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#Haskell
levDistance :: Metric String
levDistance sa sb = last $ foldl transform [0..length sa] sb
  where
    transform xs@(x:xs') c = scanl compute (x+1) (zip3 sa xs xs')
    where
      compute z (c', x, y) = minimum [y+1, z+1, x + fromEnum (c' /= c)]

buildFromDictFile :: String -> IO (BKTree String)
buildFromDictFile fileName = do
  dictContents <- readFile fileName
  return $ foldl bkInsert (mkBkTree levDistance) (lines dictContents)

By building a BKTree from the Levenshtein distance metric and list of know dictionary words, we can write an easy function to suggest words one edit away from any given string:

*SpellCorrect> spellTree <- buildFromDictFile "dictionary.txt"
*SpellCorrect> let suggest = bkLookup spellTree 1
*SpellCorrect> suggest "strea"
["strep","strew","streak","stream","stria"]