This post describes implementation of BK-trees in Haskell. Code can be found here.

A BK-tree is a type of metric tree, which is to say it’s a tree structure ideal for storing and retrieving elements of metric spaces.

So what’s a metric space? A metric space is made up of two things:

  1. A set \( E \) of elements, and
  2. a function \( m \) that, given any two elements \( x,y \in E \), returns some distance \( d \in \mathbb{R} \) between them.

The distance function \( m \) is called the “metric”. There are some restrictions on how the metric must behave, but any metric that meets the qualifications can be combined with \( E \) to form a metric space. The restrictions are as follows:

\( \forall x,y \in E \), the metric \( m(x, y) \) must

  • Produce a non-negative value.
  • Produce zero if and only if \(x = y \).
  • Be equal to \( m(y, x) \) (in other words, \( m \) must be symmetric).
  • Satisfy the triangle inequality.
An example of a metric space is the set of integers with the metric \( m (x,y) = x - y \). The set of integers with metric \( m’(x,y) = \sqrt{x^2 - y^2} \) forms a different metric space.

You can also form metric spaces over non-numerical sets. The set of strings with a Levenshtein distance (or “edit distance”) function forms a metric space, for example.

BK-trees exploit properties of a metric function to efficiently retrieve elements based on distance. For example, a spelling corrector might want to find all words within a Levenshtein distance of one from some misspelled word. BK-trees are designed for this type of problem. (There are, however, better ways of tackling the spelling correction problem.)

Nodes in a BK-tree can have an arbitrary number of children, but the trick is that each node can have at most one child any given distance from it. So, in our String/Levenshtein space, the word “steam” could have a child “stream” (distance 1); but if it did, it could not also have a child of “steal” (also distance 1).

To insert a node \( ins \) into a BK-tree, start with the root. Compute \( dist = m(root, ins) \). If the root already has a child \( c \) that is \( dist \) away, recursively insert \( ins \) as a child of \( c \). Otherwise, \( ins \) becomes the child of the root with distance \( dist \).

A tree constructed this way lets you prune entire branches as you search for elements within a distance. If the maximum distance in your search is \( d_{max} \), you can ignore subtrees of the root greater than \( d_{max} \) away. For children of the root with distance \( d \), you can ignore any of their children not within the range \( d \pm d_{max} \); this follows from the metric’s adherence to the triangle inequality. The pruning cutoff range increases by the distance from node to node as you recurse down the tree.

For a more detailed walkthrough of the insertion and search algorithms, see this blog post.

As mentioned at the start of this post, my implementation of BK-trees is up on github. The exported parts of the module are:

type (Metric a) = a -> a -> Int
mkBkTree :: (Metric a) -> (BKTree a)
bkInsert :: (BKTree a) -> a -> (BKTree a)
bkLookup :: (BKTree a) -> Int -> a -> [a]

Using the BKTree module in the String/Levenshtein context looks like this:

import BKTree
-- From http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance#Haskell
levDistance :: Metric String
levDistance sa sb = last $ foldl transform [0..length sa] sb
  where
    transform xs@(x:xs\') c = scanl compute (x+1) (zip3 sa xs xs\')
    where
      compute z (c\', x, y) = minimum [y+1, z+1, x + fromEnum (c\' /= c)]

buildFromDictFile :: String -> IO (BKTree String)
buildFromDictFile fileName = do
  dictContents <- readFile fileName
  return $ foldl bkInsert (mkBkTree levDistance) (lines dictContents)

By building a BKTree from the Levenshtein distance metric and list of know dictionary words, we can write an easy function to suggest words one edit away from any given string:

*SpellCorrect> spellTree <- buildFromDictFile "dictionary.txt"
*SpellCorrect> let suggest = bkLookup spellTree 1
*SpellCorrect> suggest "strea"
["strep","strew","streak","stream","stria"]

Know any fun metric spaces to play with?