Fuzzy Matching – The Algorithmist

A common data problem that companies have is deal with in-exact matches because humans are inconsistent when they are entering data, and computers are not naturally good at identifying things that are almost the same but differing in some minor way. Even the algorithms that did exist to identify similar things didn’t scale efficiently to handle comparisons between large lists. Since I needed to process quickly datasets ranging from tens-of-thousands to millions of items, and the list of known/standard values where often in the thousands, and the kinds of reasons for non-standardization varied (typos, abbreviations, omitted unimportant words, altered word order, etc), no single algorithm produced acceptable results or ran with acceptable speed.

My solution was to create a synthesized approach that started with creating pre-computed indexes ( O(n) ) based on character-ngrams and phonetic that could be run at lightning speed to find candidate matches, and then using the slower but more accurate algorithms ( O(n^2) ) such as levenshtein-damerau, jaccard word-set overlap, and custom difference-probability comparisons to rank and pick the best match(es) out.

CategoryProjects

TagsAlgorithms Cython

Leave a Reply Cancel reply