Finding Similar Items Under Edit Distance

November 15, 2018

3:00 PM

Halligan 102

Speaker: Sam McCauley, Wellesley College

Host: Diane Souvaine

Abstract

A fundamental problem in computer science is to, given a query text, find the most similar text in a large database. This similar text may differ by character replacements, insertions or deletions---in other words, we want the most similar item (or at least an approximately-most-similar item) under edit distance. This problem has widespread uses in text processing and computational biology. Previous approaches to this problem fell into three categories. Brute-force approaches search through the whole database, using very little preprocessing at the cost of large query time. Prefix-tree approaches can be much faster, but they have very large exponential terms, so they are only efficient when the desired text is very similar (has a small edit distance). Embedding approaches can overcome these large search costs regardless of the similarity of the desired text, but embeddings can only be used when the approximation factor is very large (in fact superconstant). I will be presenting a new method that overcomes these issues: a locality-sensitive hashing approach. This method uses a simple hash function, under which similar texts are more likely to hash together than dissimilar texts. Using this hash function, we can answer queries using a hash table lookup, achieving good bounds for all approximation factors and moderate distances to the most similar database text. I will also discuss some potential future directions of this approach, as well as some other recent work on related problems.

Bio

Sam McCauley is currently a post-doc at Wellesley College. He was previously a post-doc at IT University of Copenhagen, and next semester he will be visiting Bar Ilan University as a fellow in the Zuckerman STEM Leadership Program. He obtained his Ph.D. from Stony Brook University, advised by Michael Bender, and his B.A. in Computer Science and Mathematics from Tufts University. His research focuses on algorithms, particularly for problems inspired by practical large-scale databases. Recently he has been most interested in approximate membership queries and similarity search. His work has been recognized by multiple awards and fellowships, such as a best paper award at IPDPS, an NSF EAPSI Fellowship to perform research at National University of Singapore, and a Chateaubriand Fellowship to conduct research at ENS Lyon.