Distinguished Colloquium: Near-Optimal Hashing Algorithm for the Approximate Nearest Neighbor Problem

September 26, 2007

2:50 pm - 4:00 pm

Halligan 111

Abstract

The nearest neighbor problem is defined as follows: given a set of n data points in a d-dimensional space, construct a data structure which, given a query point q, returns the data point closest to the query. The problem is of importance in multiple areas of computer science. Nearest neighbor algorithms can be used for: classification (in machine learning), similarity search (in biological, text or visual databases), data quantization (in signal processing and compression), etc. Most of these applications involve high-dimensional data. Unfortunately, classic algorithms for geometric search problems, such as kd-trees, do not scale well as the dimension increases. In recent years, there has been extensive research done on *approximate* variants of the nearest neighbor problem. One of the algorithms, based on the idea of Locality-Sensitive Hashing (LSH), has been successfully used in several applied scenarios.

In this talk I will review that work, as well as describe recent developments in this area. In particular, I will show a new LSH-based algorithm, whose running time significantly improves over the earlier bounds. The running time of the new algorithm is provably close to the best possible in the class of hashing-based algorithms.

This is joint work with Alex Andoni (MIT).