Online Algorithms for Extreme Clustering and Automated Integration of Crowdsourced Feedback

October 4, 2018
Halligan 102
Speaker: Ari Kobren, University of Massachusetts, Amherst
Host: Norman Ramsey

Abstract

Clustering algorithms are essential tools in data science, where they are used in a wide variety of applications such as identifying functionality similar genes, discovering themes in large text corpora, visualization and dimensionality reduction, among others. In the past decade, clustering problems have significantly increased in complexity and scale—not only in the number of points, but also the number of clusters. Many modern clustering methods scale well to a large number of data points, N, but not to a large number of clusters, K. In this talk, I will introduce PERCH, a new non-greedy algorithm for online hierarchical clustering that scales to both massive N and K--a problem setting we term extreme clustering. Our algorithm efficiently routes new data points to the leaves of an incrementally-built tree. Motivated by the desire for both accuracy and speed, our approach performs tree rotations for the sake of enhancing subtree purity and encouraging balancedness. We prove that, under a natural separability assumption, our non-greedy algorithm will produce trees with perfect dendrogram purity regardless of online data arrival order. Our experiments demonstrate that PERCH constructs more accurate trees than other tree-building clustering algorithms and scales well with both N and K, achieving a higher quality clustering than the strongest flat clustering competitor in nearly half the time. Then, I’ll present GRINCH, a more powerful incremental clustering algorithm that can be used to cluster a dataset with respect to any function that computes the similarity between two point sets. The key component of Grinch is its Graft subroutine, which efficiently reconfigures the hierarchy to facilitate the discovery of clusters with complex structure. We introduce a general notion of separability, which we call model-based separation, and prove that when a dataset respects this type of separation, GRINCH provably recovers trees with perfect dendrogram purity. Finally, I will discuss preliminary work in using GRINCH for automated integration of crowdsourced human feedback provided with respect to clustering under identity uncertainty, i.e., the precise ground-truth targets of the user feedback are uncertain. In particular, I will focus on entity resolution--a prevalent instance of extreme clustering.

Bio

Ari Kobren is Ph.D. student at the University of Massachusetts Amherst doing research in machine learning advised by Prof. Andrew McCallum. His research is focused on the design of large-scale algorithms for clustering and on mechanisms for automated integration of human feedback. Kobren’s work has been applied to information extraction, information integration and natural language processing. He has done two internships at Google: during the first he worked on the Google Knowledge Graph and during the second he worked on Google Maps. He is a recipient of the National Science Foundation Graduate Research Fellowship and the Paul Utgoff Memorial Scholarship. Prior to UMass, Ari earned a BS from Tufts University.