Asking the Right Questions: New Query Types for Active Learning

March 4, 2010
2:50 pm - 4:00 pm
Halligan 111
Speaker: Burr Settles, Carnegie Mellon University


The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with less training if it is allowed to choose the data from which it learns. In this talk, I present two recent active learning paradigms in which learning algorithms may pose novel types of "queries" of human annotators to great effect. We call these new paradigms "multiple-instance active learning" and "feature active learning."

In traditional active learning, a partially-trained model selects new data instances to be labeled by a human annotator, which are then added to the training set and the process repeats. In a text classification task, for example, the learner might query for the labels of informative-looking documents. However, having a human read an entire document can be an inefficient use of time, particularly when only certain passages or keywords are relevant to the task at hand. Multiple-instance active learning addresses this problem by allowing the model to selectively obtain more focused labels at the passage level in cases where noisy document-level labels might be available (e.g., from hyperlinks or citation databases). This active learning approach provides a direct training signal to the learner and is also less cumbersome for humans to read. Likewise, feature active learning allows the learner to query for the labels of salient words (e.g., the query word "puck" might be labeled "hockey" in a sports article classification task), which naturally exploits the annotator's inherent domain knowledge. We show that such alternative query paradigms, especially when combined with intuitive user interfaces, can make more efficient use of human annotation effort.

[Joint work with Mark Craven, Soumya Ray, Gregory Druck, and Andrew McCallum.]

Bio: Burr Settles is a Postdoctoral Fellow in the Machine Learning Department at Carnegie Mellon University. He received a PhD from the University of Wisconsin-Madison in 2008 with a major in Computer Sciences and minors in Linguistics and Biology. His current research interests are focused on maximizing the use of unlabeled data and minimizing the cost of obtaining labeled data for applications in natural language processing and bioinformatics.