PhD Defense: Strategic Targeting of Outliers for Expert Review

June 14, 2010
10:00 am - 12:00 pm
Halligan 111A
Speaker: Umaa Rebbapragada, Tufts University


In the era of massive data sets, it is difficult for domain scientists to interact directly with their own data. Because the analysis of single examples may yield insights in the research process, it is important to use automated methods to highlight potentially interesting phenomena when hand inspection is not possible. This dissertation examines a particular sub-case of this problem: the use of machine learning to direct an expert's attention to potentially informative outliers. Outliers warrant study in both a positive and negative sense. In the positive sense, outliers might be indicative of new scientific phenomena, whose study may pave the road to scientific discovery. In the negative sense, outliers might be worthy of elimination or alteration, in order to move forward with the original research objectives.

This dissertation explores two novel manifestations of the anomaly detection problem that are motivated by domain scientists' need to mine their data sets for single outlying examples. The first is the unsupervised detection of anomalies in large sets of unsynchronized time series data for the purpose of aiding scientific discovery. This work is applied to astrophysics time series data. The second is the detection of label noise in training data in order to improve the supervised learning process. This work is applied to problems in remote sensing, medical text mining, and volcanology.

This thesis makes four contributions. First, we introduce a method called PCAD, for the discovery of local and global outliers on large sets of unsynchronized time series data. Second, we perform a comprehensive review of methods for the detection of label noise in training data, and introduce a new method called PWEM. Third, we introduce an interactive framework, called ICCN, that cleans training sets of label noise with help from a domain expert. Finally, we introduce a semi-supervised learning method, called collaborative learning, that synthesizes ideas from our research on label noise detection and uses them to minimize label noise during training data generation.