PhD Defense: Active Acquisition of Informative Training Data
The performance of a classifier built from labeled training data is highly dependent on the quality of the data. In many domains, collecting high quality training data can be labor-intensive and expensive. This problem can be broken into two parts. First, it is important to ensure that the features can distinguish between the classes of interest. Although the problem of best selecting and generating informative features is important, it is not the focus of this thesis. Second, we must determine that the examples acquired are informative. Ideally, one would gather a training data set with only relevant, non-redundant examples. Additionally, one would acquire this data efficiently, with minimal effort and resources. The time of the human aiding in data generation is precious, and we seek to utilize it wisely. By considering class proportions, this thesis makes three contributions to the process of optimizing the use of human assistance in training data creation for computer-based classifiers. First, we identify a new class of supervised learning problems, in which the process of generating data cannot be separated from the process of obtaining labels. This class of problems, with we call Active Class Selection (ACS) addresses the question: if one can collect $n$ additional training instances, how should they be distributed with respect to class? The second and third contributions involve improving training data collection for a previously identified problem, Active Learning (AL). AL addresses a question distinct from but related to ACS: if one has n instances in an unlabeled pool $U$, which instances from U should one have a human label? We offer two methods of solving this problem. First, we demonstrate how ideas from ACS can be used to perform AL on multiclass datasets. Second, we address a largely neglected problem in AL: When should one stop labeling data because it will not increase the classifier performance? We also explore how to dynamically choose which AL method is best suited for a dataset at a given stage of AL.