Description: We focus on the problem of finding and eliminating class label noise, also known as mislabeled training data. We have proposed a novel framework that calculates the probability of class membership on each training instance, and uses these probabilities as instance weights. These probabilities can be used to downweight noise or correct to the true labels. We have applied our methods to land cover data set and are currently working on an application that to verify labels on soil liquefaction data in earthquake-prone regions.
Collaborators: Mark Friedl, Boston University
Authors: Carla E. Brodley, Umaa Rebbapragada, Kevin Small, and Byron C. Wallace.
Artificial Intelligence Magazine
Authors: U. Rebbapragada, R. Lomasky, C. E. Brodley and M. Friedl
Proceedings of the 2008 IEEE International Geoscience and Remote Science Symposium
Authors: Umaa Rebbapragada, Carla E. Brodley
Proceedings of the 18th European Conference on Machine Learning
Abstract: We describe a novel framework for class noise mitigation that assigns a vector of class membership probabilities to each training instance, and uses the confidence on the current label as a weight during training. The probability vector should be calculated such that clean instances have a high confidence on its current label, while mislabeled instances have a low confidence on its current label and a high confidence on its correct label. Past research focuses on techniques that either discard or correct instances. This paper proposes that discarding and correcting are special cases of instance weighting, and thus, part of this framework. We propose a method that uses clustering to calculate a probability distribution over the class labels for each instance. We demonstrate that our method improves classifier accuracy over the original training set. We also demonstrate that instance weighting can outperform discarding.
Authors: Brodley, C.E. and Friedl, M.
Journal of Artificial Intelligence Research.
vol. 11, pp. 131-167
Abstract: This paper presents a new approach to identifying and eliminating mislabeled training instances for supervised learning. The goal of this approach is to improve classification accuracies produced by learning algorithms by improving the quality of the training data. Our approach uses a set of learning algorithms to create classifiers that serve as noise filters for the training data. We evaluate single algorithm, majority vote and consensus filters on five datasets that are prone to labeling errors. Our experiments illustrate that filtering significantly improves classification accuracy for noise levels up to 30 percent. An analytical and empirical evaluation of the precision of our approach shows that consensus filters are conservative at throwing away good data at the expense of retaining bad data and that majority filters are better at detecting bad data at the expense of throwing away good data. This suggests that for situations in which there is a paucity of data, consensus filters are preferable, whereas majority vote filters are preferable for situations with an abundance of data.
Authors: Friedl, M., Brodley, C. E. and Strahler, A.
IEEE Transactions on Geoscience and Remote Sensing.
vol. 37, num 2, pp. 969--977
Abstract: Classification of land cover from remotely sensed data at continental to global scales requires sophisticated algorithms and feature selection techniques to optimize classifier performance. The authors examine methods to maximize classification accuracies using decision trees to map land cover from multitemporal AVHRR imagery at continental and global scales. As part of their analysis they test the utility of ï¿½boostingï¿½, a new technique developed to increase classification accuracy by forcing the learning (classification) algorithm to concentrate on those training observations that are most difficult to classify. Their results show that boosting consistently reduces misclassification rates by 20-50% depending on the data set in question, and that most of the benefit gained by boosting is achieved after seven boosting iterations. They also assess the utility of including phenological metrics and geographic position as additional features to the classification algorithm. They find that using derived phenological metrics produces little improvement in classification accuracy relative to using an annual time series of NDVI data, but that geographic position provides substantial power for predicting land cover types at continental and global scales. However, in order to avoid generating spurious classification accuracies using geographic position, training data must be distributed evenly in geographic space.
Authors: Friedl, M. and Brodley, C. E.
Remote Sensing of Environment
vol. 61 num 3
Abstract: Decision tree classification algorithms have significant potential for land cover mapping problems and have not been tested in detail by the remote sensing community relative to more conventional pattern recognition techniques such as maximum likelihood classification. In this paper we present several types of decision tree classification algorithms and evaluate them on three different remote sensing data sets. The decision tree classification algorithms tested include a univariate decision tree, a multivariate
decision tree, and a hybrid decision tree capable of including several different types of classification algorithms within a single decision
tree structure. Classification accuracies produced by each of these decision tree algorithms are compared with both maximum likelihood and linear discriminant function classifiers. Results from this analysis show that the decision tree algorithms consistently outperform the maximum likelihood and linear discriminant function classifiers in terms of classification accuracy.
In particular, the hybrid tree consistently produced the highest classification accuracies for the data sets tested. More generally, the results from this work show that decision trees have several advantages for remote sensing applications by virtue of their relatively simple, explicit, and intuitive classification structure. Further,
decision tree algorithms are strictly non-parametric and therefore make no assumptions regarding the distribution of input data, and are also flexible and robust with respect to non-linear and noisy relationships among input features and class labels.
Authors: Brodley, C. E. and Friedl, M. A.
Proceedings of Thirteenth National Conference on Artificial Intelligence
pp. 799-805 1996 August
Abstract: This paper presents a new approach to identifying and eliminating mislabeled training instances. The goal of this technique is to improve classification accuracies produced by learning algorithms by improving the quality of the training data. The approach employs an ensemble of classifiers that serve as a filter for the training data. Using an n-fold cross validation, the training data is passed through the filter. Only instances that the filter classifies correctly are passed to the fi- nal learning algorithm. We present an empirical evaluation of the approach for the task of automated land cover mapping from remotely sensed data. Labeling error arises in these data from a multitude of sources including lack of consistency in the vegetation classification used, variable measurement techniques, and variation in the spatial sampling resolution. Our evaluation shows that for noise levels of less than 40%, filtering results in higher predictive accuracy than not filtering, and for levels of class noise less than or equal to 20% filtering allows the base-line accuracy to be retained. Our empirical results suggest that the ensemble filter approach is an effective method for identifying labeling errors, and further, that the approach will significantly benefit ongoing research to develop accurate and robust remote sensing-based methods to map land cover at global scales.