Tufts ML Alumni
Associated Publications: [+]
Authors: Carla E. Brodley, Umaa Rebbapragada, Kevin Small, and Byron C. Wallace.
Artificial Intelligence Magazine
Authors: U. Rebbapragada, P. Protopapas, C. E. Brodley, C. Alcock
Vol. 74, Issue 3, p. 281
Abstract: Catalogs of periodic variable stars contain large numbers of periodic light-curves (photometric time series data from the astrophysics domain). Separating anomalous objects from well-known classes is an important step towards the discovery of new classes of astronomical objects. Most anomaly detection methods for time series data assume either a single continuous time series or a set of time series whose periods are aligned. Light-curve data precludes the use of these methods as the periods of any given pair of light-curves may be out of sync. One may use an existing anomaly detection method if, prior to similarity calculation, one performs the costly act of aligning two light-curves, an operation that scales poorly to massive data sets. This paper presents PCAD, an unsupervised anomaly detection method for large sets of unsynchronized periodic time-series data, that outputs a ranked list of both global and local anomalies. It calculates its anomaly score for each light-curve in relation to a set of centroids produced by a modified k-means clustering algorithm. Our method is able to scale to large data sets through the use of sampling. We validate our method on both light-curve data and other time series data sets. We demonstrate its effectiveness at finding known anomalies, and discuss the effect of sample size and number of centroids on our results. We compare our method to naive solutions and existing time series anomaly detection methods for unphased data, and show that PCAD’s reported anomalies are comparable to or better than all other methods. Finally, astrophysicists on our team have verified that PCAD finds true anomalies that might be indicative of novel astrophysical phenomena.
Authors: U. Rebbapragada, R. Lomasky, C. E. Brodley and M. Friedl
Proceedings of the 2008 IEEE International Geoscience and Remote Science Symposium
Authors: Umaa Rebbapragada, Carla E. Brodley
Proceedings of the 18th European Conference on Machine Learning
Abstract: We describe a novel framework for class noise mitigation that assigns a vector of class membership probabilities to each training instance, and uses the confidence on the current label as a weight during training. The probability vector should be calculated such that clean instances have a high confidence on its current label, while mislabeled instances have a low confidence on its current label and a high confidence on its correct label. Past research focuses on techniques that either discard or correct instances. This paper proposes that discarding and correcting are special cases of instance weighting, and thus, part of this framework. We propose a method that uses clustering to calculate a probability distribution over the class labels for each instance. We demonstrate that our method improves classifier accuracy over the original training set. We also demonstrate that instance weighting can outperform discarding.
Current Research Topics:
Past Research Topics: [+]
Description: We focus on the problem of finding and eliminating class label noise, also known as mislabeled training data. We have proposed a novel framework that calculates the probability of class membership on each training instance, and uses these probabilities as instance weights. These probabilities can be used to downweight noise or correct to the true labels. We have applied our methods to land cover data set and are currently working on an application that to verify labels on soil liquefaction data in earthquake-prone regions.
Description: We have multiple projects looking at several aspects of mining and learning with time series data with direct applications in several domains including astrophysics and predictive medicine. Recent projects include detecting anomalous time series using novel and efficient clustering algorithms, clustering and classification of time series data using kernel methods and Gaussian processes, and fast search and analysis algorithms detecting events within time series.
This work is partly supported by NSF grants IIS-0803409 and IIS-0713259.
The data in this archive includes the "OGLE II" dataset as used by the machine learning group at Tufts University. The dataset includes time series of light measurements from 3 type of periodic variable stars from the Optical Gravitational Lensing Experiment (OGLE) survey. The data includes a total of 14087 time series with (3425,3390,7272) in the categories (CEPH, RRL, EB).
The data was generated and kindly provided by other researchers. Please see the official OGLE site for more information on the survey, data and discoveries. Various queries on this and other astronomy data can also be made there as well as at the visier site and Harvard time series site. If using this data please cite the original work (Szymanski, 2005, Acta Astron., 55, 43 and Udalski, Kubiak and Szymanski, 1997, Acta Astron., 47, 319.) as suggested on the OGLE site.
Our group has made use of this data in machine learning research for anomaly detection, classification, probabilistic modeling and period detection. Please see our time series project page and additional publications therein for more information about the dataset and tasks studied.
We provide the OGLE-II dataset in order to make it more readily accessible to machine learning researchers. Toward that we packaged three versions of the data, as linked below. Please consult our papers for more information about the data, its processing and experiments.
(1) The raw data: the original time series that are measured at irregular time points and are not folded. We also provide the folded versions, as well as the known period (and other properties) as found by the OGLE project: ogle2full.tar.gz
(2) A processed form of the data: each time series is folded according to its known period, and then re-sampled via interpolation at 50 regular sampling time points. Two versions are provided, the time series "as is" and after "universal phasing". This form of the data can be simply treated as a point in 50-D Euclidean space and used directly by machine learning algorithms - providing an easy starting point to study the data. ogle50.libsvm and upogle50.libsvm