*Tufts ML Alumni*

I received my B.A. in Mathematics from the University of California, Berkeley in 1997. Shortly thereafter, I worked for internet media company CNET Networks in San Francisco first as an HTML/Javascript producer, and then as a software engineer. In 2002, I transferred to the company's Cambridge, MA office and enrolled part-time in the post-baccalaureate program in the Tufts Computer Science department. In 2004, I enrolled full-time in the Ph.D. program, and earned my M.S. in Computer Science in 2006.

**Homepage: **http://www.cs.tufts.edu/~urebbapr

- Rebbapragada, U., Brodley, C. E., Sulla-Menashe, D., and Friedl,M., Active label correction,
*IEEE 12th International Conference on Data Mining (ICDM)*, 2012 [+]

**Authors:**Rebbapragada, U., Brodley, C. E., Sulla-Menashe, D., and Friedl,M.IEEE 12th International Conference on Data Mining (ICDM)

**Year:**2012**Associated Research Topics:****Affiliated Tufts Members:**- None.

**Tufts / Purdue Alumni:** - Carla E. Brodley, Umaa Rebbapragada, Kevin Small, and Byron C. Wallace., Challenges and opportunities in applied machine learning,
*Artificial Intelligence Magazine*, 33, 2012 [+]

**Authors:**Carla E. Brodley, Umaa Rebbapragada, Kevin Small, and Byron C. Wallace.Artificial Intelligence Magazine

33**Year:**2012**Associated Research Topics:**- Time Series Data Mining
- Finding and Eliminating Class Label Noise
- Medical Text Data Mining
- Active Class Selection/Active Learning

**Affiliated Tufts Members:**- None.

**Tufts / Purdue Alumni:** - U. Rebbapragada, P. Protopapas, C. E. Brodley, C. Alcock, Finding Anomalous Periodic Time Series: An Application to Catalogs of Periodic Variable Stars,
*Machine Learning*, Vol. 74, Issue 3, p. 281, 2009 [+]

**Authors:**U. Rebbapragada, P. Protopapas, C. E. Brodley, C. AlcockMachine Learning

Vol. 74, Issue 3, p. 281**Year:**2009**Abstract:**Catalogs of periodic variable stars contain large numbers of periodic light-curves (photometric time series data from the astrophysics domain). Separating anomalous objects from well-known classes is an important step towards the discovery of new classes of astronomical objects. Most anomaly detection methods for time series data assume either a single continuous time series or a set of time series whose periods are aligned. Light-curve data precludes the use of these methods as the periods of any given pair of light-curves may be out of sync. One may use an existing anomaly detection method if, prior to similarity calculation, one performs the costly act of aligning two light-curves, an operation that scales poorly to massive data sets. This paper presents PCAD, an unsupervised anomaly detection method for large sets of unsynchronized periodic time-series data, that outputs a ranked list of both global and local anomalies. It calculates its anomaly score for each light-curve in relation to a set of centroids produced by a modified k-means clustering algorithm. Our method is able to scale to large data sets through the use of sampling. We validate our method on both light-curve data and other time series data sets. We demonstrate its effectiveness at finding known anomalies, and discuss the effect of sample size and number of centroids on our results. We compare our method to naive solutions and existing time series anomaly detection methods for unphased data, and show that PCAD’s reported anomalies are comparable to or better than all other methods. Finally, astrophysicists on our team have verified that PCAD finds true anomalies that might be indicative of novel astrophysical phenomena.**Url:**http://www.springerlink.com/content/l8r3689876024121/**Associated Research Topics:****Affiliated Tufts Members:**- None.

**Tufts / Purdue Alumni:** - U. Rebbapragada, L. Mandrake, K. Wagstaff, D. Gleeson, R. Castaño, S. Chien, and C. E. Brodley, Improving Onboard Analysis of Hyperion Images by Filtering Mislabeled Training Data Examples,
*IEEE Aerospace Conference*, 2009 [+]

**Authors:**U. Rebbapragada, L. Mandrake, K. Wagstaff, D. Gleeson, R. Castaño, S. Chien, and C. E. BrodleyIEEE Aerospace Conference

**Year:**2009**Associated Research Topics:****Affiliated Tufts Members:**- None.

**Tufts / Purdue Alumni:** - L. Mandrake, K. Wagstaff, D. Gleeson, U. Rebbapragada, D. Tran, R. Castaño, S. Chien, R. Pappalardo, Onboard Detection of Naturally Occurring Sulfur Compounds on the Surface of a Glacier using an SVM and the Hyperion Multi-Spectral Instrument,
*IEEE Aerospace Conference*, 2009 [+]

**Authors:**L. Mandrake, K. Wagstaff, D. Gleeson, U. Rebbapragada, D. Tran, R. Castaño, S. Chien, R. PappalardoIEEE Aerospace Conference

**Year:**2009**Associated Research Topics:****Affiliated Tufts Members:**- None.

**Tufts / Purdue Alumni:** - U. Rebbapragada, P. Protopapas, C. E. Brodley and C. Alcock, Anomaly Detection in Catalogs of Periodic Variable Stars,
*Astronomical Data Analysis Software and Systems XVIII*, 2008 [+]

**Authors:**U. Rebbapragada, P. Protopapas, C. E. Brodley and C. AlcockAstronomical Data Analysis Software and Systems XVIII

**Year:**2008**Associated Research Topics:****Affiliated Tufts Members:**- None.

**Tufts / Purdue Alumni:** - U. Rebbapragada, R. Lomasky, C. E. Brodley and M. Friedl, Generating High-Quality Training Data for Automated Land-Cover Mapping,
*Proceedings of the 2008 IEEE International Geoscience and Remote Science Symposium*, 2008 [+]

**Authors:**U. Rebbapragada, R. Lomasky, C. E. Brodley and M. FriedlProceedings of the 2008 IEEE International Geoscience and Remote Science Symposium

**Year:**2008**Associated Research Topics:****Affiliated Tufts Members:**- None.

**Tufts / Purdue Alumni:** - Umaa Rebbapragada, Carla E. Brodley, Class Noise Mitigation Through Instance Weighting,
*Proceedings of the 18th European Conference on Machine Learning*, 2007 [+]

**Authors:**Umaa Rebbapragada, Carla E. BrodleyProceedings of the 18th European Conference on Machine Learning

**Year:**2007**Abstract:**We describe a novel framework for class noise mitigation that assigns a vector of class membership probabilities to each training instance, and uses the confidence on the current label as a weight during training. The probability vector should be calculated such that clean instances have a high confidence on its current label, while mislabeled instances have a low confidence on its current label and a high confidence on its correct label. Past research focuses on techniques that either discard or correct instances. This paper proposes that discarding and correcting are special cases of instance weighting, and thus, part of this framework. We propose a method that uses clustering to calculate a probability distribution over the class labels for each instance. We demonstrate that our method improves classifier accuracy over the original training set. We also demonstrate that instance weighting can outperform discarding.**Url:**http://www.springerlink.com/content/n887602v2255435r/**Associated Research Topics:****Affiliated Tufts Members:**- None.

**Tufts / Purdue Alumni:** - D. Yankov, E. Keogh, U. Rebbapragada, Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets,
*IEEE International Conference on Data Mining*, 2007 [+]

**Authors:**D. Yankov, E. Keogh, U. RebbapragadaIEEE International Conference on Data Mining

**Year:**2007**Associated Research Topics:****Affiliated Tufts Members:**- None.

**Tufts / Purdue Alumni:**

**Current Research Topics:**

- Finding and Eliminating Class Label Noise [+]

**Description:**We focus on the problem of finding and eliminating class label noise, also known as mislabeled training data. We have proposed a novel framework that calculates the probability of class membership on each training instance, and uses these probabilities as instance weights. These probabilities can be used to downweight noise or correct to the true labels. We have applied our methods to land cover data set and are currently working on an application that to verify labels on soil liquefaction data in earthquake-prone regions. - Time Series Data Mining [+]

**Description:**We have multiple projects looking at several aspects of mining and learning with time series data with direct applications in several domains including astrophysics and predictive medicine. Recent projects include detecting anomalous time series using novel and efficient clustering algorithms, clustering and classification of time series data using kernel methods and Gaussian processes, and fast search and analysis algorithms detecting events within time series.

This work is partly supported by NSF grants IIS-0803409 and IIS-0713259.

**Associated Data/Software:**

- OGLE II Dataset [+]

**Description:**The data in this archive includes the "OGLE II" dataset as used by the machine learning group at Tufts University. The dataset includes time series of light measurements from 3 type of periodic variable stars from the Optical Gravitational Lensing Experiment (OGLE) survey. The data includes a total of 14087 time series with (3425,3390,7272) in the categories (CEPH, RRL, EB).

The data was generated and kindly provided by other researchers. Please see the official OGLE site for more information on the survey, data and discoveries. Various queries on this and other astronomy data can also be made there as well as at the visier site and Harvard time series site.**If using this data please cite the original work**(Szymanski, 2005, Acta Astron., 55, 43 and Udalski, Kubiak and Szymanski, 1997, Acta Astron., 47, 319.) as suggested on the OGLE site.

Our group has made use of this data in machine learning research for anomaly detection, classification, probabilistic modeling and period detection. Please see our time series project page and additional publications therein for more information about the dataset and tasks studied.

We provide the OGLE-II dataset in order to make it more readily accessible to machine learning researchers. Toward that we packaged three versions of the data, as linked below. Please consult our papers for more information about the data, its processing and experiments.

(1) The raw data: the original time series that are measured at irregular time points and are not folded. We also provide the folded versions, as well as the known period (and other properties) as found by the OGLE project: ogle2full.tar.gz

(2) A processed form of the data: each time series is folded according to its known period, and then re-sampled via interpolation at 50 regular sampling time points. Two versions are provided, the time series "as is" and after "universal phasing". This form of the data can be simply treated as a point in 50-D Euclidean space and used directly by machine learning algorithms - providing an easy starting point to study the data. ogle50.libsvm and upogle50.libsvm**Associated People:****Associated Research:**