Clustering High-Dimensional Data

April 18, 2006
2:50 pm - 4:00 pm
Halligan 111
Speaker: Dr. Jennifer Dy, Northeastern University

Abstract

Creating effective algorithms for unsupervised learning is important because vast amounts of data preclude humans from manually labeling the categories of each instance. In addition, human labeling is expensive and subjective. Therefore, a majority of existing data is unsupervised (unlabeled). The goal of unsupervised learning or cluster analysis is to group ``similar'' objects together. "Similarity" is typically defined by a metric or a probability model. These measures are highly dependent on the features representing the data. Many clustering algorithms assume that relevant features have been determined by the domain experts. But, not all features are important. Moreover, many clustering algorithms fail when dealing with high-dimensions. In this talk, I will present three approaches for dealing with clustering in high-dimensional spaces: 1. Feature subset selection for unsupervised learning, through Gaussian mixtures and the maximum likelihood and scatter separability criteria, 2. Feature selection and clustering in an interactive visualization environment; and 3. Hierarchical feature transformation and clustering, through automated hierarchical mixtures of probabilistic principal component analyzers.

Biography: Dr. Jennifer G. Dy is an assistant professor at the Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, since 2002. She obtained her MS and PhD in 1997 and 2001 respectively from the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, and her BS degree in 1993 from the Department of Electrical Engineering, University of the Philippines. She received an NSF Career award in 2004. She is an editorial board member for the journal, Machine Learning, since 2004, publications chair for the International Conference on Machine Learning in 2004, and program committee member for ACM SIGKDD, ICML, AAAI, and SIAM SDM. Her research interests include Machine Learning, Data Mining, Statistical Pattern Recognition, and Computer Vision.