Description: In this project, we are interested in applying clustering techniques to identify interesting patterns in real world data sets. In previous work, we explored how to perform automatic feature selection with clustering. Currently, we are addressing how cluster ensembles can improve performance.
Our driving application, Earth science applications, have two distinctive characteristics of the data -- they often have high dimensionality and they are spatially structured. These two features pose special challenges to the clustering task. First, high dimensionalities cause fundamental difficulties to many traditional clustering algorithms. Second, the spatial structures pose spatial continuity constraints on the clustering solutions, resulting in a constrained clustering problem. The goal of this research is to address the above issues and develop new approaches to clustering high dimensional data sets that are spatially structured.
Collaborators: Xiaoli Zhang Fern, Jennifer Dy
Authors: Fern, X. Z., Brodley, C. E., Friedl, M. A.
SIAM International Conference on Data Mining
Abstract: This paper addresses the task of analyzing the correlation between two related domains X and Y. Our research is motivated by an Earth Science task that studies the relationship between vegetation and precipitation. A standard statistical technique for such problems is Canonical Correlation Analysis (CCA). A critical limitation of CCA is that it can only detect linear correlation between the two domains that is globally valid throughout both data sets. Our approach addresses this limitation by constructing a mixture of local linear CCA models through a process we name correlation clustering. In correlation clustering, both data sets are clustered simultaneously according to the data's correlation structure such that, within a cluster, domain X and domain Y are linearly correlated in the same way. Each cluster is then analyzed using the traditional CCA to construct local linear correlation models. We present results on both artificial data sets and Earth Science data sets to demonstrate that the proposed approach can detect useful correlation patterns, which traditional CCA fails to discover.
Authors: Dy, J. and Brodley, C. E.
Journal of Machine Learning Research
5, pp. 845-889, August
Abstract: In this paper, we identify two issues involved in developing an automated feature subset selection algorithm for unlabeled data: the need for finding the number of clusters in conjunction with feature selection, and the need for normalizing the bias of fe ature selection criteria with respect to dimension. We explore the feature selection problem and these issues through FSSEM (Feature Subset Selection using Expectation-Maximization (EM) clustering) and through two different performance criteria for evaluating candidate feature subsets: scatter separability and maximum likelihood. We present proofs on the dimensionality biases of these feature criteria, and present a cross-projection normalization scheme that can be applied to any criterion to ameliorate these biases. Our experiments show the need for feature selection, the need for addressing these two issues, and the effectiveness of our proposed solutions.
Authors: Dy, J., Brodley, C.E., Kak, A., Broderick, L. S., and Aisen, A. M
IEEE Transactions on Pattern Recognition and Machine Intelligence
vol. 25, pp. 373-378 March
Authors: Fern, X. Z. and Brodley, C.E.
The Twentieth International Conference on Machine Learning
Abstract: We investigate how random projection can best be used for clustering high dimensional data. Random projection has been shown to have promising theoretical properties. In practice, however, we nd that it results in highly unstable clustering performance. Our solution is to use random projection in a cluster ensemble approach. Empirical results show that the proposed approach achieves better and more robust clustering performance compared to not only single runs of random projection/clustering but also clustering with PCA, a traditional data reduction method for high dimensional data. To gain insights into the performance improvement obtained by our ensemble method, we analyze and identify the influence of the quality and the diversity of the individual clustering solutions on the nal ensemble performance.
Authors: Dy, J. and Brodley, C.E.
The Seventeenth International Conference on Machine Learning
pp. pp.247-254, July
Abstract: This paper explores the problem of feature subset selection for unsupervised learning within the wrapper framework. Our approach performs feature subset selection wrapped around expectation-maximization (EM) clustering with order
identification (identifying the number of clusters in the data). We investigate two different performance criteria for evaluating candidate feature subsets: scatter separability and maximum likelihood. When the ``true'' number of clusters k is unknown, our experiments on simulated Gaussian data and benchmark data sets show that incorporating the search for k within the feature selection procedure obtains better ``class'' accuracy than fixing k to be the number of classes. There are two reasons: 1) the ``true'' number of Gaussian components is not necessarily equal to the number of classes and 2) clustering with different feature subsets can result in different numbers of ``true'' clusters. Our empirical evaluation shows that feature selection reduces the number of features and improves clustering performance with respect to the chosen performance criteria.
Authors: Dy, J and Brodley, C.E.
The ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Abstract: For many feature selection problems, a human defines the features that are potentially useful, and then a subset is chosen from the original pool of features using an automated feature selection algorithm. In con trast to supervised learning, class information is not available to guide the feature search for unsupervised learning tasks. In this paper, we introduce Visual-FSSEM (Visual Feature Subset Selection using Expectation-Maximization Clustering), which incorporates visualization techniques, clustering, and user interaction to guide the feature subset search and to enable a deeper understanding of the data. Visual-FSSEM, serves both as an exploratory and multiv ariate-data visualization tool. We illustrate Visual-FSSEM on a high-resolution com- puted tomography lung image data set.
Authors: Dy, J. , Brodley, C. E., Kak, A., Shyu, C., and Broderick, L. S.
Storage and Retrieval for Image and Video Databases VII, IS&T/SPIE Electronic Imaging
Abstract: This paper introduces a new approach called the 'customized- queries' approach to content-based image retrieval (CBIR). The customized-queries approach first classifies a query using the features that best differentiate the major classes and then customizes the query to that class by using the features that best distinguish the subclasses within the chosen major class. This research is motivated by the observation that the features which are most effective in discriminating among images from different classes may not be the most effective for retrieval of visually similar images within a class. This occurs for domains in which not all pairs of images within one class have equivalent visual similarity. We apply this approach to content-based retrieval of high-resolution tomographic images of patients with lung disease and show that this approach yields 82.8 percent retrieval precision. The traditional approach that performs retrieval using a single feature vector yields only 37.9 percent retrieval precision.
Authors: Dy, J, Brodley, C. E., Kak, A., Shyu, C and Broderick, L.S.
EEE Conference on Computer Vision and Pattern Recognition
vol. 99 vol. 2 pp. 400-406 1999 January
Abstract: This paper makes two contributions. The first contribution is an approach called the "customized-queries" approach (CQA) to content-based image retrieval. The second is an algorithm called FSSEM that performs feature selection and clustering simultaneously. The customized-queries approach first classifies a query using the features that best differentiate the major classes and then customizes the query to that class by using the features that best distinguish the images within the chosen major class. This approach is motivated by the observation that the features that are most effective in discriminating among images from different classes may not be the most effective for retrieval of visually similar images within a class. This occurs for domains in which not all pairs of images within one class have equivalent visual similarity, i.e. subclasses exists. Because we are not given subclass labels, we must simultaneously find the features that best discriminate the subclasses and at the same time find these subclasses. We use FSSEM to find these features. We apply this approach to content-based retrieval of high-resolution tomographic images of patients with lung disease and show that this approach radically improves the retrieval precision over the traditional approach that performs retrieval using a single feature vector.