PhD Defense: Clustering with Domain Knowledge

March 28, 2014
10:00 AM - 12:00 PM
Halligan 102
Speaker: Jingjing Liu, Tufts University


Clustering identifies major patterns or trends without any supervisory information based on a combination of the assumed cluster structure and the observed data. The field of constrained clustering developed out of the need to find ways to incorporate domain knowledge when it is available. While it is possible that a fully unsupervised clustering algorithm might naturally find a solution that is consistent with the domain knowledge, constrained clustering is particularly useful when domain knowledge suggests that the default separation presented in feature space is not the one that is sought. This thesis explores ways to incorporate domain knowledge in clustering algorithms in order to acquire a customized clustering solution that fits the need of the user. This thesis makes four contributions. First, motivated by the need to redefine class definitions such that distinctions between classes are more supported by the data, we propose a novel method of specifying class-level constraints that generates pairwise instance-level constraints. We then show how to incorporate class-level constraints into clustering algorithms. We present an application of this approach to redefining global land cover classes from remotely sensed data. Second, we propose a novel iii approach to remove the effect of confounding factors using constraint-based clustering. We apply this method to find homogeneous subgroups of Multiple Sclerosis patients. Third, we address the problem of incorporating feature relevance information and instance-level pairwise constraints in one setting by presenting a framework that uses random projection ensembles and spectral clustering. Fourth, we introduce method for iterative metric learning via user-provided pair-wise constraints and present results from a prototype implementation that allows real-time interactions with the user. iv