Somewhat Supervised Clustering: Exploring the Use of Predefined Categories in Clustering for Information Retrieval
Classification is organizing a collection of documents into predefined categories. Conventional classifiers maintain a fixed category structure and static representations of the predefined categories. Therefore, they are unable to adjust to a data set containing occasional documents that do not fit the category structure or a data set that changes with time (i.e., news articles). Clustering -- grouping documents into categories defined according to some measure of similarity -- is robust in the face of changes in the data. It is able to create a new cluster for a document that does not belong to the existing clusters. It is also able to change the representation of clusters as the data changes. Our goal is to adapt clustering techniques to the classification task in order to benefit from these advantages. In this thesis, we considered the problem of using clustering for classification in the case where the data fits the predefined category structure and does not change with time. In order to use clustering for classification, we augmented a clustering algorithm with predefined categories. We tested this method on a data set that can be classified according to different sets of predefined categories. We found that our method is able to direct the clustering algorithm to classify into given categories. In the simple case where data is not changing, its performance is comparable to a classifier. This suggests that our method has potential for use in classification applications.