TaxaMiner: An Experimental Framework for Automated Taxonomy Bootstrapping

February 16, 2005
2:50 pm - 4:00 pm
Halligan 111
Host: Donna Slonim


Hierarchical taxonomies and thesauri are frequently used by content management systems for indexing, search and categorization. They are also being viewed as rudimentary ontologies for the emerging Semantic Web infrastructure. However, to date, development of taxonomies and thesauri are human intensive processes, requiring huge resources in terms of cost and time. It is critical that approaches to reduce human effort and resource commitments be investigated. Towards this end, we present an experimentation framework for automated taxonomy construction from a large corpus of documents. Our approach involves:

  • (a) generation of a document cluster hierarchy;
  • (b) extraction of a taxonomy from this hierarchy; and
  • (c) assignment of labels to nodes in this taxonomy.
We draw upon a suite of clustering and NLP techniques and identify parameters which form the basis of an experimentation framework. We also propose metrics to measure taxonomy quality and evaluate the impacts of these parameters on these quality metrics. The MEDLINE(R) database is used as the document corpus and the MeSH thesaurus as the gold standard. We present some interesting results that demonstrate how our techniques can be used to improve the quality of the taxonomies generated and present some insights from the same.

Speaker Bio Vipul Kashyap is a Senior Medical Informatician in the Clinical Informatics Research & Development group at Partners HealthCare System. Currently, he is the chief architect of a Clinical Knowledge Management portal being rolled out at Partners to enable search, browsing and retrieval of clinical content and assets of the Partners HealthCare System. Vipul has received his PhD from the Department of Computer Science at Rutgers University in New Brunswick and has performed research on semantics and knowledge-based approaches for information and knowledge management. He was a co-project manager of a Knowledge Management effort at Telcordia Technologies (formerly known as Bellcore) focused on knowledge sharing and reuse across Telcordia's Professional Services Units. He was also a fellow at the National Library of Medicine and held a position at Micro-electronics and Computer Technology Corporation (MCC). Vipul has published 2 books on the topic of Semantics in Information Brokering and serves on the editorial boards of the International Journal of Knowledge and Learning and of the International Journal of Metadata, Semantics, and Ontologies.