Department of Computer Science
Mississippi State University
email: lboggess@cs.msstate.edu, phone: (601)325-7507, fax: (601)325-8997
To this end, a preprocessor screens text for names and formulae of chemical substances. A mutual information statistic is used to capture meaningful collocations in the domain. The text is then tagged with syntactic and semantic categories by a rule-based tagger trained on about 160,000 words of tagged and hand-corrected text from the domain. Existing text is analyzed to determine
That is, there is a dual decision: whether to index a concept at all and how to distinguish among competing concepts.
Our stated goal has been to attain a recall rate of 80% for a set of about 200 concepts in the domain of physical chemistry. In May of this year, for a test set of 15 articles, the results were as follows:
Agarwal, Rajeev. 1995. Semantic feature extraction from technical texts with limited human intervention. Ph.D. dissertation. Mississippi State University. http://www.cs.msstate.edu/artificial_intelligence/ kudzu_papers/rajeev_dissertation.ps
Agarwal, Rajeev. 1995. Evaluation of semantic clusters. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics.
Boggess, Lois and Julia Hodges. 1994. A knowledge-based approach to indexing scientific text. In In Proceedings of the Human Language Technology Workshop, p. 458. Morgan Kaufman Publishers.
Hodges, Julia, Shiyun Yie, Ray Reighart, and Lois Boggess. 1995. AIMS: Assisted indexing at Mississippi State. MSU Technical Report. http://www.cs.msstate.edu/PUBLICATIONS/ technicalReports.html#1995
Hodges, Julia, Shiyun Yie, Sonal Kulkarni, and Ray Reighart. 1995. AIMS: The index generation component. MSU Technical Report. http://www.cs.msstate.edu/PUBLICATIONS/ technicalReports.html#1995
Hodges, Julia, Shiyun Yie, and Ray Reighart. 1995. Evaluation of AIMS system. MSU Technical Report. http://www.cs.msstate.edu/PUBLICATIONS/ technicalReports.html#1995
Kulkarni, Sonal. 1995. Indexer: A tool to access index information from an object-oriented knowledge base. Master's project. Mississippi State University.
The KUDZU project has always been related to extraction of information from a body of text from some particular domain. Sometimes the purpose of the extraction has been to add to or update existing knowledge already embedded in a knowledge base. Sometimes, as in the present indexing task, the purpose has been to identify the most relevant concepts associated with that text. In either case, it is necessary to recognize the many ways that the same concepts can be expressed by writers who have the full freedom of a natural language such as English in which to express their ideas. Consequently, it is also necessary to build a system which is capable of handling vocabulary and grammar that has not been anticipated, and which is able to relate previously unseen entities in that text to known concepts, when appropriate.
One element of our work is that we are constantly examining the properties of language in the aggregate. We use probabilistic tools to associate a newly encountered word, for instance, with known patterns in which it occurs. We use clustering algorithms to examine the contexts in which words are used, with the result that we can often group conceptually related words by "the company they keep". Because of the sheer size of vocabulary and complexity of the syntax of natural language, we are looking at new ways of examining those contexts, including color coding of properties of neighborhoods, so that visual examination is possible. We are also interested in techniques for reducing dimensionality of the linguistic data.
Grishman, Ralph. 1988. A very brief introduction to computational linguistics. (My copy was with the informal notes to the 1989 DARPA Speech and Natural Language Workshop, but the paper is not in the formal proceedings.)
Unfortunately, I can't suggest a good, short introduction to corpus-based language processing. A seminal paper was
Church, Kenneth. 1988. A stochastic parts program and noun phrase parser for unrestricted text. In Second Conference on Applied Natural Language Processing. Association for Computational Linguistics.
User-centered Design. Our indexing project is not intended to stand alone as a replacement for human document analysts. Therefore it is essential that the human users responsible for the generation of all indexes from a document have access to the automatically generated indexes, the text that the system considers most relevant to the generation of that index, and easy means to augment and modify the indexes. Many of the issues in this aspect of our work are user-interface issues.