A KNOWLEDGE-BASED APPROACH TO INDEXING SCIENTIFIC DATA

Lois Boggess and Julia Hodges

Department of Computer Science
Mississippi State University

CONTACT INFORMATION

Department of Computer Science, Drawer 9637, Mississippi State, MS 39762

email: lboggess@cs.msstate.edu, phone: (601)325-7507, fax: (601)325-8997

WWW PAGE

http://www.cs.msstate.edu/artificial_intelligence/kudzu.html

PROGRAM AREA

Speech and Natural Language Understanding.

KEYWORDS

Semantics, Natural Language Analysis, Language Adaptation

PROJECT SUMMARY

The AIMS (Assisted Indexing at Mississippi State) project, an outgrowth of KUDZU, is progressing toward producing indexes (words and phrases describing the content) of scientific text. The system is evaluated by direct comparison with the indexes produced by professional document analysts who are domain specialists.

To this end, a preprocessor screens text for names and formulae of chemical substances. A mutual information statistic is used to capture meaningful collocations in the domain. The text is then tagged with syntactic and semantic categories by a rule-based tagger trained on about 160,000 words of tagged and hand-corrected text from the domain. Existing text is analyzed to determine

a) which concepts relate to the indexing task;
b) the relationships between those entities; and, because particular words and phrases can map onto more than one indexable concept,
c) what constraints can appropriately identify the correct target index among competing indexes.

That is, there is a dual decision: whether to index a concept at all and how to distinguish among competing concepts.

Our stated goal has been to attain a recall rate of 80% for a set of about 200 concepts in the domain of physical chemistry. In May of this year, for a test set of 15 articles, the results were as follows:

Length of articles ranged from 2,524 words to 13,427 words, totalling 101,895 words.
Precision ranged from a low of 27% to a high of 100%, with overall precision at 50%.
Overgeneration ranged from 0 to 73%, with overall overgeneration at 50%.,
Recall ranged from 50% to 100% with overall recall at 85%.

PROJECT REFERENCES

Agarwal, Rajeev. 1994 (Almost) automatic semantic feature extraction from technical text. In Proceedings of the Human Language Technology Workshop, pp. 378-83. Morgan Kaufman Publishers.

Agarwal, Rajeev. 1995. Semantic feature extraction from technical texts with limited human intervention. Ph.D. dissertation. Mississippi State University. http://www.cs.msstate.edu/artificial_intelligence/ kudzu_papers/rajeev_dissertation.ps

Agarwal, Rajeev. 1995. Evaluation of semantic clusters. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics.

Boggess, Lois and Julia Hodges. 1994. A knowledge-based approach to indexing scientific text. In In Proceedings of the Human Language Technology Workshop, p. 458. Morgan Kaufman Publishers.

Hodges, Julia, Shiyun Yie, Ray Reighart, and Lois Boggess. 1995. AIMS: Assisted indexing at Mississippi State. MSU Technical Report. http://www.cs.msstate.edu/PUBLICATIONS/ technicalReports.html#1995

Hodges, Julia, Shiyun Yie, Sonal Kulkarni, and Ray Reighart. 1995. AIMS: The index generation component. MSU Technical Report. http://www.cs.msstate.edu/PUBLICATIONS/ technicalReports.html#1995

Hodges, Julia, Shiyun Yie, and Ray Reighart. 1995. Evaluation of AIMS system. MSU Technical Report. http://www.cs.msstate.edu/PUBLICATIONS/ technicalReports.html#1995

Kulkarni, Sonal. 1995. Indexer: A tool to access index information from an object-oriented knowledge base. Master's project. Mississippi State University.

AREA BACKGROUND

Our work is a confluence of two areas of expertise: corpus-based natural language processing and knowledge bases. It can be seen from at least three perspectives. One of these is closely related to information retrieval, one is related to broad-based study of the characteristics of language, and one is related to knowledge bases themselves.

The KUDZU project has always been related to extraction of information from a body of text from some particular domain. Sometimes the purpose of the extraction has been to add to or update existing knowledge already embedded in a knowledge base. Sometimes, as in the present indexing task, the purpose has been to identify the most relevant concepts associated with that text. In either case, it is necessary to recognize the many ways that the same concepts can be expressed by writers who have the full freedom of a natural language such as English in which to express their ideas. Consequently, it is also necessary to build a system which is capable of handling vocabulary and grammar that has not been anticipated, and which is able to relate previously unseen entities in that text to known concepts, when appropriate.

One element of our work is that we are constantly examining the properties of language in the aggregate. We use probabilistic tools to associate a newly encountered word, for instance, with known patterns in which it occurs. We use clustering algorithms to examine the contexts in which words are used, with the result that we can often group conceptually related words by "the company they keep". Because of the sheer size of vocabulary and complexity of the syntax of natural language, we are looking at new ways of examining those contexts, including color coding of properties of neighborhoods, so that visual examination is possible. We are also interested in techniques for reducing dimensionality of the linguistic data.

AREA REFERENCES

There is a nice, short, readable paper that motivates the study and use of traditional, syntax-based natural language processing using traditional parsing. It was copyrighted by Ralph Grishman in 1988.

Grishman, Ralph. 1988. A very brief introduction to computational linguistics. (My copy was with the informal notes to the 1989 DARPA Speech and Natural Language Workshop, but the paper is not in the formal proceedings.)

Unfortunately, I can't suggest a good, short introduction to corpus-based language processing. A seminal paper was

Church, Kenneth. 1988. A stochastic parts program and noun phrase parser for unrestricted text. In Second Conference on Applied Natural Language Processing. Association for Computational Linguistics.

RELATED PROGRAM AREAS

Virtual Environments. We are already working toward issues in visualization of high-dimensional data, as well as very large quantities of data. There seems to be almost no existing body of work on visualization related to text (as opposed to speech).

User-centered Design. Our indexing project is not intended to stand alone as a replacement for human document analysts. Therefore it is essential that the human users responsible for the generation of all indexes from a document have access to the automatically generated indexes, the text that the system considers most relevant to the generation of that index, and easy means to augment and modify the indexes. Many of the issues in this aspect of our work are user-interface issues.