AUTOMATICALLY LEARNING NATURAL LANGUAGE
FROM ON-LINE TEXT CORPORA

Eric Brill

Department of Computer Science
and
Center for Language and Speech Processing
Johns Hopkins University

CONTACT INFORMATION

Eric Brill, Department of Computer Science, Johns Hopkins University, Baltimore, Md. 21218-2694

brill@cs.jhu.edu

Phone: (410) 516-5560

Fax: (410) 516-6134

WWW PAGE

http://www.cs.jhu.edu/faculty/brill/home.html

PROGRAM AREA

Speech and Natural Language Understanding.

KEYWORDS

Corpus-Based Language Learning, Computational Linguistics, Natural Language Processing

PROJECT SUMMARY

Robust automated natural language processing has the potential to revolutionize the way we interact with machines and process information. To date, the major bottleneck inhibiting the creation of robust and accurate natural language processing systems is the problem of linguistic knowledge acquisition, in particular how a machine can obtain the linguistic sophistication necessary for accurate processing of language. In this proposal, work is described which will help break this knowledge acquisition bottleneck. As an alternative to manually providing a machine with linguistic knowledge, methods will be explored for automatically learning linguistic knowledge from on-line text corpora. As a starting point for this research, methods of extending a recently developed corpus-based learning algorithm called transformation-based error-driven learning will be explored. Extending this algorithm and carefully studying the strengths and weaknesses of this and other approaches to corpus-based learning should lead to very powerful programs for automatically learning the linguistic information necessary for an advanced natural language processing system to be effective.

PROJECT REFERENCES

E. Brill, Proceedings of the 31st Meeting of the Association of Computational Linguistics, Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach, 1993,Columbus, Oh.

E. Brill, Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Some advances in rule-based part of speech tagging, 1994, Seattle, Wa.

E. Brill, Computational Linguistics,Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging, 1995

E. Brill and P. Resnik, A transformation-based approach to prepositional phrase attachment disambiguation, Proceedings of the Fifteenth International Conference on Computational Linguistics (COLING-1994),Kyoto, Japan, 1994

E. Brill, Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging, Proceedings of the third ACL Workshop on Very Large Corpora, 1995

AREA BACKGROUND

A sentence is much more than just a linear sequence of symbols. A sentence annotated with lexical information such as parts of speech, word senses and word classes, and phrasal annotations indicating hierarchical information and the relationship between words can be processed more effectively than an unannotated sequence or bag of symbols. Given accurate linguistic annotation of a sentence, meaning can be more easily extracted and grammatical/semantic constraints can more readily be checked.

In order for a program to accurately annotate a sentence, it must be provided with a great deal of information about language. Until recently, such knowledge was typically hand-coded by language engineers, a time-consuming process which rarely resulted in accurate, robust systems. The linguistic knowledge acquisition bottleneck has made it difficult to create accurate linguistic annotation programs. This inability to accurately analyze the linguistic structure of a sentence has hindered the development of sophisticated natural language processing systems.

Over the past few years, there has been a major shift from trying to manually derive linguistic information to extracting this information automatically from on-line resources such as corpora, dictionaries and encyclopedias. In addition, a number of text corpora have been carefully annotated with linguistic information. These corpora are also valuable resources for automatic knowledge acquisition. Programs employing machine learning techniques to automatically learn linguistic information are becoming more reliable all the time, as more sophisticated techniques are being developed and larger training corpora are made available. It is our hope that with the development of accurate and portable linguistic annotation algorithms, it will be possible to create highly sophisticated natural language processing systems in the near future.

AREA REFERENCES

James Allen, Natural Language Understanding, Benjamin/Cummings Publishing Co. 1995

K. Church and R. Mercer, Computational Linguistics, Introduction to the Special Issue on Computational Linguistics Using Large Corpora, 1993

E. Charniak. Statistical Language Learning. 1993. MIT Press.

RELATED PROGRAM AREAS

Virtual Environments, Other Communication Modalities, Adaptive Human Interfaces, Intelligent Interactive Systems for Persons with Disabilities