AUTOMATICALLY LEARNING NATURAL LANGUAGE
FROM ON-LINE TEXT CORPORA
Eric Brill
Department of Computer Science
and
Center for Language and Speech Processing
Johns Hopkins University
CONTACT INFORMATION
Eric Brill,
Department of Computer Science,
Johns Hopkins University,
Baltimore, Md. 21218-2694
brill@cs.jhu.edu
Phone: (410) 516-5560
Fax: (410) 516-6134
WWW PAGE
http://www.cs.jhu.edu/faculty/brill/home.html
PROGRAM AREA
Speech and Natural Language Understanding.
KEYWORDS
Corpus-Based Language Learning,
Computational Linguistics,
Natural Language Processing
PROJECT SUMMARY
Robust automated natural language processing has the potential to
revolutionize the way we interact with machines and process
information. To date, the major bottleneck inhibiting the creation of
robust and accurate natural language processing systems is the problem
of linguistic knowledge acquisition, in particular how a machine can
obtain the linguistic sophistication necessary for accurate processing
of language. In this proposal, work is described which will help
break this knowledge acquisition bottleneck. As an alternative to
manually providing a machine with linguistic knowledge, methods will
be explored for automatically learning linguistic knowledge from
on-line text corpora. As a starting point for this research, methods
of extending a recently developed corpus-based learning algorithm
called transformation-based error-driven learning will be explored.
Extending this algorithm and carefully studying the strengths and
weaknesses of this and other approaches to corpus-based learning
should lead to very powerful programs for automatically learning the
linguistic information necessary for an advanced natural language
processing system to be effective.
PROJECT REFERENCES
E. Brill, Proceedings of the 31st Meeting of the Association of
Computational Linguistics, Automatic Grammar Induction and Parsing
Free Text: A Transformation-Based Approach, 1993,Columbus, Oh.
E. Brill, Proceedings of the Twelfth National Conference on Artificial
Intelligence (AAAI-94), Some advances in rule-based part of speech
tagging, 1994, Seattle, Wa.
E. Brill, Computational Linguistics,Transformation-Based Error-Driven
Learning and Natural Language Processing: A Case Study in Part of
Speech Tagging, 1995
E. Brill and P. Resnik, A transformation-based approach to
prepositional phrase attachment disambiguation, Proceedings of the
Fifteenth International Conference on Computational Linguistics
(COLING-1994),Kyoto, Japan, 1994
E. Brill, Unsupervised Learning of Disambiguation Rules for Part of
Speech Tagging, Proceedings of the third ACL Workshop on Very Large
Corpora, 1995
AREA BACKGROUND
A sentence is much more than just a linear sequence of symbols. A
sentence annotated with lexical information such as parts of speech,
word senses and word classes, and phrasal annotations indicating
hierarchical information and the relationship between words can be
processed more effectively than an unannotated sequence or bag of
symbols. Given accurate linguistic annotation of a sentence, meaning
can be more easily extracted and grammatical/semantic constraints can
more readily be checked.
In order for a program to accurately annotate a sentence, it must be
provided with a great deal of information about language. Until
recently, such knowledge was typically hand-coded by language
engineers, a time-consuming process which rarely resulted in accurate,
robust systems. The linguistic knowledge acquisition bottleneck has
made it difficult to create accurate linguistic annotation programs.
This inability to accurately analyze the linguistic structure of a
sentence has hindered the development of sophisticated natural
language processing systems.
Over the past few years, there has been a major shift from trying to
manually derive linguistic information to extracting this information
automatically from on-line resources such as corpora, dictionaries and
encyclopedias. In addition, a number of text corpora have been
carefully annotated with linguistic information. These corpora are
also valuable resources for automatic knowledge acquisition. Programs
employing machine learning techniques to automatically learn
linguistic information are becoming more reliable all the time, as
more sophisticated techniques are being developed and larger training
corpora are made available. It is our hope that with the development
of accurate and portable linguistic annotation algorithms, it will be
possible to create highly sophisticated natural language processing
systems in the near future.
AREA REFERENCES
James Allen, Natural Language Understanding, Benjamin/Cummings
Publishing Co. 1995
K. Church and R. Mercer, Computational Linguistics, Introduction to
the Special Issue on Computational Linguistics Using Large Corpora,
1993
E. Charniak. Statistical Language Learning. 1993. MIT Press.
RELATED PROGRAM AREAS
Virtual Environments,
Other Communication Modalities,
Adaptive Human Interfaces,
Intelligent Interactive Systems for Persons with Disabilities