The objective of the proposed research is to develop a system that reliably resolves ellipsis in English. The system will accept syntactically annotated input, and it will produce output in which elliptical expressions are resolved, either by producing a non-elliptical paraphrase or by linking the elliptical expression to its antecedent.

The system will be developed using the Penn Treebank, a syntactically annotated corpus of several million words, containing a wide range of texts of varying styles. The proposed project makes crucial use of newly available syntactically annotated Treebanks and associated tools. The system will be tested on examples from the Treebank to determine the percentage of cases in which ellipsis is correctly resolved. The goal is to achieve a success rate of 95%. Preliminary research suggests that this goal is realistic.

The system would represent a practical solution to a problem that confronts virtually any Natural Language Processing application that attempts to process English in a realistic setting. In addition, the system could be used as a tool in further annotating treebanks with ellipsis resolution information. Finally, the project would produce a massive amount of valuable data for theoreticians studying ellipsis and related phenomena.

The planned education activities during the proposed award period involve: teaching and developing a graduate and undergraduate course in Natural Language Processing (NLP), teaching and developing a variety of courses for non-majors, and teaching graduate and undergraduate courses in programming languages.


The ultimate goal of Natural Language Processing (NLP) is the development of computer systems that can communicate using Natural Language. Closely related to this practical goal is a more theoretical one: the development of an adequate model of the human cognitive abilities underlying the use of language. There is a synergistic, if often contentious, relationship between the computational activities in NLP and the more theoretical inquiries of linguists, logicians, and others. In NLP we rely on concepts, methods, and frameworks developed by the theoreticians, but we also often cause the theories to be modified, extended, and improved.

The field of NLP can be divided into three sub-areas: Syntactic Processing, which involves the determination of a parse tree for a given sentence, Semantic Interpretation, which involves determining the truth conditions, or logical structure for a sentence, and Context and World Knowledge, which involves the integration of sentences with surrounding discourse as well as the general situation. The study of discourse context has long been important in NLP, and in the past two decades, the emphasis on discourse issues in NLP has contributed to a re-evalutation of the emphasis on individual sentences in theoretical linguistics, as seen in theories such as Discourse Representation Theory and Dynamic Semantics. Probably the most important current development in the field is the increased use of on-line corpora (that is, large bodies of text) as a means of developing and evaluating NL systems and theories of all kinds. I believe this development presents the opportunity to place the study of language, both computational and theoretical, on a sound empirical basis for the first time.


