BACKGROUND AND METHODOLOGY FOR THE REPRESENTATION
AND MANIPULATION OF LINGUISTIC CORPORA

Nancy Ide, Principal Investigator

Department of Computer Science
Vassar College
Poughkeepsie, New York 12601
USA

CONTACT INFORMATION

Department of Computer Science, Vassar College, Poughkeepsie, New York 12601, USA

tel : +1 (914) 437 5988

fax : +1 (914) 437 7187

email : ide@cs.vassar.edu

WWW PAGE

http://www.cs.vassar.edu/~ide/research

PROGRAM AREA

Speech and Natural Language Understanding

KEYWORDS

Corpora, corpus linguistics, text encoding, text handling software, software design

PROJECT SUMMARY

This project is intended to provide a theoretical background and develop coherent methodologies for the representation, access, and manipulation of corpora intended for use in corpus-based natural language processing (NLP) research. The project builds on and continues a program of collaborative research, established in 1988, between Vassar College's Department of Computer Science and the Laboratoire Parole et Langage of the Centre National de la Recherche Scientifique (CNRS) in Aix-en-Provence, France.

The increasing interest in the use of large-scale textual resources for NLP research has led to the rapid proliferation of both massive amounts of textual data and text-handling tools. Much of the currently available data is marked and annotated using ad hoc formats, most of which are entirely inconsistent with one another, and almost none of which has been developed on the basis of a sound model of text and text categories or in view of any serious consideration of the needs of corpus-based NLP research. Similarly, and for related reasons, there is an enormous redundancy in the functionality of much existing corpus-handling software (part-of-speech taggers, statistics-gathering programs, etc.), due to the fact that the same systems need to be re-invented over and over again to accomodate specific input and output formats and platforms. Because such software is typically instantiated in large, unbreakable systems, the ability to modify it and re-use relevant pieces in other applications is severely limited. Again, the lack of a principled basis for text software design is the cause of this redundancy and limited reusability.

Our goal is to develop a sound basis and methodology for corpus representation as well as for the design of corpus-handling tools. There is an obvious dependency between the two, which demands that they are developed hand-in-hand. The task involves: (1) analysis of the needs of corpus-based NLP research, both in terms of the kinds and degree of annotation required and the requirements for efficient processing, accessibility, etc.; (2) analysis of general properties and configuration of corpora, analysis of relevant structural and logical features of component text types, and the design of encoding mechanisms that can represent all required elements and features while accomodating the requirements determined in (1); and (3) specifications for text software design, coordinated with (2), with the aim of avoiding redundancy and maximizing modifiability, extendability, and reusability.

PROJECT REFERENCES

Ide, N. Corpus Encoding Standard. EAGLES Text Representation Group Report. Available from Istituto di Linguistica Computazionale, University of Pisa, Italy (1995) 75p.

Ide, N. Encoding standards for large text resources. Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, Kyoto, Japan (1994), 574-78.

Ide, N., Veronis, J. (eds.). Computers and the Humanitites (Special Issue on the Text Encoding Initiative) 29:1-3 (1995).

Ide, N., Veronis, J. MULTEXT: Multilingual Text Tools and Corpora. Proceedings of the 15th International Conference on Computational Linguistics, COLING'94, Kyoto, Japan, (1994) 588-92.

Ide, N., Veronis, J. What next after the Text Encoding Initiative? The need for text software. ACH Newsletter, Winter (1993), 1-3.

International Organization for Standards, ISO 8879: Information Processing -- Text and Office Systems -- Standard Generalized Markup Language (SGML). ISO (1986).

Sperberg-McQueen, C.M., Burnard, L., Guidelines for Electronic Text Encoding and Interchange, Text_Encoding Initiative, Chicago and Oxford (1994).

van Herwijnen, E. Practical SGML, Kluwer Academic Publishers, Boston (1991).

AREA BACKGROUND

Within the natural language processing (NLP) research community, the past few years have seen a burst of activity in the development of statistical methods which, applied to massive text data, have in turn enabled the development of comprehensive and robust models of language structure and use. Such models are increasingly recognized as an invaluable resource for NLP tasks, including machine translation.

The upsurge of interest in empirical methods for language modelling has led inevitably to a need for massive collections of texts of all kinds, including text collections which span genre, register, spoken and written data, etc., as well as domain- or application-specific collections, and, especially, multi-lingual collections with parallel translations. It is essential that such resources are reusable, to avoid massive duplication of effort; this in turn requires the development of standardized encoding formats for this data. It is no longer realistic to distribute data in ad hoc formats, since the effort and resources required to clean up and reformat the data for local use is at best costly, and in many cases prohibitive. Further, software to manipulate these texts, to be itself reusable, also requires the existence of a standard encoding format.

At the same time, there is an increasing need for reliable, reusable software to access, manipulate, and analyze large-scale text resources. Existing tools for corpus-based NLP research, most of which are available from private or academic research institutions, are most often hard to get, hard to install, under-documented, unreliable, and occasionally expensive. These tools are typically embedded in large systems which are fundamentally incompatible, and are therefore hard to integrate, adapt or extend. No effort has been devoted to the development of coherent and consistent methods for software design, based on strong principles and a sound text model, and with an eye toward their maximum utility (especially for handling multi-lingual data).

This project is intended to address these needs.

AREA REFERENCES

Brown, P.F. et al. Wordsense disambiguation using statistical methods. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, (1991) 18-21, Berkeley, California.

Church, K. W. A stochastic parts program and noun phrase parser for unrestricted texts. In Proceedings of the Second Conference on Applied Natural Language Processing. Austin, Texas, (1988) 136-143.

Coombs, J.H., Renear, A.H., and DeRose, S.J. Markup systems and the future of scholarly text processing. Communications of the ACM 30, 11 (1987), 933-947.

Gale, W., Church, K.W. A Program for Aligning Sentences in Bilingual Corpora, Proceedings of the ACL Conference, Berkeley, (1991) 177-184.

Goldfarb, C.F. The SGML Handbook, Clarendon Press, Oxford (1990).

Ide, N., Veronis, J. (eds.). Computers and the Humanitites (Special Triple Issue on the Text Encoding Initiative) 29, 1-3 (1995).

Ide, N., Walker, D. (Eds.) Computers and the Humanities (Special double issue on Common Methodologies in Computational Linguistics and Humanities Computing) 26, 5-6 (1993).

Liberman, M., Marcus, M. Tutorial on text corpora. Association for Computational Linguistics Annual Conference (1992).

RELATED PROGRAM AREAS

Adaptive human interfaces, Intelligent interactive systems for Persons with Disabilities.

POTENTIAL RELATED PROJECTS

Development of representation schemes and data architectures for multi-media (text, speech, image data), in conjunction with study of human queries and modes of exploring such data.

Development of representation schemes and data architectures for linked complexes of text, speech, and image data, including multi-level analyses (e.g., morpho-syntax, prosody, etc.).

Development of representation schemes for Braille.