KNOWLEDGE ACQUISITION
FOR JAPANESE-ENGLISH MACHINE TRANSLATION

Ralph Grishman

Computer Science Department
New York University
715 Broadway, 7th Floor
New York, NY 10003

CONTACT INFORMATION

Prof. Ralph Grishman, Computer Science Department, New York University, 715 Broadway, 7th Floor, New York, NY 10003

email: grishman@cs.nyu.edu, phone: (212) 998-3497, fax: (212) 995-4123

WWW PAGE

http://cs.nyu.edu/cs/faculty/grishman/proteus.html

PROGRAM AREA

Speech and Natural Language Understanding.

KEYWORDS

Machine translation, natural language processing, linguistic knowledge acquistion

PROJECT SUMMARY

Machine translation systems require large amounts of detailed information about the correspondences between expressions in the source and target languages. System performance is limited to a great extent by our ability of encode such information manually. As an alternative, this project seeks to gather such correspondences from a pair of parallel corpora in the source and target languages, and then generalize from these correspondences to create rules for use in a translation system.

Our acquisition procedure begins by parsing the texts in the source and target languages and then performing syntactic regularization, producing regularized parse trees (roughly similar to LFG F-structures). A tree-matching procedure then aligns the corresponding trees from the source and target texts, producing a set of detailed correspondences between source and target structures. These correspondences will be generalized, to the extent possible, using a set of sublanguage (semantic) word classes. The result will be a set of rules for the transfer phase of a translation system.

To date we have implemented the parsing, tree-matching, and pattern acquisition procedures. These procedures are being tested on a Japanese programming language manual in collaboration with Monmouth University, and on Spanish texts (both introductory texts and computer "help" files) in collaboration with the Autonomous University of Madrid.

Our initial work aimed at demonstrating the feasibility of the alignment process on small parallel texts; we showed that the alignment could be iteratively improved by acquiring some vocabulary correspondences during an initial alignment and then using these correspondences to obtain a better alignment. More recently, we have used recent work on tree matching algorithms to improve the efficiency of our alignment procedure. We are now preparing alignments of a larger corpus, which should allow us to acquire a substantial set of translation rules.

PROJECT REFERENCES

V. Teller, M. Kosaka, and R. Grishman. A Comparative Study of Japanese and English Sublanguage Patterns. Proc. 2nd Int'l Conf. on Theoretical and Methodological Issues in Machine Translation, Pittsburgh, PA, 1988.

R. Grishman and M. Kosaka. Combining Rationalist and Empiricist Approaches to Machine Translation. Proc. 4th Int'l Conf. on Theoretical and Methodological Issues in Machine Translation, Montreal, Canada, 1992.

Ralph Grishman, Iterative Alignment of Syntactic Structures for a Bilingual Corpus. Proceedings of the Second Annual Workshop on Very Large Corpora, Kyoto, Japan, August, 1994.

Adam Meyers, Roman Yangarber, and Ralph Grishman. Alignment of Shared Forests for Bilingual Corpora. Proteus Project Memorandum #79, Computer Science Dept., New York University, 1995.

AREA BACKGROUND

Work on machine translation has been going on for nearly 50 years; it is now an area of some commercial importance. For most of this time, the focus was on the manual development of linguistic rules and large dictionaries. However, over the last several years, along with a rapid rise of corpus-based computational linguistics, there has been increasing interest in developing machine translation systems which can be based on or trained from parallel bilingual corpora.

In Japan, there has been considerable work on "Example Based Machine Translation". Although much effort has been invested in developing word classes and rules for translating particular constructs, in some cases this has not been very successful because of the large number of idiosyncratic word patterns which require different translations. Several systems have therefore been built which, in place of rules, consult collections of paired English-Japanese examples, and use the most similar example as the basis for translation. These systems, however, are typically based on pre-selected examples rather than full parallel corpora.

Researchers at IBM in Yorktown Heights adopted a almost purely statistical approach to machine translation. Using a very large corpus of parallel French and English text, they developed statistics on the frequency of translation of English word sequences into French word sequences, and on the different possible arrangements of these French word sequences. This statistical information was then used to select a most likely French translation of new English text. However, the use of word-based alignments, without any structural analysis of the input texts, greatly increases the burden on the statistical methods.

Over the past several years, researchers at Kyoto University, the Nara Institute of Science and Technology, and Hitachi have developed alignment procedures for structurally-analyzed bilingual corpora, and conducted a few experiments on the use of these aligned corpora for translation. However, many questions remain open about the best alignment procedure and about how the resulting correspondeces should be generalized and used for translation; our research is attempting to address some of these questions.

AREA REFERENCES

W. John Hutchins and Harold L. Somers. An introduction to machine translation. Academic Press, 1992.

P.F. Brown, J. Cocke, S. A. Della Pietra, V. J. Della Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roosin. A Statistical Approach to Machine Translation. Computational Linguistics 16 (2), 1990.

E. Sumita and H. Iida, Experiments and Prospects of Example-Based Machine Translation. Proc. 29th Annl. Meeting Assn. Computational Linguistics, Berkeley, Ca, 1991.

T. Utsuro, H. Ikeda, M. Yamane, Y. Matsumoto, and M. Nagao. Bilingual Text Matching using Bilingual Dictionary and Statistics. Proc. 15th Int'l Conf. on Computational Linguistics, Kyoto, 1994.

H. Kaji, Y. Kida, and Y. Morimoto. Learning Translation Templates from Bilingual Text. Proc. 14th Int'l Conf. on Computational Linguistics, Nantes, 1992.