Bootstrapping Monolingual Parsers from Multilingual Data
The creation of the Penn Treebank and similar datasets ca. 1990 produced a flowering of research on empirically trained syntactic parsers, which is now bearing fruit in information extraction and machine translation (e.g. Weischedel 2004, Quirk et al. 2005, Marcu et al. 2006). This revolution has bypassed most languages and domains, however, due to the expense of creating treebanks. Semisupervised learning methods such as bootstrapping and cotraining have the potential to leverage diverse sources of knowledge for robust statistical parsing in these new settings.
We argue that bootstrapping a parser from limited supervised data (~50-100 trees) is most promising when the model uses a rich set of redundant features, as in recent supervised models for scoring dependency parses (McDonald, Crammer, and Pereira, 2005). We show how to aid bootstrapping efficiently by drawing new features from a parser in another domain or even another language, via parallel corpora or dictionaries. These quasi-synchronous grammars extend prior bootstrapping work with synchronous grammars (Hwa et al. 2002, Smith & Smith 2004, Jansche 2005) and also have applications in translation modeling (Smith & Eisner 2006).
Drawing on Abney's (2004) analysis of the Yarowsky algorithm, we present a family of new on-line bootstrapping algorithms that optimize a likelihood-like loss function with with generalized entropy regularization. We show that this approach avoids the losses in accuracy incurred by EM-based learning. Combining diverse knowledge sources in a novel conditional model of graph spanning trees, we learn improved parsers of Czech, German, and other non-projective languages. We argue that these techniques are broadly applicable to bootstrapping in other NLP domains with a wealth of overlapping features.
Bio: David Smith received his A.B. in classics from Harvard University. An NSF graduate fellow, he is currently a Ph.D. student in computer science at Johns Hopkins University's Center for Language and Speech Processing. His interests are in machine translation, natural language parsing, and semi-supervised machine learning methods. David was formerly head programmer for the Perseus Digital Library Project at Tufts University, where he strayed from the path of classical philology toward text mining, geocoding, and information extraction.