Fall 2019 Course Descriptions

COMP 150-08 Working with Corpora

G. Crane
M 6:00p-9:00p, Halligan Hall 108
10+ Block

Cross-listed as CLS 191

This course introduces students to methods for working with corpora and more generalized collections of text. The course builds particularly upon services available in the Natural Language Toolkit, but considers as well other workflows (such as https://weblicht.sfs.uni-tuebingen.de/, https://nlp.stanford.edu/software/, https://gate.ac.uk/, etc.). We will consider how NLP can address the two dimensions of scale: working with more materials than humans can read and working with materials in more languages than human beings can learn. The challenge is not simply to work with large bodies in a handful of languages such as English, Chinese, Spanish and Arabic but in the 24 official languages of the EU, the 22 languages with official standing in India, and historical languages of the human cultural record. “Language wrangling” involves the application of all available methods to push beyond translations, whether produced by machines or humans, and to explore the language directly. Linguists have done this for centuries by adding rich annotation to individual texts. NLP and crowd-sourcing allow us to scale these methods up to large collections.

Who this course is for:

- Computer Science students who want to familiarize themselves with the methods and open questions associated with corpora.
- Students from the Humanities and Social sciences who wish to analyze textual sources. While this course will serve students interested in Premodern studies, it can also be of particular interest to students in International Relations who wish to develop research projects working with current sources in various languages as well

Approved as a category 2 elective in Data Science (analysis and interfaces).

Prerequisite: COMP 10, COMP 11, or instruction permission.

Back to Main Courses Page