Active Learning, Human Annotation and Morpho-Syntactic Analysis of Ancient Greek

December 21, 2020

11:00am-12:00pm ET

Sococo VH 209; Zoom

Speaker: Saeed Majidi

Host: Gregory Crane

Abstract

Automatically generated metadata from large collections is an essential component of digital libraries, and is beginning to emerge as fundamental to the study of languages. Morpho-syntactic annotation captures the form of individual words and their function. Nonetheless automated syntactic analysis is still an error prone process and the production of accurate syntactic annotations requires significant human input. However, syntactic analysis is challenging for human analysts as well, which makes computational/automated syntactic analysis a complex task. The problem is further confounded for annotating corpora of historical languages that are extinct i.e., no native speakers are available.

In this thesis, we first target two main tasks of morpho-syntactic annotations of a corpus: Part-of-Speech tagging and Dependency Parsing, with a focus on Ancient Greek and Latin texts. For these two tasks, we report the results of state of the art methods and analyze the types of errors that are produced while annotating our corpus. By comparing the output of automated methods with non-expert human generated annotations, we found that the errors are similar for the task of dependency parsing. This important finding shows us that we cannot simply rely on non-expert annotations to enhance the quality of syntactic annotations.

This thesis also explores the use of active learning for automated syntactic annotation. In order to reduce the cost of annotation and improve the quality of synv tactic annotation on such corpora, we introduce a novel active learning framework for the dependency parsing task. Active learning is a machine learning method in which the learner is able to control the data from which it learns. That learner can ask an oracle, usually a domain expert, about the labels of the instances for which the model learned so far is uncertain. We formulate a novel framework in which a committee of dependency parsers collaborate to improve its eficiency using active learning techniques. Queries are created only from uncertain tokens, and the annotations of the remaining tokens of selected sentences are voted among the committee members. Our results show that the weighted Query-By-Committee framework outperforms the state of the art methods comprising uncertainty sampling and QBC, when there is a single parser to annotate the un-chosen tokens. The experimental results show that we can save up to 85% of annotation cost and at the same time all of the parsers that were used in our experiments can reach 90% accuracy when given access to fully annotated data.

Please join meeting in Sococo in VH 209. Login: tuftscs.sococo.com

Join Zoom meeting: https://tufts.zoom.us/j/98610939077

PASSWORD: see colloquium email

Dial by your location: +1 646 558 8656 US (New York)

Meeting ID: 986 1093 9077

Passcode: see colloquium email