Title : Rapid Training of Information Extraction with Local and Global Data Views

Candidate: Ang Sun
Advisor: Ralph Grishman

Abstract: This dissertation focuses on fast system development for Information Extraction (IE). State-of-the-art systems heavily rely on extensively annotated corpora, which are slow to build for a new domain or task. Moreover, previous systems are mostly built with local evidence such as words in a short context window or features that are extracted at the sentence level. They usually generalize poorly on new domains.

This dissertation presents novel approaches for rapidly training an IE system for a new domain or task based on both local and global evidence. Specifically, we present three systems: a relation type extension system based on active learning, a relation type extension system based on semi-supervised learning, and a cross-domain bootstrapping system for domain adaptive named entity extraction.

The active learning procedure adopts features extracted at the sentence level as the local view and distributional similarities between relational phrases as the global view. It builds two classifiers based on these two views to find the most informative contention data points to request human labels so as to reduce annotation cost.

The semi-supervised system aims to learn a large set of accurate patterns for extracting relations between names from only a few seed patterns. It estimates the confidence of a name pair both locally and globally: locally by looking at the patterns that connect the pair in isolation; globally by incorporating the evidence from the clusters of patterns that connect the pair. The use of pattern clusters can prevent semantic drift and contribute to a natural stopping criterion for semi-supervised relation pattern discovery.

For adapting a named entity recognition system to a new domain, we propose a cross-domain bootstrapping algorithm, which iteratively learns a model for the new domain with labeled data from the original domain and unlabeled data from the new domain. We first use word clusters as global evidence to generalize features that are extracted from a local context window. We then select self-learned instances as additional training examples using multiple criteria, including some based on global evidence.