Title : Data-driven Approaches for Paraphrasing across Language Variations

Candidate: Wei Xu
Advisor: Ralph Grishman

Abstract: Our language changes very rapidly, accompanying political, social and cultural trends, as well as the evolution of science and technology. The Internet, especially the social media, has accelerated this process of change. This poses a severe challenge for both human beings and natural language processing (NLP) systems, which usually only model a snapshot of language presented in the form of text corpora within a certain domain and time frame.

While much previous effort has investigated monolingual paraphrase and bilingual translation, we focus on modeling meaning-preserving transformations between variants of a single language. We use Shakespearean and Internet language as examples to investigate various aspects of this new paraphrase problem, including acquisition, generation, detection and evaluation.

A data-driven methodology is applied intensively throughout the course of this study. Several paraphrase corpora are constructed using automatic techniques, experts and crowdsourcing platforms. Paraphrase systems are trained and evaluated by using these data as a cornerstone. We show that even with a very noisy or a relatively small amount of parallel training data, it is possible to learn paraphrase models which capture linguistic phenomena. This work expands the scope of paraphrase studies to targeting different language variations, and more potential applications, such as text normalization and domain adaptation