This thesis proposes a novel approach for exploring Information
Extraction scenarios. Information Extraction, or IE, is a task
aiming at finding events and relations in natural language texts
that meet a user's demand. However, it is often difficult to
formulate, or even define such events that satisfy both a user's
need and technical feasibility. Furthermore, most existing IE
systems need to be tuned for a new scenario with proper training
data in advance. So a system designer usually needs to understand
what a user wants to know in order to maximize the system
performance, while the user has to understand how the system will
perform in order to maximize his/her satisfaction.
In this thesis, we focus on maximizing the variety of scenarios
that the system can handle instead of trying to improve the
accuracy of a particular scenario. In traditional IE systems, a
relation is defined a priori by a user and is identified by a set
of patterns that are manually crafted or acquired in advance. We
propose a technique called Unrestricted Relation Discovery, which
defers determining what is a relation and what is not until the
very end of the processing so that a relation can be defined a
posteriori. This laziness gives huge flexibility to the types of
relations the system can handle. Furthermore, we use the notion of
recurrent relations to measure how useful each relation is. This
way, we can discover new IE scenarios without fully specifying
definitions or patterns, which leads to Preemptive Information
Extraction, where a system can provide a user a portfolio of
extractable relations and let the user choose them.
We used one year news articles obtained from the Web as a
development set. We discovered dozens of scenarios that are
similar to the existing scenarios tried by many IE systems, as
well as new scenarios that are relatively novel. We have evaluated
the existing scenarios with Automatic Content Extraction (ACE)
event corpus and obtained reasonable performance. We believe this
system will shed new light on IE research by giving various
experimental IE scenarios.