User-Oriented Information Extraction

Peter von Etter

Helsinki May 16, 2011

UNIVERSITY OF HELSINKI

Department of Computer Science

ii

Contents

1 Introduction

1

1.1What is Information Extraction . . . . . . . . . . . . . . . . . . . . . 1

1.1.1Relation to information retrieval . . . . . . . . . . . . . . . . . 2

1.1.2 Relation to data mining . . . . . . . . . . . . . . . . . . . . . 3

1.2Structure of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Back end: The PULS IE System

5

2.1Example scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 User-Oriented IE

14

3.1Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2Measuring Usefulness . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5.1Grouping disease events into outbreaks in PULS . . . . . . . . 20

3.5.2

Fine-Grained Grouping of events . . . . . . . . . . . . . . . .

21

3.5.3

Clustering in MedISys . . . . . . . . . . . . . . . . . . . . . .

22

3.6Domains and Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.6.1 Medical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.6.2Business . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.6.3Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Front End: Web Interface

25

4.1

Table View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

4.2

List View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

iii

4.3Document View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4Map View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.5 Chart View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.6Graph View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Evaluation and Results

35

5.1Evaluation of Event Detection . . . . . . . . . . . . . . . . . . . . . . 35

5.2Evaluation of the Uniqueness Heuristic . . . . . . . . . . . . . . . . . 35

5.3Evaluation of outbreak aggregation . . . . . . . . . . . . . . . . . . . 37

6 Technical Description

40

6.1Puls-IE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.2Puls-Lib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.2.1 Puls.Conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2.2Puls.Util . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2.3

Puls.Cronjob

. . . . . . . . . . . . . . . . . . . . . . . . . . .

43

6.2.4

Puls.Model

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

6.2.5Puls.Auxil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2.6Puls.Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

 

6.2.7

Puls.Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

 

6.2.8

Puls.Lang-tools . . . . . . . . . . . . . . . . . . . . . . . . . .

46

 

6.2.9

Puls.Rss-Feed . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

 

6.2.10

Puls.Source . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

 

6.2.11

Puls.Crud . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

6.3

Puls-Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

6.4

Integration between PULS and MedISys . . . . . . . . . . . . . . . .

48

7 Conclusion

 

50

References

52

1

1Introduction

This section describes information extraction, and how it relates to other similar methods.

1.1What is Information Extraction

The activity of automatically producing structured data from unstructured text written in natural language is known as information extraction (IE) [SFvdG08]. It entails the refinement of unstructured data, where the goal is to make subsequent processing easier and more convenient. As anyone who has ever written computer programs might know, whatever the task, it is usually considerably easier to process properly structured data, than it is to work on raw text.

Text written in natural language is an example of unstructured data. So if one wants to process the information, or facts, contained in human-written text with a program, one has to translate the information into a format that is more suitable for automatic processing. This transformation can be done manually, but it would make more sense to create a program, using natural language processing (NLP) techniques, that is able to accomplish the same thing, thus removing the need for manual labor. A program capable of this could be viewed as being able to understand natural text on some level.

The Internet contains vast amounts of unstructured data in the form of text. Much, if not most, of this text has been written by humans for humans. This makes life difficult if one needs to find information of the kind where a regular keyword based search query is not fundamentally expressive enough to specify the search criteria or constraints properly. At best, the information one needs may be available directly, such as when one’s question already happens to have been answered by somebody else (e.g. in an online forum). At worst this can lead to the all too familiar, tedious session of trial and error with popular keyword based search engines, where one, after many attempts finally discovers the magic sequence of keywords for which the search engine returns satisfying results.

Suppose one would like go on vacation and is faced with the problem of choosing between either traveling to Gran Canaria or to Tenerife (both belong to the Canary Islands archipelago). One way of deciding which one to travel to would be to read the blogs of people who have visited either one and see what their opinions are. Did

2

they have a good time or not? One would then decide to travel to the island that received the highest percentage of positive opinions.

A keyword based search engine is probably going to have a hard time answering questions like “How many of these bloggers enjoyed Tenerife?”. It will certainly be able to find lots of pages that contain the keyword Tenerife and the various adjectives that one might include in the search query, but the actual counting of opinions is left to the user.

As an alternative approach, imagine a system that could perform IE on all the blog entries related to Tenerife or Gran Canaria. This system would extract any opinions stated there and store them as structured facts for later lookup. Each extracted fact would consist of data elements such as the location the fact pertains to, the opinion itself and perhaps the author of the opinion. Given a collection of extracted facts, it would then be possible to directly query the system for various statistics related to the resorts, such as the distribution of positive and negative opinions on which one would base the decision on where to fly.

1.1.1Relation to information retrieval

Information retrieval (IR) is similar to IE in that in both areas the focus is on finding relevant information out of a large amount of unstructured data. One major difference between the two is that IR pertains to returning relevant documents to a user, whereas IE is concerned with extracting specific facts in structured form out of a collection of documents [MRS08]. Popular search engines on the Internet are examples of software utilizing IR techniques; the user types in a query to which the search engine returns relevant web pages.

A fact returned by an IE application is something that contains answers to a set of specific questions. If stored in a database, they can be easily queried or aggregated in various ways in order to provide information which, if gathered manually, would require a person to examine each document, taking notes along the way. When the number of documents grows large, manual processing is not even an option in practice. This might be the case when an IR application returns very large set of relevant documents. The user has no other option than to quickly scan through all the returned documents in hope of finding the correct piece of information or to refine the search query in order to receive a smaller, more manageable set of relevant documents.

3

There is a trade-off involved between IR and IE. The difference is basically a question of scope and granularity. With IR, one is able to retrieve documents based on key words regardless of the topic of the documents, with minimal preprocessing and with no domain-specific knowledge. In other words, IR applications can be quite general. With IE, preprocessing along with domain-specific knowledge is typically required before facts can be extracted from a document. The facts are constrained to a narrow domain but represent a deeper understanding of the text than the results of an IR-based approach, due to the fact that the IE results have structure. This allows a user to find answers to questions that simply could not be answered using plain IR. However, tuning an IE application to work in a new domain can be a time-consuming task. Methods that reduce the effort required to do this is a topic of active research [Yan00, Ril96].

An IE system can be combined with an IR system to improve efficiency [YBvE07, YvES08]. The IR system is used to filter out unwanted documents at an early stage based on keywords or phrases. The documents that passed the filter are then used as input to the IE system. The IE task is computationally expensive, therefore it may not be feasible to process all available documents. The filtering done by the IR system lets the IE system focus on the documents that are most likely to contain events, which means that it is possible to cover an order of magnitude more documents compared to an approach based on IE alone. A combination of IE and IR of this kind is described in section 6.4 on page 48.

1.1.2Relation to data mining

Like in IE and IR, the goal of data mining is to find useful information out of large amounts of data. Where IE focuses on extracting structured facts, and IR on returning relevant documents, data mining focuses on finding trends and patterns out of structured data [HK05]. For example, given the products bought by customers at a store, data mining can be used to find interesting associations between products. The correlations can then be leveraged in various ways to increase sales. For instance, the store owner might choose to place products whose sales correlate highly in close proximity to one another, in hope of triggering additional sales.

Data mining (DM) techniques can be applied to the results produced by an IE system. This may aid in the discovery of non-obvious associations between the events produced by IE. This can be of great value to the people analyzing the database because it saves time and may help in making quicker decisions based on

4

the data.

It can also be used for error correction if the set of events has a large degree of redundancy by using global information to correct local errors. This means that an error in a single event can be corrected if there exists a large enough number of almost identical events that do not have this error [Jok05].

1.2Structure of this Thesis

This thesis is structured as follows. Section 2 describes the PULS information extraction system. Section 3 describes our problem setting, the methods we employ to improve the usefulness of PULS as well as the scenarios supported by the system. Section 4 describes the web interface provided to the users of this system. In section 5 we describe the performance of aggregation and confidence methods that have been implemented to give the users richer information. Section 6 describes technical aspects of the system and section 7 offers some closing thoughts.

5

2Back end: The PULS IE System

PULS, the Pattern-based Understanding and Learning System, is developed at the University of Helsinki to extract factual information from plain text. It is an IE system built on NLP technology that performs extraction using a set of knowledge bases [GHY02]. The system can be adapted to different domains by modifying the knowledge bases. The core of the system is domain independent. The items extracted are mainly events i.e. activities that have taken place in a particular point in time at a particular location involving one or more entities. An entity can be a thing, a person, a company or any other object of interest.

Each domain is associated with one or more event templates. A template defines the required attributes of the events one wants to extract from a domain. A template is said to have slots that should be filled with entities found in the input text whenever an event is instantiated. Some slots can be defined as optional, in order to facilitate more flexible extraction.

2.1Example scenario

One domain, or scenario, in which PULS is currently able to extract events is the medical intelligence domain. Here, outbreaks of diseases are the events extracted by the system. The resulting event database is intended to assist medical professionals and organizations such as the European Center for Disease Control (ECDC) in assessing the threats posed by epidemics and in offering recommended courses of action when faced with a potential epidemic.

Without technology capable of automatic monitoring of diseases, medical profes- sionals would have to manually sift through vast amounts of news reports in order to assemble an accurate overview of the epidemiologic situation of the world. This work is critical, since actual lives depend on having accurate, real-time medical in- telligence that allows authorities to act as quickly as possible when faced with an epidemic, for instance, by sending out alerts or deploying personnel specialized in disease control. Therefore there exists a great need to employ all possible technology that will allow the medical professionals to perform their work more efficiently.

In the medical intelligence domain, important entities involved in an event are

The disease in question

6

The location in which the outbreak occurred

The date on which the outbreak occurred

The victims of the outbreak

As an example, consider the sentence

Indonesia reported 5 new cases of bird flu on Monday.

This sentence describes a disease outbreak where the entities involved are as follows:

Disease

Location

Date

Victims

bird flu

Indonesia

Monday

5 new cases

These entities when taken together form an event. The goal is to have a database filled with events such as this one that will enable a user to easily find out things like

The number of bird flu reports in Indonesia

Which diseases have been reported in Indonesia

The most commonly reported disease in Indonesia

or any other piece of information that can be computed using a standard relational database. It should be clear that having access to a database filled with struc- tured facts allows for richer querying and aggregation of information than taking an approach based on IR only.

This transformation from sentence to table is essentially what IE is about. The transformation is comprised of several smaller sub-processes that are executed one after the other in a pipeline fashion until the extraction is complete.

2.2Architecture

The flow of information in the PULS system forms a pipeline. Raw documents are fed into the system and extracted events are produced as a result of many sub- procedures. The steps needed to go from document to event are as follows:

7

1.Collect the documents

2.Pre-process the documents

3.Run the documents through the core IE module

4.Post-process the events produced by the IE module

5.Store the events in a database.

Figure 1 on the next page shows a flow chart with the most important stages in the pipeline going from publisher to user query. Documents are collected from publishers using various methods (4c) and fed to the IE system (1). After IE processing, any events discovered are stored in a database (4b). Users are then able to view and moderate the events through a web interface (4a). Areas of active research are represented by (3b). These include trend detection, cross-validation, noise reduction and data correction.

Adapting the system to a new domain entails linking the core IE engine to the relevant knowledge bases. This allows the system to extract events from documents in the new domain.

These knowledge bases contain primarily four kinds of knowledge: concepts, lexi- cons, patterns and predicates.

The concept knowledge base (concept base) is a hierarchical structure that gives the system knowledge about how things are related to each other in a ‘is a’ manner. The concept base is essentially an ontology. This means that any concept in the hierarchy specializes on its parent, except the root concept, which is the super class of all other concepts and has no parent. For example, a concept base related to animals might contain the concept poodle. The parent of poodle might be dog, the parent of which might be mammal. The concept mammal in turn might be a subclass of the root concept animal.

The concept base enables the system to match patterns not only against specific words, but against whole classes of words. This makes pattern matching much more general and concise. For example, instead of creating separate patterns for all kinds of dogs, one can just make a single pattern that matches the class dog and be done with it. This pattern would then match poodle, German shepherd, collie and so on, spanning all types of dogs in the concept base.

8

customizer

 

 

 

 

Trenddetection

 

 

 

 

 

 

 

Customization

 

Knowledgebases :

Cross-validation

 

 

Noisereduction

 

 

Lexicon/Ontology

 

environment

(2)

Datacorrection

(3b)

Patterns

 

 

 

 

 

 

 

 

 

 

 

(Inferencerules)

 

 

 

 

Candidateknowledge

CoreIE

 

 

Extractedfacts

 

 

 

Engine

(1)

 

 

 

 

Un/Supervised

 

 

 

DBserver

(4b)

 

 

 

 

 

 

learning

(3a)

 

 

 

 

 

 

 

 

Textdocuments

 

 

Webserver

(4a)

 

 

Datacollection

(4c)

Userquery

Response

Othercorpora

 

 

 

 

 

 

 

 

 

publisher

 

 

user

 

 

 

 

 

 

 

 

Figure 1: PULS Diagram

9

A lexicon is a dictionary that links actual words to concepts. It lets system look up part-of-speech information and syntactic features for words. Some words map only to a single concept, but a large number of words map to several concepts. For instance, a lexicon would let the system know that the word dog can indicate a verb or a noun, two separate concepts.

A pattern consists of a sequence of symbols designed to match some piece of natural text. Whenever certain patterns match, the system will attempt to extract the information required to form a fact from the surrounding text. It is these patterns that ultimately determine when the system should extract something from the text. In other words, they function as triggers that initiate the extraction process. A pattern, in the context of this system, can be thought of as a regular expression. The difference is that instead of matching individual letters like a ‘normal’ regular expression, patterns are used to match linguistic objects at a higher level. Sequences of words, concepts, noun or verb groups are all things that can be matched with patterns.

For instance, suppose we would like to match phrases like “We visited Tenerife” or “My girlfriend and I visited the beautiful Gran Canaria”. An appropriate pattern for these types of phrases could have the form

NP(person) VG(visit) NP(location),

where np(PERSON) signifies a noun phrase whose head is of type person, vg(VISIT) a verb group indicating that a visit took place and np(LOCATION) a noun phrase whose head is something the system has identified as a location (e.g. Tenerife). Here, person, visit and location are all concepts that exist in the concept base.

But patterns don’t do anything by themselves except match tokens. In order to become useful, a pattern needs to be associated with an action. When a pattern fires, the action is called, which will usually group smaller constituents together to form a larger group or create a logical entity based on the text matched by the pattern. The pattern described above, for example, could be linked to an action that creates a logical visit entity, to indicate that a visit has taken place in the context of the document. Another way of looking at actions, is that they are mappings from patterns to entities that carry some meaning to the system. An entity created by an action is added to an internal pool of entities that contain everything the system has seen in the document currently being processed. All facts produced by the system are based on the entities in this pool.

10

Text processing begins by performing a dictionary look-up on each word. After this, the system starts applying patterns to the input. The patterns are arranged in a pipeline configuration such that the results produced by a pattern can be used by subsequent patterns to form more complex entities. At the end of the pipeline are special inference rules that, by building on the results of the previous patterns, create the facts that will ultimately be reported by the system. Inference rules are actions that are triggered when certain entities have been extracted from the document. Inference rules can be domain specific or domain independent. Examples of inference rules are:

If no date has been found for an event, assign the publication date of the document to the event

If no country was matched by an event pattern, take the country that is nearest the text that triggered the event.

If two events can be unified with respect to attribute values, merge them into a single event.

Whenever one wishes to perform extraction, one must first provide the system with the appropriate domain-specific knowledge. This entails creating patterns, predi- cates, concepts and dictionaries that are relevant to the task at hand. In the case of the vacation scenario, a specialized lexicon containing the names of geographic locations is an example of relevant knowledge as well as patterns designed to pick up the visits and opinions of people. And furthermore, since the objective of the system is to extract structured data, a predicate for this extracted data needs to be defined beforehand.

A predicate represents what it is we want to learn. It is the set of attributes that we are interested in and serves as a template, determining what data the extracted facts should contain. A predicate for the vacation scenario described above could be defined in the following way:

Location Rating

These two fields together constitute a single fact, also known as an event. Here location refers to a possible travel destination (Tenerife or Gran Canaria), and rating to the writer’s opinion of the location in question. This rather simple predicate

11

could be augmented with additional attributes, such as the author’s name or the date of writing, if one wishes. It is also of practical importance that the predicate be easily represented as a relation in a relational database, since it should be as straightforward as possible to perform queries on the extracted data. A relational database does an adequate job of providing a query interface.

As there are many ways of expressing the quality of something (‘good’, ‘excellent’, ‘incredible’ etc.), it is a good idea to define a mapping from the different expressions onto a set of numbers. The simplest mapping would be a binary one, where any positive opinion is mapped to 1 and any negative opinion to 0. Since opinions are highly subjective, the binary mapping is probably a safe bet, but a finer scale with different grades of goodness is also quite possible.

2.3Related Work

Information retrieval and information extraction have been thoroughly researched over the recent decades, with abundant literature on both topics. Typically they are studied separately, with results reported at different fora, and they are considered different problem areas, since they employ quite different methods. Conceptually, IR and IE both serve a user’s information need, though they do so at different levels. It is understood that in real-world settings, IR and IE may be expected to interact, for example, in a pipeline fashion. The possibilities of tighter interaction largely remain yet to be researched.

Gaizauskas and Robertson ([GR97]) investigated the costs and benefits of combining IR and IE, by first applying a search engine (Excite) and its summary extraction tool, and then extracting MUC-6 “management succession” events, ([Def95]). The MUC-6 task is to track changes in corporate management: to find the manager’s post, the company, the current manager’s name, the reason why the post becomes vacant, and other relevant information about the management switch. The authors conclude that using IR as a filter before IE clearly results in a speed gain (since applying IE to all documents returned by the search engine would not have been possible), while the cost was a loss of 7% of the relevant documents. In further experiments by Robertson and Gaizauskas ([RG97], precision rose by 32% up to 100%, though at the cost of losing 65% of the retrieved relevant documents.

From an application point of view, to our knowledge, there are two other systems that attempt to gather information about infectious disease outbreaks from automat-

12

ically collected news articles: Global Health Monitor [DHNKC08] and HealthMap [FMRB08]. The systems provide map interfaces for visualising the events found.

Global Health Monitor follows about 1500 RSS news feeds hourly, and matches words in the new articles against a taxonomy of about 4300 named entities, i.e., 50 names of infectious diseases, 243 country names, and 4000 province or city names. For place names, the taxonomy contains latitude-longitude information. The 50 disease names are organised into an ontology with the properties relating to synonyms, symptoms, associated syndromes and hosts. The Global Health Monitor processing consists of four steps: (1) relevance decision (using Naïve Bayes classification); (2) named entity recognition (disease, location, person and organisation, using Support Vector Machine classification); (3) filtering of articles containing both disease and location names in the first half of the text. Additionally, only those disease-location pairs are retained that are frequently found in a separate reference corpus. Step (4) then visualises the successful matches on a map. Due to the rigorous filtering in steps (1) to (3), the system retains information on 25-30 locations and on about 40 infectious diseases a day. The system currently provides text analysis for English, though the underlying ontology includes terms in several other languages, including Japanese, Thai, and Vietnamese.

HealthMap monitors articles from the Google News aggregator and emails from the collaborative information-sharing portal ProMED-Mail,1 and extracts information about infectious diseases and locations. After a manual moderation process, the results are stored in a database and visually presented on a map. Diseases and loca- tions are identified in the text if words in the text exactly match the entities in the HealthMap taxonomy, which contains about 2300 location and 1100 disease names. Some disambiguation heuristics are applied to reduce redundancy (e.g., if the words “diarrhoea” and “shigellosis” are found, only the more specific entity “shigellosis” will be retained). HealthMap identifies between 20 and 30 disease outbreaks per day. More recent articles and those disease-location combinations reported in multiple news items and from different sources are highlighted on the map. The system de- velopers point out the importance of using more news feeds, as their current results are focused toward the North-American continent. HealthMap currently displays articles in English, French, Spanish and Russian, ([FMRB08] describes English pro- cessing only).

Some major differences between these two systems and MedISys are: (1) MedISys

1http://www.promedmail.org

13

is not limited to infectious diseases, but covers also reports on symptoms, vaccines and medicines, nuclear and chemical incidents, bio-terrorism, and more. (2) MedISys covers 43 languages (for highly inflected languages it uses regular-expression patterns to simulate a simplified morphological analysis). As the interests of its user base are wide, (3) MedISys does initially display all articles mentioning a certain category (including reports on medicines and vaccines, etc.) and then offers the additional functionality of filtering for outbreaks, vaccines, legislation, etc.

There exists several other frameworks for performing information extraction as de- scribed in the previous section. A popular open source package is GATE (General Architecture for Test Engineering) [CMBT02]. GATE contains an IE component called ANNIE which is similar to the the core system described in section 2.2 on page 6 and section 6 on page 40.

14

3User-Oriented IE

This section describes the PULS IE system and the integration between PULS and its data sources. It provides an overview of how the prototype system is currently being used in the real world by real users.

In order to provide a useful system for the users, the PULS system incorporates a number of features designed to reduce redundancy and increase relevance of what is shown to users. This section describes the approaches taken.

3.1Overview

Professionals in many fields need to sieve through large volumes of information from multiple sources on a daily basis. Most European Union (EU) countries have a national organisation that continuously monitors the media for new threats to the Public Health in their country, and for the latest events involving health threats. These threats range from outbreaks of communicable diseases and terrorism cases such as the deliberate release of biological or chemical agents, to chemical or nuclear incidents. Typically, the staff of these organisations search their national and local newspapers and/or purchase electronic news from commercial providers such as Factiva or Lexis-Nexis. Until recently, relevant news articles were cut out from printed press, and compiled into an in-house newsletter, which was then discussed among specialists who had to decide on the appropriate action. As more news sources became available on-line, it became easier to find relevant articles and to compile and manage them electronically. At the same time, the number of available sources rose, and—due to increased travel and the consequent importing of infectious diseases—it became necessary to monitor the news of neighbouring countries and major travel destinations.

These and similar professional communities can benefit from text analysis software that identifies potentially relevant news items, and thereby increases the speed and efficiency of their work, which is otherwise slow and repetitive. The search func- tions of news aggregators, such as Factiva or Google News, allow users to formulate Boolean search word combinations that filter items from large collections. The Euro- pean Commission’s Medical Information System, MedISys, in addition to providing keyword-based filtering, aggregates statistics about query matches, which enables it to provide early-warning signals by spotting sudden increases in media reports about any Public Health-related issue and alerting the interested user groups [BvdGB05].

15

While this functionality is in itself helpful for the communities in question, deeper text analysis can provide further advantages, beyond those provided by classic In- formation Retrieval (IR) and alerting. In this section, we describe the IR and early- warning functionality of MedISys, and how it inter-operates with the information extraction (IE) system PULS, which analyses the documents identified by MedISys, retrieves from them events, or structured facts about outbreaks of communicable disease, aggregates the events into a database, and highlights the extracted infor- mation in the text. Our evaluation confirms that event extraction helps to narrow down the selection of relevant articles found in the IR step (improving precision), while on the other hand missing a small number of relevant articles (lowering recall).

MedISys has proven to be useful and effective for finding and categorising documents from a large number of Web sources. To make the retrieved information even more useful for the end-user, it is natural to consider methodologies for deeper analysis of the texts, in particular, information extraction (IE) technology. After MedISys identifies documents where the alerts fire, IE can deliver more detailed information about the specific incidents of the diseases reported in those documents.

IE helps to boost precision, since keyword-based queries may trigger on documents which are off-topic but happen to mention the alerts in unrelated contexts. Pattern matching in IE provides the mechanism that assures that the keywords appear in relevant contexts only. This is of value to users who are interested in specific scenar- ios involving diseases—outbreaks and epidemics, vaccination campaigns, etc.—as opposed to users who wish to monitor documents that mention the diseases in a broader context.

For each document, the PULS IE system extracts a set of incidents reported in the text. An incident is a structured representation of an event involving some commu- nicable disease, described in the text in natural language. An incident consists of a set of attributes: the location and country of the incident, the name of the disease, the date of the incident, and descriptive information about the victims—their type (people, animals, etc.), their number, whether they survived, etc. The incident may cover a single occurrence “80 chickens died on the farm on Wednesday,” or larger time interval, as in “Two people in the region have contracted the disease since the beginning of the year.” Text may also contain ’periodic’ incidents: “according to authorities, 330 people die of malaria in Uganda daily” (these are not currently handled by the system).

The system also identifies events in which the disease is unknown, or undiagnosed,

16

which are especially important for surveillance.

For example, the sentence:

The deadly Ebola outbreak has so far killed 16 people in Gabon

will trigger the creation of an incident—a record in a relational database—and assign the underlined values to the corresponding attributes. Each record extracted from the document is permanently stored, together with links to the exact offsets in the text where its attributes were found within the document.

For detailed information about the design principles behind PULS, see, e.g., [YJRH05, GHY03]. The system relies on several kinds of domain-independent and domain- specific knowledge bases. An example of domain-independent knowledge is the lo- cation hierarchy, containing names of countries, states or provinces, cities, etc. An example of a domain-specific knowledge base is the medical ontology, containing names of diseases, viruses, drugs, etc., organised in a conceptual hierarchy. The on- tology currently contains 2,400 disease terms; 400 vectors (organisms that transmit disease, like rats, mosquitoes, etc.); 1,500 political entities—countries, their top-level divisions and name variants; over 70,000 location names (towns, cities, provinces).

PULS uses pattern matching to extract events; the system contains a domain-specific pattern base—a cascade of finite-state patterns, which map information from its syn- tactic representation in the sentence to its semantic representation in the database records. For example, the above sentence about Ebola will be matched by a pattern like:

NP(disease) VP(kill) NP(victim) [ ’in’ NP(location) ]

The pattern first matches a noun phrase (NP) of semantic type disease; “Ebola” is a descendant of the disease node in the ontology. Then it matches a verb phrase (VP) headed by the verb kill (or its synonyms in the ontology). The verb phrase also subsumes modifier elements, such as the auxiliary verb has, the adverbial phrase so far, etc. The square brackets indicate that the locative prepositional phrase is optional; in case the location is omitted in the sentence, it is inferred from the surrounding context.

Populating the knowledge bases requires a significant investment of time and manual labour. PULS employs weakly-supervised learning to reduce the amount of manual labour as far as possible, by bootstrapping the knowledge bases from large, un- annotated document collections, [Yan03, LYG03].

17

3.2Measuring Usefulness

Traditionally, information extraction has been mostly restricted to document-local analysis. This means that any information found in a document is not used in the analysis of other documents. Here, the goal is to achieve high scores in precision and recall. These scores are measured using a pre-annotated test corpus contain- ing events that are considered "correct", and should therefore be extracted by an information extraction system. The performance of an IE system is measured by running it over the test corpus, and examining which events it picked up during the run.

Precision is the ratio of correctly extracted events to the total number of extracted events. Recall is the ratio of the number of correctly extracted events to the total number of correct events in the gold standard. Precision indicates how well the system is able to avoid extracting incorrect events, whereas recall indicates what proportion of correct events were found during the extraction.

Measuring the performance of an IE system requires that both precision and recall are computed. Using only one of the two results in an insufficient performance measure. If only precision is computed, an IE system need only find one correct event to get a perfect score, whereas if only recall is computed, an IE system can generate a large number of events in the hope of catching all the correct ones.

Usually, precision and recall are combined into a score called F-measure. The F- measure is defined as the harmonic mean of precision and recall [MRS08].

In addition to precision and recall, the PULS project places emphasis on being useful the the users. This isn’t straightforward, as usefulness, or utility, is a subjective concept. It is defined entirely based on the users’ needs, which means that obtaining a definition of usefulness is an exercise in user requirements gathering.

Our users are concerned with several factors regarding the events they are served:

They want to see as much relevant information as possible.

The information should be correct.

Important information should not be lost

Information should not be repeated.

The following three sections describe the methods with which we tackle these issues.

18

3.3Relevance

Even if an event is completely accurate, it may not always be of interest to the user. Therefore, in order to increase the usefulness of the system, we introduce the concept of relevance as an orthogonal property of events alongside correctness. PULS assigns a relevance score to each event processed . The intent is to reduce the number of non-relevant events a user has to see and to allow them to quickly find events that they consider relevant.

We take a machine-learning approach to the this problem as opposed to a heuristics- based one we used for confidence. Two different classifiers have been implemented, a Naive Bayes classifier and an SVM classifier.

We have defined relevance as follows:

1.High relevance, breaking news

2.Quite relevant, important updates

3.Less relevant, current events, but no new information

4.Low relevance, historical or not current information

5.Not relevant, hypothetic, non-factive information

Determining when an event is relevant is a difficult problem, partly because of the subjective nature of relevance and partly because the IE process is never, in practice, completely accurate.

Therefore, we simplify the problem slightly by making the assumption that all users share the same notion of relevance. This allows us to use the evaluated events of all users as training data for building a single relevance classifier.

This means that if one user considers an event relevant, we assume that all other users agree. In reality, every user may have their own, unique view on what is relevant and what is not. This means that in order to accommodate each user’s needs, user-specific classifiers need to be trained and build. This would require more training data.

We further simplify the task of predicting relevance by reducing it to a binary classification problem. The first class, relevant, contains scores 4 and 5. The second class, non-relevant, contains scores 0, 1, 2 and 3.

An evaluation of the classifiers is described in [vEHV10].

19

3.4Confidence

Sometimes the system misinterprets the text it is processing and extracts an event where, in reality, there is none, or extracts one that is somehow inaccurate. For example, it might get the location related to the event wrong. This is generally caused by the reference resolution component picking the incorrect antecedent for an anaphor appearing in the text.

It is therefore desirable to have some kind of idea of whether an extracted event accurately reflects whatever was actually stated in the input text. Of course the easiest way to do this would be to manually read the text and compare it with the event reported by the system. This works fine when dealing with a few events or when developing new features to the system, but with tens of thousands of events, a strong need arises for some automated method that enables the system to somehow differentiate between accurate events and inaccurate ones.

An event deemed accurate is labeled confident. Such an event has been judged to be accurate with a fairly high likelihood by the system. This kind of classification is particularly useful when the events extracted are publically viewable on a web site, and one wants to make sure that as few inaccurate events as possible make their way to the web site, even at the cost being unable to display events that are accurate in reality, but have been marked non-confident by the system for some reason. It is essentially a question of trading recall for greater precision, since we are willing to sacrifice some, possibly confident, events in order to have the end result contain the fewest inaccuracies possible.

An easy way of finding confident events would be to simply take all the events that were extracted from documents mentioning only one of each type of attribute we are interested in extracting. This is too harsh in practice, since such documents are very rare, and it would result in very few confident events. Therefore the requirements for an event to be considered confident need to be relaxed.

Currently, the system attempts to accomplish this using a simple uniqueness heuris- tic. The method described below is a slightly modified version of the one described in [YJ05] [Jok05]. The basic idea is simple: given an event and the text from which the event was extracted, if a pre-selected set of attributes in the event fulfill one of the three criteria described below, then the event is labeled confident. Which attributes to select depend entirely on the scenario at hand.

The criteria are:

20

Inside trigger The attribute is inside the event span. The event span is the piece of text that was matched by a pattern.

Unique in sentence The attribute is present and unique in the sentence contain- ing the event span. This means that there are no other attributes of the same kind to choose from in that sentence.

Unique so far The attribute is present and unique in the text ranging from the beginning of the document to the end of the sentence containing the event span. A unique attribute may be mentioned several times in the document.

The intuition behind the heuristic is that it tries to identify events where the possi- bility of a mistake having been made is minimized. For example, suppose we have a pattern that is supposed to match phrases like “we really enjoyed our vacation”. Since no location is automatically picked up by the pattern, one needs to be searched for in the pool of entities seen so far (the prior context). If this pattern fired in a sen- tence that mentions only ‘Tenerife’, then it is much more likely that the destination for the vacation in question was indeed Tenerife than if the same sentence had also mentioned ‘Gran Canaria’, ‘Lanzarote’ and ‘Fuerteventura’. Still, it’s worth noting that it’s completely possible that the IE engine would have done the correct thing in both cases, but only the first case can be considered confident because Tenerife is the only reasonable location to associate with the vacation.

In section 5 we present an evaluation of the uniqueness heuristic.

3.5Aggregation

This section describes the methods used to reduce redundancy by aggregating events into groups.

3.5.1Grouping disease events into outbreaks in PULS

PULS goes beyond the traditional IE paradigm in two respects. First, in a typical IE system, documents are processed separately and independently; facts found in one document do not interact with information found in other documents [YBvE+07]. Second, for each attribute in an extracted incident, the IE system stores only one value in the database record—the locally best guess for that attribute.

21

1.After PULS extracts information from each document locally, it attempts to globally unify the extracted facts into groups, which we call outbreaks or level 2 groups. An outbreak is a set of related incidents. Currently, incidents are related by simple heuristics: they must share the same disease name and the same country, and occur reasonably ’close’ in time. Closeness is determined by a time window, currently fixed at 15 days.2 A chain of incidents, any pair of which are separated by no more than the time window, are aggregated into the same group. Thus, an outbreak is a kind of a ’bin’ containing related incidents, and provides an added level of abstraction above the ’low-level’ facts/incidents.

2.When PULS stores a record in the database, for each attribute, in general, rather than storing a single value, PULS stores a distribution over a set of possible values. For example, the sample text (in the first paragraph of this section) might read instead “Five more people died last week.” PULS will then try to fill in the missing attributes (i.e., the disease name, location) by searching for entities of the corresponding semantic type elsewhere in the discourse. In general, for a given attribute of an event, the document will contain several possible candidate entities, and each candidate will have a corresponding score—measuring how well it fits the event. The score depends on certain features of the candidate value. These features include whether the value is mentioned inside the trigger—the piece of text that triggered some pattern from the pattern base; whether it appears in the same sentence as the trigger; whether it appears before or after the sentence containing the trigger; whether this value is the unique value of its type in the sentence that contains the trigger (e.g., the sentence mentions only a single country, or disease); whether the value is unique in the entire document; etc.

Using a set of candidate values rather than a single candidate is helpful in two ways. First, it allows us to compute the confidence of an incident (described in the previous section). Second, it allows us to explore methods for recovery from locally-best but incorrect guesses by using global information [Jok05].

In section 5.3 we present an evaluation of outbreak grouping.

3.5.2Fine-Grained Grouping of events

The PULS system receives large amounts of data each day. This data produces so many events that individually checking them all becomes difficult for the users. To

2The time window could be made more sensitive, e.g., dependent on the disease type.

22

alleviate this problem, PULS attempts to aggregate events even further into fine- grained groups called level 1 groups. The idea is similar to level 2 grouping, except that it is applied to all scenarios and the time window that determines whether two events belong to same group is exactly one day. The list view on the front end displays a list of level 1 groups (described in section 4.2 on page 27).

3.5.3Clustering in MedISys

Besides the accuracy of the MedISys filtering and categorisation, an important issue for users is multiple reporting: due to the high number of independent news sources, MedISys captures many reports that readers of one or a few news sources would miss, but the flip-side of the coin is multiple reporting. This causes extra work for the users and makes monitoring daily news a time-consuming task. The solution to this problem lies in the aggregation of reports into larger units. MedISys and PULS use different approaches to aggregation, which are not currently integrated.

MedISys presents news clusters to the users, grouping similar news reports arriving within at most 8 hours of each other. The short time window means that clusters normally contain articles published within the same day. If reporting continues steadily, articles from different days will be grouped into the same cluster. The similarity measure for the news articles is based on cosine similarity on a simple vector-space representation of the first 200 word tokens of each article. This means that not only multiple reports of the same story, but also similar reports about different cases for the same disease may be grouped together. This method also allows users to ignore entire groups of non-relevant articles (e.g., discussions about vaccination campaigns) at once.

3.6Domains and Scenarios

This section describes the domains and scenarios supported by the PULS system. A domain signifies a general area of interest in the context of information extraction. A scenario is a more specific sub-area of interest belonging to some domain for which event templates have been defined.

The PULS system can be adapted to work with any number of scenarios. Each scenario is associated with a source of data. These sources can supply data ei- ther in batches or in real-time. The event extraction can be invoked manually or automatically at certain intervals.

23

Currently, documents from three domains are processed by PULS in real-time. The domains are: Medical, Business, Security. For each domain, PULS extracts events from one or more scenarios.

The following subsections describe the domains in more detail.

3.6.1Medical

The medical domain deals with health-related issues such as disease outbreaks. Cur- rently PULS extracts events for one scenario, Epidemic Surveillance, in this domain. The events extracted in this scenario are intended to aid medical professionals in tracking epidemics as they occur throughout the world. The scenario aims to pro- vide an early warning system for people and organizations responsible for reacting to epidemic threats around the world. Examples of such organizations include the European Center for Disease Control (ECDC) and the World Health Organization (WHO).

The documents for this scenario are provided by MedISys3. The entire pipeline is automated; a live feed supplies new documents every 10 minutes, which are imme- diately processed. Any events found will promptly become visible on the web site. Each day, approximately 10000 documents are received and processed.

Examples of sentences containing events for this scenario are:

Four children died this week in Santiago, located northwest of the capital of Santo Domingo.

Seven people have died and there are more than 570 suspected cases in the outbreak, centered in coastal villages around Port Moresby.

The disease has been confirmed in seven of South Africa’s nine provinces, and has infected 60 people.

3.6.2Business

The business domain deals with business-related events such as companies buying other companies, or people getting nominated for positions within companies. The purpose of extracting events in this domain is to produce business intelligence. This

3http://medusa.jrc.it

24

type of information is useful, for instance, to companies planning future endeavours or to people intending to buy or sell stock.

In this domain, PULS supports six scenarios. They are:

Investments Events describe companies investing money in various undertakings, such as investing money in new infrastructure.

Acquisitions Events describe companies buying, or merging, with other compa- nies.

New Products Events describe companies bringing new products to the market.

Nominations Events describe people being nominated for positions within com- panies.

Marketing Events describe companies marketing products or services, such as launching advertising campaigns.

Ownership Events describe ownership relations between companies.

PULS receives documents in batches, once per day. Each batch contains approxi- mately 500 human-moderated documents.

The following sentences contain business events:

USD 70mn (EUR 52.95mn) has been invested in the new aircraft. (invest- ments)

Motorola sells wireless unit to Nokia Siemens for USD 1.2bn (acquisitions)

Nissan plans advertising campaign to boost sales of Infiniti brand (marketing)

3.6.3Security

In the security domain PULS currently supports one scenario, cross-border crime. The events in this scenario deal with criminal activities concerning more than one country. The scenario is divided into three sub-scenarios:

Illegal migration All types of events related to illegal migration. These include illegal entry attempts, illegal exit attempts and illegal stay.

25

Human trafficking Events related to trafficking of humans for various reasons. These include prostitution, forced labour, begging and organ transport.

Smuggling Events related to smuggling of various things, such as drugs, arms, goods or waste.

Examples of sentence containing security events are:

According to data released by the Romanian Interior Ministry, 591 Romanian citizens were repatriated from France in 2010. (illegal migration)

4 Bangladeshis held for human trafficking (human trafficking)

Air hostess smuggles cocaine into UK (smuggling)

4Front End: Web Interface

PULS provides a web interface for examining and moderating events extracted by the back-end. Events are stored in a relational database, which makes it easy to implement various views on the data. The interface contains simple views (displaying just one table) as well as more complicated aggregate views. Any number of scenarios are supported as well any number of users. The views are generated dynamically based on the database contents, which means that if the database is updated at regular intervals, the web site can be used to track changes and new events.

In addition to viewing events, the interface lets users moderate events or even create new events that the IE system may have missed through an easy drag-and-drop mechanism. The moderated events can later be used as training data for machine learning tasks or to assess the performance of the IE system.

This section describes the most important views provided by the web interface.

4.1Table View

The table view is the most basic view provided by the interface. It is aimed at users who want to examine single events without any aggregation or who need a flexible method of sorting and searching the event database.

There exist two variations of the table view, the first one shows all events in the database, whereas the second one shows only events that have been labeled highly

26

Figure 2: Table View Screenshot

relevant by either the system or the users. See section 3.3 on page 18 for a more detailed explanation of relevance in the PULS system.

The table view displays a table of events in a concise manner with each row in the table representing one event extracted by the IE system. The most important attributes of the events make up the columns of the table. By default, the events are sorted according to recency, but the events can be sorted by any column in either ascending or descending order. Furthermore, constraints can be added to any column to narrow down the events shown. A snippet containing the sentence in which the event was found is displayed as a tool tip when hovering the mouse cursor over any row in the table.

Each event row may have a ’Related’ link placed next to it if there exists similar events in the database. Clicking on the link will constrain the table to show only these related events. This kind of event similarity is defined by our concept of level 2 groups, described in section 3.5.1 on page 20.

A screenshot of the table view is shown in figure 2.

27

Malaria - Vietnam 2010-02-03

Seven people found with cholera in Mekong Delta province | 11:10

www.thanhniennews.com

Also on Sunday, the Nam Tra My General Hospital in the central province of Quang Nam said they had admitted 25 people with malaria over the past week from Tra Tap Commune. [Relevance score: 4]

The local health agency found nearly 100 people affected with the mosquito- borne infectious disease, but they had not approached the hospital for treat- ment. [Relevance score: 4]

Figure 3: Event in snippet format

4.2List View

The list view is intended to let users see the most recent events in the database. This view takes a text-based approach as opposed to the table-based in the table view. Events are shown as snippets of text, as opposed to fields in a table. This gives less structured information than the table view, but instead gives a better idea of the overall quality of the event. To illustrate, consider the following event presented in snippet format in figure 3.

The table format is compact, but only presents the facts extracted by the IE system. It is difficult to determine the accuracy of the event simply by looking at the row. The snippet format, on the other hand, is less compact, but gives the viewer more context in which to judge the event, namely, the sentence from which the event was extracted.

Besides detailed analysis, PULS tries to do aggregation. This means that similar events are grouped together and that duplicate events are greyed out.

This view displays events in reverse chronological order arranged into fine-grained groups. These groups are called level-1 groups. In general, a level 1 group is a set of events that share a number of attributes. In the medical scenario, each group contains events of a given disease and country reported on a given day.

The list view supports a more limited form of searching than the table view. Sorting is not supported, as the ordering of groups is chronological.

Only the headline and events from one document is shown for each group along with

28

Figure 4: List View Screenshot

an automatically generated headline for the entire group. In the medical scenario the headline is of the form disease - country.

A group can be expanded to show all documents and events in that group by clicking on an arrow that is displayed next to the group. The big stars indicate the relevance score of the whole group, while the small stars indicate the relevance of each event in that group. The relevance of a level-1 group is defined as the maximum relevance of all events belonging to that group. Groups with a low relevance are greyed out.

If a user want to examine a event more closely, she can click on the text snippet. This will take her to the document view.

A screenshot of the list view is shown in figure 4.

29

4.3Document View

A screenshot of the document view is shown in figure 5 on the following page.

The document view displays information about events found in a single document. The currently selected event is underlined in black, other event are underlined in blue. Other events can be selected by clicking on them. When a event is selected, de- tails about the event are displayed in the box to the right. Details include extracted attributes, the publication date, and a link to the original article.

A event can be exported by clicking on one of the export links. This allows the users easily add events to their own private databases or spreadsheet.

The document view allows users to correct incorrectly extracted attributes of any event. This is done either by entering text manually into text fields or through a simple drag-and-drop mechanism implemented by most web browsers. When an event is corrected, the view is immediately updated to reflect the changes.

In addition to correcting events, the document view allows users to assign events relevance scores of their own. Users are encouraged to moderate events, as PULS is able learn from the users’ ratings, which means that the more ratings it gets, the better it becomes at automatically determining the relevance of future events.

4.4Map View

A screenshot of the map view is shown in figure 6 on page 31.

The map view uses Google Maps to plot events and outbreaks on a map. Any events displayed on the table view can be plotted. In other words, constraints specified in the table view are carried over to the map view.

The map view plots events based on the country and location slots of each event. If only the country slot is filled, then the coordinates for the country are looked up in a database and a map marker is placed approximately on the center of the country. This marker represents all events in that particular country. If both slots are filled, the coordinates for the location (e.g. a city) are looked up and a map marker is placed at those coordinates. This marker represents all events in that particular location.

The map view utilizes two databases to implement the location lookup. The first database is the GEOnet Names Server (GNS) database maintained by the U.S.

30

Figure 5: Document View Screenshot

31

Figure 6: Map view screenshot showing events from 2010

National Geospatial-Intelligence Agency (NGA)4. It is a comprehensive database containing names and attributes of locations, cities and countries all over the world except the United States of America.

The second database is the Geographic Names Information System maintained by the U.S. Geological Survey5. Is is a comprehensive database containing names and attributes of locations and cities in the USA and Antarctica.

4.5Chart View

Figure 7 on the following page shows a screenshot of the chart view summarizing events for the year 2010.

The chart view displays a horizontal bar chart that summarizes event data. It shows the number of events grouped by some pre-defined fields. The constraints from the table view are carried over to the chart when the user clicks the chart link. The medical domain chart is able to group by either disease, country, document source

4http://www.nga.mil/, http://earth-info.nga.mil/gns/html/

5http://geonames.usgs.gov/

32

Figure 7: Chart view screenshot summarizing events from 2010

or disease and country. The business domain chart can group by either business sector, country or both.

The chart is sorted in descending order, with the largest group at the top. The label for each group is a link that, when clicked, will take the user to a constrained table view showing only events in that group. To prevent the chart from becoming too big, an limit of 500 rows has been placed on the view.

4.6Graph View

The graph view shows an interactive graph with entities discovered in the processed documents of the business domain. The view is based on a Java applet implemented by the Biomine Project6 for visualizing biological data, but has been integrated with the PULS front end to display business event data.

The graph displays a connected component of entities. This means that for a given event, the view shows the entities that are directly and indirectly connected to it. Each node in the graph represents an entity belonging to some event. Each edge, on the other hand, represents an event or relation between the entities. For instance, if we have an acquisition event, where one company bought another company, then

6http://www.cs.helsinki.fi/group/biomine/

33

Figure 8: Graph View Screenshot 1

the graph would show the two companies as nodes and an edge between them rep- resenting the acquisition event.

The graph can be manipulated by dragging nodes around. The graph has an auto- balancing feature, which means that dragging a node somewhere will change the layout of the whole graph in a fluid motion. Clicking on an edge will open up the corresponding event in the document view.

On the right-hand side, a list of entities involved in the current graph is shown. Hovering over an entity’s name will highlight the corresponding node in the graph viewport. Clicking on one of the entities, say a company, will zoom in the graph viewport on the company’s node.

Figure 8 shows a screenshot with a complete and rather large graph of events related to an event concerning the company Nokia. Figure 9 on the following page shows the same graph, with the view zoomed in on a part of the graph.

34

Figure 9: Graph View Screenshot 2

35

5Evaluation and Results

In this section we evaluate the performance of the system as a whole, as well as the performance of the confidence measure and outbreak aggregation separately.

5.1Evaluation of Event Detection

At the time of this evaluation PULS received on the order of 10,000 documents from MedISys each month. From 27% of these documents, PULS extracted about 6,000 incidents, on average.7 The remaining 73% of the documents processed by PULS yielded no incidents. This is as expected, since MedISys does not explicitly select for outbreaks, but for mentions of disease names in any context, and many documents may mention diseases in contexts unrelated to epidemics and outbreaks.

To estimate the proportion of documents rejected by PULS that contain missed events—false negatives—we manually checked 200 MedISys documents that pro- duced no events. Among these documents, 14% contained an event that the IE system had missed.8 As PULS filtered out 73% of the incoming documents, adding back the incorrectly filtered documents (14% of all filtered), yields that about 63% of the documents that arrived to PULS contained no epidemic events. In this way, the IE phase helps to distinguish reports about epidemic outbreaks from documents that mention diseases in other contexts.9

5.2Evaluation of the Uniqueness Heuristic

So far, the uniqueness heuristic has merely been observed to work well “most of the time”. In order to gain a better understanding of how useful the heuristic is, and perhaps to find places where it could be improved, a small sample of confident events were selected for manual evaluation. The objective was to see how reliably

7In IE, it is typical for a relevant document to contain more than one incident, since often there is one or more main events, and other, related events are mentioned as part of background discussion.

8NB: this does not correspond to the false-negative rate. To compute the false-negative rate, re- call at the document level, and recall at the level of events would require a more detailed evaluation, to be conducted in the future.

9MedISys has an optional boolean filter that tries to capture outbreaks by requiring the name of the disease to occur in combination with keywords like bedridden, hospital*, deadly, cases, etc. This has not yet been evaluated.

36

In neighboring Laos, health officials confirmed on Thursday that bird flu had killed a 15-year-old girl, the first confirmed victim there of the H5N1 virus.

Disease:

Avian Influenza

Country:

Laos

Date:

Thursday (2007-03-08)

Descriptor:

a 15-year-old girl

Count:

1

Status:

Dead

Figure 10: Example disease event.

the heuristic is able to identify accurate events. In other words, whenever an event is labeled confident, we want to know whether the event is actually accurate in reality.

We have a number of databases containing events for different scenarios. The events selected for this experiment are a subset of the confident events from one of these databases. This database contains events for an epidemics scenario, in which events represent outbreaks of diseases around the world. The events in this database have been extracted from various news-related web sites. See figure 10 for an example of a disease event. The text is an excerpt from one of the documents processed by the system, with the underlined portion of the text being the event span. The extracted attributes are listed below the text.

A total of 50 confident events were randomly selected for evaluation from the set of events produced during January 2007. The attributes that were chosen to be included in the uniqueness check are disease, country and date. In other words, the fields descriptor, count and status had no effect on the confidence of an event. Events were assigned one of the following three grades:

Accurate The event accurately reflects whatever was mentioned in the text. All three attributes are required to be correct.

Heuristic Inaccurate The event does not accurately reflect whatever was men- tioned in the text due to the heuristic failing.

IE Inaccurate The event does not accurately reflect whatever was mentioned in the text due to an error on the IE system’s part. This could happen, for instance, if a pattern fires even though no activity related to diseases is men- tioned in the text.

37

Figure 11 contains an example of a confident accurate event. The event is confident because the disease and date are mentioned in the event span (the underlined text) and the country, Hong Kong, is the only country mentioned in the text up to the event span. Thus, all three attributes fulfill one of the criteria described in the previous section. The event is accurate, since the text does indeed mention an occurrence of bird flu in Hong Kong on January 6.

A confident but inaccurate event due to an error in the IE system is shown in figure 12. Clearly, the excerpt shown in the table does not depict any kind of disease activity, but because a pattern fired here, and the date and country are unique in the sentence, and ‘lung cancer’ happens to be the only disease mentioned in the document so far, the event becomes confident. This is incorrect, since there is really no event to begin with. The reason an event was extracted here, is because the pattern that fired is perhaps too permissive and might need to be adjusted.

A confident but inaccurate event due to the heuristic failing is shown in figure 13. Here, the IE system correctly extracted an event, but even though somebody “died suddenly”, no disease is mentioned as the cause of death. This is what causes the heuristic to fail, since the system will still try to find a disease for the event from the prior context, as part of the reference resolution process. In this case it found ‘avian influenza’, which is incorrect, since judging by the way the sentence is phrased, this person most likely died of some other cause. The event is confident because the date and country are present in the event span, and ’avian influenza’ is the only disease mentioned so far.

The results of the evaluation are listed in table 1. While the majority of the events evaluated were accurate, there is still some room for improvement. When judging how well the heuristic worked, it might make sense to ignore the ‘IE inaccurate’ events. They are essentially non-events and do not contribute toward the evaluation of the heuristic itself in any meaningful way. The underlying causes for these non- events are, for the most part, pretty easy to fix, while tweaking the heuristic is more time consuming. When ignoring the ‘IE inaccurate’ events, the percentage of accurate events rises to 84%.

5.3Evaluation of outbreak aggregation

Since outbreak aggregation is our primary means of reducing redundant information in the flow of news to the user, it is important to estimate the accuracy of the out-

38

Grade

Count

Relative

 

 

 

 

 

 

Accurate

36

72%

 

 

 

Heuristic inaccurate

7

14%

 

 

 

IE inaccurate

7

14%

 

 

 

 

 

 

Total

50

100%

 

 

 

Table 1: Evaluation results.

Hong Kong says dead goshawk carried bird flu virus

17 January 2007 Hong Kong confirmed on Wednesday that a bird of prey found in the city carried the H5N1 virus, the second such case this month. The dead crested goshawk was found on a hill behind a health clinic in the built-up Shek Kip Mei district in Kowloon on Jan 9. The Agriculture, Fisheries and Conservation Department confirmed the bird was infected with H5N1. Another bird, a scaly breasted munia, also tested positive for the H5N1 virus on Jan. 6. . . .

Disease:

Avian influenza

Country:

Hong Kong

Date:

January 6 (2007-01-06)

Figure 11: Example ‘accurate’ event

Lung cancer is mentioned earlier in the document.

. . . U.S. track star Justin Gatlin, the current world record holder in the 100 meters and reigning Olympic champion in the event, revealed Sunday that he had tested positive for a performance-enhancing drug, The New York Times reported. . . .

Disease: Cancer

Country: USA

Date: Sunday (2006-12-31)

Figure 12: Example ‘IE inaccurate’ event.

39

Avian influenza is mentioned earlier in the document.

. . . Her predecessor as WHO chief, South Korea’s Lee Jong-Wook, died suddenly in office last year. . . .

Disease:

Avian influenza

Country:

South Korea

Date:

Last year (2006)

Figure 13: Example ‘heuristic inaccurate’ event.

break grouping. We analysed a randomly chosen set of medium-sized outbreaks: 20 outbreaks with approximately 10 incidents in each. For each incident we determined whether it was appropriately included in the outbreak. 68% of the incidents were correctly identified with their outbreaks. Three of the outbreaks (about 15%) were erroneous, i.e., based on incorrect confident incidents.10

Of all the incidents examined in this evaluation 22.5% were confident (i.e., on aver- age, the outbreaks contained only 2–3 confident incidents).

10It was interesting to observe that aggregation is often useful even when the outbreak consists entirely of incorrectly analysed incidents. For example, in high-profile cases picked up by main news agencies, reports are re-circulated through multiple sites worldwide. Because the text is very similar to the original report, the IE system extracts similar incidents from all reports, and correctly groups them together. Although some attribute is always analysed incorrectly, the error is consistent, and the grouping is still useful: it helps reduce the load on the user by aggregating related facts.

40

6Technical Description

This section describes technical aspects of the PULS systems and its components.

The PULS system consists of a fairly large body of code, made up of several different programming languages, implemented by several different programmers. The initial version of the system was implemented in the mid 80’s. Parts of the original version remain in production use today.

The system is split up into several independent components, each responsible for one or more tasks. The major components are:

PULS-IE

PULS-LIB

PULS-WEB

6.1Puls-IE

Puls-IE is the core IE system, implemented almost exclusively in Common Lisp. It consists of roughly 100 KLOC 11. The task of Puls-IE is to take documents as input, analyze them, and output a “response” containing the events discovered along with any necessary auxiliary information related to the events. This component is the heart of the system, everything else is built around it, either to pre-process, post- process or display data. A graphical customization environment allows developers or domain experts to customize the knowledge bases and thus affect the behaviour of the IE system.

The interactive nature of Lisp allows for an intuitive development environment where new changes to the knowledge bases can be rapidly tested. As the requirements placed on the system are constantly being updated, even changes to the innards of the IE system can be easily implemented. On several occasions, even relatively advanced features have been added with only a few well-targeted patches.

For instance, a number of embedded domain-specific languages (EDSL, DSL) have been made available to the knowledge base customizers with surprisingly little effort. The DSLs greatly reduce the time it takes to extend the behaviour of the system. Further, they also make it possible for non-programmers to customize the system.

11Kilo-lines of code, 1 KLOC = 1000 lines of source code

41

DSLs have been created for defining concepts, patterns, dictionaries and inference rules

The IE system can be used interactively or in batch mode. Interactive use takes place in the Lisp REPL12, where the user enters a sentence and the system outputs a response. In batch mode the system can process a large number of documents unattended, and store them in a database.

For batch mode operation, the system is compiled into what is known as a core. It is a memory image that has been saved into an executable file. Using cores allow even large programs to start up very quickly, which is critical when running the IE system as part of frequently invoked cron jobs. Multiple processes started with the same core will share the memory used by the core on many operating systems. This is advantageous in many situations, as it preserves memory.

Usually, the following steps are needed to create a core:

1.Start up the default Lisp environment

2.Load any number of libraries

3.Save the core

This procedure is usually automated with scripts or makefiles. The saved core will contain all the libraries that were loaded into the environment, and they will therefore be immediately available on start-up. It is significantly faster to load a core with libraries preloaded than to load the libraries at runtime. It is comparable to statically linking libraries into a program.

6.2Puls-Lib

Puls-Lib is a collection of modules implemented to perform a variety of tasks. The three most important tasks of this component are:

Encapsulating the Puls-IE component to provide a simplified facade interface

Parsing the output produced by Puls-IE into data structures suitable for fur- ther processing.

12Read-Eval-Print-Loop

42

Providing models (business logic) according to which the various databases can be manipulated. This includes all code related to aggregation and confidence.

Providing cronjob scripts that enable the processing of documents in real-time.

The package effectively encapsulates the databases by exposing a number of meth- ods designed for adding, updating or deleting events or documents safely and in a consistent fashion.

Other things in Puls-Lib include general utility libraries, libraries for generating RSS feeds based on the database contents, language related libraries such as language identification, libraries for parsing input files received from document providers and various libraries required for the operation of the system.

Puls-Lib is almost exclusively implemented in Common Lisp and consists of approx- imately 65 KLOC.

As a general rule, all systems defined in Puls-Lib have a system definition file, also called an “.asd” file. It is similar to a makefile, in that it specifies which files the system depends on and the order in which they need to be loaded or compiled. The systems that define a namespace usually do so in a file called package.lisp or packages.lisp (namespaces in Common Lisp are called packages). Most name- spaces defined in Puls-Lib use the prefix PULS. For instance, the general utility library defines the package PULS.UTIL.

The following sections will describe the most important components of Puls-Lib in more detail.

6.2.1Puls.Conf

The PULS.CONF module is used to manage and load configuration files. It ex- poses a few functions that allows for loading of files at predefined locations (e.g.

~/.puls-lib/). Currently, configuration files are just normal lisp source files that are loaded in source form13. A configuration files does two things, first, it assigns values to parameters, second, it executes (re-)initialization hooks. A parameter in Common Lisp is a global variable with dynamic scope [Ste90]. By convention, we re- quire that the configuration file be loadable into an empty lisp environment. We also disallow all code except simple variable assignment or invocations of initialization hooks.

13Common Lisp can be either compiled (for speed) or executed directly from source files.

43

The configurable variables include:

Pathnames for storing documents

Foreign library pathnames

Database names, hostnames and ports

IE core pathnames

Mappings between models and specific IE cores

Mappings between models and databases

Ports and addresses for the web site

The rest of the system refers to the variables defined in this module. For developer convenience, the module supports reloading of configurations at runtime. Switching configurations is also supported.

6.2.2Puls.Util

The PULS.UTIL module contains general (and generic) utilities that have accrued over time and proven useful. All other modules depend on this module. Examples of such utilities are:

Functions for list and hashtable manipulation

Logging functions

Macros that abstract away common programming patterns.

Functions for string manipulation and scanning.

6.2.3Puls.Cronjob

The PULS.CRONJOB module contains code that should be executed at regular inter- vals from a cron job. This includes scripts and glue code for fetching documents from various sources and processing them. The actual tasks are performed by other modules, this module merely provides the top-level scripts that initiate the sequence of actions. When a cron job script is invoked, it takes the following actions:

44

1.Download a new batch of documents from a predetermined location.

2.Parse the downloaded file, extracting the new documents.

3.Discard any previously seen documents.

4.Invoke the appropriate Puls-IE core, depending on the source and domain of the documents.

5.Parse the response of Puls-IE and store it in the database.

6.Publish an RSS file containing the latest events. Currently, this step is done only for the medical domain (see 6.4.

6.2.4Puls.Model

The PULS.MODEL module contains the business logic associated with manipulating the databases. It exposes functions for adding, deleting, or modifying events. These functions ensure that the database is always left in a consistent state, and that all dependencies are taken care of. The databases should only be manipulated through the interface provided by the models in this module. Each PULS scenario is associated with a subclass of the base model class. The module implements functionality that is common to all scenarios as well as scenario-specific functionality.

The basic methods implemented by all models are add-item, delete-item,update- item, match-item. For example, adding an event to the database entails creating an event object with the desired slot values, then applying add-item to a model object and the newly created event object. The model will infer the values of slots that were left empty, if possible. If the event is considered valid according to the model’s business logic, then the event is added to the database and any dependencies (such as event aggregation) are automatically seen to.

Some methods have complex dependencies. To deal with this issue, the model creates an internal list of actions, called a plan, that represents the actions that should be taken in order to successfully complete the method call. An identifier of the programmer’s choosing is linked to every action, as well as a priority. Once the plan has been constructed, each action is executed in order. The results of each action are stored in a table and made available to subsequent actions in the current plan.

45

Using this planning method makes code simpler, as one does not need to worry about taking some action twice; the planner ensures that every action for a given identifier occurs exactly once in the plan, and that it is executed at the right time (based on priority).

6.2.5Puls.Auxil

The PULS.AUXIL module implements code for parsing the output of the Puls-IE component. It exposes three functions, one for parsing the main output file of Puls- IE (known as the response file), one for parsing the auxiliary output file (known as the auxil file) and one for parsing the special document metadata files that specify the headline location, paragraph boundaries, publication date and other metadata about the document. The result of parsing these files is a simple data structure designed to hold key-value pairs.

6.2.6Puls.Groups

The PULS.GROUPS module provides the core algorithm for computing level 1 and level 2 groups. The code is generic; it is up to the users of this module to adapt it to the appropriate scenario. The core algorithm for both types of groups is a one-pass algorithm that chains together events into groups on a timeline. A timeline is a set of events that share one or more attribute values. Each scenario may have timelines based on different attributes. In the medical domain, for instance, timelines consist of events that share the disease and country attributes.

The procedure for producing event groups is designed for batch computation. It can currently only compute the groups of an entire timeline. Updating individual groups (e.g. by adding one event) is not supported.

For a given timeline and time window n, the groups are computed as follows:

1.A single event is a group with start and end time equal to the start and end time of that event.

2.If the start or end time of a group is within n days of another group, merge the groups into a new group. Then set the start time of the merged group to the minimum start time of all events in the group. Set the end time of the merged group to the maximum end time of all events in the group. Repeat this step until no more groups are merged.

46

Level 1 groups are displayed on the Puls-Web list view, as described in section 4.2.

6.2.7Puls.Graph

The PULS.GRAPH module provides code for computing the graph that is displayed in the graph view described in 4.6 on page 32.

The module creates a graph based on all data in business domain. It queries the database for (node, edge, node) triplets. Then it computes all the connected com- ponents for these triplets using a breadth-first search. Nodes represent different objects, such as companies, persons, products etc. Edges represent relations be- tween the objects, such as buy, employ, invest and so on.

The edges are assigned urls that point to the document view on the puls website of the event represented by the edge.

6.2.8Puls.Lang-tools

The PULS.LANG-TOOLS module contains tools for performing multi-lingual informa- tion extraction. Currently French and Russian is supported. The French IE module is tool is not as sophisticated as Puls-Ie but is able to extract medical events from French documents by matching keywords on the text. The Russian module serves as a wrapper for the AOT parser and morphological analyzer14. It outputs data than can be used as input for Puls-Ie. This effectively allows us to leverage the machin- ery of Puls-Ie to perform information extraction for Russian documents. Significant effort has gone into creating multi-lingual concept bases and dictionaries. Figure 14 on the following page shows a screenshot of the PULS web document page displaying an event in the security domain extracted from a Russian document.

This module also contains a tool for performing language identification on documents whose language is unknown. It works by measuring frequencies of common words.

6.2.9Puls.Rss-Feed

The PULS.RSS-FEED module provides code for producing RSS feeds from events in the database. It is used as part of the integration between PULS and JRC, as described in 6.4 on page 48. Here, data is periodically sent back and forth using

14http://www.aot.ru/

47

Figure 14: Document view showing a Russian security event

simple RSS.

The RSS file contains a list of events, with data similar to what is shown on the document view.

6.2.10Puls.Source

The PULS.SOURCE module offers functions for handling the raw data files that are received from various sources. It is able to parse XML files containing documents, convert them into the format understood by Puls-IE and store them for later pro- cessing. The module supports standard RSS format as well as other ad-hoc formats.

6.2.11Puls.Crud

The PULS.CRUD module provides a simple object-relational mapping (ORM) that is intended to make manipulating the database less cumbersome. An ORM let’s the programmer query and modify a database without having to write excessive amounts of SQL. Each table in the database is associated with a class. Querying the ORM results in a set of instances of some table class, one for each row returned.

48

An instance of a table class has slots that correspond to the fields in the table. Assigning values to these slots and then calling an update method on the instance will save the new values to the database. Users of this module aren’t forced to change the database through instances, queries can still be written in normal SQL, if needed.

The module is able to examine the schema of any given database and automatically generate class definitions that correspond to every table in the database, thereby eliminating the need for defining table classes manually.

6.3Puls-Web

Puls-Web is a small website designed to display events in various ways and to let users evaluate events in the database.

The different views offered by the site were explained in section 4 on page 25.

It consists of a tiny framework implemented in Common Lisp and the code the generate the views, initially around 1 KLOC. It was intended to be a light-weight site for visualizing the contents of the database. The core framework has changed very little over the course of 3 years, but the code that generates the different views has since grown to approximately 15 KLOC.

6.4Integration between PULS and MedISys

A special RSS tunnel has been set up between MedISys and PULS. MedISys forwards documents which it categorises as relevant to the medical domain through the tunnel to PULS. Currently, the documents arrive as plain text. This is done in addition to the normal processing on the MedISys side, where running averages are monitored for all alerts, etc. A document batch is sent every 10 minutes, with documents newly discovered on the Web.

On the PULS side, the IE system analyses all documents received from MedISys, and returns information that it extracted from the received documents back through the tunnel—in structured form (also at 10 minute intervals). This communication is asynchronous, while both sites are operating in real-time.

When PULS receives documents from MedISys, it performs the following steps:

First, the IE system analyses the documents, extracts incidents, and stores them

49

in the database (at http://doremi.cs.helsinki.fi/jrc). Second, PULS uses document- local heuristics to compute the confidence of the attributes in the extracted incidents.

The confidence of an attribute is computed from the set of candidate values for that attribute, based on their scores, which are in turn based on the features, as explained in section 3.4. If the score of the best value exceeds a certain threshold, the attribute is considered confident. Some of the attributes of an incident are considered to be more important than others: in the case of epidemic events, these principal attributes are the disease name, location and date. If all principal attributes of an incident are confident, the entire incident is considered confident as well.15

Third, the system aggregates the extracted incidents into outbreaks, across multiple documents and sources as described in section 3.5.1. The aggregation process re- quires that at least one of the incidents in each outbreak chain must be confident (that is, chains composed entirely of non-confident incidents are discarded).

Finally, PULS returns a batch of recent incidents to MedISys, for displaying on its pages. The goal is to return a set of incidents with high confidence and low redundancy—a complete yet manageably-sized set for the user to explore. The batch is restricted to documents published within the last 10 days; from this period, PULS returns the most recent 50 incidents, filtering out duplicates: if multiple incidents of the same disease in the same location are reported, PULS returns only the most recent one.16

On the MedISys side, the returned events are displayed in two views. The main MedISys page shows the five most recent events—these correspond to the most urgent news. For more detail, this box has a link to the batch of 50 most recent incidents. For the complete view, the recent list has a link to the PULS database.

15In the PULS tables, confident incidents are highlighted.

16Note that this implies that a recent event that was last reported more than 10 days ago, will not appear in the result list, while an event from several months ago may appear—if it is mentioned in a very recently published report. This is a design choice that aims for a balance between recency of publication vs. recency of occurrence of an incident: both may be important to the user. Note also that in any case all events are available for browsing in the PULS database.

50

7Conclusion

In this thesis we have described the PULS IE system and how it currently operates in production. Different ways of adding value to the users were described; confidence to promote correct events, aggregation to reduce redundancy and add structure between events and documents, and relevance to let the users more easily find what they are interested in.

The combination of the two initially independent systems MedISys and PULS has lead to a stronger application offering users complementary functionality through a unified user interface. For communicable disease outbreaks, which are covered by both systems, the combination of IR in MedISys and IE in PULS leads to ad- ditional advantages: Firstly, PULS’s computationally heavier methods only need to be applied to the document collection pre-filtered by MedISys. Secondly, the medical event extraction patterns act as an additional filter to identify only disease outbreak reports. MedISys is designed to capture not only disease outbreak reports, but also other news articles mentioning diseases. For users interested specifically in disease outbreaks, PULS’s event recognition helps reduce the number of reports by filtering out just under three quarters of incoming reports, of which about 14% are incorrectly filtered relevant reports.

The current status of integration can be taken further: the systems don’t yet make full use of the other’s information aggregation methods. The categorisation of news items by MedISys can be useful for the analysis performed by PULS, and is yet to be utilised. The taxonomies used by the systems are overlapping, but have not yet been fully integrated. These and other issues are to be tackled in future work.

While we believe that the combination of IR in MedISys and IE in PULS provides added value, it is not a universal solution. An important strength of MedISys is its multilinguality: it monitors media reports in currently 43 languages. Developing PULS-style event extraction grammars for so many languages is not currently possi- ble: porting the IE system to a new language requires pre-existing robust lower-level linguistic components (named entity tagger, ontology, parser) for each new language, which are unlikely to be available for all the languages covered by MedISys in the near future. However, focusing on the major languages for which lower-level linguis- tic resources have been developed is planned for future extensions.

We further plan to integrate a tool that automatically extracts terms from the

51

comprehensive medical thesaurus MeSH (Medical Subject Headings)17 , and to allow users to select articles by browsing and drilling down in the multilingual MeSH hierarchy. This will give the user an alternative entry point to the same information.

We need to resolve some technical problems to improve the quality of the input data. One problem relates to the way MedISys extracts textual content from source sites. Because the original focus of MedISys was on the keywords contained in the text, it ignored document layout information (such as headings, sub-headings, by- and date-lines, paragraph breaks, etc.), which provides important cues when detailed text analysis is required. The lack of this information is known to confuse the IE process, and needs to be addressed to improve IE accuracy.18

17www.nlm.nih.gov/mesh

18Extracting document layout accurately is a highly non-trivial problem, since source sites are completely unstandardized, and in general the layout is hard to infer automatically.

52

References

BvdGB05

Best, C., van der Goot, E., Blackler, K., Garcia, T. and Horby, D.,

 

Europe Media Monitor—system description. Technical Report 22173

 

EN, EUR, 2005.

CMBT02

Cunningham, H., Maynard, D., Bontcheva, K. and Tablan, V., GATE:

 

A framework and graphical development environment for robust NLP

 

tools and applications. Proceedings of the 40th Anniversary Meeting of

 

the Association for Computational Linguistics, 2002.

Def95

Defence Advanced Research Projects Agency, Information extraction

 

task: scenario on management succession. Proceedings of the Sixth

 

Message Understanding Conference (MUC-6), Columbia, MD, Novem-

 

ber 1995, Morgan Kaufmann, pages 167–176.

DHNKC08

Doan, S., Hung-Ngo, Q., Kawazoe, A. and Collier, N., Global Health

 

Monitor—a web-based system for detecting and mapping infectious dis-

 

eases. Proceedings of the International Joint Conference on Natural

 

Language Processing (IJCNLP), 2008.

FMRB08

Freifeld, C., Mandl, K., Reis, B. and Brownstein, J., HealthMap: Global

 

infectious disease monitoring through automated classification and vi-

 

sualization of internet media reports. Journal of American Medical

 

Informatics Association, 15, pages 150–157.

GHY02

Grishman, R., Huttunen, S. and Yangarber, R., Event extraction for

 

infectious disease outbreaks. Proceedings of the 2nd Human Language

 

Technology Conference (HLT 2002), San Diego, CA, March 2002.

GHY03

Grishman, R., Huttunen, S. and Yangarber, R., Information extraction

 

for enhanced access to disease outbreak reports. Journal of Biomedical

 

Informatics, 35,4(2003), pages 236–246.

GR97

Gaizauskas, R. and Robertson, A., Coupling information retrieval and

 

information extraction: A new text technology for gathering informa-

 

tion from the web. Proceedings of the 5th RIAO Computer-Assisted In-

 

formation Searching on Internet, Montreal, Canada, June 1997, pages

 

356–370.

 

53

HK05

Han, J. and Kamber, M., Data Mining: Concepts and Techniques. Mor-

 

gan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005.

Jok05

Jokipii, L., Confidence measuring and data improvement of extracted

 

information disease outbreak reports, Master’s thesis, Department of

 

Computer Science, University of Helsinki, 2005.

LYG03

Lin, W., Yangarber, R. and Grishman, R., Bootstrapped learning of

 

semantic classes from positive and negative examples. Proceedings of

 

the ICML Workshop on The Continuum from Labeled to Unlabeled Data

 

in Machine Learning and Data Mining, Washington, DC, August 2003.

MRS08

Manning, C. D., Raghavan, P. and Schtze, H., Introduction to Infor-

 

mation Retrieval. Cambridge University Press, New York, NY, USA,

 

2008.

RG97

Robertson, A. and Gaizauskas, R., On the marriage of information re-

 

trieval and information extraction. In Information retrieval research

 

1997: Proceedings of the 1997 annual BCS-IRSG colloquium on IR re-

 

search, Aberdeen, Scotland, Furner, J. and Harper, D., editors, Springer-

 

Verlag, London, 1997, pages 356–370.

Ril96

Riloff, E., Automatically generating extraction patterns from untagged

 

text. Proceedings of Thirteenth National Conference on Artificial In-

 

telligence (AAAI-96). The AAAI Press/MIT Press, 1996, pages 1044–

 

1049.

SFvdG08

Steinberger, R., Fuart, F., van der Goot, E., Best, C., von Etter, P.

 

and Yangarber, R., Text mining from the web for medical intelligence.

 

In Mining Massive Data Sets for Security, Perrotta, D., Piskorski, J.,

 

Soulié-Fogelman, F. and Steinberger, R., editors, OIS Press, Amster-

 

dam, the Netherlands, 2008.

Ste90

Steele, G. L., Common LISP: The Language. Digital Press, Bedford,

 

MA, second edition, 1990.

vEHV10

von Etter, P., Huttunen, S., Vihavainen, A., Vuorinen, M. and Yangar-

 

ber, R., Assessment of utility in web mining for the domain of public

 

health. Proceedings of the NAACL HLT 2010 Second Louhi Workshop

54

on Text and Data Mining of Health Documents, Los Angeles, Califor- nia, USA, June 2010, Association for Computational Linguistics, pages 29–37, URL http://www.aclweb.org/anthology/W10-1105.

Yan00

Yangarber,

R.,

Scenario

Customization for

Information

 

Extraction.

Ph.D.

thesis,

Courant Institute

of

Mathe-

 

matical Sciences,

Department

of

Computer

Science,

New

 

York University,

New

York,

NY,

September

2000.

URL

 

ftp://www.cs.nyu.edu/pub/theses/roman-yangarber-2000.ps.gz.

 

Also appears in NYU Thesis repository.

 

 

 

 

Yan03

Yangarber, R., Counter-training in discovery of semantic patterns. Pro-

 

ceedings of the 41st Annual Meeting of the Association for Computa-

 

tional Linguistics, Sapporo, Japan, July 2003.

 

 

 

YBvE07

Yangarber, R., Best, C., von Etter, P., Fuart, F., Horby, D. and Stein-

 

berger, R., Combining information about epidemic threats from multi-

 

ple sources. Proceedings of the MMIES Workshop, International Con-

 

ference on Recent Advances in Natural Language Processing (RANLP

 

2007), Borovets, Bulgaria, September 2007.

 

 

 

YJ05

Yangarber, R. and Jokipii, L., Redundancy-based correction of au-

 

tomatically extracted facts. Proceedings of Conference on Empirical

 

Methods in Natural Language Processing, HLT-EMNLP 2005, Vancou-

 

ver, Canada, October 2005, pages 57–64.

 

 

 

 

YJRH05

Yangarber, R., Jokipii, L., Rauramo, A. and Huttunen, S., Extracting

 

information about outbreaks of infectious epidemics. Proceedings of

 

the HLT-EMNLP 2005, Demonstration, Vancouver, Canada, October

 

2005.

 

 

 

 

 

 

 

 

YvES08

Yangarber, R., von Etter, P. and Steinberger, R., Content collection

 

and analysis in the domain of epidemiology. Proceedings of DrMED-

2008: International Workshop on Describing Medical Web Resources, at MIE-2008: the 21st International Congress of the European Federation for Medical Informatics, Göteborg, Sweden, 2008.

Convert PDF to HTML