[Next] [Previous] [Top] [Back to MUC-6 main page]

Information Extraction Task Definition

3 Scenario-Independent and Scenario-Neutral Aspects of IE Task

3.1 - Template Definition
3.1.1 - BNF
3.1.2 - Fill Format
3.1.2.1 - Slot Types
3.1.2.2 - Object Identifiers
3.1.2.3 - Notation Reserved for Use in Answer Keys
3.2 - Fill Rules
3.2.1 - TEMPLATE Object
3.2.1.1 - DOC_NR Slot
3.2.1.2 - CONTENT Slot
3.2.2 - ORGANIZATION Object
3.2.2.1 - ORG_NAME Slot
3.2.2.2 - ORG_ALIAS Slot
3.2.2.3 - ORG_DESCRIPTOR Slot
3.2.2.4 - ORG_TYPE Slot
3.2.2.5 - ORG_LOCALE Slot
3.2.2.6 - ORG_COUNTRY Slot
3.2.2.7 - ORG_NATIONALITY Slot
3.2.3 - PERSON Object
3.2.3.1 - PER_NAME Slot
3.2.3.2 - PER_ALIAS Slot
3.2.3.3 - PER_TITLE Slot
3.2.4 - ARTIFACT Object
3.2.4.1 - ART_ID Slot
3.2.4.2 - ART_DESCRIPTOR Slot
3.2.4.3 - ART_TYPE Slot
3.2.5 - Template Element Slots
3.2.5.1 - LOCALE Slot
3.2.5.2 - COUNTRY Slot
3.2.5.3 - DATE Slot

3.1 Template Definition

3.1.1 BNF

/* Top-Level Object -- applies to Scenario Template subtask only */

<TEMPLATE> :=
	DOC_NR:			"NUMBER"^
	CONTENT:		<scenario-specific-object>*
	COMMENT:		" "-

/* Template Element Objects -- apply to Template Element subtask;
   apply selectively to Scenario Template subtask */

<ORGANIZATION> :=
	ORG_NAME:		"NAME"-
	ORG_ALIAS:		"ALIAS"*
	ORG_DESCRIPTOR:		"DESCRIPTOR"-
	ORG_TYPE:		{GOVERNMENT, COMPANY, OTHER}^
	ORG_LOCALE:		LOCALE-STRING {{LOC_TYPE}} *
	ORG_COUNTRY:		NORMALIZED-COUNTRY | COUNTRY-STRING *
	ORG_NATIONALITY:	NORMALIZED-COUNTRY-or-REGION | COUNTRY-or-REGION-STRING *
        OBJ_STATUS:             {OPTIONAL}-
	COMMENT:		" "-

<PERSON> :=
	PER_NAME:		"NAME"^
	PER_ALIAS:		"ALIAS"*
	PER_TITLE:		"TITLE"*
        OBJ_STATUS:             {OPTIONAL}-
	COMMENT:		" "-

<ARTIFACT> :=
	ART_ID:			"ID"-
	ART_DESCRIPTOR:		"DESCRIPTOR"-
	ART_TYPE:		{{scenario-specific-set-fill}}
        OBJ_STATUS:             {OPTIONAL}-
	COMMENT:		" "-


LOC_TYPE ::	{CITY, PROVINCE, COUNTRY, REGION, UNK}


/* Template Element Slots -- Apply to Scenario Template subtask only.
   Valence is scenario-dependent. */

LOCALE:    LOCALE-STRING {{LOC_TYPE}}

COUNTRY:    COUNTRY | COUNTRY-STRING

DATE:	    {BEFORE, AFTER, ON} DATE-EXP
	    | BETWEEN DATE-EXP DATE-EXP
		
DATE-EXP ::	([[01-31]]|{EA, MD, LT, EO, BO})[[01-12]][[00-99]YY]
		| {EA, MD, LT, EO, BO}
		  {FA, WI, SP, SU, 1Q, 2Q, 3Q, 4Q, 1F, 2F, 3F, 4F, FY}
		  [[00-99]YY]
		| {EA, MD, LT, EO, BO, FA, WI, SP, SU, 1Q, 2Q, 3Q, 4Q,
		     1F, 2F, 3F, 4F, FY}
		  [[00-99]YY]
		| [[01-12]][[00-99]YY]
		| [[00-99]] 
		| DESCRIPTOR

3.1.2 Fill Format

3.1.2.1 Slot Types

There are four kinds of slots in the template: set fill, string fill, normalized fill, and index fill (pointer). It should be noted that for purposes of scoring, normalized fills and string fills are equivalent, i.e., the scoring software strips off external double quotes from fills for slots that are defined as taking normalized fills or string fills.

SET FILL.

To be filled in by selection from a prespecified list of categories defined in the fill rules for a given slot.

STRING FILL.

To be filled in with an exact copy of a text string from the article under analysis. The fill may be enclosed in double quotes, if desired. See the "Tokenization Rules" document for information on what counts as a word token in certain special cases.

NORMALIZED FILL.

To be filled with a text string that is converted to a canonical form in accordance with the fill rules for a given slot. The fill may be enclosed in double quotes, if desired.

INDEX FILL (POINTER).

To be filled with the index of an object, i.e., a pointer to an object. The fill is to be enclosed in angled brackets.

3.1.2.2 Object Identifiers

All objects are identified by the object name (from the template BNF), the document number (from the DOCNO tag in the text), and a one-up number; a dash is used to separate those three elements. For Wall Street Journal articles, the dash internal to the value of DOCNO must be suppressed; thus, a valid ORGANIZATION object identifier for DOCNO 891026-0100 would be <ORGANIZATION-8910260100-1>.

3.1.2.3 Notation Reserved for Use in Answer Keys

Legitimate ambiguity or vagueness in the text is reflected in the answer key by the presence of alternative acceptable fills. The "/" notation is reserved for this use; such fills are *not* to be generated by the system under evaluation. The notation allows the answer key to present alternate acceptable single fills for a slot, alternate sets of fills for a slot, optional fills (one fill or zero fills), and combinations thereof. An object is treated as optional if all pointers to it are either optional or in a list of alternatives.

Since the Template Element subtask does not include the creation of pointers to the template element objects, the optionality of ORGANIZATION, PERSON, and ARTIFACT objects is indicated via the OBJ_STATUS slot within the optional object itself. The OBJ_STATUS slot is not used for the Scenario Template subtask.

The COMMENT slot may contain notes that the analyst wants to record concerning the answer key. The slot is not scored. (Analysts should avoid entering double quotes within the comment, as they will prevent the template-filling tool, Tabula Rasa, from being able to reload the template file.)

3.2 Fill Rules

The input text contains some SGML tags, including TXT; the IE task is to be performed on the text delimited by the TXT, HL, DATELINE, and DD tags. (Note, however, that the DD tag sometimes doesn't appear at all, sometimes appears once, and sometimes appears twice.)

Lines within the <TXT> portion of the article that start with the "@" sign signify a table within the text and should NOT be annotated. (However, such lines may also appear within the <HL> portion of the article, and these should be annotated.)

3.2.1 TEMPLATE Object

DEFINITION:

Top-level object. Applies to Scenario Template subtask only.

MINIMUM INSTANTIATION CONDITIONS:

For every Scenario Template, instantiate one TEMPLATE object.

3.2.1.1 DOC_NR Slot

DEFINITION:

Article identifier. To be copied from the DOCNO tagged string in the text. Normalize the string to remove any internal dashes, e.g., 870101-0001 becomes 8701010001. This slot is not scored; it is used only to assist people in reading the template.

3.2.1.2 CONTENT Slot

DEFINITION:

Pointer to object that captures info relevant to a given scenario. It is possible for CONTENT to have multiple values, corresponding to different relevant events described. Relevant events are defined as being different when the value of one slot in the scenario object is incompatible with the value of another.

MINIMUM INSTANTIATION CONDITIONS:

Depends on scenario definition.

3.2.2 ORGANIZATION Object

DEFINITION:

Corporate, governmental, or other kind of organization.

MINIMUM INSTANTIATION CONDITIONS:

Text must refer to a particular organization and must provide fill for at least one of the following slots: ORG_NAME, ORG_DESCRIPTOR.

3.2.2.1 ORG_NAME Slot

DEFINITION:

The proper name of the organization, including any corporate designators (see reference document titled "Table of Corporate Designator Abbreviations"). If a document contains more than one variant of the name, the ORG_NAME slot is to be filled with the most complete variant.

MINIMUM INSTANTIATION CONDITIONS:

The name must appear in the text.

SPECIAL USAGE NOTES:

1. This slot has a 0 or 1 valence to allow the situation where an unnamed organization participates in an event (or relation) of interest and is perhaps referenced only by a descriptive phrase.

2. If an organization is changing name, report the current name as ORG_NAME and the past or future name as ORG_ALIAS.

3. See "Named Entity Task Definition" for information on treatment of names such as "McDonald's of Japan."

3.2.2.2 ORG_ALIAS Slot

DEFINITION:

Variant of the proper name entered in the ORG_NAME slot. There may be more than one value for this slot.

MINIMUM INSTANTIATION CONDITIONS:

The variant must appear explicitly in the text. This slot can be filled only if ORG_NAME is filled also.

SPECIAL USAGE NOTES:

1. Misspelled variants of the name reported in ORG_NAME are to be reported in ORG_ALIAS.

3.2.2.3 ORG_DESCRIPTOR Slot

DEFINITION:

Noun phrase describing or referring to an organization without naming it. This slot is not permitted to have more than one value.

MINIMUM INSTANTIATION CONDITIONS:

Text must provide a string that describes the organization and that does not fit the definition of the ORG_NAME slot. The string cannot be a pronoun, e.g., "it."

SPECIAL USAGE NOTES:

1. The answer key will provide alternative correct answers if the text supplies more than one substantive descriptor string. If the text provides one or more substantive descriptors in addition to an insubstantial one such as "the company" or "the organization," the answer key will contain only the substantive descriptors.

3.2.2.4 ORG_TYPE Slot

DEFINITION:

Categorization of organization as a corporate entity, a government entity, or some other kind of organizational entity.

MINIMUM INSTANTIATION CONDITIONS:

The ORG_TYPE fill should be based on evidence from the text or on world knowledge; the slot should never be left blank.

SPECIAL USAGE NOTES:

1. The categories that are to be used for ORG_TYPE are defined as follows:

COMPANY -- any profit-making or nonprofit legal (usually) entity, including universities, partnerships, corporations, proprietorsips, consortiums, enterprises, government-owned corporations, etc.

GOVERNMENT -- the government of a country, state, municipality, etc., or government body such as a government ministry, agency, commission, or committee. In the case of a string such as "IBM announced a joint venture with China," report "China" as type GOVERNMENT unless there is evidence for a different type elsewhere in the text.

OTHER -- organizational entities that do not fit the above categories, such as "the Apache Indian tribe," "OPEC," "the Medellin cartel," "NATO."

3.2.2.5 ORG_LOCALE Slot

DEFINITION:

Specific place where an organization is located. Only the most specific place is to be reported. (This will enable accurate, automatic scoring.) The literal string that appears in the text, plus a categorization of the place name, appear in this slot as a complex (two-part) fill.

MINIMUM INSTANTIATION CONDITIONS:

The locale must be specifically mentioned in the text. NOTE: Except in the case of organizations of type GOVERNMENT, the name itself is not to be used as a source of information for the ORG_LOCALE slot.

SPECIAL USAGE NOTES:

1. NAMES

a. The "MUC-6 Reference Gazetteer" does not contain an exhaustive list of the place names that may be used to fill the ORG_LOCALE slot, nor does it usually provide alternative spellings for place names. Use UNK as locale type only if the type cannot be determined from the text.

b. If the text provides only a relative locale such as "near Tokyo" or "60 miles from Tokyo", report "Tokyo" as ORG_LOCALE name.

2. TYPES

a. The location categories that are to be used for ORG_LOCALE are defined as follows:

CITY -- a town, city, port, suburb, or other local settlement

PROVINCE -- a state, province, island or similar subnational geographically or politically defined area

COUNTRY -- a nation, country, colony, federation of countries such as the Confederation of Independent States (the former USSR), or other similar national entity

REGION -- an international region such as Eastern Europe, the Pacific Rim, or the Malay Archipelago

UNK -- a location whose possible type cannot be identified from evidence in the text or from world knowledge

b. The "MUC-6 Reference Gazetteer" uses more location categories than are to be reported in ORG_LOCALE. The following mappings apply:

PORT and AIRPORT in gazetteer are to be reported as CITY in ORG_LOCALE.

ISLAND in gazetteer is to be reported as PROVINCE in ORG_LOCALE.

ISLAND-GROUP in gazetteer is to be reported as either PROVINCE (if part of a single country) or as REGION (if part of an international region).

CONTINENT in gazetteer is to be reported as REGION in ORG_LOCALE.

3. ORG_LOCALE vs ORG_NATIONALITY

a. When a candidate fill for the ORG_LOCALE slot is the name of a country or a reference to a country, there is potential ambiguity as to whether the fill belongs in ORG_LOCALE or in ORG_NATIONALITY. For the purpose of maintaining consistency of extraction by the human analysts, the following kinds of text expressions have been identified as some typical ones that occasion a fill in ORG_LOCALE. (For info on the distinct criteria for ORG_NATIONALITY, see 3.2.2.7.)

<org> "of" <country name> /* "Honda Inc. of America" */

<org> "in" <country name> /* "Honda Inc. in America" */

<org> "based in" <country name> /* "GM Corp., based in America", "the largest auto manufacturer based in America" */

<org> "headquartered in" <country name>" /* "GM Corp., headquartered in America" */

<country name>"-based" <org> /* "the U.S.-based company", "U.S.-based Rockwell International" */

<country name> /* "Spain" [i.e., a country name that metonymically represents the government of that country] */

3.2.2.6 ORG_COUNTRY Slot

DEFINITION:

The country in which ORG_LOCALE is located. A defining list of country names in contained in "MUC-6 Country and Region List." (This list contains only canonical forms. NLP system developers must define their own mappings from the "MUC-6 Reference Gazetteer" and/or other gazetteer resources to this list.)

MINIMUM INSTANTIATION CONDITIONS:

To be reported only if ORG_LOCALE is filled with a locale of type CITY, PROVINCE, COUNTRY or, in some cases, UNK. Fill is to be inferred, if necessary.

SPECIAL USAGE NOTES:

1. If ORG_LOCALE is filled in by a name of type COUNTRY, report the country name in this slot as a normalized form drawn from "MUC-6 Country and Region List".

2. Note that the "MUC-6 Country and Region List" may not contain a complete list of countries. If a canonical form for the name of the country does not appear on the list, report the name in noun or adjective form (whichever appears in the text) as a string fill.

3.2.2.7 ORG_NATIONALITY Slot

DEFINITION:

The name of the home country or home region of an organization. A defining list of country names in contained in "MUC-6 Country and Region List." (This list contains only canonical forms. NLP system developers must define their own mappings from the "MUC-6 Reference Gazetteer" and/or other gazetteer resources to this list.)

MINIMUM INSTANTIATION CONDITIONS:

Text must specify the nationality, which often is done in phrases such "the Japanese automaker," "Indonesia's largest electronics retailer," or "a US auto maker." Except in the case of organizations of type GOVERNMENT, the name (or alias) itself is not to be used as a source of information for the ORG_LOCALE slot. May be filled in by inference from the text or inference based on knowledge of geography, e.g., that Zurich is in Switzerland. Not to be filled in on the basis of general world knowledge alone. For example, if the text mentions "Swiss" or "Zurich" in the appropriate context, fill in SWITZERLAND, but if the article mentions "Boeing" without allusions to nationality, do not fill in UNITED STATES.

SPECIAL USAGE NOTES:

1. This slot may be filled with the name of type REGION rather than by a name of type COUNTRY if the text provides a general reference such as "Eastern European" or "Asian."

2. Note that the "MUC-6 Country and Region List" may not contain a complete list of countries and regions. If a canonical form for the name of the country or region does not appear on the list, report the name in noun or adjective form (whichever appears in the text) as a string fill.

3. As a default, assume that "American" refers to "United States."

4. There is potential ambiguity as to whether a text expression should be captured in ORG_LOCALE rather than in ORG_NATIONALITY. For the purpose of maintaining consistency of extraction by the human analysts, the following kinds of text expressions have been identified as some typical ones that occasion a fill in ORG_NATIONALITY. Note that the ORG_NATIONALITY fill is a premodifier in all cases. (For info on the distinct criteria for ORG_LOCALE, see 3.2.2.5.)

<country name> <org> /* "the Indonesia company", "the U.S. Government" */

<country name>"'s" <org> /* "Indonesia's company", "Spain's government", "U.S.'s Monsanto Inc." */

<nationality expressed in adjective form> <org> /* "the Spanish company", "the Spanish government" */

"the domestic" <org> /* "the domestic company" [requiring inference of referent] */

"the nation's" <org> /* "the nation's largest carrier" [requiring inference of referent] */

3.2.3 PERSON Object

DEFINITION:

An (unincorporated) person or family.

MINIMUM INSTANTIATION CONDITIONS:

Text must supply fill for PER_NAME slot. The guidelines for instantiating a PERSON object are the same as the guidelines given in "Named Entity Task Definition" for annotating person names.

3.2.3.1 PER_NAME Slot

DEFINITION:

The proper name of the person or family.

MINIMUM INSTANTIATION CONDITIONS:

The text must supply a person or family name.

3.2.3.2 PER_ALIAS Slot

DEFINITION:

Variant of the proper name reported in the PER_NAME slot. There may be more than one value for this slot.

MINIMUM INSTANTIATION CONDITIONS:

The variant must appear explicitly in the text. This slot can be filled only if PER_NAME is filled also.

SPECIAL USAGE NOTES:

1. Misspelled variants of the name reported in PER_NAME are to be reported in PER_ALIAS.

3.2.3.3 PER_TITLE Slot

DEFINITION:

An innate title such as "Dr." or "Ms.," as distinct from a person's role such as "President" or "CEO." (The latter would be captured by a scenario-specific template element such as a Relational object.)

MINIMUM INSTANTIATION CONDITIONS:

To be reported only if PER_NAME is filled. The text must explicitly mention the person's title.

3.2.4 ARTIFACT Object

DEFINITION:

A product or natural commodity. The nature of the specific artifact(s) to be reported is task-dependent and is therefore defined for a given Scenario Template subtask in the scenario task documentation.

MINIMUM INSTANTIATION CONDITIONS:

The text must supply a fill for at least one of the following slots: ART_ID, ART_DESCRIPTOR.

3.2.4.1 ART_ID Slot

DEFINITION:

A unique identifier for the artifact.

MINIMUM INSTANTIATION CONDITIONS:

Depends on scenario definition.

3.2.4.2 ART_DESCRIPTOR Slot

DEFINITION:

Noun phrase describing or referring to an artifact without naming it. This slot is not permitted to have more than one value.

MINIMUM INSTANTIATION CONDITIONS:

Text must provide a string that describes the artifact and that does not fit the definition of the ART_ID slot. The string cannot be a pronoun, e.g., "it."

SPECIAL USAGE NOTES:

1. The answer key will provide alternative correct answers if the text supplies more than one substantive descriptor string. If the text provides only uninformative descriptors, e.g., "the product," the fills in the answer key will all be marked as optional.

3.2.4.3 ART_TYPE Slot

DEFINITION:

A categorization of the artifact. Inventory of categories depends on scenario definition.

MINIMUM INSTANTIATION CONDITIONS:

Depends on scenario definition.

3.2.5 Template Element Slots

DEFINITION:

Task-independent slots (location and time data) that are separate from the predefined objects. They may be defined selectively for a given scenario, e.g., to provide the location and time of an event.

MINIMUM INSTANTIATION CONDITIONS:

Depends on scenario definition.

SPECIAL USAGE NOTES:

1. These slots will not be part of the Template Element evaluation. Instead, one or more of them may play a role in one or more Scenario Template subtasks. In such cases, their role will be defined in the scenario task documentation.

3.2.5.1 LOCALE Slot

DEFINITION:

Specific locale of an entity or event.

MINIMUM INSTANTIATION CONDITIONS:

Depends on scenario definition.

3.2.5.2 COUNTRY Slot

DEFINITION:

Country locale of an entity or event.

MINIMUM INSTANTIATION CONDITIONS:

Depends on scenario definition.

3.2.5.3 DATE Slot

DEFINITION:

An absolute or relative date or date range.

MINIMUM INSTANTIATION CONDITIONS:

Depends on scenario definition.

SPECIAL USAGE NOTES:

1. The YY option and DESCRIPTOR option are to be used only if the article contains no DD tags. Use YY if only a partial date is given in the text, e.g., "on 27 March;" the output of extraction for that example would be "ON 2703YY". Use descriptor string option if a time phrase is used that cannot be represented in the usual date format; for example, "last week" ("ON last week") or "Tuesday" ("ON Tuesday").

2. See separate documentation titled "A Revised Template Description for Time (v3)" and "Supplement to Time Treatment Used for MUC-5" for further information.


Information Extraction Task Definition - 14 JUN 95
[Next] [Previous] [Top] [Back to MUC-6 main page]

Generated with CERN WebMaker