G22.2591 - Advanced Topics in Natural Language Processing - Spring
Name Recognition and Classification
Why name recognition?
Name recognition was introduced as a separate task in Message
Understanding Conference - 6 (see also Grishman and
Sundheim COLING 1996). Through earlier IE evaluations, system
to recognize that name recognition and classification was an important
part of text processing, even if it was not recognized as basic in
linguistic study. Making it a separate task encouraged research
to improve this technology, and emphasized its value for a range of
applications (document indexing, and later question answering). Nadeau and Sekine
2008 provide a good recent survey of work on this task.
For MUC-6, there were three name categories -- people, organizations,
and locations. Date, time, percentage, and currency expressions
were also included under name recognition. Some evaluations since
then have added individual categories ... artifact, facility, weapon,
... while others have developed definitions and taggers for larger
(both broader and more finely grained) sets of categories, with up to
200 caetegories (Sekine and
Nobata LREC 2004). However, almost all
studies have been done with the original set of three name
categories. Similar evaluations have been done for quite a few
foreign languages; CoNLL-2002
shared task did Dutch and Spanish; CoNLL-2003 shared task
did English and German.
These categories were developed for news reports, and provide good
coverage for that subject matter. Technical texts, however,
require very different sets of name categories. There has been
particular interest in this decade in molecular biology and genomics
texts. The GENIA
corpus is annotated with 36 categories, including proteins, DNA,
RNA, cell lines and cell types, and their have been multi-site
evaluations for tagging such text.
How to measure scores?
System performance is measured by comparison with a hand-prepared
key. The two basic measures are recall (= number of correct names
/ number of names in key) and precision (= number of correct names /
number of names in system response). As a single metric to rank
systems we normally use F-measure
(= 1/(1/recall + 1/precision)). The simplest rule is to
require perfect match -- you only get credit if you get the type,
start, and end of a name correct (this is the metric used, for example,
in the Japanese IREX
evaluation); the MUC evaluations used a more generous
scoring, with partial score for identifying a name of any sort, for
getting its type correct, and for getting its extent correct.
How well do people do?
In a small study for MUC-6 (17 articles), Sundheim reported <5%
interannotator (key-to-key) error. Agreement is probably enhanced
in languages where names are capitalized and for text where the
annotator is familiar with most of the names. Without
capitalization, it can be hard to tell unfamiliar organization names
from common noun phrases.
For a specific domain, it is possible to do very well with hand-coded
rules and dictionaries. On the MUC-6 evaluation (a very favorable
situation, where the source and general topic of the test data was
known in advance), the SRA system, based on hand-coded rules, got
F=96.4. Writing rules by hand, however, requires some skill and
The hand-coded rules take advantage of
Note that sometimes the type decision is based upon left context, and
sometimes upon right context, so it would be difficult for taggers
which operate deterministically from left to right or from right to
left to perform optimally.
- known names (through lists of well-known places, organizations,
- characteristic suffixes for organizations (Corp., Associates,
...) and locations (Island, Bay)
- first names for people
- titles for people
- other mentions of the same name in an article
Like POS tagging and chunking, named entity recognition has been tried
with very many different machine learning methods. More than the
syntactic tasks, performance on NE recognition depends on the variety
of resources which are brought to bear. CoNLL evaluations are
relatively 'pure' ... the systems basically just learn from the
provided training corpus. On the other hand, 'real' systems make
use of as many lists and as much training data as available. This
has a substantial effect on performance. In additiion,
performance is strongly affected by the domain of the training and test
data. These two effects can make it difficult to compare results
across different evaluations.
As with chunking, NE tagging can be recast as a token classification
task. We will have an "O" tag (token is not part of a named
entity), and "B-X" and "I-X" tags for each name type X.
Markov Models for Name Recognition
6 of J&M 2nd Edition provides a description of HMMs, Maximum
Entropy, and Maximum Entropy Markov Models.
One of the simplest statistical, corpus-trained sequential models is
the Hidden Markov Model. HMMs are based on a generative
model of a
sentence: given the previous n words, we generate the next word
in two steps, first selecting the next part of speech based on the
parts of speech of the previous one or two words, and then selecting
the word given the part of speech. The probability of selecting
word wi is then
P ( ti
| ti-1 ) P ( wi | ti )
Based on this model, we seek the most likely tag sequence for a sentence
argmax(T) product(i) P ( ti
| ti-1 ) P ( wi | ti )
The probabilities can be easily estimated from a tagged corpus, using
Maximum Likelihood Estimates. The most likely tag sequence can
then be determined using an
HMM and the Viterbi decoder.
The NYU Jet system uses a straightforward HMM for named entity tagging.
The simplest HMM has a single state for
each name type, and a single state for not-a-name (NaN). However,
typically the first and last word of a name have different
distributions, and the words immediately before or after a word often
give a good indication of the name type (for example, 'Mr.' before a
name is a clear indication of a person, while 'near' before a name
probably indicates a location). Therefore, we were able to create
a more accurate model by having separate states for the words
immediately before and after a name, and for the first and last tokens
of a name. This added about 2 points to recall (89 to 91) and 4
points to precision (82 to 86).
BBN's Nymble name tagger (Daniel M. Bikel; Scott Miller; Richard
Schwartz; Ralph Weischedel. Nymble: a
High-Performance Learning Name-finder. Proc. ANLP 97.) is
perhaps the best-known name tagger.
They used several techniques to enhance
performance over a basic HMM. Most notably, they used bigram
probabilities: they differentiated between the probability of
generating the first word of a name and subsequent words of a
name. The probability of generating the first word was made
dependent on the prior state; the probability of generating
subsequent words was made dependent on the prior word. The
probability of a state transition was made dependent on the prior
word. This had to be combined with smoothing to handle the case
of unseen bigrams.
HMMs are generative models which produce a joint probability over
observation and label sequences; typically we compute P(new state
| prior state) and P(current word | current state). It is
difficult to represent long-range or multiple interacting features in
such a formalism. Instead, researchers have used functions which
compute the state probability given the input -- a formalism which
allows for a richer set of features.
Sekine et al. (Satoshi Sekine; Ralph
Grishman; Hiroyuki Shinnou. A Decision Tree
Method for Finding and Classifying Names in Japanese Texts
Sixth WVLC, 1998) used a decision tree method for Japanese named
entity. The decision tree yeilded information on the probability
of the various tags. A Viterbi algorithm then computed the most
likely tagging of the entire sentence.
Borthwick et al. (Andrew Borthwick; John Sterling; Eugene Agichtein;
Ralph Grishman. Exploiting Diverse
Knowledge Sources via Maximum Entropy in Named Entity Recognition
Sixth WVLC, 1998) used a
maximum entropy method
to compute the
tags. Again, a Viterbi decoder was used to select the best
tagging. By itself the method did fairly well (92.2 F on
dry-run). More interestingly, it could be combined with the
patterns of the NYU hand-coded-rule system, with each rule a separate
feature. The rule-based system by itself also got 92.2 F;
the combined system got 95.6 F, roughly on a par with the best
Entropy Markov Models for Information Extraction and Segmentation
Andrew McCallum, Dayne Freitag and Fernando Pereira. ICML-2000)
describes general Maximum Entropy Markov Models (MEMMs) as computing
P(current state | input, prior state) using Maximum Entropy
methods. McCallum notes
that the Borthwick model is somewhat weaker in that the current
state probability is conditioned only on the input, not on the
prior state, and that may be why it did not do quite as well as the
Nymble HMM model.
He later describes a conditional random field (CRF) as an improvement
over MEMMs for the NE task. (Early
results for named entity recognition with conditional random fields,
feature induction and web-enhanced lexicons.
McCallum and Wei Li. CoNLL 2003).
Another concern with HMMs is that the parameters learned may
not be the optimal ones for the ultimate classification task. As
an alternative, discriminative
methods are trained to make the
discrimination between classes directly. Collins ( Discriminative
Training Methods for Hidden Markov Models: Theory and Experiments with
Perceptron Algorithms, EMNLP 02; Collins and Duffy, ACL 2002)
described an approach using error-driven HMM training and reported a
15% reduction in
error rate on a named entity tagging task. Support Vector
Machines are currently the most widely used discriminative method in
NLP, and have been effectively applied to the named entity task (Efficient support
vector classifiers for named entity recognition, Hideki Isozaki and
Hideto Kazawa, COLING 2002).
Looking ahead to next week ...unsupervised learning of names