G22.2590 - Natural Language Processing -- Spring 2006 -- Prof. Grishman
February 7, 2006
1. HMM: (a) [1.5 points] Consider a HMM with two states,
Cow and Duck, and a start and end state. Emission probabilities:
(Nothing is emitted in the start or end state.) Transition probabilities:
- In state Cow, the HMM can emit 'moo' (with 0.9 probability) or 'hello'
- In state Duck, the HMM can emit 'quack' (0.6 probability) or ‘hello'
(0.4 probability). The Duck has been studying English longer.
Using the Viterbi algorithm, decode (find the most likely state sequence
for) 'moo hello quack'. What is the probability of emitting this sentence
from this state sequence? Show your work, so that you can get partial
credit even if you make an error.
- From the start state, the HMM goes to state Cow with 1.0 probability
- From state Cow, the HMM can remain in state Cow (0.5 probability),
go to state Duck (0.3 probability), or go to state end (0.2 probability).
- From state Duck, the HMM can remain in state Duck (0.5 probability),
go to state Cow (0.3 probability), or go to state end (0.2 probability).
(b) [1 point] Is there another state sequence which also generates 'moo hello
quack'? What is the total probability
of emitting this sentence?
2. JET HMM Tagger. [1.5 points] Try the Jet HMM tagger.
Submit the output for one correctly tagged sentence and for one sentence
with a single incorrect tag. Explain the error in terms of the emission
and transition probabilities in the HMM (file pos_hmm.txt). This is not a lengthy calculation ... you need
only compute the relative probabilities
of the two tag sequences.We recommend (so that you do not have to deal with
the back-off statistics of the tagger) that you choose an erroneous example
for which the word occured with both the correct and incorrect parts of speech
in the training corpus.
Due February 14th.
Running the tagger:
Add the properties file tagPOS.jet to your props directory:
# JET properties file for POS tagging
On the "tagger" menu, turn on the "POS tagger trace".
Jet.dataPath = data
Tags.fileName = pos_hmm.txt
processSentence = tagPOS
Analyzing the HMM file:
- The tagger was trained on sentences ending in a period; be sure
to include a period when entering sentences or you may get bizarre tag assignments.
- The training corpus is taken from the Wall Street Journal. It
is therefore likely to do beter on words you would expect to find in the
news, particularly the business news.
The pos_hmm file consists of a series of lines each beginning with a keyword:
- STATE state-name
- Defines a new state with name state-name. All following
lines until the next STATE line are part of the definition of this state.
- ARC TO state-name [count]
- Indicates that there is an arc from the current state to the state
named state-name. The count, which will be used to compute
the probability of this transition, indicates how often the transition to
state-name was observed. If absent, a count of 1 is assumed.
- EMIT token [count]
- Indicates that the current state can emit token token. The count,
which will be used to compute the probability of this emission, indicates
how often the emission of token was observed. If absent, a count
of 1 is assumed.
- TAG tag
- Indicates that the current state is associated with tag tag.
These tags are used to associate HMM states with annotations, as explained
An example of a simple file which matches a sequence of "oink"s and "quacks"
Note that the file gives counts, not probabilities (these are actual counts
from a million words of text.) To compute the emission and transition
probabilities, you also need to know the total count for a state. This
is not included in the pos_hmm file distributed with Jet, but we have created
a new pos_hmm file which provides
this additional information (as a count on each STATE line).
ARC TO middle
EMIT quack 1
EMIT oink 2
ARC TO middle 2
ARC TO end 1