#### Assignment #4

February 14, 2008

1.  HMM:  (a) [1.5 points] Consider a HMM with two states, Cow and Duck, and a start and end state.  Emission probabilities:
• In state Cow, the HMM can emit 'moo' (with 0.9 probability) or 'hello' (0.1 probability).
• In state Duck, the HMM can emit 'quack' (0.6 probability) or ‘hello' (0.4 probability). The Duck has been studying English longer.
(Nothing is emitted in the start or end state.)  Transition probabilities:
• From the start state, the HMM goes to state Cow with 1.0 probability (i.e., always).
• From state Cow, the HMM can remain in state Cow (0.5 probability), go to state Duck (0.3 probability), or go to state end (0.2 probability).
• From state Duck, the HMM can remain in state Duck (0.5 probability), go to state Cow (0.3 probability), or go to state end (0.2 probability).
Using the Viterbi algorithm, decode (find the most likely state sequence for) 'moo hello quack'.  What is the probability of emitting this sentence from this state sequence?  Show your work, so that you can get partial credit even if you make an error.

(b) [1 point] Is there another state sequence which also generates 'moo hello quack'?  What is the total probability of emitting this sentence?

2.  JET HMM Tagger.  [1.5 points] Try the Jet HMM tagger.  Submit the output for one correctly tagged sentence and for one sentence with a single incorrect tag.  Explain the error in terms of the emission and transition probabilities in the HMM (file pos_hmm.txt).  This is not a lengthy calculation ... you need only compute the relative probabilities of the two tag sequences.We recommend (so that you do not have to deal with the back-off statistics of the tagger) that you choose an erroneous example for which the word occured with both the correct and incorrect parts of speech in the training corpus.

Due February 21st.

#### Running the tagger:

# JET properties file for POS tagging
Jet.dataPath     = data
Tags.fileName    = pos_hmm.txt
processSentence  = tagPOS
On the "tagger" menu, turn on the "POS tagger trace".

Cautions:
• The tagger was trained on sentences ending in a period;  be sure to include a period when entering sentences or you may get bizarre tag assignments.
• The training corpus is taken from the Wall Street Journal.  It is therefore likely to do beter on words you would expect to find in the news, particularly the business news.
Analyzing the HMM file:

The pos_hmm file consists of a series of lines each beginning with a keyword:
STATE state-name
Defines a new state with name state-name.  All following lines until the next STATE line are part of the definition of this state.
ARC TO state-name [count]
Indicates that there is an arc from the current state to the state named state-name.  The count, which will be used to compute the probability of this transition, indicates how often the transition to state-name was observed.  If absent, a count of 1 is assumed.
EMIT token [count]
Indicates that the current state can emit token token.  The count, which will be used to compute the probability of this emission, indicates how often the emission of token was observed.  If absent, a count of 1 is assumed.
TAG tag
Indicates that the current state is associated with tag tag.  These tags are used to associate HMM states with annotations, as explained below.

An example of a simple file which matches a sequence of "oink"s and "quacks" is:

STATE start
ARC TO middle
STATE middle
EMIT quack 1
EMIT oink 2
ARC TO middle 2
ARC TO end 1
STATE end
Note that the file gives counts, not probabilities (these are actual counts from a million words of text.)  To compute the emission and transition probabilities, you also need to know the total count for a state.  This is not included in the pos_hmm file distributed with Jet, but we have created a new pos_hmm file which provides this additional information (as a count on each STATE line).