G22.2590 - Natural Language Processing - Spring 2010  Prof. Grishman

Lecture 4 Outline

February 11, 2008

Parsers and their Problems, cont'd

Problems with the top-down backtracking parser Problems with the bottom-up parser

Problem of ambiguity (sec. 13.2)

Dynamic programming parsing methods (sec. 13.4)
Example of dynamic programming parsing:  Earley parser (a chart parser) (sec. 13.4.2)
Even if we can generate all the parses with reasonable efficiency, what do we do with all these parses (all but one of which are wrong)?  We will look at a number of different approaches to the problem
Capturing constraints in a context-free grammar

Part-of-Speech Tagging (J&M chapter 5)

Role of parts-of-speech in grammar:  rules stated in terms of classes of words sharing syntactic properties

How fine should these classes be?
    Range of answers ... different part-of-speech 'tag' sets (J&M 5.2)
    Brown Corpus ... first large-scale tagged corpus
    Penn Tag Set ... used to tag Univ. of Pennsylvania Tree Bank (now several million words)
       (a detailed manual about Penn part-of-speech tagging is available from the Penn Treebank Project web site.)

The tagging task:  determining the tag of each word  (J&M 5.3)
    Not trivial:  many common words have several tags
        A dictionary will tell us which tags are possible for a word, independent of context.
        We could parse the sentence, and see which tags are used in the parses, but that's an expensive
        and difficult process (we might not always get a parse).
        Instead, we develop separate part-of-speech taggers.
        Help parsing (reduce ambiguity).
        Resolve pronunciation ambiguities (for text-to-speech).
        Resolve semantic ambiguities.

Rule based part-of-speech tagging (J&M 5.4)
    Ex:  Constraint-grammar tagger
    Needs large tagged corpus for testing

Statistical part-of-speech tagger (J&M 5.5)
    Needs large tagged corpus for training
    Unigram statistics (most common part-of-speech for each word) get us to about 90% accuracy
    For greater accuracy, need some information on adjacent words