G22.2590 - Natural Language Processing - Spring 2010    Prof. Grishman

Lecture 12 Outline

April 15, 2010

Finish semantic interprettion with quantifiers.
Term project status.

Lexical Semantics [J&M Chap 19 and 20]

In our discussion of semantics up to now, we focused on structural issues:  how to represent the relations between predicates and events and their arguments and modifiers;  how to represent quantification;  how to convert syntactic structure into semantic structure.   As our predicates, we used words, but this is really problematic for a semantic representation:  one word may have several meanings (polysemy) and several words may have the same or nearly the same meaning (synonymy).  In this section we take a closer look at word meanings.

Terminology [J&M 19.1, 2]

    - multiple senses of a word
    - polysemy (and homonymy for totally unrelated senses ("bank"))
    - metonomy for certain types of regular, productive polysemy ("the White House", "Washington")
    - zeugma (conjunction combining distinct senses) as test for polysemy ("serve")
    - synonymy:  when two words mean (more-or-less) the same thing
    - hyponymy:  X is the hyponym of Y if X denotes a more specific subclass  of Y
        (X is the hyponym, Y is the hypernym)

WordNet [J&M 19.3]

    - large-scale database of lexical relations
    - freely available for interactive use or download
    - organized as a graph whose nodes are synsets (synonym sets)
        - each synset consists of 1 or more word senses which are considered synonymous
    - primary relation:  hyponym / hypernym
    - very fine sense distinctions
    - sense-annotated corpus (SemCor, subset of Brown corpus)
    - similar wordnets developed for many foreign languages:  Global WordNet Association

Word Sense Disambiguation [J&M 20.1]

    - process of identifying the sense of a word in context
    - WSD evaluation:  either using WordNet or coarser senses (e.g., main senses from a dictionary)
    - local cues (Weaver):  train a classifer using nearby words as features
        - either treat words at specific positions relative to target word as separate features
        - or put all words within a given window (e.g., 10 words wide) as a 'bag of words'
        - simple demo for 'interest'

Simple supervised WSD algorithm:  naive Bayes [J&M 20.2.2]

        selected sense s' = argmax(sense s) P(s | F)
        where F is the set of context features (n different features)
            s' = argmax(s) P(F | s) P(s) / P(F)
               = argmax(s) P(F | s) P(s)
        If we now assume features are independent
            P(F | s) =  product(i) P(f[i] | s)
            s' = argmax(s) P(s) product(i) P(f[i] | s)
        Maximum likelihood estimates for P(s) and P(f[i] | s) can be easily obtained by counting
            - some smoothing (e.g., add-one smoothing) is needed
        Works quite well at selecting best sense (not at estimating probabilities)
        But needs substantial annotated training data for each word

Semi-supervised WSD algorithm [J&M 20.5]

        Based on Gale / Yarowsky's "one sense per discourse" observation
            (generally true for coarse word senses)
        Allows bootstrappig from a small set of sense-annotated seeds

Identifying similar words

Distance metric for Wordnet [J&M 20.6]

        Simplest metrics just use path length in WordNet
        More sophisticated metrics take account of the fact that going 'up' (to a hypernymm) may represent different degrees of generalization in different cases
        Resnik introduced P(c):  for each concept (synset), P(c) = probability that a word in a corpus is an instance of the concept (matches the synset c or one of its hyponyms)
        Information content of a concept
            IC(c) = -log P(c)
        If LCS(c1, c2) is the lowest common subsumer of c1 and c2, the JC distance between c1 and c2 is
            IC(c1) + IC(c2) - 2 IC(LCS(c1, c2))

Similarity metric from corpora [J&M 20.7]

        Basic idea:  characterize words by their contexts;  words sharing more contexts are more similar
        Contexts can either be defined in terms of adjacency or dependency (syntactic relations)
        Given a word w and a context feature f, define pointwise mutual information PMI
            PMI(w,f) = log ( P(w,f) / P(w) P(f))
        Given a list of contexts (words left and right) we can compute a context vector for each word.
        The similarity of two vectors (representing two words) can be computed in many ways;  a standard way is using the cosine (normalized dot product).
        See the Thesaurus demo by Patrick Pantel.