DEPARTMENT OF COMPUTER SCIENCE
DOCTORAL DISSERTATION DEFENSE
CANDIDATE: MAHESH CHITRAO
STATISTICAL TECHNIQUES FOR PARSING MESSAGES
9:00 A.M., MONDAY, JANUARY 21, 1991
ROOM 101,
WARREN WEAVER HALL





Abstract

Message processing is the extraction of information about key events described in brief narratives concerning a narrow domain. This is a suitable task for natural language understanding, since the amount of world knowledge required is limited. However, the messages are often ill-formed and therefore require the grammar which parses them to be quite forgiving. This often results in a proliferation of parses. This problem is compounded by one's inability to construct a complete domain model which would resolve all the semantic ambiguity. Thus, selection of the correct parse becomes an important goal for such systems.

Structural preference is a technique which helps disambiguation by assigning a higher preference to certain syntactic structures. The idea of statistical parsing evolved from the desire of being able to prefer certain structures over others on the basis of empirical observations, rather than ad-hoc judgement. In the framework of statistical parsing, every production of the grammar is assigned a priority, which is computed from a statistical analysis of a corpus.

There are two distinct methodologies that can be used for assigning these priorities. In Supervised Training, only the correct parses are used for training the grammar. On the other hand, Unsupervised Training uses parses independent of their semantic validity. After assigning the priorities, the parser searches for parses in a best-first order as dictated by these priorities.

When this scheme was incorporated into the PROTEUS message understanding system while processing OPREP (U.S. Navy Operational) messages, a two-fold advantage was observed. Firstly, the speed of the parsing increased, because rare productions tended not to get used at all. Secondly, since the parses were generated in the best-first order, the parses generated earlier on tended to be more likely and semantically more acceptable.

The performance of the modified parsing algorithm was evaluated with and without several refinements such as the use of context sensitive statistics and the use of heuristic penalties. The relative performances of the grammars trained by Supervised Training and Unsupervised Training were also compared.