A Sublanguage Based Medical Language Processing System for
10:00 a.m., Monday, July 13, 1992
12th floor conference room, 719 Broadway
The major accomplishments reported in this thesis are:
- The development of a computer grammar for a nontrivial sublanguage of German. This grammar, using the LSP (Linguistic String Processor) grammar formalism, solves a number of parsing problems arising in free word order languages such as German.
- The development of an LSP-based information formatting system that obtains semantic representations of texts in a medical sublanguage of German.
- The confirmation of the sublanguage hypothesis (explained below).
In LSP grammar theory, sentences in a language are derived from a collection of basic sentence types. The basic sentence types are described in terms of the major syntactic classes (e.g., noun, verb, adjective) of the language. Sentences are derived from these basic sentences by the insertion of optional structures called adjuncts, by conjoining, and by substituting words in the major classes. Insertion, conjoining, and substitution are constrained by co-occurrence restrictions between elements in the derived syntactic structures. The restrictions subcategorize the major word classes into subclasses that may co-occur in sentences according to the co-occurrence restrictions.
The sublanguage hypothesis elaborates LSP grammar theory in the following way. In a particular domain of discourse, the subcategorization of the major word classes reflects the underlying semantics of the domain. The basic sentence types of the language, represented by sublanguage subclasses instead of major word classes, can function as data structures (called information formats) representing the information of the domain.
The LSP Medical Language Processor (LSP/MLP) is an information retrieval/information extraction system based on sublanguage and information formatting. It processes sentences in the English sublanguage of clinical reporting into information formats, which are in turn are converted into database update records for a relational database. The information formats are derived from sublanguage co-occurrence information obtained from a corpus of discharge summaries.
The German information formatting system implemented in this work processes German Arztbriefe (doctor letters) of cancer surgery patients into information formats. It confirms the sublanguage hypothesis because it re-uses the sublanguage information (co-occurrence information and formats) of the English LSP/MLP system in an equivalent sublanguage, showing that the sublanguage information reflects the semantics of the domain.