Patterns and Pattern Matching

The pattern matcher lies at the heart of the Jet toolset.  It allows the user to write regular expressions which match sequences of annotations, and to create new annotations.  Most of the language analysis in Jet is performed using a sequence of pattern matching steps.

Jet Pattern Language

A pattern collection is a sequence of pattern statements, where each pattern statement is terminated by a semicolon.  The basic language has two statement types, a pattern definition statement and a when statement, which indicates the action to be performed when a pattern is matched.

Pattern Definition

A pattern definition has the form
pattern-name : = option1 | option2 | ... ;
where where pattern-name is a sequence of letters beginning with a lower-case letter, and each optioni is a sequence of repeated-pattern-elements separated by spaces.  A repeated pattern element has one of the forms
pattern-element
pattern-element ?
pattern-element *
pattern-element +
to indicate exactly one, zero or one, zero or more, one or more instances of pattern-element.  Pattern-element in turn may be
a string:  "quack"
an annotation:  [type feature=value  feature=value ...]
the name of another pattern
an alternation:  ( option1 | option2 | ... )
as assignment pattern element:  variable = value
A string pattern element matches an annotation of type token spanning the specified string.  An annotation pattern element matches an annotation which has the specified type and features (and may have additional features).  For example, the pattern element
[constit cat=tv]
matches a constit annotation whose feature cat has the value 'tv'.  A test of a feature against the concept hierarchy (rather than a single value) has the form
feature ?isa(concept)
This feature test succeeds if the value of feature is a word associated with concept, or associated with some concept' which is a descendant of
concept in the hierarchy.

Variables

A variable name is a sequence of letters beginning with a capital letter.  A variable may be bound in a pattern in several ways.  An assignment pattern element
Variable = value
binds Variable to value.  At present, the only values allowed are integers.  A parenthesized pattern may be followed by a colon (:) and variable name
(pattern ) : Variable
This binds the variable to the span of the document matched by the pattern.

An annotation pattern element can specify a variable as the value of one of the features:  feature=Variable.  If the variable is unbound when this pattern element is matched against an annotation in the text, the variable will be bound to the value of this feature.  On the other hand, if the variable is already bound, the pattern element will match only if the value of the feature is equal to the value of the variable.  For example, the pattern element

[constit cat=vp number=Number]
(assuming this is the first appearance of Number), will bind Number to the value of the number feature.  The sequence
[constit cat=np number=Number]   [constit cat=vp number=Number]
can be used to insure that the value of the number feature on the np and vp is the same.

When Statements and Actions

When statements associate patterns with sequences of actions.  When the pattern is matched in a document, the associated actions are performed.  The when statement has the form
when pattern-name, action1, action2, ... ;
At present, three actions are implemented:  the add action, which adds an annotation, and print action, and the write action.

The add action

The add action adds an annotation to the text.  It has the form
add [annotation-type feature=value feature=value ...]
or
add [annotation-type feature=value feature=value ...] over variable
In the first form, the span of the new annotation is the text matched by the pattern.  In the second form, the variable must have been bound to a span as part of the pattern matching;  this is used as the span of the new annotation.

The print action

The print action has the form
print stringExpression
where stringExpression can be a string (enclosed in double quotes), a variable, or a sequence of two or more strings and variables separated by plus signs (+).  A variable in a stringExpression should have been bound to a span or an annotation as part of the pattern matching process;  the print action prints the text subsumed by that span or annotation.  If the stringExpression contains two or more items, they are concatenated and the result printed together on a single line.  The output is sent to the Jet console.

The write action

The write action has the form
write stringExpression
It has the same semantics as the print action, except that the output is written to standard output.

Pattern Sets and Pattern Matching Process

The when statements are organized into pattern sets.  The statement
pattern set name;
indicates that all following when statements (until the next pattern set statment) belong to pattern set name. The basic 'top level' operation in Jet is the application of a pattern set to a sentence.

The process begins by matching all patterns in the pattern set (i.e., all patterns referenced by when statements in the pattern set) starting at the first token of the sentence.  If several patterns match, we select the pattern which matches the longest portion of the text.  If several patterns match the same (longest) portion, we select the pattern whose when statement appeared first in the pattern file.  The actions associated with the selected pattern are then executed in sequence (if no pattern matches, no actions are performed).

The starting point for pattern matching is then advanced and the process is repeated.  If any of the actions created new annotations, the starting point is set to the maximum of thes end of the annotations.  If no new annotation was created, the starting point is advanced by one token.  The matching continues until the starting point reaches the end of the sentence.