The Tipster package provides the basic methods for recording information about documents.  It is loosely based on the 'Tipster Architecture' developed by R.Grishman as part of the Government-sponsored Tipster program.  The basic objects are Documents and Annotations;  a Document is a container for the text of the document, and a set of Annotations on the Document.

In the course of processing, the Jet system builds up a lot of information about the words and phrases in a Document:  simple things like parts-of-speech for individual words and type information (person/company/location) for names, as well as more complex things like phrases and clauses (with internal structure).  We want to have a single class of object for capturing all of this information and associating it with a Document.  The class we use for this purpose is the Annotation.  An Annotation is associated with a Span (substring) of the text of a Document.  The Annotation has a type and a set of features with values.  For example, an annotation can indicate that a portion of a document is a sentence, or is a token with a given part-of-speech.  More complex structures can be build by having Annotations which point to other annotations.

A Document is processed in a series of stages, such as tokenization, sentence splitting, dictionary look-up, pattern matching, etc.  Each stage uses the Annotations placed on the Document by previous stages, and adds its own Annotations to the Document.

Annotations provide a mark-up capability very similar to that of SGML or XML (although Annotations do not have to be nested the way SGML/XML mark-up it).  The Document class provides a method for converting selected Annotations on a Document to XML mark-up, and in the future will have a method for converting XML mark-up to Annotations.  In addition, the Document class provides a method for viewing a Document and highlighting selected annotations (this is very primitive at present).