In the course of processing, the Jet system builds up a lot of information about the words and phrases in a Document: simple things like parts-of-speech for individual words and type information (person/company/location) for names, as well as more complex things like phrases and clauses (with internal structure). We want to have a single class of object for capturing all of this information and associating it with a Document. The class we use for this purpose is the Annotation. An Annotation is associated with a Span (substring) of the text of a Document. The Annotation has a type and a set of features with values. For example, an annotation can indicate that a portion of a document is a sentence, or is a token with a given part-of-speech. More complex structures can be build by having Annotations which point to other annotations.
A Document is processed in a series of stages, such as tokenization, sentence splitting, dictionary look-up, pattern matching, etc. Each stage uses the Annotations placed on the Document by previous stages, and adds its own Annotations to the Document.
Annotations provide a mark-up capability very similar to that of SGML or XML (although Annotations do not have to be nested the way SGML/XML mark-up it). The Document class provides a method for converting selected Annotations on a Document to XML mark-up, and in the future will have a method for converting XML mark-up to Annotations. In addition, the Document class provides a method for viewing a Document and highlighting selected annotations (this is very primitive at present).