OAK System
Proteus Project
Department of Computer Science
New York University
Availability
Beta version is available upon request.
Please send email to sekine@cs.nyu.edu.
I will dispatch it once a month, please bare with me
if you don't get responce for a while.
Here is the manual in
ps format and
pdf format.
General
OAK system is a total English analyzer, which consists of
a sentence spliter, a tokenizer, a POStagger, a stemmer,
a chunker, a Naned Entity (NE) tagger, a dependency analyzer,
a parser, a function tagger and a regularizer.
It basically use explicit rules, rather than probabilistic scores,
so that human can modify and hopefully improve the accuracy.
The rules are mostly extracted based on transformation or
decision list learning method, and the rules are look like
regular expressions.
It can have any level of input (text, plain sentence,
tokenized, POS-tagged, chunk-tagged, dependency-tagged, parsed,
function-tagged or regularized sentences) and also any level of
output (the same).
It also can handle different kinds of format (plain, Penn TreeBank's
tagged format, Penn Treebank's combined format, plain stemmed format,
stem with POS tag's format, MUC format, Collins' parser format,
Tipster format, SGML format).
So, it can be used as a filter, simplifier, as well as an analyzer.
Current Situation (as of June 20, 2001)
- Sentence Spliter : Done. Quite good accuracy.
- Tokenizer : Done. Quite good accuracy.
- POS tagger and Stemmer : Done. POS tagger has 13% less error rate than Brill's tagger.
- Chunker : Implemented. Need accuracy improvement.
- NE tagger : Implemented. It has 100 kinds of NE.
- Dependency Analyzer : Not yet.
- Parser : Not yet.
- Function Tagger : Implemented. Need accuracy imrovement.
- Regularizer : Adam Meyer is implementing this part.
Availability
We would like to make this tool available for anyone for research purpose.
However, if you really want it even it is on a development stage and
you will be coporative to us, we may provide it now.
Please contact sekine@cs.nyu.edu.
Manual
Here is the "under construction" manual http://cs.nyu.edu/cs/projects/proteus/oak/manual.html
Demo: Snap Shot
Tokenizer
LINUX> oak -i SENTENCE -o TOKENIZED
Oak System (0.6) March.13.2001 Satoshi Sekine (NYU)
-----
Loading Dictionary ... done
-----
> "I'm a boy."
" I 'm a boy . "
Stemmer
LINUX>oak -i SENTENCE -o POSTAG -O STEM
Oak System (0.6) March.13.2001 Satoshi Sekine (NYU)
-----
Loading Dictionary ... done
Loading POS tagger rule ...done
-----
> Tables aren't broken.
table be not break .
POS tagger
LINUX> oak -i SENTENCE -o POSTAG -O PTB_TAG
Oak System (0.6) March.13.2001 Satoshi Sekine (NYU)
-----
Loading Dictionary ... done
Loading POS tagger rule ...done
-----
> Prof. Sekine promised to create this program by December 2001.
Prof./NNP Sekine/NNP promised/VBD to/TO create/VB this/DT program/NN by/IN December/NNP 2001/CD ./.
NE tagger
LINUX> oak -i SENTENCE -o NE -O MUC
Oak System (0.6) March.13.2001 Satoshi Sekine (NYU)
-----
Loading Dictionary ... done
...
Loading NE rule ...done
-----
> Prof. Sekine promised to create this program by December 2001.
Prof. <ENAMEX TYPE=PERSON>Sekine</ENAMEX> promised to create this program by <TIMEX TYPE=DATE>December 2001</TIMEX>.
Chunker
LINUX> oak -i SENTENCE -o CHUNK -O CONLL
Oak System (0.6) March.13.2001 Satoshi Sekine (NYU)
-----
Loading Dictionary ... done
Loading POS tagger rule ...done
Loading chunker quadgram ...done
Loading chunker rule ...done
-----
> Prof. Sekine promised to create this program by December 2001.
Prof. NNP B-NP
Sekine NNP I-NP
promised VBD B-VP
to TO I-VP
create VB I-VP
this DT B-NP
program NN I-NP
by IN B-PP
December NNP B-NP
2001 CD I-NP
. . O
Any comments or questions on this page, please send e-mail to sekine@cs.nyu.edu