DEPARTMENT OF COMPUTER SCIENCE

DOCTORAL DISSERTATION DEFENSE

Candidate: CHIA-HSIANG CHANG

Advisor: ROBERT PAIGE

DOCTORAL DISSERTATION DEFENSE

Candidate: CHIA-HSIANG CHANG

Advisor: ROBERT PAIGE

**From Regular Expressions to DFA's Using Compressed NFA's **

11:30 a.m., Thursday, August 20, 1992

719 Broadway, 12th fl. conference room

Abstract

We show how to turn a regular expression *R* of length *r*
into an *O*(*s*) space
representation of McNaughton and Yamada's NFA, where
*s* is the number of occurrences of alphabet symbols in *R*, and *s*+1is the number of NFA states.
The standard adjacency list representation of McNaughton and Yamada's
NFA takes up
1 + 2*s* + *s*^{2} space in the worst case.
The adjacency list representation of the NFA produced by Thompson
takes up between 2*r* and 6*r* space,
where *r* can be arbitrarily larger than *s*.
Given any subset *V* of states in McNaughton and Yamada's NFA,
our representation can be used to
compute the set *U* of states one transition away from the states
in *V* in optimal time
*O*(|*V*| + |*U*|). McNaughton and Yamada's NFA
requires ` $\Theta(|V| \times |U|)$`

time in the worst case. Using Thompson's NFA,
the equivalent calculation requires `$\Theta(r)$`

time in the worst case.

An implementation of our NFA representation confirms that it takes up an order of magnitude less space than McNaughton and Yamada's machine. An implementation to produce a DFA from our NFA representation by subset construction shows linear and quadratic speedups over subset construction starting from both Thompson's and McNaughton and Yamada's NFA's.

It also shows that the DFA produced from our NFA is as much as one order of magnitude smaller than DFA's constructed from the two other NFA's.

An UNIX egrep compatible software called cgrep based on our NFA representation is implemented. A benchmark shows that cgrep is dramatically faster than both UNIX egrep and GNU e?grep.

Throughout this thesis the importance of syntax is stressed in the design of our algorithms. In particular, we exploit a method of program improvement in which costly repeated calculations can be avoided by establishing and maintaining program invariants. This method of symbolic finite differencing has been used previously by Douglas Smith to derive efficient functional programs.