[FOM] regular expressions

Vaughan Pratt pratt at cs.stanford.edu
Wed Oct 31 02:49:21 EDT 2007


Adam Kolany wrote:
 > I am sorry for possibly trivial questions:
 >     1. Is the problem of equality of languages given by two regular
 > expresions equal decidable?


Normally one would just answer this in the affirmative and point to any 
good automata text such as Hopcroft and Ullman, Lewis and Papadimitriou, 
Sipser, etc.  However this is a nice example of where there is a really 
slick proof that is essentially categorical in nature (exhibit the 
initial automaton and then the final automaton in the category of all 
deterministic automata accepting L) yet which doesn't require knowing 
any category theory to appreciate.  (Another such is the proof that two 
algebras freely generated by sets of the same cardinality are 
isomorphic.)  I believe the late Joe Goguen was the first to point this 
out.  I like this story enough that I can't resist telling it here.

For this purpose a deterministic automaton is any graph with set Q of 
vertices, possibly infinite, with one start state and an arbitrary set 
of final states (possibly empty, possibly all of Q), such that there are 
exactly |S| edges leaving each state, called the *transitions* from that 
state, each labeled with a different letter of the alphabet S, and such 
that every state is reachable by a path from the start state.  Two or 
more of the edges from one state may go to the same state (in contrast 
to the directed graphs of graph theory which only allow one edge per 
ordered pair of vertices). There are no epsilon transitions, every edge 
must be labeled with a single letter.

The crucial propositions are as follows.

Proposition 1.  Given any language L, as a subset of S*, there exists a 
canonical deterministic automaton A_L recognizing L.

(L need not be regular, or recursive, or recursively enumerable, it can 
be any language whatsoever.)

Proposition 2.  For any regular expression denoting a language L, the 
A_L associated to L by Proposition 1 is finite, with a number of states 
at most exponential in the size of the regular expression, and can be 
obtained from the regular expression in time at most quadratic in the 
size of A_L.

Corollary.  Equivalence of regular expressions is decidable in 
exponential time.  (Meyer and Stockmeyer obtained in 1974 the tighter 
result that the problem is PSPACE-complete.)

To show Proposition 1, take an infinite tree T all of whose vertices are 
of out-degree the cardinality s of the alphabet S.  All are of in-degree 
1 save the root vertex which is of in-degree zero.  There are no 
leaves---all paths are infinite.

Label the s edges leaving each vertex of T with the respective s letters 
of S.

Label the vertices of T with words from S* according to the labels on 
the path to that vertex from the root.  Thus the root will be labeled 
epsilon (the empty word), the s vertices immediately below the root will 
be labeled with the s one-letter words, and the s^n vertices n levels 
below the root will be labeled with the s^n n-letter words, for all 
natural numbers n (which is why every path from the root must be infinite).

Make T an automaton by taking its start state to be the root vertex of T 
and its set of final states to be the vertices of T labeled with members 
of L.

It should be clear that T is a deterministic automaton accepting L.  (It 
is in fact the initial such in a suitable category of all such.)

Now associate to every vertex v of T the language w\L defined as the set 
of those words x in S* such that wx is in L, where w is the word 
labeling v per the above procedure.  Call two vertices with labels v and 
w *equivalent* when v\L = w\L, that is, when the two subtrees with 
respective roots v and w are made equivalent automata by taking the 
respective subtree's root to be the start state in each case.  (The 
final states remain marked as originally.)

One can now show that the automaton obtained by identifying equivalent 
vertices is the final automaton A_L in the category of all deterministic 
automata accepting L.  Finality establishes the canonical property 
promised by Proposition 1.

Both the meaning and significance of finality for A_L is that *any* 
automaton accepting L can be turned into A_L by identifying states. 
This is key to the last stage of the construction in Proposition 2.

It is further the case that A_L as defined above is finite if and only 
if L is regular.  (The initial automaton is always infinite as long as S 
is nonempty, the usual though not necessary assumption in automata 
theory.)  This automaton is the canonical deterministic finite state 
automaton A_L recognizing L, as promised by Proposition 1.  (It is also 
the minimal deterministic such, subject to a remark below.)

Proposition 2 is shown by a series of standard reductions.  First 
transform the regular expression into a nondeterministic finite state 
automaton A.  (The end product should have no epsilon transitions; a 
popular two-step approach first produces one with epsilon transitions 
and then eliminates them, but it can also be done in one step.)  Then 
"determinize" A by taking the set of states of A' to be the power set 
2^Q of the set Q of states of A.  Exercise: given a subset X of Q, 
determine the unique subset Y of Q that the transition from X labeled s 
must go to.  Then determine which subsets of Q should be the final 
states of A' and which the start state of A'.

The last step is to identify the equivalent states of A'.  (This is far 
from the fastest way but it is the easiest to describe.)  Start out with 
the very coarse equivalence relation such that two states are equivalent 
just when either both or neither are final (so there are only two 
equivalence classes, or one when all or no states are final).  Now 
refine this to the (new) equivalence relation holding between (old) 
equivalent states that also have (old) equivalent successor states. That 
is, if q is equivalent to q' before, and for all s the two states 
reached by the s-transitions from q and q' are equivalent before, then q 
and q' are still equivalent after, the new equivalence relation.  Refine 
to convergence (i.e. when nothing changes), necessarily in at most |2^Q| 
steps since no relation can be finer than the identity relation.  Now 
identify equivalent states.  The result is A_L.

To decide if two regular expressions are equivalent (denote the same 
language L), compute their respective canonical automata and test 
whether they are isomorphic.  (Isomorphism of automata is trivial to 
decide in comparison with graph isomorphism: start by pairing up the 
initial states, then the successor states of the initial states, and so 
on.  With graph isomorphism one does not have any good starting place in 
general.)

Note that this A_L is not necessarily minimal in the standard sense used 
in automata textbooks because some authors define "deterministic" to 
mean only that *at most* one s-transition leaves each state, whereas the 
definition needed for the above treatment calls for *exactly* one 
s-transition.  The difference is that the A_L produced above will in 
general (but not always) include a state that leads to no final state; 
the other definition eliminates that state (but no other) out of a sense 
of efficiency.

============
 >     2. Is there a (decidable) formal system for proving the above
 > equalities?

The short answer is that no finite axiomatization is known that is 
decidable.  If you omit decidable then there two good axiom systems, 
each with its own merits.

1.  Kleene algebras.  A Kleene algebra is an idempotent semiring with 
identity 1 and zero 0 satisfying 1+x+x*x* = x*, xy <= y  -->  x*y = y, 
and yx <= y  -->  yx* = y.  All these conditions are equations save the 
last two which are quasiequations (universal Horn clauses).  The 
equations of this theory are exactly the identities of regular algebra, 
such as a** = a*, a(ba)* = (ab)*a, and so on, namely what you asked for. 
  The first nontrivial result about axiomatizing these identities is due 
to V. Redko in 1967 who showed that no finite set of equations could 
axiomatize them.  Arturo Salomaa then gave an axiom schema for the 
theory.  In 1971 John Conway (of Life and surreal-numbers fame) wrote a 
book on the subject which includes the observation that the equational 
theory includes a four element algebra in which aa = a yet a* is not 
equal to a, i.e. a nonstandard model.  In 1990 Dexter Kozen gave a 
finite equational theory together with a quasiequation expressing 
induction, the equations of which were the desired equations and which 
did not have Conway's nonstandard model.  Kozen defined a Kleene algebra 
to be any model of this quasiequational (universal Horn) theory.  For 
more details see the relevant Wikipedia article at 
http://en.wikipedia.org/wiki/Kleene_algebra .

The equational fragment of Kleene algebras is decidable as per the first 
section above, but Redko showed that fragment has no standalone finite 
axiomatization.  The system axiomatized is not decidable, a common 
outcome when passing from an equational theory to a larger 
quasiequational theory.

2.  Action algebras.  In 1990 I posed and answered affirmatively the 
question of whether the equational theory of regular expressions had a 
finitely axiomatizable equational conservative extension.  That is, 
could there be a superset of the regular operations and a finitely 
axiomatized equational theory thereof, such that those equations in the 
language of regular expressions are exactly the regular identities you 
are asking about.  The extension amounts to adding "guards" or "tests" 
to the language of regular expressions, so one can think of the language 
of action algebra as guarded regular expressions.  The two new 
constructs are a --> b and a <-- b, understood as assertions "if a 
before then b now" and "a now if ever b in the future".  Concatenation 
ab is viewed as conjunction ("a and then b") while a+b is disjunction 
("a or b").  The crucial axiom pinning down * is (a --> a)* |- a --> a, 
where |- can be read as <= (less-or-equal), with a <= b in turn being 
viewed as an abbreviation for a+b = b.  (Or more simply (a --> a)* = a 
--> a which is equivalent but less elegant, asserting that a --> a is 
its own reflexive transitive closure.) Conway's nonstandard model is 
killed off by this equational theory, and moreover in every model of the 
theory a* is reflexive transitive closure in the sense that a is 
reflexive when 1 <= a and transitive when aa <= a.  For more details see 
the relevant Wikipedia article at 
http://en.wikipedia.org/wiki/Action_algebra .

It is a long-standing (17 years) open problem whether action algebra is 
decidable.  This has been looked at by a number of algebraists, mainly 
in Europe, so far without success.

Vaughan Pratt



More information about the FOM mailing list