## Lecture 2:  Regular Expressions

January 28, 2010

Text, Chapter 2

### Introduction

To understand natural language we need mechanisms for analyzing its structure and identifying particular types of constituents.

We will begin by looking at regular expressions;  alhough regular expressions will not be powerful enough to handle all the structural analysis tasks we will face, they will play a major role in our study of NLP.  Because they can be very efficiently implemented, they also play a significant role in many other computer science applications.

A regular expression is a pattern which can match one or more character strings.  We will begin by considering patterns over characters, but will in fact use REs mostly for patterns over words.

The simplest RE is a string and matches exactly those characters
`Grishman`
a set of characters in brackets matches any one of those characters
`[GF]rishman`
dashes inside brackets denote a range of characters
`[F-I]rishman`
a question mark denote optionality
`Grishmann?`
`moo?`
an asterisk (Kleene star) denotes zero or more instances of the preceding character
`moo*`
while the plus denotes one or more instances
`mo+`
a vertical stroke separates alternatives
`cats|dogs`
parentheses may be used to group strings
`I like (cake|cookies)`
`I like brownies (very )+ much`
\b forces matches on word boundaries
`\bcat\b`

### Finite-State Automata

A regular expression can be represented by a finite-state automaton, which is represented by a graph consisting of states and arcs.  The arcs correspond to transitions and are labeled with characters.  Among the states are a start state and one or more end (accepting) states.

If the labels on the arcs leaving each state are disjoint, the automaton can be implemented as a deterministic recognizer (a deterministic finite automaton or DFA).  If they are not disjoint, we require a non-deterministic recognizer (non-deterministic finite automaton or NDFA).  An NDFA must search for a match, for example by saving choice points and backtracking to a choice point whenever it gets stuck.  An NDFA can be converted to a DFA, although the DFA may have many more states.

The finite-state automaton (or the RE) can generate or recognize a set of strings;  this set constitutes the (formal) language defined by the automaton.

### Java implementation of REs

Java has two primary classes for REs:  the Pattern and the Matcher.

The static method Pattern.compile translates a regular expression string into a Pattern:
`Pattern pattern = Pattern.compile(regularExpression);`
Then the matcher method creates a Matcher which matches the pattern against a given string:
`Matcher matcher = pattern.matcher(stringToBeMatched);`
Sun provides a tutorial on patterns.

Here's a basic matching program:
`import java.io.*;import java.util.regex.Pattern;import java.util.regex.Matcher;public class Regex {    public static void main(String[] args) throws IOException {        String patString = "cat";        System.out.println("Regular expression: " + patString);        Pattern pattern = Pattern.compile(patString);        String target = "I play catch with my cat.";        System.out.println("Target: " + target);        Matcher matcher = pattern.matcher(target);        boolean found = false;        while (matcher.find()) {            System.out.format("Found the text \"%s\" starting at " +               "index %d and ending at index %d.%n",                matcher.group(), matcher.start(), matcher.end());            found = true;        }        if(!found){            System.out.format("No match found.%n");        }    }}`
Examples including code written in class:

### Formal and Natural Languages

Consider the last task (spoken numbers).  The regular expressions we have written define a formal language of spoken numbers.   However, as computational linguists, having that formal language is not enough.  We have to verify empirically that the language includes all (or almost all) the ways people speak numbers.

How can we empirically assess a regular expression in this way?  We need a corpus which has been annotated (by hand) to mark where the numbers occur.  Then we run our recognizer and have it automatically mark the matching expressions.  We count
• correct:  system annotations which match hand annotations
• spurious:  system annotations which do not match a hand annotation
• missing:  hand annotations which do not match a system annotation
and then compute
• precision = correct / (correct + spurious)
• recall = correct / (correct + missing)
...