• # Algorithmic Forays Part 4

General and Gameplay Programming

This series has since been revised and combined
In the last article, I presented the three main techniques in use for generating recognizers from regular expressions. Just to remind you, the techniques are:
• Build a NFA from the regex. "Simulate" the NFA to recognize input.
• Build a NFA from the regex. Convert the NFA to a DFA. Simulate the DFA to recognize input.
• Build a DFA directly from the regex. Simulate the DFA to recognize input. At first, I was determined to spare you from the whole DFA/NFA discussion and just use the third - direct DFA - technique for recognizer generation. Then, I changed my mind, for two reasons. First, the distinction between NFAs and DFAs in the regex world is important. Different tools use different techniques (for instance, Perl uses NFA while lex and egrep use DFA), and it is valuable to have at least a basic grasp of these topics. Second, and more important, I couldn't help falling to the charms of the NFA-from-regex construction algorithm. It is simple, robust, powerful and complete - in one word, beautiful. So, I decided to go for the second technique. [size="5"]Construction of a NFA from a regular expression Recall the basic building blocks of regular expressions: eps which represents "nothing" or "no input"; characters from the input alphabet (we used a and b most often in these articles); characters may be concatenated, like this: abb; alternation a|b meaning a or b; the star * meaning "zero or more of the previous"; and grouping (). What follows is Thompson's construction - an algorithm that builds a NFA from a regex. The algorithm is syntax directed, in the sense that it uses the syntactic structure of the regex to guide the construction process. The beauty and simplicity of this algorithm is in its modularity. First, construction of trivial building blocks is presented.
1. For eps, construct the NFA: Here i is a new start state and f is a new accepting state. It's easy to see that this NFA recognizes the regex eps
2. For some a from the input alphabet, construct the NFA: Again, it's easy to see that this NFA recognizes the trivial regex a.
Now, the interesting part of the algorithm: an inductive construction of complex NFAs from simple NFAs. More specifically, given that N(s) and N(t) are NFA's for regular expressions s and t, we'll see how to combine the NFAs N(s) and N(t) according to the combination of their regexes.
1. For the regular expression s|t, construct the following composite NFA N(s|t): The eps transitions into and out of the simple NFAs assure that we can be in either of them when the match starts. Any path from the initial to the final state must pass through either N(s) or N(t) exclusively. Thus we see that this composite NFA recognizes s|t
2. For the regular expression st (s and then t), construct the composite NFA NFA(st): The composite NFA will have the start state of N(s) and the end state of N(t). The accepting (final) state of N(s) is merged with the start state of N(t). Therefore, all paths going through the composite NFA must go through N(s) and then through N(t), so it indeed recognizes N(st).
3. For the regular expression s*, construct the composite NFA N(s*): Note how simply the notion of "zero or more" is represented by this NFA. From the initial state, either "nothing" is accepted with the eps transition to the final state or the "more than" is accepted by going into N(s). The eps transition inside N(s) denotes that N(s) can appear again and again.
4. For the sake of completeness: a parenthesized regular expression (s) has the same NFA as s, namely N(s).
As you can see, the algorithm covers all the building blocks of regular expressions, denoting their translations into NFAs. [size="5"]An example If you follow the algorithm closely, the following NFA will result for (our old friend,) the regex (a|b)*abb:

Report Article

## User Feedback

You need to be a member in order to leave a review