# Algorithmic Forays Part 5

nfa nfas states construction parse tree cpp state code

The article discusses and implements the process of taking an arbitrary regex and automatically converting it into an NFA.

Last time we met, you learned about Thompson's Construction - an algorithm for building NFAs from regular expressions. I also presented a "starter" implementation of a basic NFA class in C++. In this article, you'll see the connection between these two. I will present a complete C++ implementation of Thompson's Construction. Additionally, you will see how to represent regular expressions with parse trees, and get acquainted with the code that builds complete NFAs from these regular expressions.

**Implementing Thompson's Construction**

Thompson's Construction, as you probably remember (If you don't, it is a good time to skim through the previous article) tells us how to build NFAs from trivial regular expressions and then compose them into more complex NFAs. Let's start with the basics:

**The simplest regular expression**

The most basic regular expression is just some single character, for example

**a**. The NFA for such a regex is:

// Builds a basic, single input NFA // NFA build_nfa_basic(input in) { NFA basic(2, 0, 1); basic.add_trans(0, 1, in); return basic; }Just to remind you about our NFA implementation: The first line of the function creates a new (with no transitions yet) NFA of size 2 (that is: with 2 states), and sets state 0 to be the initial state and state 1 to be the final state. The second line adds a transition to the NFA that says "

**in**moves from state 0 to state 1". That's it - a simple regex, a simple construction procedure.

Note that this procedure is suited for the construction of an

**eps**transition as well (it was presented separately in the last article's discussion of Thompson's Construction).

**Some changes to the NFA class**

Last article's implementation of a simple NFA class was just a starting point and we have quite a few changes to make.

First of all, we need direct access to all of the class's data. Instead of providing get and set accessors (which I personally dislike), all of the class members (size, initial, final and trans_table) have been made public.

Recall what I told you about the internal representation inside the NFA class - it's a matrix representing the transition table of a graph. For each i and j, trans_table[][/arrayi][j] is the input that takes the NFA from state i to state j. It's NONE if there's no such transition (hence a lot of space is wasted - the matrix representation, while fast, is inefficient in memory).

Several new operations were added to the NFA class for our use in the NFA building functions. Their implementation can be found in nfa.cpp (one of the files in the .zip that comes with this article). For now, try to understand how they work (it's really simple stuff), later you'll see why we need them for the implementation of various Thompson Construction stages. It may be useful to have the code of nfa.cpp open in an editor, and to follow the code for the operations while reading these explanations. Here they are:

**append_empty_state**- I want to append a new, empty state to the NFA. This state will have no transitions to it and no transitions from it. If this is the transition table before the appending (a sample table with size 5 - states 0 to 4):

Then this is the table after the appending:

The shaded cells are the transitions of the original table (be they empty or not), and the white cells are the new table cells - containing NONE.**shift_states**- I want to rename all NFA's states, shifting them "higher" by some given number. For instance, if I have 5 states numbered 0 - 4, and I want to have the same states, just named 2 - 6, I will call shift_states(2), and will get the following table:

**fill_states**- I want to copy the states of some other NFA into the first table cells of my NFA. For instance, if I take the shifted table from above, and fill its first two states with a new small NFA, I will get (the new NFA's states are sky blue):

Note that using fill_states after shift_states is not incidental. These two operations were created to be used together - to concatenate two NFAs. You'll see how they are employed shortly.

**Implementing alternation: a|b**

Here is the diagram of NFA alternation from the previous article:

**Implementing concatenation: ab**

Here is the diagram of NFA concatenation from the previous article:

**Implementing star: a***

Here is the diagram of the NFA for

**a***from the previous article:

**eps**transitions added to implement the star operation.

**Specialty of NFAs constructed by Thompson's Construction**

You might have observed that all the NFAs constructed by Thompson's Construction have some very specific behavior. For instance, all the basic building blocks for single letters are similar, and the rest of the constructions just create new links between these states to allow for the alternation, concatenation and star operations. These NFAs are also special implementation-wise. For instance, note that in our NFA implementation, the first state is always the initial, and the last state is always final. You may have noted that this is useful in several operations.

**A complete NFA construction implementation**

With these operations implemented, we now have a full NFA construction implementation in nfa.cpp! For instance, the regex

**(a|b)*abb**can be built as follows:

NFA a = build_nfa_basic('a'); NFA b = build_nfa_basic('b'); NFA alt = build_nfa_alter(a, b); NFA str = build_nfa_star(alt); NFA sa = build_nfa_concat(str, a); NFA sab = build_nfa_concat(sa, b); NFA sabb = build_nfa_concat(sab, b);With these steps completed, sabb is the NFA representing

**(a|b)*abb**. Note how simple it's to build NFAs this way! There's no need to specify individual transitions like we did before. In fact, it's not necessary to understand NFAs at all - just build the desired regex from its basic blocks, and that's it.

**Even closer to complete automation**

Though it has now become much simpler to construct NFAs from regular expressions, it's still not as automatic as we'd like it to be. One still has to explicitly specify the regex structure. A useful representation of structure is an expression tree. For example, the construction above closely reflects this tree structure:

**(a|b)*abb**is represented here in a tree form, just like an arithmetic expression.

Such trees in the world of parsing and compilation are called expression trees, or parse trees. For example, to implement an infix calculator (one that can calculate, for example: 3*4 + 5), the expressions are first turned into parse trees, and then these parse trees are walked to make the calculations.

Note, this is, as always, an issue of representation. We have the regex in a textual representation:

**(a|b)*abb**and we want it in NFA representation. Now we're wondering how to turn it from one representation to the other. My solution is to use an intermediate representation - a parse tree. Going from a regex to a parse tree is similar to parsing arithmetic expressions, and going from a parse tree to a NFA will be now demonstrated, using the Thompson's Construction building blocks described in this chapter.

**From a parse tree to a NFA**

A parse tree in our case is just a binary tree, since no operation has more than two arguments. Concatenation and alternations have two arguments, hence their nodes in the tree have two children. Star has one argument - hence only the left child. Chars are the tree leaves. Take a good look at the tree drawn above, you'll see this very clearly.

Take a look at the file regex_parse.cpp that comes with this article. It has alot in it, but you only need to focus only on some specific things for now. First, let's look at the definition of parse_node:

typedef enum {CHR, STAR, ALTER, CONCAT} node_type; // Parse node // struct parse_node { parse_node(node_type type_, char data_, parse_node* left_, parse_node* right_) : type(type_), data(data_), left(left_), right(right_) {} node_type type; char data; parse_node* left; parse_node* right; };This is a completely normal definition of a binary tree node that contains data and some type by which it is identified. Let us ignore, for the moment, the question of how such trees are built from regexes (if you're very curious - it's all in regex_parse.cpp), and instead think about how to build NFAs from such trees. It's very straight forward since the parse tree representation is natural for regexes. Here is the code of the tree_to_nfa function from regex_parse.cpp:

NFA tree_to_nfa(parse_node* tree) { assert(tree); switch (tree->type) { case CHR: return build_nfa_basic(tree->data); case ALTER: return build_nfa_alter(tree_to_nfa(tree->left), tree_to_nfa(tree->right)); case CONCAT: return build_nfa_concat(tree_to_nfa(tree->left), tree_to_nfa(tree->right)); case STAR: return build_nfa_star(tree_to_nfa(tree->left)); default: assert(0); } }Not much of a rocket science, is it? The power of recursion and trees allows us to build NFAs from parse trees in just 18 lines of code...

**From a regex to a parse tree**

If you've already looked at regex_parse.cpp, you surely noted that it contains quite a lot of code, much more than I've show so far. This code is the construction of parse trees from actual regexes (strings like

**(a|b)*abb**).

I really hate to do this, but I won't explain how this regex-to-tree code works. This article series is complicated enough without getting into parsing. You'll just have to believe me it works (or study the code - it's there!). As a note, the parsing technique I employ to turn regexes into parse trees is called Recursive Descent parsing. It's an exciting topic and there is plenty of information available on it, if you are interested.

**We've come a long way...**

Let's see what we have accomplished: we have implemented the complete process of taking regular expressions and turning them into NFAs - automatically! Our program turns the regexes into parse trees, walks these parse trees and creates NFAs from them using Thompson's Construction.

To see this in action, compile the files nfa.cpp and regex_parse.cpp together (make sure nfa.h resides in the same directory). This would be the command for those who use GCC:

g++ -o regex2nfa regex_parse.cpp nfa.cpp

regex_parse.cpp contains a small main function that takes the first argument and displays the NFA resulting from it. For instance, running:

regex2nfa "(a|b)*abb"

Displays:

This NFA has 11 states: 0 - 10

The initial state is 0

The final state is 10

Transition from 0 to 1 on input EPS

Transition from 0 to 7 on input EPS

Transition from 1 to 2 on input EPS

Transition from 1 to 4 on input EPS

Transition from 2 to 3 on input a

Transition from 3 to 6 on input EPS

Transition from 4 to 5 on input b

Transition from 5 to 6 on input EPS

Transition from 6 to 1 on input EPS

Transition from 6 to 7 on input EPS

Transition from 7 to 8 on input a

Transition from 8 to 9 on input b

Transition from 9 to 10 on input b

Which is, if you recall, the exact same NFA we dutifully crafted by hand for

**(a|b)*abb**in the previous article. Now, we can generate it automatically!

**A note about the source code**

Along with this article, comes a .zip file containing all the code. It currently consists of 3 files: nfa.h, nfa.cpp and regex_parse.cpp (which contains a main() function). All the code, like all my future code for the series is in the public domain - you may do anything you wish with it. If you have any problem compiling and running it, feel free to contact me, I'll help. I'd also be quite happy to receive feedback and bug reports.

© Copyright by Eli Bendersky, 2003. All rights reserved.

## 2 Comments

Note: GameDev.net moderates article comments.