• entries
  • comments
  • views

ANL Expression Parsing

Sign in to follow this  


Over the last several years, my noise library (ANL) has been a sort of sink for whatever miscellaneous time I get in between work, kids, the house, game development, game playing, etc... It has evolved quite significantly from its roots as basically a libNoise knockoff. One of the key goals I have always had in mind has been to decrease the redundancy and boiler-plate involved in creating complex noise functions.

In the beginning, ANL followed libNoise in its structure. Noise functions were composed by chaining instances of various function classes. Creating a noise function was very clunky, as you had to instance the function objects then chain them manually by parameter passing. It was a pointer-indirection hell with a LOT of manual typing redundancy. Observe:

anl::CImplicitFractal frac(anl::FBM, anl::GRADIENT, anl::QUINTIC, octaves, freq, false);anl::CImplicitAutoCorrect ac(0,1);ac.setSource(&frac);anl::CImplicitSelect(0, 1, &ac, 0.5, 0.1);It worked, but it took a lot of typing. Over time, I simplified things with various interface classes and some Lua code that allowed me to specify function chains using Lua tables. The latest major re-write of the library formulates noise functions as simple arrays of instructions in a virtual-machine-ish fashion. The primary interface is quite similar to the old tree-builder interface added to the initial version, ie you can build modules in this manner:

k=anl::CKernel();k.select(k.constant(0), k.constant(1), k.simplefBm(anl::GRADIENT, anl::QUINTIC, 6, 2, 123123), k.constant(0.5), k.constant(0.1));This format is much more concise, but still requires the mechanism of making function calls on k.

Something I have wanted to implement for several years now is the ability to construct a noise function from an expression. The other day in the gd.net chat, we were talking about implementing simple calculator functionality to evaluate an expression, and it got me motivated to start working on an ExpressionBuilder class for ANL that can parse an expression string and construct a function chain from it. This morning, I pushed a commit of my initial work on this. This expression builder functionality lets you write expressions such as:

clamp(gradientBasis(3,rand)*0.5+0.5,0,1)+sin(x)*3and the code will parse the expression and generate the functions within a kernel, returning the index of the final function in the chain.

If you've never written an expression parser/evaluator, then know that there are essentially 3 steps to the process. 1) Split the input string into a stream of 'tokens' 2) Convert the token stream into a format the computer can easily evaluate 3) evaluate the final expression and return an answer.

The first simple tokenizing a new programmer is likely to encounter is a simple string.split operation, which splits up a string into chunks based on whitespace. Such a split operation might look like this:

std::vector tokenize(const std::string &s){ std::istringstream st(s); std::vector vec; std::string token; while(st >> token) vec.push_back(token); return vec;}This simple code accepts a string as input, and returns a vector of the individual tokens. However, the issue with a simple tokenizer like this is that individual tokens must be delineated with whitespace. The above example expression, then, would have to be written as:

clamp ( gradientBasis ( 3 , rand ) * 0.5 + 0.5 , 0 , 1 ) + sin ( x ) * 3which, obviously, is annoying as hell. Make a simple mistake of omitting a space, and suddenly what should be 2 separate tokens get merged into a single token that, probably, doesn't correcly match a token pattern. The expression parser wants tokens to 'match' a pattern in order to determine what the token is. You have your identfiers (start with a letter, can contain some combination of letters, numbers and _ characters), your numbers (contain digits, possibly a leading -, possibly a decimal point somewhere, etc..), your operators (math operations such as +, -, ^, etc..), your parentheses (open and close), and so forth. These tokens have to match the correct pattern so the evaluator knows what it is, but if two tokens get 'squished' together because you forgot a space, the result is likely a parse error, where the parser erroneously interprets the squished tokens as a different token altogether, or simply can't interpret it as any kind of valid token.

So, the 'simple' tokenizer won't work for general expressions. That means, you have to write a more robust (ie, more complicated) tokenizer. Such a thing is going to iterate the string character by character, attempting to match patterns as it goes. It will "eat" whitespace as it works (ie, before attempting to parse a token it will discard any leading whitespace), and attempt to build a token of a particular type based on the first non-whitespace character it encounters. Did it encounter a letter? Then the token is likely either a function name or a variable name, so parse an identifier. Starting from that first character, it will read characters until it encounters one that is not valid for an identifier: an operator, a parentheses, a comma, or whatever. As soon as this pattern-breaking character is encountered, then the tokenizer packages up the characters it has read, labels them as a FUNCTION or VARIABLE token, and inserts it into the token stream. Then it continues on, parsing the operator or parentheses it encountered.

At the end of the tokenizing, you end up with a token stream. This stream is essentially the same as the expression itself. It is in the same order, the chief difference is that it is 'split up' into easy-to-digest chunks, and each chunk is labeled as to what 'type' it is, ie NUMBER, FUNCTION, OPERATOR, and so forth. However, it's still not in a format that the computer can easily evaluate.

Computers are different from you and I. You or I could take an expression like x+3*y-9 and figure out an answer. You know from math class that multiplication comes first, so you're going to multiply 3 by y. Then you're going to add that to x, and subtract 9 from the whole thing. For us, it's 'easy' to interpret such an expression string. But the computer has a hard time interpreting it in the initial format. Part of is lies in the formal idea of operator 'precedence'. You know that multiplication comes first, so you skip to that part first. Your natural language processing and pattern processing 'brain' knows how to find the pieces that have to be calculated first. But a computer has to be specifically told which operations to do, and in what order to do them, and it is difficult to figure that out inherently from an expression such as we are used to looking at.

The trick is to convert the expression from its current format (commonly called 'infix') to a format that it can work with more easily, called 'postfix'. Infix simply means that the operators are 'inside' the operation. Postfix means that the operators come at the end of the operation. For example, the expression 4*3 in infix would equate to the expression 4 3 * in postfix. Similarly, the infix expression 5*(3+5) would convert to 5 3 5 + *.

A postfix operation specifies the operands first, followed by the operator to use on the operands. It is 'easy' for us to read an infix expression, but hard for us to read the postfix format of the expression, whereas it is 'easy' for the computer to read the postfix, and harder for it to read the infix. So, once an expression has been successfully converted into a stream of valid tokens, the next step is conversion from infix to postfix.

The algorithm I use is called the Shunting yard algorithm. This algo is so-named due to it's similarity to how rail cars are split up and assembled into trains in a rail-yard. The linked wikipedia article describes the algorithm fairly well. Essentially, it is a series of 'rules' for how each token in a stream is to be processed. The algorithm uses 2 data structures: the output vector (which will be a token stream converted to postfix) and an operator stack, onto which operator tokens or function tokens can be pushed. The algorithm works by iterating the token stream from first to last, and for each token:

1) If it's a number or variable token, push it into the output vector
2) If it's a function token, push it onto the operator stack
3) If it's an argument separator (comma), pop operators off of the top of the operator stack, until a left parentheses ( is encountered.
4) If it's an operator, compare the operator's precedence with the precedence of the operator currently at the top of the stack (if the stack is not empty) and if the precedence of the operator being considered is less-than-or-equal to the one on top of the stack, then pop that operator off the stack and push it into the output vector. Keep going until either the stack is empty, or the operator on top has less precedence. Then, push the operator being considered onto the stack.
5) If left parentheses, push it onto the stack
6) If right parentheses, then pop operators off of the stack and push them into the output vector, until you get to a ( parentheses. Pop that off and discard. If the next token is a function, pop it off the stack and push it into the vector.
7) Once the end of the input stream is reached, pop all remaining operators off of the stack and push them into the vector. Then, return the vector to caller.

The result should be the expression converted to postfix notation, barring any errors.

The postfix notation has the characteristic that all of the parentheses and commas in the expression are eliminated, so that no parentheses or comma tokens end up in the output stream. Only operands (number or variable), functions and operators are present in the stream. The operands and operators are ordered such that if the postfix token stream is evaluated, the order implied by the parentheses in the original expression is preserved. For example, in the expression 4*3+5 the resulting postfix will be 4 3 * 5 +, whereas with the expression 4*(3+5) the postfix will be 4 3 5 + *.

Evaluating a postfix is a fairly simple operation, involving yet another stack. The stack this time is used to hold operands (numbers or variable tokens).

To evaluate a postfix stream, simply iterate the stream, and for each token:

1) If it's a number, push the number onto the operand stack
2) If it's a variable, look up the value of the variable and push it onto the stack
3) If it's an operator, pop 2 values off the stack (right, left), perform the operation described by the operator using the two operands, and push the result onto the stack.
4) If it's a fuction, pop as many operands as the function requires off the stack, call the function, and push the result back onto the stack.

When all is said and done, there should be one value left on top of the stack. This is the result of the expression.

The evaluator I implemented for the ExpressionBuilder works similarly to this, except rather than using a stack for operands, I use a stack for CInstructionIndex values returned from the various functions of CKernel. When evaluating the postfix stream, if a token is a number or variable, then the number/variable is passed to a call to CKernel::constant() and the resulting instruction index is pushed on the stack. If a token is an operator, such as *, then 2 indices will be popped, then the corresponding math function in CKernel is called, and the result pushed onto the stack. ie, CKernel::multiply(left,right). And so it goes, until the final entry on the stack will be the index of the final function in the chain.

The ExpressionBuilder implements most of the functions in the CKernel interface. (I still have to figure out the best way to implement the fractal functions, though.) Some of the CKernel functions, such as x(), are implement as variable rather than functions, meaning that in the expression you can use x instead of having to use x(), in order to get the value of the x part of the input coordinate. Saves a little bit of typing. Similarly, radial is implemented as a variable rather than a function.

The ExpressionBuilder also implements 3 'special' variables: rand, rand01 and index. The rand token, when encountered, results in a PRNG call to get a random number, which is passed to CKernel::seed(). The rand01 token results in a PRNG call to get a random number converted into the range 0,1 and passed to CKernel::constant(). The index token is not yet implemented; it's what I'll be working on today. This token will allow you to 'index' the results of previously-evaluated expressions as a means of 'including' them into the current expression. For certain function chains that might be used repeatedly throughout a larger function chain, this is the way to go.

So far, the code I have pushed works. I haven't tested it very thoroughly, and I don't completely trust the tokenizer, but I'll continue to work on it throughout the days of my time off. I'm pretty happy to have finally started this project, though.

Sign in to follow this  


Recommended Comments

Given that you samples all seem to be valid C++ expressions, why not implement them as such, rather than going to all the trouble of parsing a string?


It seems that you could construct the return and parameter types of the various functions such that they would return a fully-functional generator object at the end.

Share this comment

Link to comment

Parsing from a string really isn't all that much trouble. It has the added benefit that I can store a module definition as a string in a text file, and use it as-is regardless of whether the app is written in C++ or Lua. I have a prototype graph/node based visual editor in the works that is being written in Lua, and having a consistent storage and serialization medium will be a benefit as I work further on it.

Share this comment

Link to comment
Nice work.

I prefer to build a node tree from expressions, then walk the tree recursively to calculate the result. It makes future expansion to more complicated things like functions, variables etc much more feasible.

But to be fair, I'm talking more about scripting languages, which in your case would probably be better solved by using a third party.

Share this comment

Link to comment

Well, the node tree structure is still there in ANL, it's just that I removed the pointer indirection and packed the nodes into a flat array, with array indexing forming the parent connections. Parsing an expression string still results in the same kind of node structure as always. Just a little more concise, and more easily stored as external data. It fills the same niche that my old Lua table descriptors did, but this sytem can be used the same in C++ as in Lua.

Share this comment

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now