Evaluating expressions.(Expression trees?)

Started by
12 comments, last by BeanDog 18 years, 5 months ago
Hi, I'm writing a script engine and have the scripts parsed and tokenized, including expressions but I can't manage to code something that evaluates the expressions. Can someone point me in the direction of a good easy to understand tutorial? I've found a couple but they've been complex and not easy to follow at all. What interests most are expression trees. Since my tokenized set up is already one object per token, adding a tree/leaf structure would be very simple i'm just not sure of two things. 1.How do I construct the tree? Do I parse from left to right of the expression? It being very important () parantheses(sp?) is taken into account. I'd be grateful if someone posted some psudeo code that does this, given tokens can be Multiply,Plus,Sub,Or Numeric (there are more but those are the essential ones) 2.Once I have a tree constructed, how do I evalute it maintaing operate precedence? A good tutorial on expression trees anywhere? Or anyone know of some code that lets me manually construct an expression tree by passing tokes and then can evaluate it?
Advertisement
You may have better luck searching for "abstract syntax tree".
Do you have any pages in paticular that may be of help. I just googled abstract syntax tree and tutorials on it and I couldn't find anything to do with expressions.

Bumpy bump.
Edit: Uh.. ok, my diagrams got completely messed up for some reason. Not sure what's going on there.

The aim is to end up with a tree that looks something like this:
example input:a + b * cexpression tree:  +       . / \      .a   *     .   / \    .  b   c   .

The tree can be evaluated recursively. Starting at the root node, you check which type of node it is (an operator, eg +, or a variable or a constant, or whatever else it could be, depending on your scripting language). If it's an operator, then you recursively evaluate its operands (in this case 'a', and 'b * c'), and then perform the operation on the two results.

An important point to note is that the operator precedence rules are applied when you constructed the tree, which is why it is:
  +       <-- correct / \      .a   *     .   / \    .  b   c   .NOT:    *     <-- WRONG   / \    .  +   c   . / \      .a   b     .


The tree is constructed by a parser (it's the parser that takes precedence rules into account) - search for 'recursive descent parser' for explanations of one way to construct a parser.

John B

[Edited by - JohnBSmall on November 5, 2005 10:56:50 AM]
The best thing about the internet is the way people with no experience or qualifications can pretend to be completely superior to other people who have no experience or qualifications.
I've just finished implementing such an expression parser (well, mine is a little different, but that doesn't matter). Anyway, I just used the Composite pattern for representing the expression data. Mine operates at compile-time and so it works off of some template trickery (it's much more similar to Boost.Lambda in nature), but the same general idea applies.
There's a parsing library in Boost that you should check out: Spirit.
It's good that you have a tokenizer done. A good parser is built atop a good tokenizer. I've just taken a class on grammars and parsing, so hopefully I can give you some useful advice.

What you need to define is some rules for your expression language (or syntax, whatever you want to call it). For instance, assuming your tokenizer gives you a series of tokens like "(", ")", "*", "314", "3.2", "+", etc., you could write the following rules (these are ripped shamelessly from the web site of Spirit, of boost.org):

group ::= '(' expression ')'
factor ::= integer | group
term ::= factor (('*' factor) | ('/' factor))*
expression ::= term (('+' term) | ('-' term))*

Here, ::= means that the item on the left is defined on the right. So, a group is an opening paren followed by an expression object (defined later), followed by a close paren. If something isn't in the right place as you try to parse the group, you have invalid input. The trailing asterisks mean a list of 0 or more of what comes before. So a term is a factor followed by an asterisk and a factor or a forward slash and a factor, 0 or more times. Make sense?

So, abstractly, we define a few ideas as groups, factors, terms, and expressions. One of them is the "start symbol", expression in this case. This means that anything in this language can be defined using the definition of expression given above.

I'll skip a lot of the crazy thinking problems and get right to the punch. To get this working easily, you need a class for each rule. So a class for expression, for term, for factor, and for group. So assuming your tokenizer has a GetNextToken() method, you could write something like this for the constructor for an expression:
Expression::Expression(Tokenizer &t){  /*  Since an expression must begin with a term, try  to parse a term from the current location in the  tokenizer stream.  If the term class can't parse  a valid term from here, then you have valid input.  */  m_Term1 = new Term(Tokenizer &t);  if(m_Term1.Error())  {    this->SetError();    return;  }  //Now we get 0 or more +'s and -'s followed by terms.  while(t.PeekToken() == "+" || t.PeekToken() == "-")  {    m_Operators.push_back(t.GetNextToken());    m_OtherTerms.push_back(new Term(Tokenizer &t));    if(m_OtherTerms.back()->Error())    {      this->SetError();      return;    }  }}


You do this for each rule, and believe it or not, the whole thing works! We wrote a mini-Prolog parser/interpreter for the class. It was a lot of code (40 rules or so), but it was pretty easy to write because it's so cut and dried. In fact, there are programs to write programs that do this, because all you really need to know is the rules to write the code. There aren't any tricks to it.

One thing that might throw you is the rule for factors. It could be either an integer or a group. So, just try to read an integer from the tokenizer. If you can't, try to read a group. If that fails, too, you have invalid input.

Then, on each class, write a little function to evaluate it. It will be recursive, really. For instance, some more pseudocode:
float Expression::Evaluate(){  float ret = m_Term1.Evaluate();  for(int i = 0; i < m_Operators.size(); i++)  {    if(m_Operators == "+")      ret += m_OtherTerms.Evaluate();    else      ret -= m_OtherTerms.Evaluate();  }}


The rules of precedence are built into the rules. You see, the expression (your start symbol, where you initially call the parse function, or here, the constructor), has the lowest precedence. The + and - operations will happen last. The terms that it calls Evaluate() on are the next order of precedence, * and /. Those in turn work on either literal numbers or parenthesized statements (which always get evaluated first). This will work to any depth with any statement you can be given.

Really, this is the accepted way to do this sort of thing. Of course, I would highly suggest that you check out Spirit, as stated above, as it will ease the coding a *LOT*, especially if your scripting language is more than just numerical evaluation. With Spirit, you could put together a full C parser in a few pages of code.

I hope this helps. I wish someone would've explained this to me a few years ago :-)



~BenDilts( void );
Thanks for the help, but I'm still having trouble understanding how the grammer is defined.

How do I parse the grammer and apply it to the tokens? I can't find any information on that. Do I run the grammer through the toknizer?

Could you explain it english steps please? I've not found a single tutorial that does anything but tell you what the grammer does, not how it does it.
The tokenizer is a very simple program. All it does is take in a stream of characters and put out a stream of higher-level tokens. For instance, a tokenizer would take in the following string:

(435-2)+4*2

and provide the following list of tokens:

OpenParen
Integer(435)
Minus
Integer(2)
CloseParen
Plus
Integer(4)
Times
Integer(2)

Here's a trace of what happens:
Still unparsed: (435-2)+4*2Then the grammar parser takes this list and starts at the beginning.  It callsthe expression parser (the start symbol of the grammar), which says, "OK, Ineed to get an 'expression'.  That means I need to begin with a term."    Still unparsed: (435-2)+4*2  So the 'expression' parser calls the 'term' parser.  The term parser says, "OK,  I need a 'term'.  That means I'm beginning with a 'factor'."  So the term  parser calls the factor parser.      Still unparsed: (435-2)+4*2    The factor parser says, "OK, we're supposed to have a factor right here.  I    begin with an integer."  So we peek at the first token in the stream.  It's    not an integer!  So that won't work.  But the factor can also be a group.    So the 'factor' parser calls the 'group' parser.      Still unparsed: (435-2)+4*2      The 'group' parser says, "OK, I need to get a group from this token      list."  The 'group' parser gets the first token from the list, and it is      indeed an open paren, the first thing needed in his definition.  Then he      sees that the next thing he needs to succeed is an expression.  So he      calls the expression parser.        Still unparsed: 435-2)+4*2        The expression parser says, "OK, I need to get an 'expression'.  That        means I need to begin with a term."  So the 'expression' parser calls        the 'term' parser.            Still unparsed: 435-2)+4*2          The term parser says, "OK, I need a 'term'.  That means I'm beginning          with a 'factor'."  So the term parser calls the factor parser.              Still unparsed: 435-2)+4*2            The factor parser says, "OK, we're supposed to have a factor right            here.  I begin with an integer."  So we peek at the first token in            the stream.  It is indeed an integer (435)!  So the factor class            takes that integer and stores it internally so that its Evaluate            function can return 435 (since this factor is merely an integer            rather than a group, its other option).  The factor parser returns            success.          Still unparsed: -2)+4*2          The term parser sees that the factor parser has returned          success, so a factor has been successfully read from the token          stream.  It peeks to see if the next token is a Times or Divide, but          it's not, so the term knows it has no more factors.  But it did find          enough to match its rule, so it returns success.        Still unparsed: -2)+4*2        The expression parser sees that it successfully found its first term.        It stores the instance of the term parser for calling Evaluate on        inside of the expression's own Evaluate function.         It peeks to see if the next token is Plus or Minus.  It is Minus, so        the Expression parser class remembers that it found a minus        sign for use in its Evaluate function.  It then calls the term        parser again.          Still unparsed: 2)+4*2          The term parser says, "OK, I need a 'term'.  That means I'm beginning          with a 'factor'."  So the term parser calls the factor parser.              Still unparsed: 2)+4*2            The factor parser says, "OK, we're supposed to have a factor right            here.  I begin with an integer."  So we peek at the first token in            the stream.  It is indeed an integer (2)!  So the factor class            takes that integer and stores it internally and returns            success.          Still unparsed: )+4*2          The term parser sees that the factor parser has returned          success, so a factor has been successfully read from the token          stream.  It peeks to see if the next token is a Times or Divide, but          it's not, so the term knows it has no more factors.  But it did find          enough to match its rule, so it returns success.        Still unparsed: )+4*2        The expression got another success result (this time the factor        parser found 2).  The expression parser stores this term internally for        use in its Evaluate function also.  The expression parser peeks the next        character to see if it's a Plus or Minus (it's not, it's a        CloseParen), so the expression knows that its list of terms is done.         It returns success.      Still unparsed: )+4*2      The group parser who called this instance of the expression parser sees      that the expression parser returned success, so it stores that      expression parser instance internally so it can return the expression's      Evaluate() as its own Evaluate() value.  The group parser reads a token      from the tokenizer and finds that it is indeed a CloseParen, so it      returns success (its definition is over).    Still unparsed: +4*2    The factor parser sees that the group parser returned success, so it    stores the group parser instance internally to call Evaluate() on.  The    factor parser returns success.  Still unparsed: +4*2  The term parser sees that the factor parser has returned  success, so a factor has been successfully read from the token  stream.  It peeks to see if the next token is a Times or Divide, but  it's not, so the term knows it has no more factors.  But it did find  enough to match its rule, so it returns success.Still unparsed: +4*2The expression parser sees that the term parser returned success, so it storesthat term internally for use later.  It peeks to see if the next token is Plusor Minus.  It is Plus, so it stores that internally.  It knows the next part ofthe expression must be another term, so it calls the term parser.Still unparsed: 4*2


I think you get the idea. Or if you don't, more text won't help you :-)

The point is, this will produce a sort of tree. The tree is much more complex than the one you wanted, but it will work every time. This expression ((435-2)+4*2) will produce the following tree:

                    expression                |        |          |            term         +           term              |                      |  |   |              factor             factor   *    factor              |                   |            |            group              integer      integer              |                  |               |       (  expression  )          4               2          |   |    |       term   -   term       |              |   factor            factor      |                 |  integer            integer      |                 |    435                 2


Each node in the tree can call Evaluate and return the value of everything below it, since its children are stored internally.

Hope that helps.



~BenDilts( void );

[Edited by - BeanDog on November 6, 2005 8:44:48 PM]

This topic is closed to new replies.

Advertisement