Sign in to follow this  
ScopeDynamo

Evaluating expressions.(Expression trees?)

Recommended Posts

Hi, I'm writing a script engine and have the scripts parsed and tokenized, including expressions but I can't manage to code something that evaluates the expressions. Can someone point me in the direction of a good easy to understand tutorial? I've found a couple but they've been complex and not easy to follow at all. What interests most are expression trees. Since my tokenized set up is already one object per token, adding a tree/leaf structure would be very simple i'm just not sure of two things. 1.How do I construct the tree? Do I parse from left to right of the expression? It being very important () parantheses(sp?) is taken into account. I'd be grateful if someone posted some psudeo code that does this, given tokens can be Multiply,Plus,Sub,Or Numeric (there are more but those are the essential ones) 2.Once I have a tree constructed, how do I evalute it maintaing operate precedence? A good tutorial on expression trees anywhere? Or anyone know of some code that lets me manually construct an expression tree by passing tokes and then can evaluate it?

Share this post


Link to post
Share on other sites
Edit: Uh.. ok, my diagrams got completely messed up for some reason. Not sure what's going on there.

The aim is to end up with a tree that looks something like this:

example input:
a + b * c

expression tree:
+ .
/ \ .
a * .
/ \ .
b c .


The tree can be evaluated recursively. Starting at the root node, you check which type of node it is (an operator, eg +, or a variable or a constant, or whatever else it could be, depending on your scripting language). If it's an operator, then you recursively evaluate its operands (in this case 'a', and 'b * c'), and then perform the operation on the two results.

An important point to note is that the operator precedence rules are applied when you constructed the tree, which is why it is:

+ <-- correct
/ \ .
a * .
/ \ .
b c .
NOT:
* <-- WRONG
/ \ .
+ c .
/ \ .
a b .



The tree is constructed by a parser (it's the parser that takes precedence rules into account) - search for 'recursive descent parser' for explanations of one way to construct a parser.

John B

[Edited by - JohnBSmall on November 5, 2005 10:56:50 AM]

Share this post


Link to post
Share on other sites
I've just finished implementing such an expression parser (well, mine is a little different, but that doesn't matter). Anyway, I just used the Composite pattern for representing the expression data. Mine operates at compile-time and so it works off of some template trickery (it's much more similar to Boost.Lambda in nature), but the same general idea applies.

Share this post


Link to post
Share on other sites
It's good that you have a tokenizer done. A good parser is built atop a good tokenizer. I've just taken a class on grammars and parsing, so hopefully I can give you some useful advice.

What you need to define is some rules for your expression language (or syntax, whatever you want to call it). For instance, assuming your tokenizer gives you a series of tokens like "(", ")", "*", "314", "3.2", "+", etc., you could write the following rules (these are ripped shamelessly from the web site of Spirit, of boost.org):

group ::= '(' expression ')'
factor ::= integer | group
term ::= factor (('*' factor) | ('/' factor))*
expression ::= term (('+' term) | ('-' term))*

Here, ::= means that the item on the left is defined on the right. So, a group is an opening paren followed by an expression object (defined later), followed by a close paren. If something isn't in the right place as you try to parse the group, you have invalid input. The trailing asterisks mean a list of 0 or more of what comes before. So a term is a factor followed by an asterisk and a factor or a forward slash and a factor, 0 or more times. Make sense?

So, abstractly, we define a few ideas as groups, factors, terms, and expressions. One of them is the "start symbol", expression in this case. This means that anything in this language can be defined using the definition of expression given above.

I'll skip a lot of the crazy thinking problems and get right to the punch. To get this working easily, you need a class for each rule. So a class for expression, for term, for factor, and for group. So assuming your tokenizer has a GetNextToken() method, you could write something like this for the constructor for an expression:

Expression::Expression(Tokenizer &t)
{
/*
Since an expression must begin with a term, try
to parse a term from the current location in the
tokenizer stream. If the term class can't parse
a valid term from here, then you have valid input.
*/
m_Term1 = new Term(Tokenizer &t);
if(m_Term1.Error())
{
this->SetError();
return;
}

//Now we get 0 or more +'s and -'s followed by terms.
while(t.PeekToken() == "+" || t.PeekToken() == "-")
{
m_Operators.push_back(t.GetNextToken());
m_OtherTerms.push_back(new Term(Tokenizer &t));
if(m_OtherTerms.back()->Error())
{
this->SetError();
return;
}
}
}


You do this for each rule, and believe it or not, the whole thing works! We wrote a mini-Prolog parser/interpreter for the class. It was a lot of code (40 rules or so), but it was pretty easy to write because it's so cut and dried. In fact, there are programs to write programs that do this, because all you really need to know is the rules to write the code. There aren't any tricks to it.

One thing that might throw you is the rule for factors. It could be either an integer or a group. So, just try to read an integer from the tokenizer. If you can't, try to read a group. If that fails, too, you have invalid input.

Then, on each class, write a little function to evaluate it. It will be recursive, really. For instance, some more pseudocode:

float Expression::Evaluate()
{
float ret = m_Term1.Evaluate();
for(int i = 0; i < m_Operators.size(); i++)
{
if(m_Operators[i] == "+")
ret += m_OtherTerms[i].Evaluate();
else
ret -= m_OtherTerms[i].Evaluate();
}
}


The rules of precedence are built into the rules. You see, the expression (your start symbol, where you initially call the parse function, or here, the constructor), has the lowest precedence. The + and - operations will happen last. The terms that it calls Evaluate() on are the next order of precedence, * and /. Those in turn work on either literal numbers or parenthesized statements (which always get evaluated first). This will work to any depth with any statement you can be given.

Really, this is the accepted way to do this sort of thing. Of course, I would highly suggest that you check out Spirit, as stated above, as it will ease the coding a *LOT*, especially if your scripting language is more than just numerical evaluation. With Spirit, you could put together a full C parser in a few pages of code.

I hope this helps. I wish someone would've explained this to me a few years ago :-)



~BenDilts( void );

Share this post


Link to post
Share on other sites
Thanks for the help, but I'm still having trouble understanding how the grammer is defined.

How do I parse the grammer and apply it to the tokens? I can't find any information on that. Do I run the grammer through the toknizer?

Could you explain it english steps please? I've not found a single tutorial that does anything but tell you what the grammer does, not how it does it.

Share this post


Link to post
Share on other sites
The tokenizer is a very simple program. All it does is take in a stream of characters and put out a stream of higher-level tokens. For instance, a tokenizer would take in the following string:

(435-2)+4*2

and provide the following list of tokens:

OpenParen
Integer(435)
Minus
Integer(2)
CloseParen
Plus
Integer(4)
Times
Integer(2)

Here's a trace of what happens:

Still unparsed: (435-2)+4*2
Then the grammar parser takes this list and starts at the beginning. It calls
the expression parser (the start symbol of the grammar), which says, "OK, I
need to get an 'expression'. That means I need to begin with a term."

Still unparsed: (435-2)+4*2
So the 'expression' parser calls the 'term' parser. The term parser says, "OK,
I need a 'term'. That means I'm beginning with a 'factor'." So the term
parser calls the factor parser.

Still unparsed: (435-2)+4*2
The factor parser says, "OK, we're supposed to have a factor right here. I
begin with an integer." So we peek at the first token in the stream. It's
not an integer! So that won't work. But the factor can also be a group.
So the 'factor' parser calls the 'group' parser.

Still unparsed: (435-2)+4*2
The 'group' parser says, "OK, I need to get a group from this token
list." The 'group' parser gets the first token from the list, and it is
indeed an open paren, the first thing needed in his definition. Then he
sees that the next thing he needs to succeed is an expression. So he
calls the expression parser.

Still unparsed: 435-2)+4*2
The expression parser says, "OK, I need to get an 'expression'. That
means I need to begin with a term." So the 'expression' parser calls
the 'term' parser.

Still unparsed: 435-2)+4*2
The term parser says, "OK, I need a 'term'. That means I'm beginning
with a 'factor'." So the term parser calls the factor parser.

Still unparsed: 435-2)+4*2
The factor parser says, "OK, we're supposed to have a factor right
here. I begin with an integer." So we peek at the first token in
the stream. It is indeed an integer (435)! So the factor class
takes that integer and stores it internally so that its Evaluate
function can return 435 (since this factor is merely an integer
rather than a group, its other option). The factor parser returns
success.

Still unparsed: -2)+4*2
The term parser sees that the factor parser has returned
success, so a factor has been successfully read from the token
stream. It peeks to see if the next token is a Times or Divide, but
it's not, so the term knows it has no more factors. But it did find
enough to match its rule, so it returns success.

Still unparsed: -2)+4*2
The expression parser sees that it successfully found its first term.
It stores the instance of the term parser for calling Evaluate on
inside of the expression's own Evaluate function.
It peeks to see if the next token is Plus or Minus. It is Minus, so
the Expression parser class remembers that it found a minus
sign for use in its Evaluate function. It then calls the term
parser again.

Still unparsed: 2)+4*2
The term parser says, "OK, I need a 'term'. That means I'm beginning
with a 'factor'." So the term parser calls the factor parser.

Still unparsed: 2)+4*2
The factor parser says, "OK, we're supposed to have a factor right
here. I begin with an integer." So we peek at the first token in
the stream. It is indeed an integer (2)! So the factor class
takes that integer and stores it internally and returns
success.

Still unparsed: )+4*2
The term parser sees that the factor parser has returned
success, so a factor has been successfully read from the token
stream. It peeks to see if the next token is a Times or Divide, but
it's not, so the term knows it has no more factors. But it did find
enough to match its rule, so it returns success.

Still unparsed: )+4*2
The expression got another success result (this time the factor
parser found 2). The expression parser stores this term internally for
use in its Evaluate function also. The expression parser peeks the next
character to see if it's a Plus or Minus (it's not, it's a
CloseParen), so the expression knows that its list of terms is done.
It returns success.

Still unparsed: )+4*2
The group parser who called this instance of the expression parser sees
that the expression parser returned success, so it stores that
expression parser instance internally so it can return the expression's
Evaluate() as its own Evaluate() value. The group parser reads a token
from the tokenizer and finds that it is indeed a CloseParen, so it
returns success (its definition is over).

Still unparsed: +4*2
The factor parser sees that the group parser returned success, so it
stores the group parser instance internally to call Evaluate() on. The
factor parser returns success.

Still unparsed: +4*2
The term parser sees that the factor parser has returned
success, so a factor has been successfully read from the token
stream. It peeks to see if the next token is a Times or Divide, but
it's not, so the term knows it has no more factors. But it did find
enough to match its rule, so it returns success.

Still unparsed: +4*2
The expression parser sees that the term parser returned success, so it stores
that term internally for use later. It peeks to see if the next token is Plus
or Minus. It is Plus, so it stores that internally. It knows the next part of
the expression must be another term, so it calls the term parser.

Still unparsed: 4*2


I think you get the idea. Or if you don't, more text won't help you :-)

The point is, this will produce a sort of tree. The tree is much more complex than the one you wanted, but it will work every time. This expression ((435-2)+4*2) will produce the following tree:


expression
| | |
term + term
| | | |
factor factor * factor
| | |
group integer integer
| | |
( expression ) 4 2
| | |
term - term
| |
factor factor
| |
integer integer
| |
435 2


Each node in the tree can call Evaluate and return the value of everything below it, since its children are stored internally.

Hope that helps.



~BenDilts( void );

[Edited by - BeanDog on November 6, 2005 8:44:48 PM]

Share this post


Link to post
Share on other sites
BeanDog, for future reference, you can write trees using s-expressions and not worry so much about the forum munging them. So the 3 + 2 * 5 becomes (+ 3 (* 2 5)) Where each ( is the beginning of a new left branch. Each ) is the end of a right branch and everything apart from the first element is a leaf.

Some more examples:

3 + 9 * (10 + 7) => (+ 3 (* 9 (+ 10 7)))
x - 4^3 => (- x (^ 4 3))
1 + 1 + 1 + 1 => (+ 1 (+ 1 (+ 1 1)))
3 => 3

Share this post


Link to post
Share on other sites
Quote:
Original post by flangazor
BeanDog, for future reference, you can write trees using s-expressions and not worry so much about the forum munging them. So the 3 + 2 * 5 becomes (+ 3 (* 2 5)) Where each ( is the beginning of a new left branch. Each ) is the end of a right branch and everything apart from the first element is a leaf.

Some more examples:

3 + 9 * (10 + 7) => (+ 3 (* 9 (+ 10 7)))
x - 4^3 => (- x (^ 4 3))
1 + 1 + 1 + 1 => (+ 1 (+ 1 (+ 1 1)))
3 => 3


scheme^^

Share this post


Link to post
Share on other sites
Thanks again, but it's still not making sense!

I have the tokenizer working perfectly. I can determine if a token is a literal, variable/class etc using multi-pass. My problem is I just understand how The grammer is processed in real terms.

Do I tokenize the grammer rules file as well as create the rules based on that or do I build each rule by hand?
I can't find a single tutorial on the subject.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this