Expression parsing?

Started by
14 comments, last by GameDev.net 17 years, 5 months ago
I'm currently writing my own scripting language, and it's time to write the compiler. The language itself (DruinkScript) has a C-like syntax, and I'm wondering how to process expressions in a nice way. For instance, you can define a variable with: int nFoo = 14 + 7 * 5 / (4 * 8 + func(7, "pies")); I'm kind of stuck on how to parse what's on the right of the assignment, considering I allow pretty much all non-pointer related C operators, including function calls. The book I was readin (Game Scripting Mastery), uses a hacky way to parse expressions, and I'm not 100% sure what it's doing. It splits the expression into two levels of precedence, and messes around with them. It also mentions stack based expression parsing, where I'd push the operands and operators onto two stacks, and then generate the I-code when I have an operator with the highest precedence on the top of the stack. However, this method doesn't seem particular efficient or easy to scale with more precedence levels. What I started thinking of doing was splitting the expression up into sub-expressions on brackets, that means that all the "stuff" in the sub-expressions can be parsed according to normal precedence rules. That eliminates the brackets problem, but still leaves how to handle the rest, and how to handle function calls (Although function calls are probably pretty straightforward). Anyone got any general hints or good links for me to look at?
Advertisement
The way I do it is like this:

First, I break the string down into a linked list of tokens. To make this very simple, I'll ignore variables and focus on numeric constants and operators only.

Numeric constant tokens are pretty simple, they just contain a number.

Operator tokens contain the operator "type" (from an enumeration), the position in the string it was (eg, it was 35 characters in) and how many opened parentheses are before it. Here's what I mean by the parentheses:

 .  . .  .  .   .  .  .3+(2+4/(4-3)+((5*4)/2)-1) 0  1 1  2  1   3  2  1  <- each operator token's parenthesis index.
(Once read in, parentheses are ignored, so I just get a chain of operators and numbers).

Now I go through and find all of the operator tokens and collect them into a list. I then sort this list, based on the following criteria:

  • Are the parenthesis indices different? If so, compare them and return the result.
  • Are the operators different? If so, compare then and return the result.
  • Compare the token's index within the original string and return the result. (For unary operators, the further right it is the higher precedence it is. For binary operators, the further left it is the higher precedence).


To sort the operator type (eg "Addition" vs "Shift Right") they are represented by an enumeration. I multiply the "real" value by 16, and return the precedence by diving by 16. To make this clearer, as an example, * might by 0, / might be 1 and + (which is the next level down in terms of order of precedence) skips ahead to 16. That way, to get the order of precedence, I divide by 16 and find that * and / are the same, but + is different.

Now I have this sorted list, I hammer through it, and use each operator token to "merge" the two tokens to either side. A bit like this:

       {4} {+} {5}        |___ ___|            |______ Grab the outer tokens and perform the operation on them.                   {4} {9} {5}            |______ Replace the operator with the result                    .  {9}  .        |_______|__ Delete the two constants.			If it's unary:	        {-} {1}             |___ Grab the rightmost token and perform the operation on it.		              .  {-1}              |__ Replace the token with the result and delete the operator.
That way, once you've finished going through your list of operators you'll only have one token left - and that's the result.

That's the basics, and it should (hopefully?) be simple enough to extend it to support "variable" tokens or "function" tokens. I've written it up in rather more detail here.

Hopefully that's not too confusing or convoluted. (My old version used a more direct approach - string manipulation, searching for operators using substrings - that sort of thing. Needless to say it's incredibly slow and full of hacks to keep it working).

[Website] [+++ Divide By Cucumber Error. Please Reinstall Universe And Reboot +++]

If you've got a copy of Stroustrup's The C++ Programming Language, there is a very elegant example of a math calculator that I was able to build upon to write a compiler that could handle arbitrarily complex expressions as well as functions, reference parameters etc.

I'm sure you are well aware of recursive descent parsing. The idea I've always used is to recursive descent parse an expression into a tree, then have a virtual emit() member on the node that will output the instructions for the vm to compute.

I've taken the approach in the past of a stack-based vm and the role of emit() is to emit code that always ends up with the result of the expression on the stack.

// simple and partial exampleclass node{public:    virtual ~node(){ }    virtual void emit()=0;};class lit_n : public node{private:   int val;public:   lit_n(int v) : val(v) { }   void emit(){ output(ic::push,val); }};class add_n : public node{private:    node *ln,*rn;public:    add_n(node *l,node *r) : ln(l),rn(r) { }    ~add_n(){ delete ln; delete rn; }    void emit()        { ln->emit(); rn->emit(); output(ic::add,0); }        // vm add instruction pops top two values, adds them and pushes the result};node *prim(bool get);node *term(bool get){    node *n=prim(get);    while(true)        {        switch(scan_type())            {            case plus : n=new add_n(n,prim(true)); break;            case minus: n=new sub_n(n,prim(true)); break;            default: return n;            }        }}node *prim(bool get){    scan(get); node *n;    switch(scan_type())        {        case number: n=new lit_n(value); scan(true); return n;        case lbracket: n=term(true); match(rbracket); scan(true); return n;        default: error(); return 0;        }}void expression_statement(bool get){    node *n=term(get); match(semi,false);    n->emit(); delete n;    output(ic::pop,0);}


You can expand this approach to functions by having a function node that contains a vector of node pointers that point to the argument and have calling emit() on a function node emit the code to push the params on the stack, call the function and end up with the return value on the stack.
Quote:Original post by Evil Steve
Anyone got any general hints or good links for me to look at?


(1) Define your grammar. Use EBNF to formalize it.

(2) Use standard techniques to elimiate left-recursion and perform left-factoring of your grammar into an LL(k)-parsable form.

(3) Whip up a basic recursive-descent parser.

(4) Fill in the action rules in appropriate places.

It's as simple as that. The ultimate reference is the dragon book.

There is a boost library you can use to write parsers but I find it, like many boost libraries, waaay too complex. Them b'ys shure are klever, but us mundanes just want a bit of butter for our bread.

Stephen M. Webb
Professional Free Software Developer

Quote:It also mentions stack based expression parsing, where I'd push the operands and operators onto two stacks, and then generate the I-code when I have an operator with the highest precedence on the top of the stack.
However, this method doesn't seem particular efficient or easy to scale with more precedence levels.


What a coincidence i was looking at this article the other day, im not sure about the effeciancy but it seems to me that it would scale fairly well because it only relies on "is the presidence of this operator higher then that one", anyway hope the article helps.
Woo, lots of replies [smile]

I've already got a basic recursive descent parser and a lexer working, so that's one thing. I also have Bjarne's book, I'll have to take a look at that calculator sample, it's been a while since I looked at the book, but that sounds promising.
I've currently got my grammar defined, but not in BNF notation (Since I couldn't remember BNF notation when I started it). I'll convert it to BNF soon (I don't have that much in it at the moment).

Bregma: Assuming you mean Boost Spirit - it terrifies me [smile]

Thanks for all the replies, I'll have a poke around and let you know how I get on...
Uhm, bison, uhm.

Seriously, though, you really do want to use bison.
All you need is a LL(k) grammar. Left-recursion elimination and factoring is done for you. This is very handy, since you can specify semantic actions for complex rules, leaving all the meat to the generated inner-engine.
Quote:Original post by deffer
Uhm, bison, uhm.

Seriously, though, you really do want to use bison.
All you need is a LL(k) grammar. Left-recursion elimination and factoring is done for you. This is very handy, since you can specify semantic actions for complex rules, leaving all the meat to the generated inner-engine.
This is mainly a learning experience, so I'd prefer to stay away from something like YACC or bison which will do the hard stuff for me. I like making life difficult for myself [smile]
Hi
I've learn't much from this series. It's quite old, but I think its useful.
The work done by bison is very low-level. Worry not, you'd still have your hands full with (among others):
- creating right grammar (not so easy)
- error recovery (quite hard to do it right)
- attaching semantic rules

Additionally, as bison is a proffesional tool, this knowledge could be most valued by your employer.

Ok, ok, I'm shutting up now.[grin]

This topic is closed to new replies.

Advertisement