Archived

This topic is now archived and is closed to further replies.

evolutional

Building Syntax Trees

Recommended Posts

After several hours of studying compilers and their theory I have built myself a fairly successful lexer using regular expressions. Now, I''ve hit a stumbling block - all the tutorials and articles I have read seem to use Bison/YACC to build a syntax tree of valid grammar and parse it. I understand how to create a YACC file can describe the context-free grammar (from the tutorial at Flipcode and Peroxide''s series) - I want to understand how to present valid grammar rules and build the tree myself. I''ve studied a variety of sources (Gamedev Fora/Flipcode/Peroxide/Crenshaw/Google, et al) for hours and have not come up with much. The things I''ve been reading from the search on GDNet seems to be pointing at Finite State Automata, but I''m not too sure. Has anyone else built their own syntax parser (using a stream of lexxed tokens as an input) without using YACC/Bison or similar that may be able to offer me a few pointers to resources? Thanks in advance

Share this post


Link to post
Share on other sites
The term you might want to google for is "recursive descent parser". This is pretty much the only type of parser it makes sense to write by hand. A recursive does put certain restrictions on your grammar - it has to be LL(1), which, among other things, means you have to eliminate all left-recursive rules from your grammar.

The principle behind a recursive descent parser is rather simple: You have a function/method per non-terminal, in which you read the next token and use it to determine which function to call next.

--
AnkhSVN - A Visual Studio .NET Addin for the Subversion version control system.
[Project site] [Blog] [RSS] [Browse the source] [IRC channel]

Share this post


Link to post
Share on other sites
Hey thanks for that, it seems as if I understood the term RDP without knowing it by name. A google has turned up some Comp. Science lecture notes on the subject. I''ll look into those for the time being...

Share this post


Link to post
Share on other sites
You can build an AST or parse tree with Yacc as you go along, that is how I think most people use it anyways. At each terminal you push, at each expression you restructure the tree.

If you do it by hand I can explain how to easily build the tree yourself, it is not hard.

Create a binary tree structure. Have a stack that you can push the nodes onto it.

parse_term();
push_top_node();
parse_term();
push_top_node();
new_top->left = pop();
new_top->right = pop();
push(new_top)


I really don't know the *right* way to do it, or how they teach you in school, I taught myself and came up with my own ideas on how to do it. For some types of expressions, like a C for-loop, it has 3 terms, so I create something called a node and split them on it. The leafs contain the expressions.

[edit - diagram is fubared ]

For something like a function, I have a vector of parse nodes in the node, and just store *the statements* in there. It is easy and it works.

Peace


[edited by - PeterTarkus on May 21, 2004 2:42:15 AM]

[edited by - PeterTarkus on May 21, 2004 2:42:47 AM]

[edited by - PeterTarkus on May 21, 2004 2:43:29 AM]

[edited by - PeterTarkus on May 21, 2004 2:44:02 AM]

Share this post


Link to post
Share on other sites
quote:
Original post by PeterTarkus
... interesting stuff here ...



Interesting - I''ve been looking at compiler theory books a lot over the past few days (and considering buying Game Scripting Mastery) and they never seem to describe it as simply as you do.

So let me get your right - I should be taking my token stream, looking for a terminal (eg: for/while) and creating tree nodes from there, linking the expressions as leafs.

I think I''m getting you - are you hard-coding these rules or do you have a grammar input definition like YACC seems to have?

Share this post


Link to post
Share on other sites
No, it''s a recursive descent parser. I have to warn you that recdescent parsers become very hard to manage as your language grows. That is why most people use Yacc/Lex or DFA parsers or whatever.

It is very simple I will show you an example



void parse()
{
while(1)
{
token = get_token();
switch(token)
{
case token_for: parse_for()
case token_while: parse_while()
.
.
.
}
while(token == tok_semicolon) token = get_token()
}
}



A good book for seeing how to build a rec descent parser is "Writing Compilers and Interpreters" by Ronald Mak.

I have a project with a simple language I made you can examine it to see how I did things if you want : http://xano.sourceforge.net/cull.html

Recursive descent is only good for small languages, you could not hope to build a C++ compiler in it you would go mad.

Share this post


Link to post
Share on other sites
quote:
Original post by PeterTarkus

snip

Recursive descent is only good for small languages, you could not hope to build a C++ compiler in it you would go mad.




I don''t think anyone here is aiming for the next C++... most people''s grammars could probably be left factored and have their left recursion eliminated in a reasonable amount of time.

Share this post


Link to post
Share on other sites
Yes you are correct :-)

Yeah, I was a bit **tired** before, building the syntax trees is pretty much like you got going. The syntax trees are used mostly for parsing expressions like 1+2*3+6*4 etc. The other structures
like for, while, etc are just conveniently stored in the structure.


struct ParseTreeNode
{
Token token;
ParseTreeNode * left, * right;
vector<ParseTreeNode*> statements;
}



For something like a while, I put the while into a node,
and then all the statements between BEGIN .. END go into the
statements list. This just makes it easy to manage.

The stack for expression parsing is as old as time you can find info on that in any good C/Pascal programming book.

Peace

[edited by - PeterTarkus on June 7, 2004 1:21:13 AM]

Share this post


Link to post
Share on other sites
The C# compiler is recursive descent. You can see the main parser file here: http://sharedsourcecli.sscli.net/source/browse/sharedsourcecli/clr/src/csharp/csharp/sccomp/parser.cpp?rev=1.1.1.1&content-type=text/vnd.viewcvs-markup

The JScript.NET compiler included in Rotor is also recursive descent, but written in C#: http://sharedsourcecli.sscli.net/source/browse/sharedsourcecli/jscript/engine/jsparser.cs?rev=1.1.1.1&content-type=text/vnd.viewcvs-markup

--
AnkhSVN - A Visual Studio .NET Addin for the Subversion version control system.
[Project site] [Blog] [RSS] [Browse the source] [IRC channel]

Share this post


Link to post
Share on other sites
quote:
Original post by downgraded
Out of interest, which features of C++/C# would make it harder to parse? Are we talking class structures or pointer indirection?
There are a bunch of wrinkles (particularly in C++) that require that the parser get all mixed up with the semantic analyzer. For instance i < j && k > l could either be an expression or a template instantiation, depending on what i is.

As long as you''re willing for your language to not ape C++, it shouldn''t be a big deal.

Share this post


Link to post
Share on other sites
That is interesting that C# is a recursive descent parser.

The C++ language requires alot of state management and I would think that complexity is what would drive you mad. The state management of the compiler is what becomes complex, not really the parser, but because the state management is mixed in with the parser it becomes a nightmare to remember what you were doing.

The original K&R C compilers were very simple really.


Peace

Share this post


Link to post
Share on other sites