Building Syntax Trees

Started by
11 comments, last by evolutional 19 years, 10 months ago
After several hours of studying compilers and their theory I have built myself a fairly successful lexer using regular expressions. Now, I''ve hit a stumbling block - all the tutorials and articles I have read seem to use Bison/YACC to build a syntax tree of valid grammar and parse it. I understand how to create a YACC file can describe the context-free grammar (from the tutorial at Flipcode and Peroxide''s series) - I want to understand how to present valid grammar rules and build the tree myself. I''ve studied a variety of sources (Gamedev Fora/Flipcode/Peroxide/Crenshaw/Google, et al) for hours and have not come up with much. The things I''ve been reading from the search on GDNet seems to be pointing at Finite State Automata, but I''m not too sure. Has anyone else built their own syntax parser (using a stream of lexxed tokens as an input) without using YACC/Bison or similar that may be able to offer me a few pointers to resources? Thanks in advance
Advertisement
The term you might want to google for is "recursive descent parser". This is pretty much the only type of parser it makes sense to write by hand. A recursive does put certain restrictions on your grammar - it has to be LL(1), which, among other things, means you have to eliminate all left-recursive rules from your grammar.

The principle behind a recursive descent parser is rather simple: You have a function/method per non-terminal, in which you read the next token and use it to determine which function to call next.

--
AnkhSVN - A Visual Studio .NET Addin for the Subversion version control system.
[Project site] [Blog] [RSS] [Browse the source] [IRC channel]
--AnkhSVN - A Visual Studio .NET Addin for the Subversion version control system.[Project site] [IRC channel] [Blog]
Hey thanks for that, it seems as if I understood the term RDP without knowing it by name. A google has turned up some Comp. Science lecture notes on the subject. I''ll look into those for the time being...
You can build an AST or parse tree with Yacc as you go along, that is how I think most people use it anyways. At each terminal you push, at each expression you restructure the tree.

If you do it by hand I can explain how to easily build the tree yourself, it is not hard.

Create a binary tree structure. Have a stack that you can push the nodes onto it.

parse_term();
push_top_node();
parse_term();
push_top_node();
new_top->left = pop();
new_top->right = pop();
push(new_top)


I really don't know the *right* way to do it, or how they teach you in school, I taught myself and came up with my own ideas on how to do it. For some types of expressions, like a C for-loop, it has 3 terms, so I create something called a node and split them on it. The leafs contain the expressions.

[edit - diagram is fubared ]

For something like a function, I have a vector of parse nodes in the node, and just store *the statements* in there. It is easy and it works.

Peace


[edited by - PeterTarkus on May 21, 2004 2:42:15 AM]

[edited by - PeterTarkus on May 21, 2004 2:42:47 AM]

[edited by - PeterTarkus on May 21, 2004 2:43:29 AM]

[edited by - PeterTarkus on May 21, 2004 2:44:02 AM]
quote:Original post by PeterTarkus
... interesting stuff here ...


Interesting - I''ve been looking at compiler theory books a lot over the past few days (and considering buying Game Scripting Mastery) and they never seem to describe it as simply as you do.

So let me get your right - I should be taking my token stream, looking for a terminal (eg: for/while) and creating tree nodes from there, linking the expressions as leafs.

I think I''m getting you - are you hard-coding these rules or do you have a grammar input definition like YACC seems to have?
The best explanation I have EVER read about this is in the book "Game Scripting Mastery" by Alex Varnese (sp?).

It''s a must-buy for anyone wanting to roll their own
No, it''s a recursive descent parser. I have to warn you that recdescent parsers become very hard to manage as your language grows. That is why most people use Yacc/Lex or DFA parsers or whatever.

It is very simple I will show you an example

void parse(){  while(1)  {    token = get_token();    switch(token)    {    case token_for: parse_for()    case token_while: parse_while()    .    .    .    }    while(token == tok_semicolon) token = get_token()   }}


A good book for seeing how to build a rec descent parser is "Writing Compilers and Interpreters" by Ronald Mak.

I have a project with a simple language I made you can examine it to see how I did things if you want : http://xano.sourceforge.net/cull.html

Recursive descent is only good for small languages, you could not hope to build a C++ compiler in it you would go mad.

quote:Original post by PeterTarkus

snip

Recursive descent is only good for small languages, you could not hope to build a C++ compiler in it you would go mad.



I don''t think anyone here is aiming for the next C++... most people''s grammars could probably be left factored and have their left recursion eliminated in a reasonable amount of time.
Yes you are correct :-)

Yeah, I was a bit **tired** before, building the syntax trees is pretty much like you got going. The syntax trees are used mostly for parsing expressions like 1+2*3+6*4 etc. The other structures
like for, while, etc are just conveniently stored in the structure.

struct ParseTreeNode{   Token token;   ParseTreeNode * left, * right;   vector<ParseTreeNode*> statements;}



For something like a while, I put the while into a node,
and then all the statements between BEGIN .. END go into the
statements list. This just makes it easy to manage.

The stack for expression parsing is as old as time you can find info on that in any good C/Pascal programming book.

Peace

[edited by - PeterTarkus on June 7, 2004 1:21:13 AM]
Out of interest, which features of C++/C# would make it harder to parse? Are we talking class structures or pointer indirection?

This topic is closed to new replies.

Advertisement