# Parser help

This topic is 4161 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Dear Thread Reader, I finaly finished my lexer (totaly homemade) and now Im going to work on my parser. But I'm not 100% sure my idea of how to do it is right, so here is what I think the best way to do this. Acording to the first token in the line expect what the other will be, is the other different then reexpect till there is one that matches or when there is nothing left to check against (sintax error). Say we have this line:
//Shared means that other scripts will be allowed to acces the var.
shared int SomeInt = 5*4;


When the parser sees 'shared' it expects a var or function declaration or a function implention. When it checks the next one it still expects the same, after the name he still expects both but when he hits '=' it knows its a var difinition sends the rest of the tokens '4*5;' to the solver wich will solve the math and send the awnser to init the var to. This similar pattern will be followed for function calling or other 'lines'. Is this the right way to it? Are there better ways? I don't want to use a premade parser so there is no point in trying to convince me to use a premade one.

##### Share on other sites
What I'd suggest:

Write down a BNF grammar for your language: now you have a formal definition instead of a fuzzy idea of what your language is.

Only then should you try to code anything.

##### Share on other sites
I'm sorry to say that that approach is not going to scale well at all. Unless what you are trying to do is pretty trivial, what I would suggest is formalizing your language as a grammar (in Backus-Naur form), and writing a recursive descent parser.

This is a somewhat complex process, but once you get your head around it, it is not too bad to actually implement. (I just finished a recursive descent parser for an independent study).

HTH

##### Share on other sites
that's right, BNF is the right way to go. Also, while it will be harder to get it going at first, if it's designed right, it's extremely easy to extend and make adjustment to your grammar.

...all that being said, I'm sure there's a tool out there that generates code for a parser, given a grammar. Though if you want the whole learning experience, it's worth trying it out for yourself first.

##### Share on other sites
Flex/Bison.
Lex/Yacc.

Note that a generic BNF form grammer is still too hard to parse, if I remember my parsing headaches right.

In effect, you want to form your grammer so that which branching of your grammer you are going down can be predicted based on the next token.

This can restrict your language in interesting ways.

Ie:
Chunk => scope type identifier ChunkTail;
ChunkTail => VarTail | FuncTail
VarTail => empty | = Expression
FuncTail => ( Arguements ) { FunctionBody }
...

Notice that from the very next token, you can tell which of the two possible grammer steps you follow. This is known as a LL(1) grammer, IIRC -- left lookahead 1 grammer.

##### Share on other sites
if you make your language syntactically simple it is easy to write a parser. I started with a fuzzy idea of how my language was going to work and ended up with a parser that is extremely simple and less than 650 lines. Here is some code from my language:

(int x) -> (int y) f{	y = x*x},char c,class C = (int a, int b, int c),()->() main{	print(f 5);	(int)->(int)@ fp;	fp=f@}

The algorithm is simple:

first I find all of the top level grouping symbol pairs ( ) { } [ ] and parse them (recursively) Then I do the -> operator, then the "adjacency operator" (the lack of an operator between two expressions is considered to be an implied adjacency operator). Then I do the rest of the operators, including the , and ; which are just plain operators. I have no prefix operators to make things much easier, just infix and postfix. After these steps I post-process the parse tree to recognize certain language constructs. I guess this is the parse tree to syntax tree step. That's it.

This technique probably is more effective for highly orthagonal languages like mine. It was much easier to hand write the parser than to use a tool like yacc or bison. I had to refactor constantly. At one point I had a large section of code where essentially the same code was replicated three times and held together with gotos. It was awesome. I eventually refactored it into a few simple loops but I don't think that I could have written it that way initially. I regret taking out the gotos but they had to go when the triple code went.

##### Share on other sites
since NotAYakk mentioned Flex/Bison etc.. i might as well throw boost.spirit out there.

It generates a recursive decent parser at compile time from the bnf.

Example:
subrule<0> identifier;subrule<1> modifier;subrule<2> type;subrule<3> expression;parse(start_of_script,      end_of_script,      (          identifier = alpha_p >> *(alnum_p | '_'),          modifier = "shared" | none_p,          type = str_p("int") | "unsigned int" | "float",          expression = ......,          modifier >> type >> identifier >> !('=' >> expression)      ));

##### Share on other sites
Note that recursive-decent is not the only way to parse things. Take a look at LR Parsers, which allow you to parse a larger class of grammers.

Also you appear to be mixing up interpretation and parsing. You don't want to do this. Form your entire syntax tree before interpreting things otherwise you'll make your parsing far too complex and hard to maintain.

##### Share on other sites
I forgot to mention the key part of how my parser works. After I load all of the Tokens into a container I turn them into Expression and throw pointers to them into a list. All of my parsing functions take a range (pair of iterators) in addition to their other arguments. Each parsing function gradually removes Expression*s from the list and in the end there is only one Expression* left in the list. Expression has a bunch of members but the key ones are:

Token token
Expression* left
Expression* right
Expression* contents

Initially each Expression is just a wrapper for the token that it was created from. Later an Expression representing the + operator will store pointers to its left and right operands in its left and right members. The left and right operands are removed from the list. If you have any questions let me know.

Also there is this program called Graphviz. I found it after I had finished the parser but it is so awesome. Give your parse tree the ability to write out a graphviz file, then use graphviz to turn it into an image showing you the nodes and pointers and contents.

1. 1
2. 2
frob
15
3. 3
4. 4
5. 5

• 14
• 13
• 14
• 69
• 22
• ### Forum Statistics

• Total Topics
632138
• Total Posts
3004319

×