Trouble in building parser for compiler.

Started by
2 comments, last by assainator 10 years, 9 months ago
Hello all,
I have a question related to parsing in a compiler. I'm having troubles comming up with a proper approach to parse certain statements.
For example, take this grammar:
A:= literal | variable
B:= '+' | '-'
C_1:= A
C_2:= C, B, C
C_3:= '(', C, ')'
C = C_1 | C_2 | C_3
D:= 'return', C, ';'
example of parsable code:
return (5+a) - 3;
My question is, what approach can I best to convert these rules into functions?
My main concern is, how do I make sure that for "return 3 + 5;" the proper rules (D and C_2) are used?
After having started D, thus encountering the token '3', C_1 is also satisfactory. But then the parser encounters '+' instead of the expected ';'.
I would like to write this myself for learning purposes so parser generators like yacc/bison are a no go.
Every time I think I came up with something I end up with a function so long and ugly (and non operational) that I can't help but think I'm missing something. Does anyone have some pointers?
Thanks a lot in advance.
"What? It disintegrated. By definition, it cannot be fixed." - Gru - Dispicable me

"Dude, the world is only limited by your imagination" - Me

Advertisement

Here are some options:

http://en.wikipedia.org/wiki/LR_parser

http://en.wikipedia.org/wiki/Recursive_descent_parser

I personally like the LR parser better myself (because it can be extended to parallel and unrestricted parsing much more easily), but it's harder to think about how to produce helpful syntax error messages.

Personally - I know this is going to be unpopular - I suggest to not think at the grammar at all.

There has been a time in the past in which I was all around BNF and stuff. My last parser has been going for a while now and I still have no explicitly written grammar.


how do I make sure that for "return 3 + 5;" the proper rules (D and C_2) are used?

I'd do first a keyword match - just for stuff as for, which got pretty odd syntax for example. In this case we match return.

Then we find a literal. Looking ahead 1 token (the whole point of LR) we find a + token which is recognized as a operator. It's a binary op because we found it after a literal, so we fetch something else. The point is: don't be greedy! Don't switch just because you matched now.

At this point we have this expression parsed: the compiler will have to find what 3 and 5 are so it can emit proper ADD instruction.

I actually do expression assembly in the compiler, someone could say because of poor design.


Every time I think I came up with something I end up with a function so long and ugly (and non operational) that I can't help but think I'm missing something. Does anyone have some pointers?

Are you trying to parse and compile at the same time? This will end in tears in my experience. Pre-tokenization is a must in my opinion (and no, I don't care about what GCC/CompilerX does).

Object orientation might help you - in my experience this comes at a negligible cost for example you can have a loop "by keyword match" which is very compact and yet dispatches the correct syntax without visible ifs. You'll need to provide a set of basic compiler features such as type lookup. I've done this successfully with an base interface.

Performance wise, I once had... a problem with my data import routines. So I made a Notepad++ script which would encode all that data in a program... which turned out like 3000 lines long. It took a while to process (like 10 secs) but it was acceptable as a band-aid solution.

edit: two small clarifications.

Previously "Krohm"


how do I make sure that for "return 3 + 5;" the proper rules (D and C_2) are used?

Looking ahead 1 token (the whole point of LR) we find a + token which is recognized as a operator. It's a binary op because we found it after a literal, so we fetch something else. The point is: don't be greedy! Don't switch just because you matched now.

Thanks, that was my main thought-problem. For some reason, the idea of simply looking ahead didn't come to mind.

As for my 'long and ugly code', looking ahead makes the code a lot simpler.

Nypyren: Thanks for the links, I had read them before, but only in a theoretical context. Now I've read them in a practical context, thanks.

"What? It disintegrated. By definition, it cannot be fixed." - Gru - Dispicable me

"Dude, the world is only limited by your imagination" - Me

This topic is closed to new replies.

Advertisement