LALR/LR Parser Generation

Started by
6 comments, last by Ectara 11 years, 7 months ago
I have a lexical parser that's designed to output an abstract syntax tree, and I've run into the problem that certain ambiguities aren't resolved as expected. For instance, I have a sample grammar and a sample input; for the sake of simplicity, I'll leave out the terminal definitions:

Value : INT_CONST | IDENTIFIER ;
Products : Products '*' Value | Value ;
Sums : Sums '+' Products | Products ;
start : Sums ;


and an input of "A*2 + 1".

The parser will read in the file, construct a table of the non-terminals, and their alternatives within, and then attempt to shift incoming tokens on to the token stack, and then reduce them when the topmost tokens match any alternative of a non-terminal, which sounds good in theory.

However, the bug come in ambiguities in handling the integers and the identifier. The parser will push the A, see that the IDENTIFIER matches Value, and reduce it and push a Value on the stack. Then, it sees that Value matches Products, since it can be made from a single Value. Then it reduces and pushes a Products on to the stack. Then it sees that Products matches Sums, since it can be made from a single Products. Then it reduces and pushes a Sums on to the stack. Then it sees that Sums matches start, since it can be made from a single Sums. Then it reduces and pushes a start on to the stack.

Then it pushes the asterisk, and after that, the 2. The parser then sees the INT_CONST matches Value, and reduce it and push a Value on the stack. Then, it sees that Value matches Products, since it can be made from a single Value... all the way up to start, and adds another start.

I suppose I could modify the grammar to avoid this, but I'd like to support this kind of grammar. This leads me to believe that I need a special table that has information on where to go next, rather than check everything to see if it matches. The problem that must be accomodated, is that its behavior could be correct in some instances, such as if the input is simply "A". The parse can't fail by requiring the ignored parts of grammar to be present and used. I could read a token of look-ahead, but I can't figure out how to use it with this situation; looking ahead one token might help me guess what's next, but if the ambiguitiy occurs one more token out in the alternatives, than one token of look-ahead won't resolve the issue with such naivety. So, surely there's another method than looking ahead to the entire input to know for absolute sure. If anyone could point me in the right direction, I'd appreciate it. Online references are extremely confusing, making huge assumptions about what the reader knows, without first telling the reader what they should know beforehand.
Advertisement
Well, this is the kind of situation where you need one token look-ahead. It is, in most cases, sufficient. Whether it is or not depends, as you already noticed, on the language you want to parse (and the way you formalize it). C, for example, can be nicely parsed with an LALR(1)-parser (that is, with one exception: the "dangling else-problem", which is a shift-reduce conflict, but that one can be resolved by preferring a shift). C++ is something different entirely. Parsing it with a LALR(1)-parser has been attempted, but the result is... not very nice. Another type of parser, GLR, would be a better choice. So it really depends on what you want to do. If your language is not more complex than C - and given your question, it probably isn't - LALR(1) might be the right choice for you.

What are you using to generate the parser? If you feed your grammar to yacc (or its GNU-implementation, bison), it will work nicely.
I have my own parser generator; it reads in my grammar format, then it generates a parser in memory, where I can then run the parser on input. It won't be as fast as a parser generated by YACC, but the ability to arbitrarily specify a grammar and parse with it without having to compile a separate parser program is essential. Also, it'd support Unicode.

The parser generator will read in the file, construct a table of the non-terminals, and their alternatives within. The part that actually does the parsing will attempt to act on this table. It could function like a state machine, seeing as each non-terminal and its alternative will lead to the next token that needs to be read, however it is non-deterministic. Do I need to generate an NFA like a regular language?
So, you read some grammar G1, say, in BNF, defining L(G1), and your program is now accepting L(G1). Then, on the fly, you read G2, and, by some magic trickery as it were, you now accept words from L(G2). Well, nice. But (setting aside the fact that I can think of neither practical application nor theoretical value) how would that do something useful? You need to take, when you encounter a statement, a form of semantic action. Say after playing with your example of addition and substraction, you have the idea to allow assignments, i.e., you have an
assignment-expression: additive-expression | postfix-expression ASSIGNMENT_OP assignment_expression

...now you would recognize "a = b = 2+3*sqrt(5);" as an assignment-expression. But what would you *do*? How do you intend to take a semantic action of something you don't have any semantics associated with? There is actually a reason that we have a C-compiler and an Ada-compiler and so on and not one meta-compiler that needs to be fed with a grammar, and then can compile any given language.

Setting my reservations regarding the usefulness and feasibility aside, my answer to your initial question still stands: you need, for the grammar you specified, one token look-ahead.

Oh, and by the way: yacc does, by and large, not operate on characters, but on tokens. So unicode should not be a problem; as for (f)lex, google found me this: http://stackoverflow.com/questions/9611682/flexlexer-support-for-unicode
And I highly recomend it, even if you stick with your original plan, to parse the grammar itself and create the parser for that grammar.
Yes, but what do you do with this one token of look-ahead? The main problem I'm having is how do I use it to deterministically choose the correct path to take; is there a specific structure in which I need to maintain the grammar information?
Unfortunately, this doesn't have a simple answer that can be answered easily in a forum post. The construction of a parsing table with lookahead takes approximately twenty pages in the dragon book, going from 221 to 240 to cover the algorithm and going on to 247 discussing how to optimize performance. While that's the book I have on hand, any book on compiler construction should have a comparable section on parsing table generation.
I see. Thank you for telling me. I had a guess to try a DFA, since that does sound oddly like what I need; I've written an NFA before, so this shouldn't be too hard. I suppose I'll try the DFA approach, then I'll consult an actual book on the matter.
http://www.gnu.org/software/bison/manual/bison.html#Lookahead

This page was all I needed to get it working; didn't need a rewrite and a new technique, I just needed to understand that (1) the look-ahead is _not_ part of the token stack, (2) when the tokens are reduced, the look-ahead is _not_ operated upon, and (3) the look-ahead is pushed on to the stack and replaced with a new one after all reductions have been completed. The look-ahead is treated differently than the stack, and the key to reducing most conflicts was to see if the top of the stack, with the look-ahead token, can possibly match a non-terminal if more tokens arrive that fit. I should be able to generate a meaningful error message from the last non-terminal that was a possible match.

The above link might give a better explanation than this paraphrasing, but it was all the push I needed.

This topic is closed to new replies.

Advertisement