Reduce/Reduce Conflict Resolution

Started by
7 comments, last by Ectara 11 years, 7 months ago
I have a LALR(1) parser that I'm writing, and I'm having trouble with a frequent occurence in a lot of grammars. There are non-terminals with identical definitions, and the parser is supposed to pick between them to go with the non-terminal it is to reduce them into. Additionally, there is also the expectation that not all tokens that comprise either of these non-terminals be reduced into them; some other non-terminals require it, as is seen in the example below.

As for the parser, it reads in grammar, and then generates a state machine in memory, which then can be used to parse a language described by the grammar. It works a lot like more conventional parser generators, except that rather than outputting code to be compiled into a state machine, it creates a state machine in memory that the parser then follows to process the input. It doesn't have features like combining similar state sequences in the machine into one chain, as that would break the requirement of needing the non-terminal names to be distinct, and to be able to tell which non-terminal it is.

As for the example, take this C grammar excerpt:


enumeration_const : IDENTIFIER
;

...

type_spec : VOID | CHAR | SHORT | INT | LONG | FLOAT
| DOUBLE | SIGNED | UNSIGNED
| struct_or_union_spec
| enum_spec
| typedef_name
;

...

typedef_name : IDENTIFIER
;

...

primary_exp : IDENTIFIER
| const
| STRING
| '(' exp ')'
;

...

const : INT_CONST
| CHAR_CONST
| FLOAT_CONST
| enumeration_const
;


Here, you can see that both enumeration_const and typedef_name clearly expect to be defined as a single identifier, and multiple other non-terminals expect either one or the other. In addition, primary_exp requires an IDENTIFIER that is not reduced to either of them. To resolve the reduce-reduce conflict, the parser currently reduces all identifiers to enumeration_const's, which is not good. My first guess is to somehow use the look-ahead to determine whether or not to reduce, and to what I should reduce based on against which non-terminal I'm checking the look-ahead. It'd have to work for any number of tokens, and not depend on the non-terminal's definition, so that has me stuck. I also can't use a symbol table from the parser to tell whether it is an enumeration or a type name, as it isn't supposed to be language dependant; not all languages have enumurations or named types, and some languages have more symbol types than this. I'd like to resolve it without any magic.

Lex and Yacc are good references, but they cannot be used as substitutes, and will not be. The parser must be completed; if this really truly is impossible for a LALR algorithm, then I'd be interested in learning how to express this concept in a manner that can be parsed by this parser. This parser serves an important purpose, and while each rule and non-terminal can have a callback to execute a function when it encounters one of that type, if I could do it without having to execute any code outside of the parser, it'd be preferable. The parser is written in C, and the language I'm currently testing is C. I have about all relevant standard library functions, and the parser cannot backtrack or guess incorrectly. The parser uses advanced file streams, and can be reading the input from memory, a physical file, a pipe, a network socket, a device node, a cat, etc, so the stream cannot be rewound past a character or two of ungetc(). While I have 32bit integers, I cannot guarantee the presence of a 64bit type.

Any advice would be appreciated, and let me know if more information is needed.
Advertisement

My first guess is to somehow use the look-ahead to determine whether or not to reduce, and to what I should reduce based on against which non-terminal I'm checking the look-ahead. It'd have to work for any number of tokens, and not depend on the non-terminal's definition, so that has me stuck.
[/quote]

That would be my guess. Looking around, it looks like there's a transform step for these conflicts that changes the (ambiguous non-terminal) rules from their specified grammar into more complex rules with the 1 step look-ahead.

Unfortunately, my experience was with top down parsing so it's hard to map that experience with how you're doing things.
Yeah, I see what you're saying. Thank you for taking the time to look it over. It looks like I'll need to build a new state machine for the parser; traversing a table that represents the grammar exactly as specified will only get me so far. I might be able to keep most of the code that traverses it, but I need to make a way to introduce subtle modifications to the grammar.

However, I can't quite tackle removing the ambiguity. Leaving the tokens as-is is the easier part, it would seem, if I could find a way to identify if a non-terminal becomes ambiguous when paired with another. This is all, of course, paired with the assumption that I can even parse this grammar with an LALR(k) parser.

So, I guess my tasks are:

1. Determine when a non-terminal makes another ambiguous.

2. Make a more complex state machine to handle this condition.

3. Find a way to traverse this new machine, after making it.

For 1, I might have to scan every state sequence and compare it with the one that is proposed to be added, and see if they are identical. If they are even one token off, the parser knows how to choose between them already. From there, I might mark it as being ambiguous.

For 2, I might add a special identification node indicating an ambiguity is about to occur and under what name it should be called in this instance, and remove the ambiguous definitions state sequences altogether, leaving a single state sequence named something unique.

For 3, if I follow 2 as described, I could traverse, and see if a non-terminal is matched containing the unique state sequence described earlier, and if so, fix the reference to it to indicate that it is of the type indicated by the identification node as I'm reducing. When the new non-terminal node is pushed on to the stack, it is now correctly resolved.

Does this sound logical?

Does this sound logical?


It sounds naively as a lot of work for what amounts to a special case, but I might just be misunderstanding things.

It shouldn't be a new state machine, it should be (assuming you're using a state machine already) refining some of the duplicate branches of the state transitions and replacing them with "them +1 lookahead" to disambiguate them. The state machine itself would behave similarly except for that transform.
I'm not sure that the problem can be reconciled so easily; considering two non-terminals like so:

n1 : IDENTIFIER ;
n2 : IDENTIFIER ;

n3 : n1 rule_1 rule_2 rule_3 ... rule_n '*'
n4 : n2 rule_1 rule_2 rule_3 ... rule_n '/'


It would seem that I would need to look ahead n + 1 tokens to find the last token that would tell which that identifier should be, if either of them. Seeing as n could be any whole number, the number of tokens I could encounter before knowing what to do would be indeterminable at the time that I need to parse the identifier because I have no room for guesswork.

The method I suggested is the first that comes to mind that would solve the problem of not knowing what it should be by waiting until the non-terminal containing it is reduced to then reduce it to the correct non-terminal.

The first problem I can think of with my method is telling the difference between an ordinary sequence of tokens and the same sequence of tokens that fits the definition of the ambiguous non-terminals. My plan isn't worth it if it doesn't solve all problems presented...
If I read the internet correctly, if you cannot disambiguate after the first lookahead then the grammar itself is ambiguous as far as LALR(1) is concerned.
enumeration_consts are identical to identifiers in both lexical and syntactic meaning, but not semantic meaning. Even a GLR parser (which are extended LR parsers that handle conflicts in the parse table by branching the parse stack) won't know what to do (it'll produce an output forest with every identifier-or-enumeration_const permuted).

The solutions are either:

  1. Remove enumeration_const from your grammar and do semantic analysis later to identify them.
  2. Treat enumeration_const as a terminal instead, which means the lexer needs to be able to recognize them before feeding them to the parser. This is harder since you require information from your work-in-progress parse tree. This will make your grammar non-ambiguous, though.

I generally use #1 (if anything is lexically identical, I treat them all as one terminal and do semantic analysis after the parse to determine what it actually is).

If I read the internet correctly, if you cannot disambiguate after the first lookahead then the grammar itself is ambiguous as far as LALR(1) is concerned.

When you put it that way, that makes a lot of sense. I get too caught up with trying to create contingency plans for everything. I'll have to see if I can resolve the conflict with one token of look-ahead; maybe the inverse of what I've been doing: Instead of always checking to see if the look-ahead could create a possible match to know when not to reduce, perhaps I should also check to see if I _should_ reduce. I'd need some way to know that the tokens on the stack could be reduced to the ambiguous non-terminals; maybe if I see that it could match either of them, and they have an ambiguous flag set, then I check to see if the look-ahead indicates that I should reduce to one or the other. If not, then I don't reduce, doing the opposite of what I usually do.



enumeration_consts are identical to identifiers in both lexical and syntactic meaning, but not semantic meaning. Even a GLR parser (which are extended LR parsers that handle conflicts in the parse table by branching the parse stack) won't know what to do (it'll produce an output forest with every identifier-or-enumeration_const permuted).

The solutions are either:

  1. Remove enumeration_const from your grammar and do semantic analysis later to identify them.
  2. Treat enumeration_const as a terminal instead, which means the lexer needs to be able to recognize them before feeding them to the parser. This is harder since you require information from your work-in-progress parse tree. This will make your grammar non-ambiguous, though.

I generally use #1 (if anything is lexically identical, I treat them all as one terminal and do semantic analysis after the parse to determine what it actually is).

Unfortunately, this isn't an option; many grammars make use of such facilities. I can't allow the feature not to be used; if a well known version of the C grammar from 1985 (http://www.lysator.l...-grammar-y.html) expects the ambiguity to be resolved, then I should probably support it. As for the second point, if I require knowing the purpose of the tree's data while parsing, that restricts the parser to only grammar that gives that amibiguous non-terminal a comparable value. To be more clear, if the language has ambiguous non-terminals, but doesn't store the results of either in any sort of container, then reading the tree won't help determine what it is. Additionally, if the language does store it, this restricts me by having to add new capabilites for each language, if they support different types of names for different things.

EDIT: I've decided to change the state machine a bit; the machine now does a table look-up with the current state and the look-ahead token to find where to go next. While it consumes more memory than my current method, it'll be much faster, and a solution might be more feasible with a new method. I'll report back with any new options that have opened up.
I've been trying for quite a while, but I can't quite get my head around it. How should I design this machine? Is it a web of states that are connected by pointers? Is it supposed to be several tables containing something, with something acting as an index? I'm not quite sure how to implement it; any guess I think of would result in an LL(k) parser.

This topic is closed to new replies.

Advertisement