As for the parser, it reads in grammar, and then generates a state machine in memory, which then can be used to parse a language described by the grammar. It works a lot like more conventional parser generators, except that rather than outputting code to be compiled into a state machine, it creates a state machine in memory that the parser then follows to process the input. It doesn't have features like combining similar state sequences in the machine into one chain, as that would break the requirement of needing the non-terminal names to be distinct, and to be able to tell which non-terminal it is.
As for the example, take this C grammar excerpt:
enumeration_const : IDENTIFIER ; ... type_spec : VOID | CHAR | SHORT | INT | LONG | FLOAT | DOUBLE | SIGNED | UNSIGNED | struct_or_union_spec | enum_spec | typedef_name ; ... typedef_name : IDENTIFIER ; ... primary_exp : IDENTIFIER | const | STRING | '(' exp ')' ; ... const : INT_CONST | CHAR_CONST | FLOAT_CONST | enumeration_const ;
Here, you can see that both enumeration_const and typedef_name clearly expect to be defined as a single identifier, and multiple other non-terminals expect either one or the other. In addition, primary_exp requires an IDENTIFIER that is not reduced to either of them. To resolve the reduce-reduce conflict, the parser currently reduces all identifiers to enumeration_const's, which is not good. My first guess is to somehow use the look-ahead to determine whether or not to reduce, and to what I should reduce based on against which non-terminal I'm checking the look-ahead. It'd have to work for any number of tokens, and not depend on the non-terminal's definition, so that has me stuck. I also can't use a symbol table from the parser to tell whether it is an enumeration or a type name, as it isn't supposed to be language dependant; not all languages have enumurations or named types, and some languages have more symbol types than this. I'd like to resolve it without any magic.
Lex and Yacc are good references, but they cannot be used as substitutes, and will not be. The parser must be completed; if this really truly is impossible for a LALR algorithm, then I'd be interested in learning how to express this concept in a manner that can be parsed by this parser. This parser serves an important purpose, and while each rule and non-terminal can have a callback to execute a function when it encounters one of that type, if I could do it without having to execute any code outside of the parser, it'd be preferable. The parser is written in C, and the language I'm currently testing is C. I have about all relevant standard library functions, and the parser cannot backtrack or guess incorrectly. The parser uses advanced file streams, and can be reading the input from memory, a physical file, a pipe, a network socket, a device node, a cat, etc, so the stream cannot be rewound past a character or two of ungetc(). While I have 32bit integers, I cannot guarantee the presence of a 64bit type.
Any advice would be appreciated, and let me know if more information is needed.
Edited by Ectara, 27 August 2012 - 09:39 AM.