Lexical Parsing With Meaningful Errors

Started by
3 comments, last by Ectara 11 years, 8 months ago
I have a lexical parser that I'm writing that parses as it reads the input with a shift and reduce parser that accumulates lexemes until it can determine that there is syntactic meaning to the topmost tokens on the stack, then it groups them as children to a node of the non-terminal's type, then pushes the new node. At the end of parsing, it results in an abstract syntax tree; however, if there is an error in parsing and it is malformed, I have the feeling that the tree will be incorrectly generated, with error information being difficult to recover. The tree generation uses the lexer's underlying internal functions for handling the input text, and returning the next lexeme. If the input is perfectly formed, it would result in a correct tree, without any line numbers or information associated with the input text. Is there a better way to parse the files that would preserve the information? With slight modification, I suppose I could track column and line numbers.
Advertisement

At the end of parsing, it results in an abstract syntax tree; however, if there is an error in parsing and it is malformed, I have the feeling that the tree will be incorrectly generated, with error information being difficult to recover.


If there is an error in parsing, you should never reach the end. You should error with 'expected X saw ___' if possible. Otherwise, keeping track of the 'deepest' attempted parse is usually sufficient to identify where things went south. If your parsing has a bug on the otherhand... shrug.

From experience, it is important to carry enough information to provide a filename (or other source identification string) and a line/column number - all the way through to code generation. Not only does it make things easier to debug, but you need to eventually output debugging symbols, which use this info so your output itself is debuggable.
Right, I understand this. Is there a better way to parse the files that would preserve the information? I could get the lexemes one by one, and keep track of it easily. However, parsing from top to bottom seems silly, when parsing from bottom to top makes the most sense. However, parsing from the bottom up cannot see the big picture, and thus when an error occurs, you wouldn't quite know it for a while, and even then, it is hard to know what was really expected. The method I have would work, but it is the unexpected error that would break its behavior.
I would suggest having a look at how Clang does its error handling, cause not only does it have the most informative errors and warnings I've ever seen, its doing it for some of the most complex programming languages available (in terms of processing), plus it runs very efficiently (both speed and memory wise).
I'm reading through all of its source, but it is extremely difficult to digest, especially with just about all of the code written in its headers. Does this mean I'm on the right track with a LALR parser using a shift-and-reduce method?

This topic is closed to new replies.

Advertisement