Lexical Parser

Started by
16 comments, last by Ectara 12 years ago
How would one go about writing a lexical parser? I have several tools and containers that I've written at my disposal to contribute to the job. Ultimately, I'd like to use it in my compiler, and be able to read in a token at a time, and be able to tell its type and decide what to do based upon that, but I know it is never that simple. I plan to have it read in BNF files, and create a lexer state that processes input and returns tokens sequentially. One problem I foresee is knowing what many small tokens form when put together, and if the largest non-terminal type is returned, finding out the smaller components to operate on them, unless I return some sort of tree. I'm not sure how to go about this, and any advice would be helpful.

And please, no suggestions to abandon the project, or use one already written. Far too frequently, I receive only discouraging responses.
Advertisement
Note that this is most likely not the most optimal (or perhaps practical) way of doing this. I need to do this for an assignment and we are "forced" to use certain tools to create our own "scripting" language. What I do is the following:

I use the flex component from cygwin to scan the file and with the help of a small script, I am able to filter all the different tokens I want to be able to parse. (some usage examples I use from this site). This scanner creates a C file where you can overload the functions from and parse to your leisure. From this part on you are able to go through your tokens and create "paths" that lead you to variable declaration (for example, you found a path "object.position.x = 10") and with that path you can figure out what to do with whatever you want to achieve. There are multiple ways to get meaningful data and I guess you can figure out what suits you best and how you want to parse them.

Hope it helps!
You seem to be trying to write a syntactic parser, not a lexical parser. A lexical parser eats characters and spits out tokens, while a syntactic parser eats tokens and spits out a parsing tree. People often refer to these as "lexer" and simply "parser", respectively.

If you want to take one token at a time and decide what to do based on that, you probably want to write an LL(1) parser. That Wikipedia page contains a section on how to build an LL(1) parsing table which I think answers your question.
Are you making a lexer ( fixed parsing for a known grammar) or are you making a lexer generator ( reading bnf or some other grammar and then working dynamically on a stream based on that)?

For the former, it's fairly easy to do it naively. Read in characters, switch on state and generate results. If it's the later... You should really use an existing tool. Unless you're doing something quirky, this is a well covered area that is difficult to do well but has been effectively solved for decades...

[edit: ] if it's for a compiler it's likely fixed, though most compilers can't just read a line in at a time and know the type of each token... Do you happen to have the grammar well defined, or give us a better idea of what you'd expect the inputs and outputs to be?
I guess I could work from tokens. However, this would require me to write something like I already have for every type of input to create tokens from it. I'm also looking for re-usability. My preprocessor reads a stream and creates tokens from it, and I could use the tokens from that for something, however, I'd ideally like to have this create the tokens. However, it looks like I'll need both parsers at some point.

As for the suggestions to abandon writing this project and use an already written parser: no, thank you.

My preprocessor reads a stream and creates tokens from it...


So, it sounds like you have the tokenizer step done already. The only thing left during lexical analysis is to classify what kind of terminal each token is. For example, your terminal types might include: {identifier, keyword-void, keyword-for, keyword-if, operator-plus, operator-multiply, operator-not, ...}

After you've created terminals, you feed those to the parser (by parser, I mean the part that builds your abstract syntax tree).

I use table-based LR(0) parsing with a GLR-based state machine. This involves:

  • Building the parse table from the grammar productions. This is probably the hardest to understand step. I do this at runtime, where most popular parser generators do this as a pre-compile step (generating source code for the parser). This is a reusable step you can perform on a wide variety of grammars.
  • Create a state machine that can follow the transitions in your table. An LR state machine is easy to write, the GLR extensions are much more complicated. This is a generic algorithm which can be reused for multiple parse tables.
  • Feed all of your tokenized terminals to the state machine, followed by a special 'end of input' terminal.

http://en.wikipedia...._parsing_tables

[quote name='Ectara' timestamp='1333309300' post='4927252']
My preprocessor reads a stream and creates tokens from it...


So, it sounds like you have the tokenizer step done already. The only thing left during lexical analysis is to classify what kind of terminal each token is. For example, your terminal types might include: {identifier, keyword-void, keyword-for, keyword-if, operator-plus, operator-multiply, operator-not, ...}

After you've created terminals, you feed those to the parser (by parser, I mean the part that builds your abstract syntax tree).

I use table-based LR(0) parsing with a GLR-based state machine. This involves:

  • Building the parse table from the grammar productions. This is probably the hardest to understand step. I do this at runtime, where most popular parser generators do this as a pre-compile step (generating source code for the parser). This is a reusable step you can perform on a wide variety of grammars.
  • Create a state machine that can follow the transitions in your table. An LR state machine is easy to write, the GLR extensions are much more complicated. This is a generic algorithm which can be reused for multiple parse tables.
  • Feed all of your tokenized terminals to the state machine, followed by a special 'end of input' terminal.

http://en.wikipedia...._parsing_tables
[/quote]

I would like to have it tokenize, as I would think that trying to attach the type of token after it has been processed is not an easy task. Not to mention, the preprocessor does quick and imperfect tokenization, as it ensures that tokens are whole, well-formed, and are in such a state that they can be operated upon.

I would like to have it tokenize, as I would think that trying to attach the type of token after it has been processed is not an easy task. Not to mention, the preprocessor does quick and imperfect tokenization, as it ensures that tokens are whole, well-formed, and are in such a state that they can be operated upon.


It depends on the language. Some languages are designed so that the lexical stage is very easy. Natural languages like English are the exact opposite. GLR parsers are specifically designed to handle ambiguous situations like this. For example, if you see the word "run" in English, it can be a noun, intransitive verb, transitive verb. To handle this, you can pass a GLR parser multiple options per token, and it'll analyze each possibility in parallel. Most programming languages specifically avoid this ambiguity, which makes things much easier.

Your thread has the "C Language" tag on it. Does that mean you want to parse C, or are implementing it in C?

Your thread has the "C Language" tag on it. Does that mean you want to parse C, or are implementing it in C?


Both. I have a couple specialized virtual processors already implemented, and I wrote a poor compiler for it before. I have a C preprocessor already written that can either give a container of tokens and line numbers, or write the processed output to a stream.

Past that, I have nothing. Ideally, it'd be neat to use this lexer for something other than the compiler, or another compiler. I don't want to hard-code C syntax into it, and I don't want to embed it in to the compiler; I planned it out to be a separate module that the compiler makes calls to in order to process its data.

I guess, I'll first ask questions to see if I understand. First, does BNF describe a syntax for a lexical parser, or syntax parser? If BNF is not used for a lexical parser, which type of input should I use?

Second, would I simply be returning tokens, without any type information or how they might relate to each other? If so, would the next parser then take those and interpret syntactic meaning? Also, how would I determine which non-terminal is of the highest order, without scanning all of the tokens, if one terminal could be part of a non-terminal, and that non-terminal could belong to another, how would I best ascertain the highest non-terminal?

First, does BNF describe a syntax for a lexical parser, or syntax parser?


Either. BNF for the lexical parser works with characters and the BNF for the syntactic parser works with tokens that the lexer generates. Depending on the grammar you're looking at, they might be mixed together though. It might be a grammar based on words with the implicit 'ignore all whitespace' rule.


Second, would I simply be returning tokens, without any type information or how they might relate to each other?


Usually, the lexical grammar simply converts the stream of characters into a stream of tokens, omitting comments/whitespace/preprocessing directives. It might be prudent to include association, but it's usually too early to determine that.


If so, would the next parser then take those and interpret syntactic meaning?


Yes. The lexer makes tokens, the parser makes sure the tokens are in the right order and figures out things like "Are these parenthesis a function call or a favored precedence in an expression?".

This topic is closed to new replies.

Advertisement