Lexical Analyzer

Started by
9 comments, last by GameDev.net 18 years, 5 months ago
How would I go about writing my own lexical analyzer. Just something that would just take a std::string, parse it into tokens and dump it in a std::list.
"Are you threatening me, Master Jedi?" - Chancellor Palpatine
Advertisement
Robust lexical analyzers generally use Finite State Automata. Deterministic FSAs, which are processed from Nondeterministic FSAs. The nondeterministic FSA, in turn, is built from regular expressions. Any good book on compilers will teach you how to do these things; alternatively, do a web search for the terms I've capitalized. It's a heady area of computer science, but an accessible one.
You can write them by hand for simple things. It's really fairly simple. If you want to use more advanced techniques like DFAs etc look at lex/flex and other generators, why reinvent the wheel on something like that?

In the old days, every C programmer knew how to do these simple things. Now in this modern day, everyone wants to use the most esoteric techniques to do a child's job.


If you are going to STL, might has well use things like string streams and tokenizers.

But if you want to do it yourself... like Sneftel said, you have to construct the regular expression of each of your token. Then you produce a state machine that will read each character until it finishes reading a token or reaches an error state.

Example, the token for integer numbers. A regular expression could look like this:

(digit)^1+

with digit = (0|1|2|3|4|5|6|7|8|9)

That translates by "an integer is any sequence of at least one digit"

then you can easily write a deterministic FSA to read it:

State 0: On digit, goto State 1
On other, goto error.
State 1: On digit, goto State 1
On other, goto State 2
State 2: Integer read.

There are a lot of tutorial around about FSA and regular expressions. Good luck!

PS: If its not for learning but for a real project,, lex and lex++ are the way to go.
We have an article series that covers things like finite state automatat if you're interested in writing those yourself.

Another option in C++ is to use boost::spirit to create a lexer.
Quote:Original post by Anonymous Poster
You can write them by hand for simple things. It's really fairly simple. If you want to use more advanced techniques like DFAs etc look at lex/flex and other generators, why reinvent the wheel on something like that?

In the old days, every C programmer knew how to do these simple things. Now in this modern day, everyone wants to use the most esoteric techniques to do a child's job.
There are many valid reasons for doing it by hand, such as having readable code that fits your style and design. I've yet to see a generator that produced code I could use as anything more than a standalone converter.
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
Quote:Original post by Extrarius
Quote:Original post by Anonymous Poster
You can write them by hand for simple things. It's really fairly simple. If you want to use more advanced techniques like DFAs etc look at lex/flex and other generators, why reinvent the wheel on something like that?

In the old days, every C programmer knew how to do these simple things. Now in this modern day, everyone wants to use the most esoteric techniques to do a child's job.
There are many valid reasons for doing it by hand, such as having readable code that fits your style and design. I've yet to see a generator that produced code I could use as anything more than a standalone converter.


Not only that, but for complex grammars auto-generated code can be quite slow compared to a hand written lexer. Even then, by far, the parser in a typical compiler is the most processor hungry part.

And anyway, every real computer scientist needs to know DFAs, NFAs, regex, formal grammars, and how to link them all together. They're essential to formal, theoretical computer science as well as understanding of more complex automata and turing machines, which in turn are essential to the study of algorithms.
Quote:Original post by Extrarius
Quote:Original post by Anonymous Poster
You can write them by hand for simple things. It's really fairly simple. If you want to use more advanced techniques like DFAs etc look at lex/flex and other generators, why reinvent the wheel on something like that?

In the old days, every C programmer knew how to do these simple things. Now in this modern day, everyone wants to use the most esoteric techniques to do a child's job.
There are many valid reasons for doing it by hand, such as having readable code that fits your style and design. I've yet to see a generator that produced code I could use as anything more than a standalone converter.


If you want to waste your time nitpicking about something like that, go ahead.
Many times strtok() suffices. When strtok() fails, there are handrollers. When handrollers fail, there is Lex. When lex fails, you kill yourself.

Not to mention Python, Perl and other scripting languages tend to have a ton of tools to help too that are fast and simple to use.

But rolling the DFA as pure C code is very simple anyways.
Quote:Original post by etothex
Quote:Original post by Extrarius
Quote:Original post by Anonymous Poster
You can write them by hand for simple things. It's really fairly simple. If you want to use more advanced techniques like DFAs etc look at lex/flex and other generators, why reinvent the wheel on something like that?

In the old days, every C programmer knew how to do these simple things. Now in this modern day, everyone wants to use the most esoteric techniques to do a child's job.
There are many valid reasons for doing it by hand, such as having readable code that fits your style and design. I've yet to see a generator that produced code I could use as anything more than a standalone converter.


Not only that, but for complex grammars auto-generated code can be quite slow compared to a hand written lexer. Even then, by far, the parser in a typical compiler is the most processor hungry part.

And anyway, every real computer scientist needs to know DFAs, NFAs, regex, formal grammars, and how to link them all together. They're essential to formal, theoretical computer science as well as understanding of more complex automata and turing machines, which in turn are essential to the study of algorithms.


You make a good point but for 99% of the time you only need very basic tools.

If the OP wants to write a compiler, than by all means, this advice is good.
I am assuming OP wants practical advice for daily chores, not writing the next C#2++ compiler etc.
Quote:Original post by chaosgame
How would I go about writing my own lexical analyzer. Just something that would just take a std::string, parse it into tokens and dump it in a std::list.


Not sure what you are trying to do, but the easiest way is to parse something is just to use yacc and define your grammer. Its pretty easy. Its not so hard to make your own parser, but the nice thing about yacc is that its very easy to make changes. The only problem with shift/reducing parsing is that its hard to get good error messages.

Of course, parsing begs for a lisp like data structure, its really the easiest way to implement it.



EvilDecl81

This topic is closed to new replies.

Advertisement