Back to General and Gameplay Programming

Lexical Analyzer

General and Gameplay Programming Programming

Started by chaosgame November 29, 2005 08:03 PM

9 comments, last by GameDev.net 18 years, 5 months ago

chaosgame

146

Author

November 29, 2005 08:03 PM

How would I go about writing my own lexical analyzer. Just something that would just take a std::string, parse it into tokens and dump it in a std::list.

"Are you threatening me, Master Jedi?" - Chancellor Palpatine

Sneftel

1,788

November 29, 2005 08:11 PM

Robust lexical analyzers generally use Finite State Automata. Deterministic FSAs, which are processed from Nondeterministic FSAs. The nondeterministic FSA, in turn, is built from regular expressions. Any good book on compilers will teach you how to do these things; alternatively, do a web search for the terms I've capitalized. It's a heady area of computer science, but an accessible one.

Anonymous

November 29, 2005 08:30 PM

You can write them by hand for simple things. It's really fairly simple. If you want to use more advanced techniques like DFAs etc look at lex/flex and other generators, why reinvent the wheel on something like that?

In the old days, every C programmer knew how to do these simple things. Now in this modern day, everyone wants to use the most esoteric techniques to do a child's job.

Steadtler

220

November 29, 2005 08:35 PM

If you are going to STL, might has well use things like string streams and tokenizers.

But if you want to do it yourself... like Sneftel said, you have to construct the regular expression of each of your token. Then you produce a state machine that will read each character until it finishes reading a token or reaches an error state.

Example, the token for integer numbers. A regular expression could look like this:

(digit)^1+

with digit = (0|1|2|3|4|5|6|7|8|9)

That translates by "an integer is any sequence of at least one digit"

then you can easily write a deterministic FSA to read it:

State 0: On digit, goto State 1
On other, goto error.
State 1: On digit, goto State 1
On other, goto State 2
State 2: Integer read.

There are a lot of tutorial around about FSA and regular expressions. Good luck!

PS: If its not for learning but for a real project,, lex and lex++ are the way to go.

SiCrane

11,840

November 29, 2005 09:02 PM

We have an article series that covers things like finite state automatat if you're interested in writing those yourself.

Another option in C++ is to use boost::spirit to create a lexer.

Extrarius

1,412

November 29, 2005 09:35 PM

Quote:Original post by Anonymous Poster
You can write them by hand for simple things. It's really fairly simple. If you want to use more advanced techniques like DFAs etc look at lex/flex and other generators, why reinvent the wheel on something like that?

In the old days, every C programmer knew how to do these simple things. Now in this modern day, everyone wants to use the most esoteric techniques to do a child's job.

There are many valid reasons for doing it by hand, such as having readable code that fits your style and design. I've yet to see a generator that produced code I could use as anything more than a standalone converter.

"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk

etothex

728

November 30, 2005 12:04 AM

Quote:Original post by Extrarius
Quote:Original post by Anonymous Poster
You can write them by hand for simple things. It's really fairly simple. If you want to use more advanced techniques like DFAs etc look at lex/flex and other generators, why reinvent the wheel on something like that?

In the old days, every C programmer knew how to do these simple things. Now in this modern day, everyone wants to use the most esoteric techniques to do a child's job.
There are many valid reasons for doing it by hand, such as having readable code that fits your style and design. I've yet to see a generator that produced code I could use as anything more than a standalone converter.

Not only that, but for complex grammars auto-generated code can be quite slow compared to a hand written lexer. Even then, by far, the parser in a typical compiler is the most processor hungry part.

And anyway, every real computer scientist needs to know DFAs, NFAs, regex, formal grammars, and how to link them all together. They're essential to formal, theoretical computer science as well as understanding of more complex automata and turing machines, which in turn are essential to the study of algorithms.

Anonymous

November 30, 2005 02:16 AM

Quote:Original post by Extrarius
Quote:Original post by Anonymous Poster
You can write them by hand for simple things. It's really fairly simple. If you want to use more advanced techniques like DFAs etc look at lex/flex and other generators, why reinvent the wheel on something like that?

In the old days, every C programmer knew how to do these simple things. Now in this modern day, everyone wants to use the most esoteric techniques to do a child's job.
There are many valid reasons for doing it by hand, such as having readable code that fits your style and design. I've yet to see a generator that produced code I could use as anything more than a standalone converter.

If you want to waste your time nitpicking about something like that, go ahead.
Many times strtok() suffices. When strtok() fails, there are handrollers. When handrollers fail, there is Lex. When lex fails, you kill yourself.

Not to mention Python, Perl and other scripting languages tend to have a ton of tools to help too that are fast and simple to use.

But rolling the DFA as pure C code is very simple anyways.

Anonymous

November 30, 2005 02:40 AM

Quote:Original post by etothex
Quote:Original post by Extrarius
Quote:Original post by Anonymous Poster
You can write them by hand for simple things. It's really fairly simple. If you want to use more advanced techniques like DFAs etc look at lex/flex and other generators, why reinvent the wheel on something like that?

In the old days, every C programmer knew how to do these simple things. Now in this modern day, everyone wants to use the most esoteric techniques to do a child's job.
There are many valid reasons for doing it by hand, such as having readable code that fits your style and design. I've yet to see a generator that produced code I could use as anything more than a standalone converter.

Not only that, but for complex grammars auto-generated code can be quite slow compared to a hand written lexer. Even then, by far, the parser in a typical compiler is the most processor hungry part.

And anyway, every real computer scientist needs to know DFAs, NFAs, regex, formal grammars, and how to link them all together. They're essential to formal, theoretical computer science as well as understanding of more complex automata and turing machines, which in turn are essential to the study of algorithms.

You make a good point but for 99% of the time you only need very basic tools.

If the OP wants to write a compiler, than by all means, this advice is good.
I am assuming OP wants practical advice for daily chores, not writing the next C#2++ compiler etc.

EvilDecl81

360

November 30, 2005 03:35 AM

Quote:Original post by chaosgame
How would I go about writing my own lexical analyzer. Just something that would just take a std::string, parse it into tokens and dump it in a std::list.

Not sure what you are trying to do, but the easiest way is to parse something is just to use yacc and define your grammer. Its pretty easy. Its not so hard to make your own parser, but the nice thing about yacc is that its very easy to make changes. The only problem with shift/reducing parsing is that its hard to get good error messages.

Of course, parsing begs for a lisp like data structure, its really the easiest way to implement it.

EvilDecl81

Lexical Analyzer

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Lexical Analyzer

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines