# Lexical Analyzer

This topic is 4858 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

How would I go about writing my own lexical analyzer. Just something that would just take a std::string, parse it into tokens and dump it in a std::list.

##### Share on other sites
Robust lexical analyzers generally use Finite State Automata. Deterministic FSAs, which are processed from Nondeterministic FSAs. The nondeterministic FSA, in turn, is built from regular expressions. Any good book on compilers will teach you how to do these things; alternatively, do a web search for the terms I've capitalized. It's a heady area of computer science, but an accessible one.

##### Share on other sites
You can write them by hand for simple things. It's really fairly simple. If you want to use more advanced techniques like DFAs etc look at lex/flex and other generators, why reinvent the wheel on something like that?

In the old days, every C programmer knew how to do these simple things. Now in this modern day, everyone wants to use the most esoteric techniques to do a child's job.

##### Share on other sites
If you are going to STL, might has well use things like string streams and tokenizers.

But if you want to do it yourself... like Sneftel said, you have to construct the regular expression of each of your token. Then you produce a state machine that will read each character until it finishes reading a token or reaches an error state.

Example, the token for integer numbers. A regular expression could look like this:

(digit)^1+

with digit = (0|1|2|3|4|5|6|7|8|9)

That translates by "an integer is any sequence of at least one digit"

then you can easily write a deterministic FSA to read it:

State 0: On digit, goto State 1
On other, goto error.
State 1: On digit, goto State 1
On other, goto State 2

There are a lot of tutorial around about FSA and regular expressions. Good luck!

PS: If its not for learning but for a real project,, lex and lex++ are the way to go.

##### Share on other sites
We have an article series that covers things like finite state automatat if you're interested in writing those yourself.

Another option in C++ is to use boost::spirit to create a lexer.

##### Share on other sites
Quote:
 Original post by Anonymous PosterYou can write them by hand for simple things. It's really fairly simple. If you want to use more advanced techniques like DFAs etc look at lex/flex and other generators, why reinvent the wheel on something like that?In the old days, every C programmer knew how to do these simple things. Now in this modern day, everyone wants to use the most esoteric techniques to do a child's job.
There are many valid reasons for doing it by hand, such as having readable code that fits your style and design. I've yet to see a generator that produced code I could use as anything more than a standalone converter.

##### Share on other sites
Quote:
Original post by Extrarius
Quote:
 Original post by Anonymous PosterYou can write them by hand for simple things. It's really fairly simple. If you want to use more advanced techniques like DFAs etc look at lex/flex and other generators, why reinvent the wheel on something like that?In the old days, every C programmer knew how to do these simple things. Now in this modern day, everyone wants to use the most esoteric techniques to do a child's job.
There are many valid reasons for doing it by hand, such as having readable code that fits your style and design. I've yet to see a generator that produced code I could use as anything more than a standalone converter.

Not only that, but for complex grammars auto-generated code can be quite slow compared to a hand written lexer. Even then, by far, the parser in a typical compiler is the most processor hungry part.

And anyway, every real computer scientist needs to know DFAs, NFAs, regex, formal grammars, and how to link them all together. They're essential to formal, theoretical computer science as well as understanding of more complex automata and turing machines, which in turn are essential to the study of algorithms.

##### Share on other sites
Quote:
Original post by Extrarius
Quote:
 Original post by Anonymous PosterYou can write them by hand for simple things. It's really fairly simple. If you want to use more advanced techniques like DFAs etc look at lex/flex and other generators, why reinvent the wheel on something like that?In the old days, every C programmer knew how to do these simple things. Now in this modern day, everyone wants to use the most esoteric techniques to do a child's job.
There are many valid reasons for doing it by hand, such as having readable code that fits your style and design. I've yet to see a generator that produced code I could use as anything more than a standalone converter.

Many times strtok() suffices. When strtok() fails, there are handrollers. When handrollers fail, there is Lex. When lex fails, you kill yourself.

Not to mention Python, Perl and other scripting languages tend to have a ton of tools to help too that are fast and simple to use.

But rolling the DFA as pure C code is very simple anyways.

##### Share on other sites
Quote:
Original post by etothex
Quote:
Original post by Extrarius
Quote:
 Original post by Anonymous PosterYou can write them by hand for simple things. It's really fairly simple. If you want to use more advanced techniques like DFAs etc look at lex/flex and other generators, why reinvent the wheel on something like that?In the old days, every C programmer knew how to do these simple things. Now in this modern day, everyone wants to use the most esoteric techniques to do a child's job.
There are many valid reasons for doing it by hand, such as having readable code that fits your style and design. I've yet to see a generator that produced code I could use as anything more than a standalone converter.

Not only that, but for complex grammars auto-generated code can be quite slow compared to a hand written lexer. Even then, by far, the parser in a typical compiler is the most processor hungry part.

And anyway, every real computer scientist needs to know DFAs, NFAs, regex, formal grammars, and how to link them all together. They're essential to formal, theoretical computer science as well as understanding of more complex automata and turing machines, which in turn are essential to the study of algorithms.

You make a good point but for 99% of the time you only need very basic tools.

If the OP wants to write a compiler, than by all means, this advice is good.
I am assuming OP wants practical advice for daily chores, not writing the next C#2++ compiler etc.

##### Share on other sites
Quote:
 Original post by chaosgameHow would I go about writing my own lexical analyzer. Just something that would just take a std::string, parse it into tokens and dump it in a std::list.

Not sure what you are trying to do, but the easiest way is to parse something is just to use yacc and define your grammer. Its pretty easy. Its not so hard to make your own parser, but the nice thing about yacc is that its very easy to make changes. The only problem with shift/reducing parsing is that its hard to get good error messages.

Of course, parsing begs for a lisp like data structure, its really the easiest way to implement it.

• ### What is your GameDev Story?

In 2019 we are celebrating 20 years of GameDev.net! Share your GameDev Story with us.

• 14
• 23
• 12
• 12
• 11
• ### Forum Statistics

• Total Topics
634782
• Total Posts
3019267
×