Public Group

Creating a lexer

This topic is 4866 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

I was wondering if anyone knows of any tutorials on programming your own lexer. The only ones I'm aware of all assume you want to use LEXX or YAC, but I am more interested in learning the theory and how to create one. I have googled to no avail, and I can't afford the Dragon Book so I would appreciate any help anyone could give me :) PS: I will buckle down and buy the book if there aren't any online tutorials.

Share on other sites
If you insist on writing one by hand you can use the techniques in the Algorithmic Forays article series here on gamedev to write a finite state machine to handle lexical analysis.

Share on other sites
Here is a great link to an excellent parser. He does go into some detail outlining how things work, which should help.

Gold Parser

Share on other sites
IMO lex is not that much simpler then writing your own.

A lot of people seem to insist on the 'state machine' patern, but things become a lot easier when use inline states. Here is an example from my current project:
Void tokenize(String s, TokenList& out) {	size_t pos = 0	while (pos < s.size()) {		char c = s.getChar(pos)		// Determine token type based on first character		if ( isspace(c) ) {			// ignore			++pos		} else if ( isalpha_(c) ) {			// identifier			size_t poss = pos // start			// find end			while ( pos < s.size() && isalnum_(s.getChar(pos)) ) {				++pos			}			out.pushBack(s.substr(poss, pos-poss))		} else if ( isdigit(c) ) {			// number			size_t poss = pos			while ( pos < s.size() && isdigit(s.getChar(pos)) ) {				++pos			}			out.pushBack(s.substr(poss, pos-poss))		} else if ( isoper(c) ) {			// operator			if ( pos+1 < s.size() && islongoper(s.substr(pos,2)) ) {				// long operator				out.pushBack(s.substr(pos, 2))				pos += 2;			} else {				out.pushBack(s.substr(pos, 1))				++pos			}		} else if (c=='"') {			// string			size_t poss = pos			++pos			while ( pos < s.size() ) {				char c = s#pos				if c=='"' : break				if c=='\\' : ++pos				++pos			}			out.pushBack(s.substr(poss, pos-poss))			++pos		} else if (c=='#') {			// comment untill end of line			while ( pos < s.size() && s.getChar(pos) != '\n' ) {				++pos			}		} else {			throw ScriptParseError("Unknown character in code")			++pos		}	}}

This technique can handle the lexing for most practical languages (things get slightly more difficult if you need to look ahead, for example in C++ L"something" is a string).

Share on other sites
Wow, thanks for all the links and replies.

@twanvl: You're probably right, but it seems like it would take longer (worst case is that the symbol you need is last case). Also, I need look ahead. Thanks, though.

@Codemonger: That's awesome! I haven't seen that before, I'm sure it will be very helpfull.

rate++ both.

1. 1
2. 2
Rutin
18
3. 3
4. 4
5. 5

• 9
• 9
• 9
• 14
• 12
• Forum Statistics

• Total Topics
633300
• Total Posts
3011266
• Who's Online (See full list)

There are no registered users currently online

×