Sign in to follow this  
Drakkcon

Creating a lexer

Recommended Posts

I was wondering if anyone knows of any tutorials on programming your own lexer. The only ones I'm aware of all assume you want to use LEXX or YAC, but I am more interested in learning the theory and how to create one. I have googled to no avail, and I can't afford the Dragon Book so I would appreciate any help anyone could give me :) PS: I will buckle down and buy the book if there aren't any online tutorials.

Share this post


Link to post
Share on other sites
IMO lex is not that much simpler then writing your own.

A lot of people seem to insist on the 'state machine' patern, but things become a lot easier when use inline states. Here is an example from my current project:

Void tokenize(String s, TokenList& out) {
size_t pos = 0
while (pos < s.size()) {
char c = s.getChar(pos)
// Determine token type based on first character
if ( isspace(c) ) {
// ignore
++pos
} else if ( isalpha_(c) ) {
// identifier
size_t poss = pos // start
// find end
while ( pos < s.size() && isalnum_(s.getChar(pos)) ) {
++pos
}
out.pushBack(s.substr(poss, pos-poss))
} else if ( isdigit(c) ) {
// number
size_t poss = pos
while ( pos < s.size() && isdigit(s.getChar(pos)) ) {
++pos
}
out.pushBack(s.substr(poss, pos-poss))
} else if ( isoper(c) ) {
// operator
if ( pos+1 < s.size() && islongoper(s.substr(pos,2)) ) {
// long operator
out.pushBack(s.substr(pos, 2))
pos += 2;
} else {
out.pushBack(s.substr(pos, 1))
++pos
}
} else if (c=='"') {
// string
size_t poss = pos
++pos
while ( pos < s.size() ) {
char c = s#pos
if c=='"' : break
if c=='\\' : ++pos
++pos
}
out.pushBack(s.substr(poss, pos-poss))
++pos
} else if (c=='#') {
// comment untill end of line
while ( pos < s.size() && s.getChar(pos) != '\n' ) {
++pos
}
} else {
throw ScriptParseError("Unknown character in code")
++pos
}
}
}




This technique can handle the lexing for most practical languages (things get slightly more difficult if you need to look ahead, for example in C++ L"something" is a string).

Share this post


Link to post
Share on other sites
Wow, thanks for all the links and replies.

@twanvl: You're probably right, but it seems like it would take longer (worst case is that the symbol you need is last case). Also, I need look ahead. Thanks, though.

@Codemonger: That's awesome! I haven't seen that before, I'm sure it will be very helpfull.

rate++ both.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this