Jump to content

  • Log In with Google      Sign In   
  • Create Account






Basic Lua tokenising is go...

Posted by phantom, in Lua 16 January 2011 · 240 views

Over a few weekends leading up until Xmas and the last couple since then I have been playing around with boost::spirit and taking a quick look at ANTLR in order to setup some code to parse Lua and generate an AST.

Spirit looked promising, the ability to pretty much dump the Lua BNF into it was nice right up until I ran into Left Recursion and ended up in stack overflow land. I then went on a hunt for an existing example but that failed to compile, used all manner of boost::fusion magic and was generally a pain to work with.

I had a look at ANTLR last weekend and while dumping out a C++ parser using their GUI tool was easy enough the C++ docs are... lacking.. it seems and I couldn't make any headway when it came to using it.

This afternoon I decided to bite the bullet and just start doing it 'by hand'. Fortunately the Lua BNF isn't that complicated with a low number of 'keywords' to deal with and a syntax which shouldn't be too hard to build into a sane AST from a token stream.

I'm not doing things completely by hand; the token extraction is being handled by boost::tokeniser with a custom written skipper which dumps space and semi-colons, keeps the rest of the punctuation required by Lua and, importantly, it aware of floating point/double numbers so that it can correctly spit out a dot as a token when it makes sense.

Currently it doesn't deal with/hasn't been tested with octal or escaped characters and comments would probably cause things to explode, however I'll deal with them in the skipper at some point.

Given the following Lua;

foo = 42; bar = 43.4; rage = {} rage:fu() rage.omg = "wtf?"

The following token stream is pushed out;

<foo (28)> <= (18)> <42 (30)>
<bar (28)> <= (18)> <43.4 (29)>
<rage (28)> <= (18)> <{ (20)> <} (21)>
<rage (28)> <: (26)> <fu (28)> <( (24)> <) (25)>
<rage (28)> <. (27)> <omg (28)> <= (18)> <"wtf?" (31)>

Where the number is the token id found

There is a slight issue right now, such as when given this code;

foo = 42; bar <= 43.4; rage = {} rage:fu() rage.omg = "wtf?"

The token stream created is;

<foo (28)> <= (18)> <42 (30)>
<bar (28)> << (26)> <= (18)> <43.4 (29)>
<rage (28)> <= (18)> <{ (20)> <} (21)>
<rage (28)> <: (26)> <fu (28)> <( (24)> <) (25)>
<rage (28)> <. (27)> <omg (28)> <= (18)> <"wtf?" (31)>

Notice that it create two tokens for the '<=' sequence; this will probably need to be solved in the skipper as well.

So, once that is solved the next step will be the AST generation.. fun times...






Recent Entries

Recent Comments

PARTNERS