writing a scripting language parser in C++

Started by
22 comments, last by jwezorek 14 years, 2 months ago
I'm about to write a parser for a scripting language in C++ and was wondering which Boost libraries I can leverage to do this. I know about Boost::Spirit but, honestly, I feel like it's going to take me longer to understand it enough to use it effectively than it is for me to just write a recursive descent parser. If someone would like to talk me out of that belief I'm open to changing my mind. I was wondering what I can use to do tokenization, however. I'm looking at boost::tokenizer but all the example code I've seen is using it as a glorified string split() function. I need to actually tokenize i.e. end up with a vector of token objects that have an enum for the type of token and a possibly null string value. Can boost::tokenizer do this?
Advertisement
One word: ANTLR

http://www.antlr.org/
ANTLR strongly seconded. Parser generators are some of the most well-understood and refined pieces of software that exist today. Rolling a parser by hand anymore is a serious waste of effort, especially for nontrivial grammars.

If you are even passingly familiar with EBNF notation, you can specify the grammar with ease, and then having a parser generator create the code for you is a snap. If you're not familiar with the notation, it's easy to pick up, and learning how to specify grammars isn't very hard.

boost::spirit is a remarkable library, but it just doesn't match up to a good lexer/parser generator. The only reason I would accept for choosing spirit over ANTLR/flex/yacc/bison/et. al. is if you need to heavily develop the grammar and don't want to spend all your time regenerating your parser code. Even then, once the grammar stabilizes a bit, transitioning to a more robust tool is strongly recommended. There quickly comes a point for nontrivial grammars where you start spending long compile times in spirit, at which point you might as well just swap to a full-fledged generator.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Quote:
If you are even passingly familiar with EBNF notation, you can specify the grammar with ease, and then having a parser generator create the code for you is a snap.


I must be cursed or something, but I never found this to be the case. I've never once gotten a parser generator to do anything, even just pasting the example into the tool. It's not as though there's a huge userbase either, and the tutorials are pretty much 'paste this code in, and it works!', which is of course useless when it doesn't work.

Not that you shouldn't use a parser generator, but they're not a snap to use in my (perhaps accursed) experience.
So ANTLR will generate parsing code in C from a description of the grammar, I get that, but what will be the output of the generated C code when I call it to do parsing? -- that's the part I don't understand. How do you specify the back end?
Quote:Original post by Telastyn
Not that you shouldn't use a parser generator, but they're not a snap to use in my (perhaps accursed) experience.


Yeah, just looking around on the ANTLR site I can tell that it's going to be hard to get set up and doing something. The documentation looks bad, basically. I'm willing to give it a try, however.

The thing is this scripting language that I'm parsing is a newer version of a scripting language that I wrote the parser for 10 years ago -- so if it comes down to it, I have the old code. I mean with minor changes I could just re-use my old code for parsing expressions, for example, which is the hard part; however, it would kind of be ugly to do it that way -- cutting and pasting legacy code --, just wanted to do this cleanly this time.

But I don't have a ton of time for this project. Am already dealing with / getting feedback from the users of the app I'm working on with the scripting language not supported yet, etc ... so we'll see how this plays out.

[Edited by - jwezorek on February 11, 2010 8:10:19 PM]
I've basically just finished doing what you propose.

I wanted to make a simple scripting language to describe the creation of a directed acyclic graph and I used flex/bison.

flex/bison might be old but I found them easy to use, they work brilliantly together for lexical analysis and parsing and I had both those two steps done in a day coming from having no background in compilers at all.

http://www.flipcode.com/archives/Implementing_A_Scripting_Engine-Part_1_Overview.shtml

not sure if you've seen this tutorial. I used it for reference.

I also looked at the code for both the pixie and aqsis renderman compliant renderers.
They are both open source and use flex/bison for their implementation of their rsl compilers.

between these three resources and a bit of googling, It was quite straight forward to get the basics working.

I spent much more time on the semantics and code generation than the parsing using these generators.


I would STRONGLY, STRONGLY advise against implementing the scripting language yourself unless you have a special reason to do so (such as learning). Lua+Luabind is an insanely powerful and stupid simple VM and scripting language that can do anything you ever need, and is already bug-tested like crazy.

That being said, I recently finished implementing a sorta-simple scripting language in boost::spirit... frankly, it works, and its cool...I think it took me about 95% of the time in spirit as it would take me to implement it in recursive descent, and debugging was hell.

I think either way is a decent approach, but if you arent familiar with template meta-programming and aren't interested in learning, I would just implement it yourself recursive descent.
Quote:Original post by Steve132
I think either way is a decent approach, but if you arent familiar with template meta-programming and aren't interested in learning, I would just implement it yourself recursive descent.


Lua is not an option. I'm actually just writing a parser and serializer not implementing an interpreter. It's kind of hard to explain what I'm doing: I'm writing something that takes a scripting language as input and serializes it to a binary format that is set in stone and is a binary serialization of a scripting language that is slightly different than the one I'm taking as input. I thought this was going to be easier because I thought that the syntax of expressions was the same in both cases. Had that been the case I wouldn't have to parse expressions because in the old format they are serialized as strings, but, as it turns out, expressions are slightly different, so to do this right, I have to parse new style expressions and then serialize them as old-style expressions, if that makes any sense. The cleanest way to do what I have to do then is do a full parse of the new-style script.

I'm not interested in learning, I'm interested in getting this done. I've written parsers for nontrivial languages professionally twice before, both times I rolled my own -- once in Java where, at the time, there were no (good) tools I could have used, and once in C++ for which I could've used lex/yacc but didn't. I'm interested in boost::spirit or ANTLR to the extent that I can get this done faster using them than not.

[Edited by - jwezorek on February 12, 2010 4:33:42 AM]
Assuming performance is not that a problem and since you already have experience in writing parsers yourself, then maybe Parsing Expression Grammars (PEG) are alternative for you. Advantage: You don't have to hassle with ambiguities and debugging is really a bliss (recursive descent).

Here is a good description of how they work PEG Lib (C# implementation).

This topic is closed to new replies.

Advertisement