# writing a scripting language parser in C++

## Recommended Posts

jwezorek    2663
I'm about to write a parser for a scripting language in C++ and was wondering which Boost libraries I can leverage to do this. I know about Boost::Spirit but, honestly, I feel like it's going to take me longer to understand it enough to use it effectively than it is for me to just write a recursive descent parser. If someone would like to talk me out of that belief I'm open to changing my mind. I was wondering what I can use to do tokenization, however. I'm looking at boost::tokenizer but all the example code I've seen is using it as a glorified string split() function. I need to actually tokenize i.e. end up with a vector of token objects that have an enum for the type of token and a possibly null string value. Can boost::tokenizer do this?

##### Share on other sites
AAA    196
One word: ANTLR

http://www.antlr.org/

##### Share on other sites
ApochPiQ    23064
ANTLR strongly seconded. Parser generators are some of the most well-understood and refined pieces of software that exist today. Rolling a parser by hand anymore is a serious waste of effort, especially for nontrivial grammars.

If you are even passingly familiar with EBNF notation, you can specify the grammar with ease, and then having a parser generator create the code for you is a snap. If you're not familiar with the notation, it's easy to pick up, and learning how to specify grammars isn't very hard.

boost::spirit is a remarkable library, but it just doesn't match up to a good lexer/parser generator. The only reason I would accept for choosing spirit over ANTLR/flex/yacc/bison/et. al. is if you need to heavily develop the grammar and don't want to spend all your time regenerating your parser code. Even then, once the grammar stabilizes a bit, transitioning to a more robust tool is strongly recommended. There quickly comes a point for nontrivial grammars where you start spending long compile times in spirit, at which point you might as well just swap to a full-fledged generator.

##### Share on other sites
Telastyn    3777
Quote:
 If you are even passingly familiar with EBNF notation, you can specify the grammar with ease, and then having a parser generator create the code for you is a snap.

I must be cursed or something, but I never found this to be the case. I've never once gotten a parser generator to do anything, even just pasting the example into the tool. It's not as though there's a huge userbase either, and the tutorials are pretty much 'paste this code in, and it works!', which is of course useless when it doesn't work.

Not that you shouldn't use a parser generator, but they're not a snap to use in my (perhaps accursed) experience.

##### Share on other sites
jwezorek    2663
So ANTLR will generate parsing code in C from a description of the grammar, I get that, but what will be the output of the generated C code when I call it to do parsing? -- that's the part I don't understand. How do you specify the back end?

##### Share on other sites
jwezorek    2663
Quote:
 Original post by TelastynNot that you shouldn't use a parser generator, but they're not a snap to use in my (perhaps accursed) experience.

Yeah, just looking around on the ANTLR site I can tell that it's going to be hard to get set up and doing something. The documentation looks bad, basically. I'm willing to give it a try, however.

The thing is this scripting language that I'm parsing is a newer version of a scripting language that I wrote the parser for 10 years ago -- so if it comes down to it, I have the old code. I mean with minor changes I could just re-use my old code for parsing expressions, for example, which is the hard part; however, it would kind of be ugly to do it that way -- cutting and pasting legacy code --, just wanted to do this cleanly this time.

But I don't have a ton of time for this project. Am already dealing with / getting feedback from the users of the app I'm working on with the scripting language not supported yet, etc ... so we'll see how this plays out.

[Edited by - jwezorek on February 11, 2010 8:10:19 PM]

##### Share on other sites
kael00    100
I've basically just finished doing what you propose.

I wanted to make a simple scripting language to describe the creation of a directed acyclic graph and I used flex/bison.

flex/bison might be old but I found them easy to use, they work brilliantly together for lexical analysis and parsing and I had both those two steps done in a day coming from having no background in compilers at all.

http://www.flipcode.com/archives/Implementing_A_Scripting_Engine-Part_1_Overview.shtml

not sure if you've seen this tutorial. I used it for reference.

I also looked at the code for both the pixie and aqsis renderman compliant renderers.
They are both open source and use flex/bison for their implementation of their rsl compilers.

between these three resources and a bit of googling, It was quite straight forward to get the basics working.

I spent much more time on the semantics and code generation than the parsing using these generators.

##### Share on other sites
Steve132    433
I would STRONGLY, STRONGLY advise against implementing the scripting language yourself unless you have a special reason to do so (such as learning). Lua+Luabind is an insanely powerful and stupid simple VM and scripting language that can do anything you ever need, and is already bug-tested like crazy.

That being said, I recently finished implementing a sorta-simple scripting language in boost::spirit... frankly, it works, and its cool...I think it took me about 95% of the time in spirit as it would take me to implement it in recursive descent, and debugging was hell.

I think either way is a decent approach, but if you arent familiar with template meta-programming and aren't interested in learning, I would just implement it yourself recursive descent.

##### Share on other sites
jwezorek    2663
Quote:
 Original post by Steve132I think either way is a decent approach, but if you arent familiar with template meta-programming and aren't interested in learning, I would just implement it yourself recursive descent.

Lua is not an option. I'm actually just writing a parser and serializer not implementing an interpreter. It's kind of hard to explain what I'm doing: I'm writing something that takes a scripting language as input and serializes it to a binary format that is set in stone and is a binary serialization of a scripting language that is slightly different than the one I'm taking as input. I thought this was going to be easier because I thought that the syntax of expressions was the same in both cases. Had that been the case I wouldn't have to parse expressions because in the old format they are serialized as strings, but, as it turns out, expressions are slightly different, so to do this right, I have to parse new style expressions and then serialize them as old-style expressions, if that makes any sense. The cleanest way to do what I have to do then is do a full parse of the new-style script.

I'm not interested in learning, I'm interested in getting this done. I've written parsers for nontrivial languages professionally twice before, both times I rolled my own -- once in Java where, at the time, there were no (good) tools I could have used, and once in C++ for which I could've used lex/yacc but didn't. I'm interested in boost::spirit or ANTLR to the extent that I can get this done faster using them than not.

[Edited by - jwezorek on February 12, 2010 4:33:42 AM]

##### Share on other sites
unbird    8338
Assuming performance is not that a problem and since you already have experience in writing parsers yourself, then maybe Parsing Expression Grammars (PEG) are alternative for you. Advantage: You don't have to hassle with ambiguities and debugging is really a bliss (recursive descent).

Here is a good description of how they work PEG Lib (C# implementation).

##### Share on other sites
jwezorek    2663
I've decided to fool around with boost::spirit a little bit to get a feel for it. Does anyone know of some sample code that parses a toy, but not trivial, language? I found this project which implements a JSON parser but it's a little too much to get my head around.

##### Share on other sites
Steve132    433
Boost spirit has an example that parses XML into an AST. Read ALL the tutorials, then that example will be your bible.

##### Share on other sites
Steve132    433
I should warn you, however, that having done what you just did its not that easy. If you just want to "get it done" then just code it up yourself.

##### Share on other sites
jwezorek    2663
Quote:
 Original post by Steve132I should warn you, however, that having done what you just did its not that easy. If you just want to "get it done" then just code it up yourself.

Yeah, I know, I'm going to play with it today and over the weekend and then probably decide not to use it. :)

##### Share on other sites
ApochPiQ    23064
Quote:
 Original post by jwezorekI've decided to fool around with boost::spirit a little bit to get a feel for it. Does anyone know of some sample code that parses a toy, but not trivial, language? I found this project which implements a JSON parser but it's a little too much to get my head around.

Epoch language parser - written using boost::spirit and decidedly non-trivial [smile]

##### Share on other sites
jwezorek    2663
So I've been playing around with boost::spirit all day today, and, I don't know, it seems like doing anything slightly non-trivial is really a pain. I mean I want to like it but even in their miniXml example that thing is returning a recursive std::vector of boost::variants which isn't *that* complicated but still would be a pain to deal with.

Anything more complicated than a contrived example like that is going to naturally return a more complex composition of templatized gobbedly-gook, which the first thing you're going to want to do with is introspect, generate something useful, and throw away.

I mean, take expressions, if you extend the model that the documentation endorses your parser will end up returning vectors of variants that contain vectors of variants which seems pretty cumbersome to me. I don't see an easy way to get output that is a dynamically allocated expression node that contains a vector of pointers to expression nodes of various classes derived from an expression node base class, which is what I would want. I'm thinking it can be done with the [bind(f, _val, _1)] thingamajig; however, guess I'll try that tomorrow.

Also compile times are crazy. I'm just compiling little toy grammars that I'm learning with and it's taking 5 minutes to build. If the time explodes when I scale up to a real grammar I don't think this thing is practical for me to use in production code.

##### Share on other sites
theOcelot    498
I think most of the compilation time is just in pulling in all those huge headers. It won't take much longer as your grammars expand.

You might use Boost.Tokenizer to simply get string tokens, then have a simple post-processing step turn it into your own token type.

##### Share on other sites
jwezorek    2663
So, I set out today to write an expression parser with Spirit that outputs a plain-vanilla dynamically allocated expression tree rather than a templated vector/variant thing. I got it working -- it produces the correct output -- but I know it's not quite right. I works correctly but I couldn't manage to get some of the code to be the way that your supposed to use spirit, and I was wondering if someone who has experience with Spirit could help me out.

Here's my code (with the implementation of the expression hierarchy omitted for brevity):
#include "stdafx.h"#include <boost/config/warning_disable.hpp>#include <boost/spirit/include/qi.hpp>#include <boost/spirit/include/phoenix_core.hpp>#include <boost/spirit/include/phoenix_stl.hpp>#include <boost/spirit/include/phoenix_operator.hpp>#include <boost/spirit/include/phoenix_object.hpp>#include <boost/spirit/home/phoenix/bind/bind_function.hpp>#include <boost/fusion/include/io.hpp>#include <boost/fusion/include/std_pair.hpp>namespace qi = boost::spirit::qi;namespace ascii = boost::spirit::ascii;using namespace std;/*-------------------------------- Expression class --------------------------------*/enum OpType {    Add, Subtract, Multiply, Divide};class Expression {public:    virtual int eval() = 0;    virtual void output() = 0;    virtual ~Expression(){}};struct OpExprPair {    OpExprPair(OpType op = Add, Expression* expr = 0) :        _op(op), _expr(expr) {}    OpType _op;    Expression* _expr;};class CompoundExpression : public Expression {protected:    Expression* _left_side;    std::vector<OpExprPair> _right_side;     ...public:    CompoundExpression(Expression* l_s, const std::vector<OpExprPair>& r_s) :        _left_side(l_s), _right_side(r_s) {}    ...};class NumExpression : public Expression {protected:    int _value;public:    NumExpression(int n = 0) : _value(n) {}    ...};/*-------------------------------Helper functions-----------------------------------*/                                 void CreateNumExpression(Expression*& value, int n) {    value = new NumExpression(n);}void MakeOpExprPair(OpExprPair& value, OpType op, Expression* expr) {   value = OpExprPair(op, expr);}void MakeCompoundExpr(Expression*& value,         Expression* left_side,         const std::vector<OpExprPair>& right_side) {    if (right_side.empty()) {        value = left_side;        return;     }     value = new CompoundExpression(left_side, right_side);}/*-----------------------------------Grammar----------------------------------------*/struct additive_op_ : qi::symbols<char, OpType> {    additive_op_() {        add            ("+", Add)            ("-", Subtract)        ;    }} additive_op;struct multiplicative_op_ : qi::symbols<char, OpType> {    multiplicative_op_() {        add            ("*" , Multiply)            ("/" , Divide)        ;    }} multiplicative_op;template <typename Iterator>struct ExpressionParser : qi::grammar<Iterator,Expression*(), ascii::space_type> {    ExpressionParser() : ExpressionParser::base_type(start) {        using boost::spirit::arg_names::_val;        using boost::spirit::arg_names::_1;        using boost::spirit::arg_names::_2;        using boost::spirit::int_;        using boost::phoenix::bind;        using boost::phoenix::push_back;                start            = '['                >> expression [_val = _1]                >> ']'            ;                  expression            = ( term                 >> term_list              ) [bind(MakeCompoundExpr, _val, _1, _2)]            ;        term            = ( factor                 >> factor_list              ) [bind(MakeCompoundExpr, _val, _1, _2)]            ;                factor_list            = *(factor_pair [push_back(_val, _1)])            ;                    term_list            = *(term_pair [push_back(_val, _1)])             ;                    factor_pair            = (multiplicative_op >> factor) [bind(MakeOpExprPair,_val,_1,_2)]            ;                    term_pair           = (additive_op >> term) [bind(MakeOpExprPair,_val,_1,_2)]           ;                    factor            =   int_ [bind(CreateNumExpression, _val, _1)]            |   '(' >> expression [_val = _1] >> ')'             ;     }    qi::rule<Iterator, std::vector<OpExprPair>(), ascii::space_type>         factor_list,         term_list;    qi::rule<Iterator, OpExprPair(), ascii::space_type> term_pair, factor_pair;    qi::rule<Iterator, Expression*(), ascii::space_type> expression, factor, term, start;};/*----------------------------------------------------------------------------------*/int main() {    ExpressionParser<std::string::const_iterator> expression_parser;        string str;    while (getline(cin, str)) {        if (str.empty() || str[0] == 'q')            break;        using boost::spirit::ascii::space;        Expression* expr;        if (phrase_parse(str.begin(), str.end(), expression_parser, expr, space)) {            expr->output();            cout << "\nParse successful\n";        } else {            cout << "Nope\n";        }    }    return 0;}

The biggest problem is that the semantic action [bind(MakeOpExprPair,_val,_1,_2)] shouldn't be necessary. I should be able to tell boost::fusion about my struct, OpExprPair, using BOOST_FUSION_ADAPT_STRUCT and then the output attribute of that rule should just be an OpExprPair automatically. I couldn't get this to work. Also couldn't get it to work using an std::pair as OpExprPair, and according to everyone that's supposed to just work naturally if you #include <boost/fusion/include/std_pair.hpp>.

Further, I have a feeling that there is a way to get a rule like
factor_list = *(factor_pair)
to output a possibly empty vector of OpExprPair structs without using a semantic action but I have no idea how to do it.

Anyway, if anyone has been down this road before, I'd appreciate some input.

[Edited by - jwezorek on February 13, 2010 9:49:52 PM]

##### Share on other sites
ApochPiQ    23064
I've always just used straight functors for the semantic actions; those functors then access a wrapper class for the parser state, which decouples the generation of the AST from the spirit code. I don't really see the benefit of doing all the bind junk when you can just specify a trivial functor and get the same results more efficiently.

##### Share on other sites
jwezorek    2663
Quote:
 Original post by ApochPiQI've always just used straight functors for the semantic actions; those functors then access a wrapper class for the parser state, which decouples the generation of the AST from the spirit code.

I was trying to use regular functors to begin with but I couldn't get it to work. I'll have to look at your code again now that I understand spirit better.

I started to think that you have to use the phoenix::bind stuff to get a function called from a semantic action unless you are using boost::spirit::classic or whatever it is called -- is this not the case? Anyway, the _1 and _2 argument placeholders don't work with regular functors because they're phoenix entities (which makes sense to me now), so I didn't know how to get input to a naked functor from a semantic action. If it's always called on the output attribute of the parser object that the semantic action is associated with, does that mean that if you want to call a functor on the result of a sequence parse you have to make your functor take a boost::fusion::vector as an input argument?

[Edited by - jwezorek on February 13, 2010 11:38:29 PM]

##### Share on other sites
jwezorek    2663
If anyone cares, the problems that I was seeing were caused by the fact that I was unknowingly using Spirit 2 beta rather than Spirit 2.1. All of the super fancy automatic attribute conversion to std types stuff only works if you are including the very latest boost headers i.e. boost_1_41_0. Would've been nice if the documentation was clearer on this ... but I guess it is my own fault.

Yeah, so with the new auto conversion stuff my grammar reduces to this:
template <typename Iterator>struct ExpressionParser : qi::grammar<Iterator,Expression*(), ascii::space_type> {    ExpressionParser() : ExpressionParser::base_type(expression) {                using spirit::int_;        using spirit::_val;        using phoenix::bind;        using phoenix::push_back;        using spirit::_1;        using spirit::_2;                  expression            = ( term                 >> term_list              ) [bind(MakeCompoundExpr, _val, _1, _2)]            ;        term            = ( factor                 >> factor_list              ) [bind(MakeCompoundExpr, _val, _1, _2)]            ;                 factor_list            = *(multiplicative_op >> factor)            ;                    term_list            = *(additive_op >> term)             ;                factor            =   int_ [bind(CreateNumExpression, _val, _1)]            |   '(' >> expression [_val = _1] >> ')'             ;     }    qi::rule<Iterator, std::vector<OpExprPair>(), ascii::space_type>             factor_list, term_list;    qi::rule<Iterator, Expression*(), ascii::space_type>             expression, factor, term;};

The factor_list and term_list rules automagically generate std::vector<OpExprPair>'s when they parse without me having to do anything other than specify the attribute type; it's really pretty amazing.

Now I'm wondering if it's possible to get rid of term_list and factor_list, but it seems to need their attribute declaration to know what to cast to. I don't know how to get this type information to a subrule if it's even possible.

Also, compile time seems a little better with Spirit 2.1 but that might be just my imagination because I've pretty much decided to use this.

[Edited by - jwezorek on February 14, 2010 3:21:50 PM]

##### Share on other sites
jwezorek    2663
So, I'm still trying the boost::spirit route. It's working but my compile times are becoming ridiculous. Is there anyway I can speed it up -- restructure the code? Compile options? Anything? (short of buying a faster computer)

##### Share on other sites
ApochPiQ    23064
AFAIK you're pretty much stuck with the long compile times. You can improve things by dicing up your grammar a bit, but the big downside to spirit is that it is pretty mind-numbingly slow.

Of course, using a standard parser generator, you'd probably wait close to the same amount of time for any non-trivial grammar, so at least IME it's a wash.

##### Share on other sites
jwezorek    2663
Quote:
 Original post by ApochPiQAFAIK you're pretty much stuck with the long compile times. You can improve things by dicing up your grammar a bit, but the big downside to spirit is that it is pretty mind-numbingly slow.Of course, using a standard parser generator, you'd probably wait close to the same amount of time for any non-trivial grammar, so at least IME it's a wash.

Maybe I should make expressions a separate grammar defined in a separate .cpp file that is called by a top-level grammar that handles statements and control structures? Not sure if doing so would would be any better.