Sign in to follow this  

Reading a context free grammar in c++?

This topic is 2811 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi, In c++, I'm reading language like data from text files, and I've decided to use a context free grammar to interpret the data to avoid doing a lot of ifs/elseifs code. I've implemented context free grammars before, using Scala, where the functional and pattern matching aspects made it really easy to do. But I'm at a loss at how to go about it in an imperative language like c++. Has anyone any experience in this? Thanks.

Share this post


Link to post
Share on other sites
Sure! ANTLR is a good solution. I'd advise steering clear of boost.spirit, which is too clever by half and too wordy by about 20x.

Share this post


Link to post
Share on other sites
Generally, you use a lexer and parser generator to create C or C++ code that you call from your own code. Use your favorite search engine with the terms "C++ parser generator" and "C++ lexer" to find tools.

Share this post


Link to post
Share on other sites
Quote:
Original post by SiCrane
Generally, you use a lexer and parser generator to create C or C++ code that you call from your own code. Use your favorite search engine with the terms "C++ parser generator" and "C++ lexer" to find tools.


Oh okay, so it's not as simple as getting a library to do the lexical and syntactical analysis on it and then build a kind of 'source program tree' for me.

Quote:
Original post by Sneftel
Sure! ANTLR is a good solution. I'd advise steering clear of boost.spirit, which is too clever by half and too wordy by about 20x.


So this seems to be an example of what SiCrane was talking about.

I keep away from boost for now.

Thanks!

Share this post


Link to post
Share on other sites
Actually, it would be really useful to have a library to which you give a grammar and a piece of text and it returns a tree. Of course, not knowing the grammar at compile time probably means that it won't be as fast as more traditional solutions, but I am sure there are many applications for which the speed would still be acceptable.

I wonder how hard it would be to implement... Does anyone want to work with me in writing one?

Share this post


Link to post
Share on other sites
Quote:
Original post by alvaro
Actually, it would be really useful to have a library to which you give a grammar and a piece of text and it returns a tree. Of course, not knowing the grammar at compile time probably means that it won't be as fast as more traditional solutions, but I am sure there are many applications for which the speed would still be acceptable.

I wonder how hard it would be to implement... Does anyone want to work with me in writing one?


I wish this existed, would be useful for small things.

It would be hard to do on top of spirit: easy to define a grammar for the grammar specification but then to dynamically generate spirit objects for the target parser from the AST would be hard to do. Spirit.classic used to have "dynamic parsers" which might be the right thing but I think they've been deprecated and I'm not sure it would fit this use case anyway.

Share this post


Link to post
Share on other sites
Quote:
Original post by alvaro
Actually, it would be really useful to have a library to which you give a grammar and a piece of text and it returns a tree.
I'm not sure what that would be useful for. With no semantic backend, you wouldn't be able to go any further than the tree. Nevertheless, you can certainly do it.. just pass the output of yacc to tcc.

Share this post


Link to post
Share on other sites
Quote:
Original post by jwezorek
Quote:
Original post by alvaro
Actually, it would be really useful to have a library to which you give a grammar and a piece of text and it returns a tree. Of course, not knowing the grammar at compile time ...<snip>
I wish this existed, would be useful for small things.
Assuming I'm reading you correctly, then this is exactly what boost::xpressive does [smile]. Expressions can be mostly just regex-style strings, so grammars can be combined at run-time; then the result of a match can be returned as an iterable hierarchy of match groups. Don't think that you're limited to regular grammars though, expressions are allowed to be recursive so it's pretty simple to construct context-free grammars too.

Share this post


Link to post
Share on other sites
Quote:
Original post by Sneftel
Quote:
Original post by alvaro
Actually, it would be really useful to have a library to which you give a grammar and a piece of text and it returns a tree.
I'm not sure what that would be useful for. With no semantic backend, you wouldn't be able to go any further than the tree. Nevertheless, you can certainly do it.. just pass the output of yacc to tcc.


It could be templated on a factory and a treenode type. You pass it a factory so that it generates tree nodes with whatever semantics you want.

Share this post


Link to post
Share on other sites
Quote:
Original post by Malevolence
I've used Lex and Yacc for this sort of thing before, but that's C only. The extremely similar C++ equivalents are Flex and Bison.

Actually, both flex and bison output C code (though I think bison can also do C++). The difference between lex and flex, and between yacc and bison, is that the former are POSIX standards for behavior (based on early actual implementations), whereas the latter are modern implementations of the standard (plus some add-on behavior).

Share this post


Link to post
Share on other sites
Quote:
Original post by dmatter
Don't think that you're limited to regular grammars though, expressions are allowed to be recursive so it's pretty simple to construct context-free grammars too.


I didn't know this.

Xpressive lets you create an expression that includes itself as a sub-expression? Or a grammar of n mutually recursive expressions?

Share this post


Link to post
Share on other sites
Quote:
Original post by jwezorek
I didn't know this.

Xpressive lets you create an expression that includes itself as a sub-expression? Or a grammar of n mutually recursive expressions?
Certainly.
It's essentially a hybrid between boost::spirit and boost::regex. You can build spirit-style static grammars with it, then mix and match those with the dynamic regex-style expressions. By embedding expressions in one another by-reference they can be made to be recursive, they give the classic example in the docs of matching nested brackets.

Share this post


Link to post
Share on other sites
Quote:
Original post by dmatter
Certainly.
If I remember correctly it's built on top of boost::spirit. You can definitely build spirit-style static grammars with it too, then mix and match dynamic and static expressions together into a complete grammar.

Oh, if it's on top of spirit then it's going to have spirit's big problem: crazy compile times. That's what I was trying to get around.

I guess what I want is something like spirit but that is a less extreme example of template metaprogramming.

Like just have a banal polymorphic hierarchy of parser objects where each parser object has semantics like "parse a disjunction" or "parse a sequence". Have the whole thing templated on a pointer to an output tree node type. Have the output of each parser object be a pointer to the tree node type but have each parser object take a boost::function for creating the output tree node from the composition of the output from the parser object's child parsers. So for example, to instantiate a sequence parser object you'd need to supply it with a function that takes a vector of pointers to tree nodes and returns a new tree node pointer.

Share this post


Link to post
Share on other sites
I might get on board with a simpler Spirit type library. But I bet that at the beginning, Spirit was meant to be a simple, clean little library too, but to do the kind of stuff it does is just flat-out complicated. You'd have to have a rather cut-down feature list.

Of course, the other benefit of that would be that it would be easier to use. None of this typedefing three levels of iterators, scanners and rules. Just the basics.

Or maybe there should just be a simpler way of writing of plain recursive-descent grammars functional style. A couple support functions for pattern matching and tree generation, and let the user do the rest. Or we could combine them. Crap, idea overload...

Share this post


Link to post
Share on other sites
Quote:
Original post by theOcelot
I might get on board with a simpler Spirit type library. But I bet that at the beginning, Spirit was meant to be a simple, clean little library too, but to do the kind of stuff it does is just flat-out complicated. You'd have to have a rather cut-down feature list.


Yes, you're right ... In the sketch of a parser library that I tried to explain above, I am cutting down features. Namely, I'm saying that the output of each parser object -- where by "parser object" I mean the things that you compose together to construct your target grammar -- has to be a dynamically allocated instantiation of a class that derives from a base class that all of the parser objects take as a template parameter. You would be given a class hierarchy of parser object templates, would construct your grammar by composing instantiations of them, would provide a class hierarchy of output tree nodes, and would tell the parser objects how to create output tree nodes by supplying an appropriate boost::function object when instantiating a parser object. It would be verbose, yes, but, you know, so is spirit.

Spirit is a remarkable achievement and it has no drawbacks if your development box happens to be a very fast computer. But spirit is trying consciously to be a template metaprogramming library -- this was obviously one of the constraints that its designer put on it. What I'm saying is that if you don't mind using old-fashioned polymorphism/inheritance you could write something similar that may not be as clever but that would compile faster.

[Edited by - jwezorek on March 13, 2010 5:02:44 AM]

Share this post


Link to post
Share on other sites
Verbosity is one of the things I don't like about Spirit, along with the related problem of difficulty of use. If you're not using the default iterator type, Spirit is a pain. Or at least it was; I don't know about the new version, but from what I've seen the new version isn't very easy either... Would you support different iterator types?

Share this post


Link to post
Share on other sites
Quote:
Original post by theOcelot
Verbosity is one of the things I don't like about Spirit, along with the related problem of difficulty of use.


I know what you mean. The problem that I had with it was that the slowness of build times plus the inadequacy of the documentation led to a situation in which I was forced to do trial and error style programming with each trial and error cycle taking 5+ minutes to compile, very frustrating. However, once you get the basic structure of what you're trying to do working it isn't too bad, and you do get used to it. There's also some tricks that help with compile times. Splitting up your grammar across several compilation units helps and so forth.

Quote:
Original post by theOcelot
If you're not using the default iterator type, Spirit is a pain. Or at least it was; I don't know about the new version, but from what I've seen the new version isn't very easy either...


I've never used spirit with anything but the default iterator so I don't know if version 2.1 or 2.2 is better in this regard than classic.

Quote:
Original post by theOcelot
Would you support different iterator types?


Yeah, the parser objects would take a template argument for a token iterator. It would be intended to be used on input that was already tokenized rather than directly on characters like default usage of spirit.



Share this post


Link to post
Share on other sites
Quote:
Original post by jwezorek
Quote:
Original post by theOcelot
Would you support different iterator types?


Yeah, the parser objects would take a template argument for a token iterator. It would be intended to be used on input that was already tokenized rather than directly on characters like default usage of spirit.


Would you have tokenization included, or do clients have to do it themselves? I'd like to keep and improve upon Spirit's micro-parsing abilities.

Share this post


Link to post
Share on other sites
Quote:

Namely, I'm saying that the output of each parser object -- where by "parser object" I mean the things that you compose together to construct your target grammar -- has to be a dynamically allocated instantiation of a class that derives from a base class that all of the parser objects take as a template parameter.


Just for the record, I've implemented something like this for C# and Java. It is actually super helpful for things slightly more complex than a regex would handle. The main issue that it has compared to perhaps more proper parsing tools is trouble with recursive references and the 'or' parser. (I'm sure there's a proper term for the style of parser that suffers these problems, but it's beyond me). And there's actually two parts, one for lexing and another for parsing through the results for that. The compositioning and behavior is the same.


// define Parsers for Keyword,Identifier,etc.
Token = new LabelledParser("Token", Keyword | AnyLiteral | Identifier | BuiltInPunctuation | Symbol);

// or for the parser:
FieldDeclaration = new LabelledTokenParser("FieldDeclaration", FieldDeclarationModifierList + TypeExpression + Identifier+ (FieldInitializer | Semicolon));

Share this post


Link to post
Share on other sites
Quote:
Original post by theOcelot
Would you have tokenization included, or do clients have to do it themselves? I'd like to keep and improve upon Spirit's micro-parsing abilities.


I guess if I wrote this library I would include a simple default tokenizer as part of the library just so that it is useful out of the box. I haven't put any thought into the tokenizer yet...

Likewise, I'd include a function that takes a string describing a grammar and generates a parser for you (you'd have to somehow supply the boost::functions that tell the parser library how to map parser object output to tree nodes -- maybe you just call the function with one of the arguments being a vector of boost functions and then the rules in the grammar specification could refer to these functions by index) -- this functionality could be written using the parser library itself which would be kind of neat.

As for micro-parsing, if by that you mean parsing things that require small grammars rather than things like an entire scripting language. I think the most useful thing in that domain is if it's possible to skip the tokenization step all together and just use character iterators the way that spirit will let you. Honestly, I haven't put a lot of thought into supporting the character parsing use case. It might not be a big deal; I mean, you might get it basically for free if you set up the token iterator templatization correctly.

[Edited by - jwezorek on March 15, 2010 1:51:24 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Telastyn
Just for the record, I've implemented something like this for C# and Java.


Yeah, I wrote basically exactly what I'm describing (minus templates and function objects) in Java a few years ago for a job I was working that involved interpreting an "algebraic modeling language"; that's where this is coming from. Since Java doesn't have naked functions you pretty much had to implement recursive descent as a composition of parser objects -- I mean, there really weren't a lot of other options. In that version the client of the library had to inherit from the parser objects and specialize them to make a grammar: just seemed like how you're supposed to do things in Java.

Anyway, I've always kind of wanted to rewrite it in C++ and get rid of the Smalltalk-style OOP.

Quote:
Original post by Telastyn
The main issue that it has compared to perhaps more proper parsing tools is trouble with recursive references and the 'or' parser. (I'm sure there's a proper term for the style of parser that suffers these problems, but it's beyond me).


Well, I mean, what we're talking about here is just an implementation of a recursive descent parser using a set of mutually self-referential parsing objects rather than a set of mutually-recursive functions, which would be the textbook version. So you shouldn't have had any problems that you wouldn't have with any other recursive descent parser. If you had problems parsing disjunctions it might have been that your grammar accidently included left recursion? -- which recursive descent can't handle: it'll blow up the stack.

For disjunctions, my Java version just tried to apply its child parsers one after the other on the input until one of them succeeded; it returned the result of which ever succeeded first. I think this is called "short-circuiting" and is the behavior of the or-operator in Spirit. I think spirit also has another operator that tries to apply all of the child parsers and then returns the output of the one that succeeded and parsed the longest stretch of input; I'm not sure what that is called.

[Edited by - jwezorek on March 14, 2010 9:19:09 PM]

Share this post


Link to post
Share on other sites

This topic is 2811 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this