Reading a context free grammar in c++?

Started by
27 comments, last by Telastyn 14 years ago
Quote:Original post by theOcelot
Verbosity is one of the things I don't like about Spirit, along with the related problem of difficulty of use.


I know what you mean. The problem that I had with it was that the slowness of build times plus the inadequacy of the documentation led to a situation in which I was forced to do trial and error style programming with each trial and error cycle taking 5+ minutes to compile, very frustrating. However, once you get the basic structure of what you're trying to do working it isn't too bad, and you do get used to it. There's also some tricks that help with compile times. Splitting up your grammar across several compilation units helps and so forth.

Quote:Original post by theOcelot
If you're not using the default iterator type, Spirit is a pain. Or at least it was; I don't know about the new version, but from what I've seen the new version isn't very easy either...


I've never used spirit with anything but the default iterator so I don't know if version 2.1 or 2.2 is better in this regard than classic.

Quote:Original post by theOcelot
Would you support different iterator types?


Yeah, the parser objects would take a template argument for a token iterator. It would be intended to be used on input that was already tokenized rather than directly on characters like default usage of spirit.



Advertisement
Quote:Original post by jwezorek
Quote:Original post by theOcelot
Would you support different iterator types?


Yeah, the parser objects would take a template argument for a token iterator. It would be intended to be used on input that was already tokenized rather than directly on characters like default usage of spirit.


Would you have tokenization included, or do clients have to do it themselves? I'd like to keep and improve upon Spirit's micro-parsing abilities.
Quote:
Namely, I'm saying that the output of each parser object -- where by "parser object" I mean the things that you compose together to construct your target grammar -- has to be a dynamically allocated instantiation of a class that derives from a base class that all of the parser objects take as a template parameter.


Just for the record, I've implemented something like this for C# and Java. It is actually super helpful for things slightly more complex than a regex would handle. The main issue that it has compared to perhaps more proper parsing tools is trouble with recursive references and the 'or' parser. (I'm sure there's a proper term for the style of parser that suffers these problems, but it's beyond me). And there's actually two parts, one for lexing and another for parsing through the results for that. The compositioning and behavior is the same.

// define Parsers for Keyword,Identifier,etc.Token = new LabelledParser("Token", Keyword | AnyLiteral | Identifier | BuiltInPunctuation | Symbol);// or for the parser:FieldDeclaration = new LabelledTokenParser("FieldDeclaration", FieldDeclarationModifierList + TypeExpression + Identifier+ (FieldInitializer | Semicolon));
Quote:Original post by theOcelot
Would you have tokenization included, or do clients have to do it themselves? I'd like to keep and improve upon Spirit's micro-parsing abilities.


I guess if I wrote this library I would include a simple default tokenizer as part of the library just so that it is useful out of the box. I haven't put any thought into the tokenizer yet...

Likewise, I'd include a function that takes a string describing a grammar and generates a parser for you (you'd have to somehow supply the boost::functions that tell the parser library how to map parser object output to tree nodes -- maybe you just call the function with one of the arguments being a vector of boost functions and then the rules in the grammar specification could refer to these functions by index) -- this functionality could be written using the parser library itself which would be kind of neat.

As for micro-parsing, if by that you mean parsing things that require small grammars rather than things like an entire scripting language. I think the most useful thing in that domain is if it's possible to skip the tokenization step all together and just use character iterators the way that spirit will let you. Honestly, I haven't put a lot of thought into supporting the character parsing use case. It might not be a big deal; I mean, you might get it basically for free if you set up the token iterator templatization correctly.

[Edited by - jwezorek on March 15, 2010 1:51:24 PM]
Quote:Original post by Telastyn
Just for the record, I've implemented something like this for C# and Java.


Yeah, I wrote basically exactly what I'm describing (minus templates and function objects) in Java a few years ago for a job I was working that involved interpreting an "algebraic modeling language"; that's where this is coming from. Since Java doesn't have naked functions you pretty much had to implement recursive descent as a composition of parser objects -- I mean, there really weren't a lot of other options. In that version the client of the library had to inherit from the parser objects and specialize them to make a grammar: just seemed like how you're supposed to do things in Java.

Anyway, I've always kind of wanted to rewrite it in C++ and get rid of the Smalltalk-style OOP.

Quote:Original post by Telastyn
The main issue that it has compared to perhaps more proper parsing tools is trouble with recursive references and the 'or' parser. (I'm sure there's a proper term for the style of parser that suffers these problems, but it's beyond me).


Well, I mean, what we're talking about here is just an implementation of a recursive descent parser using a set of mutually self-referential parsing objects rather than a set of mutually-recursive functions, which would be the textbook version. So you shouldn't have had any problems that you wouldn't have with any other recursive descent parser. If you had problems parsing disjunctions it might have been that your grammar accidently included left recursion? -- which recursive descent can't handle: it'll blow up the stack.

For disjunctions, my Java version just tried to apply its child parsers one after the other on the input until one of them succeeded; it returned the result of which ever succeeded first. I think this is called "short-circuiting" and is the behavior of the or-operator in Spirit. I think spirit also has another operator that tries to apply all of the child parsers and then returns the output of the one that succeeded and parsed the longest stretch of input; I'm not sure what that is called.

[Edited by - jwezorek on March 14, 2010 9:19:09 PM]
Quote:Original post by jwezorek
If you had problems parsing disjunctions it might have been that your grammar accidently included left recursion? -- which recursive descent can't handle: it'll blow up the stack.


There are also performance issues with backtracking; it's faster to parse a + (b | c) than (a + b) | (a + c), but doing the factoring automatically in the general case might not be so easy, depending on the overall design. :)

As it happens I've been working on a Python implementation the last few days. Actually I started with an existing package and have been cutting it down to size ;)

Quote:For disjunctions, my Java version just tried to apply its child parsers one after the other on the input until one of them succeeded; it returned the result of which ever succeeded first. I think this is called "short-circuiting" and is the behavior of the or-operator in Spirit.


That doesn't fix left-recursion, though.

Quote:I think spirit also has another operator that tries to apply all of the child parsers and then returns the output of the one that succeeded and parsed the longest stretch of input; I'm not sure what that is called.


In the package I've been using, this one is called 'Or', and the short-circuiting version is called 'MatchFirst'. I'm not sure I'm happy with those names, though.
Quote:Original post by jwezorek
As for micro-parsing, if by that you mean parsing things that require small grammars rather than things like an entire scripting language.

Yup. Spirit would be enormously more useful if it wasn't so tricky to set up for cases like that.
Quote:Honestly, I haven't put a lot of thought into supporting the character parsing use case. It might not be a big deal; I mean, you might get it basically for free if you set up the token iterator templatization correctly.

Yeah, I guess if every character is a token, you'd have it.

After some hunting, I found how to make Spirit use the longest match: with the longest_d directive, not exactly a separate rule. Here.
Quote:
If you had problems parsing disjunctions it might have been that your grammar accidently included left recursion? -- which recursive descent can't handle: it'll blow up the stack.


Eh, yes; if I understand you correctly.
Quote:
Just for the record, I've implemented something like this for C# and Java.


Just a (belated) follow up on this, I cleaned up the API for the C# version and it's now open source (MIT License):

Tangent Parser Framework

This topic is closed to new replies.

Advertisement