How do you parsing a text file in c/c++

Started by
14 comments, last by MaulingMonkey 18 years, 7 months ago
If I were you, I'd look into learning flex and bison, as was previously suggested. There was also a fantastic tutorial on creating a scripting engine, on flipcode, which is sadly down at the moment. The tutorial was more geared towards parsing a scripting language, but gave a good introduction to flex and bison, which can be quite hard to learn at first.

I'm pretty new to parsing, but I've written a simple parser to read in an effect file format, which was formatted a bit like this (people who are familiar from Ogre will notice that the format is very similar).

I'll give you a very very brief overview of the process. I'm not a very good writer, but hopefully this will at least give you a basic idea of what happens. Please, someone else feel free to expand on what I've said, correct anything I've said that's wrong or too complex, and make it a bit more newbie friendly :)

I'm not the best person to help you, but I wanted to at least try, because I know how frustrating it can be getting a foothold into this particular topic! Apart from the Flipcode scripting tutorial, I've never found a simple parsing tutorial geared towards the newbie. They all seem to assume some kind of previous knowledge on the subject. (I've not looked for a long time though, so maybe things have improved since then)

OK

//comment////this is a simple effect file////blah blahEffect "terrain/dirt.ofx"{     Technique     {           Sorting          SKYBOX           CastShadows      ENABLED           LODLevel         0           Pass           {                DepthWrite    DISABLED                DepthFunc     LEQUAL                FogDensity    0.7                                CullingMode   CLOCKWISE           }     }}



Part 1: The lexer:

You use flex, to write a lexer, which basically reads in data from the file, and tells the parser what kind of tokens were found, and an optional value for the token. A token could be a single word, a number, a string, whatever.

Flex takes in a file, that describes all the tokens, and generates C or C++ code for the lexer.

In an attempt to explain what a token is, here's a list of tokens that flex will understand from this effect file, from my parser. The kind of tokens it understands depends on the input file to flex, basically you give flex an input file, and it generates a C++ program to parse the input.

Anyway, i'm rambling, so here's the list of the tokens my parser would get from this effect file.

Token:                           Token Type:===============================================//comment                        COMMENT (the parser is set to just skip these//                               COMMENT  until it finds a token that isn't a //this is a simple effect file   COMMENT  comment )//                               COMMENT//blah blah                      COMMENTEffect                           EFFECT{                                LEFT_CURLY_BRACKET"terrain/dirt.ofx"               STRINGTechnique                        TECHNIQUE{                                LEFT_CURLY_BRACKETSorting                          SORTINGSKYBOX                           SKYBOXCastShadows                      CAST_SHADOWSENABLED                          BOOLEANLODLevel                         LODLEVEL0                                UNSIGNED_INTEGERPass                             PASS{                                LEFT_CURLY_BRACKETDepthWrite                       DEPTHWRITEDISABLED                         BOOLEANDepthFunc                        DEPTHFUNCFogDensity			 FOGDENSITY0.7                              FLOATLEQUAL                           CMPFUNC_LEQUALCullingMode                      CLOCKWISE}                                RIGHT_CURLY_BRACKET}                                RIGHT_CURLY_BRACKET}                                RIGHT_CURLY_BRACKET



Part two to follow in a second, I am really paranoid about my browser crashing, and losing all this.

Sorry if this explanation is a bit confusing, as I say, I'm not a good writer.

[Edited by - Oxyacetylene on September 2, 2005 6:19:09 AM]
Advertisement
Part two: The parser

Similar to flex, there is a tool called bison that can work with flex to generate a parser.

Basically at the moment, all we have is a lexer that can take the input from the file, and find out what kind of tokens are present. If all we had was the lexer, then I could write the following effect file, and the parser wouldn't complain

{ Effect } Sorting DISABLED ENABLED { } TechniquePass Pass Pass ENABLED DISABLED { {{{{ }DepthFunc"hay guys whuts up in this effect file lol"{ } { } { } { }}}}}}}}{{{{{{


This effect file is clearly invalid, but the lexer would read this in quite happily.

What we need, is a program that will take tokens from the lexer, make sure that they are all in the right order, and generate some kind of output. The output could be whatever you want. My parser generates a syntax tree as its output (which I'll get into later)

The parser basically says, I want to see these tokens in the following order, if you find the right tokens in the right order, then wahey, bonus, if you don't then that's an error, and parsing has failed.

Here's a sample from the file that I used to get bison to generate a parser. Sorry, it's a bit cryptic, but I couldn't think of a better way to explain this

//An effect file consists of the token "Effect", followed by//a curly bracket, and effect block, a string, and a right curly bracketeffectfile: TOKEN_EFFECT TOKEN_STRING   TOKEN_LEFTCURLYBRACKET effectblock TOKEN_RIGHTCURLYBRACKET	;//an effect block consists of a techniqueeffectblock:	 technique;//A technique consists of the token "Technique", followed by a left//curly bracket, a list of technique statements, and a right curly brackettechnique: TOKEN_TECHNIQUE TOKEN_LEFTCURLYBRACKET techniquestatementlist TOKEN_RIGHTCURLYBRACKET;//A technique statement list consists of a //technique statement list, followed by a techniquestatement,//or nothing at all////This allows a technique statement list to contain any number of technique//statementstechniquestatmentlist: techniquestatementlist techniquestatement| /* empty */;//and so ontechniquestatment: recieveshadows| castshadows| passblock;



When the parser recognises one of these blocks, my parser generates a node in a syntax tree. A syntax tree is basically just a tree generated out of the tokens read in by the parser.

Here's a part of the syntax tree, generated from the effect file in the earlier post

             EFFECT             /     \               STRING       TECHNIQUE-----------                     /     |            \                                     SORTING   CASTSHADOWS  LODLEVEL                                 /        /               \                                  SKYBOX    ENABLED         UNSIGNED_INTEGER                                                                                                                              


My parser then reads in the syntax tree, and creates all the data structures
to hold the effect, all it's techniques and passes, etc.

Sorry, this explanation is too technical, and is really dire! :) Hopefully it will help somehow, despite its shortcomings. Someone throw me a bone here and write something better! :)

The source code for my engine, which includes the effect parser is available on request, but I'm not sure how easy it would be to understand for something new to parsing.
Of course, with boost::spirit there's no need to ever step outside of C++:
<disclaimer>
I do not claim to be an expert on boost::spirit. I'm sure there are better ways to do the following. One particular issue you may have is that I do not follow the boost::spirit style guide
</disclaimer>
#include <algorithm>#include <fstream>#include <iostream>#include <iterator>#include <boost/spirit.hpp>#include <boost/spirit/tree/ast.hpp>#include <boost/spirit/tree/tree_to_xml.hpp>#include <map>using namespace boost::spirit;class Parser	:	public grammar< Parser >{	public:		Parser();		static int const effectFileId = 1;		static int const effectId = 2;		static int const stringId = 3;		static int const effectBlockId = 4;		static int const techniqueId = 5;		static int const techniqueStatementListId = 6;		static int const techniqueStatementId = 7;		static int const sortingTechniqueStatementId = 8;		static int const castShadowsTechniqueStatementId = 9;		static int const lodLevelTechniqueStatementId = 10;		static int const passTechniqueStatementId = 11;		static int const passStatementListId = 12;		static int const passStatementId = 13;		static int const depthWritePassStatementId = 14;		static int const depthFuncPassStatementId = 15;		static int const fogDensityPassStatementId = 16;		static int const cullingModePassStatementId = 17;		static int const commentId = 18;		template < typename ScannerT >		class definition		{			public:				definition(Parser const &);				rule< ScannerT, parser_context<>, parser_tag< effectFileId > > const & start() const;			private:				rule< ScannerT, parser_context<>, parser_tag< commentId > > comment;				rule< ScannerT, parser_context<>, parser_tag< cullingModePassStatementId > > cullingModePassStatement;				rule< ScannerT, parser_context<>, parser_tag< fogDensityPassStatementId > > fogDensityPassStatement;				rule< ScannerT, parser_context<>, parser_tag< depthFuncPassStatementId > > depthFuncPassStatement;				rule< ScannerT, parser_context<>, parser_tag< depthWritePassStatementId > > depthWritePassStatement;				rule< ScannerT, parser_context<>, parser_tag< passStatementId > > passStatement;				rule< ScannerT, parser_context<>, parser_tag< passStatementListId > > passStatementList;				rule< ScannerT, parser_context<>, parser_tag< passTechniqueStatementId > > passTechniqueStatement;				rule< ScannerT, parser_context<>, parser_tag< lodLevelTechniqueStatementId > > lodLevelTechniqueStatement;				rule< ScannerT, parser_context<>, parser_tag< castShadowsTechniqueStatementId > > castShadowsTechniqueStatement;				rule< ScannerT, parser_context<>, parser_tag< sortingTechniqueStatementId > > sortingTechniqueStatement;				rule< ScannerT, parser_context<>, parser_tag< techniqueStatementId > > techniqueStatement;				rule< ScannerT, parser_context<>, parser_tag< techniqueStatementListId > > techniqueStatementList;				rule< ScannerT, parser_context<>, parser_tag< techniqueId > > technique;				rule< ScannerT, parser_context<>, parser_tag< effectBlockId > > effectBlock;				rule< ScannerT, parser_context<>, parser_tag< stringId > > string;				rule< ScannerT, parser_context<>, parser_tag< effectId > > effect;				rule< ScannerT, parser_context<>, parser_tag< effectFileId > > effectFile;		};};Parser::Parser(){}template < typename ScannerT >Parser::definition< ScannerT >::definition(Parser const &)	:	comment	(		discard_node_d		[			lexeme_d			[				str_p("//") >>				*(anychar_p - ch_p('\n'))			]		]	),	cullingModePassStatement	(		discard_node_d		[			str_p("CullingMode")		] >>		(str_p("CLOCKWISE") | str_p("ANTICLOCKWISE"))	),	fogDensityPassStatement	(		discard_node_d		[			str_p("FogDensity")		] >>		real_p	),	depthFuncPassStatement	(		discard_node_d		[			str_p("DepthFunc")		] >>		(str_p("LEQUAL") | str_p("EQUAL") | str_p("LESS"))	),	depthWritePassStatement	(		discard_node_d		[			str_p("DepthWrite")		] >>		(str_p("ENABLED") | str_p("DISABLED"))	),	passStatement	(		depthWritePassStatement |		depthFuncPassStatement |		fogDensityPassStatement |		cullingModePassStatement	),	passStatementList	(		+passStatement	),	passTechniqueStatement	(		discard_node_d		[			str_p("Pass")		] >>		inner_node_d		[			str_p("{") >>			passStatementList >>			str_p("}")		]	),	lodLevelTechniqueStatement	(		discard_node_d		[			str_p("LODLevel")		] >>		uint_p	),	castShadowsTechniqueStatement	(		discard_node_d		[			str_p("CastShadows")		] >>		(str_p("ENABLED") | str_p("DISABLED"))	),	sortingTechniqueStatement	(		discard_node_d		[			str_p("Sorting")		] >>		(str_p("SKYBOX") | str_p("WORLD"))	),	techniqueStatement	(		sortingTechniqueStatement |		castShadowsTechniqueStatement |		lodLevelTechniqueStatement |		passTechniqueStatement	),	techniqueStatementList	(		*techniqueStatement	),	technique	(		discard_node_d		[			str_p("Technique")		] >>		inner_node_d		[			str_p("{") >>			techniqueStatementList >>			str_p("}")		]	),	effectBlock	(		technique	),	string	(		leaf_node_d		[			lexeme_d			[				ch_p('\"') >>				*(anychar_p - ch_p('\"')) >>				ch_p('\"')			]		]	),	effect	(		discard_node_d		[			str_p("Effect")		] >>		string >>		inner_node_d		[			str_p("{") >>			effectBlock >>			str_p("}")		]	),	effectFile	(		*comment >>		effect >>		*comment	){}template < typename ScannerT >rule< ScannerT, parser_context<>, parser_tag< Parser::effectFileId > > const & Parser::definition< ScannerT >::start() const{	return effectFile;}int main(){	Parser parser;	std::ifstream reader("input.txt");	reader.seekg(0, std::ios::end);	std::vector< char > input(std::streamsize(reader.tellg()) + 1, 0);	reader.seekg(0, std::ios::beg);	reader.read(&input[0], input.size());	tree_parse_info<> info = ast_parse(&input[0], parser, space_p);	if (!info.full)	{		std::cout << "Parse error\n\n";		std::cout << "\tParsed:\n";		std::copy< char const *, std::ostreambuf_iterator< char > >(&input[0], info.stop, std::ostreambuf_iterator< char >(std::cout));		return -1;	}	std::map<parser_id, std::string> rule_names;	rule_names[Parser::effectFileId] = "Effect File";	rule_names[Parser::effectId] = "Effect";	rule_names[Parser::stringId] = "String";	rule_names[Parser::effectBlockId] = "Effect Block";	rule_names[Parser::techniqueId] = "Effect Techniques";	rule_names[Parser::techniqueStatementListId] = "Techniques";	rule_names[Parser::techniqueStatementId] = "Technique";	rule_names[Parser::sortingTechniqueStatementId] = "Sorting Technique";	rule_names[Parser::castShadowsTechniqueStatementId] = "Shadow Casting Technique";	rule_names[Parser::lodLevelTechniqueStatementId] = "LOD Level Technique";	rule_names[Parser::passTechniqueStatementId] = "Pass Info";	rule_names[Parser::passStatementListId] = "Pass Info List";	rule_names[Parser::passStatementId] = "Pass Setting";	rule_names[Parser::depthWritePassStatementId] = "Depth Write Pass Setting";	rule_names[Parser::depthFuncPassStatementId] = "Depth Func Pass Setting";	rule_names[Parser::fogDensityPassStatementId] = "Fog Density Pass Setting";	rule_names[Parser::cullingModePassStatementId] = "Culling Mode Pass Setting";	rule_names[Parser::commentId] = "Comment";	tree_to_xml(std::cout, info.trees, "", rule_names);}

which, when run on Oxyacetylene's input file, produces the parse tree:
<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE parsetree SYSTEM "parsetree.dtd"><parsetree version="1.0">    <parsenode rule="Effect File">        <parsenode rule="Comment">        </parsenode>        <parsenode rule="Comment">        </parsenode>        <parsenode rule="Comment">        </parsenode>        <parsenode rule="Comment">        </parsenode>        <parsenode rule="Comment">        </parsenode>        <parsenode rule="Effect">            <parsenode rule="String">                <value>"terrain/dirt.ofx"</value>            </parsenode>            <parsenode rule="Techniques">                <parsenode rule="Sorting Technique">                    <value>SKYBOX</value>                </parsenode>                <parsenode rule="Shadow Casting Technique">                    <value>ENABLED</value>                </parsenode>                <parsenode rule="LOD Level Technique">                    <value>0</value>                </parsenode>                <parsenode rule="Pass Info List">                    <parsenode rule="Depth Write Pass Setting">                        <value>DISABLED</value>                    </parsenode>                    <parsenode rule="Depth Func Pass Setting">                        <value>LEQUAL</value>                    </parsenode>                    <parsenode rule="Fog Density Pass Setting">                        <value>0.7</value>                    </parsenode>                    <parsenode rule="Culling Mode Pass Setting">                        <value>CLOCKWISE</value>                    </parsenode>                </parsenode>            </parsenode>        </parsenode>    </parsenode></parsetree>

<off-topic>Why do we not have a [source lang="xml"] option?</off-topic>

Enigma
Yeah, boost::spirit is really nice, so that's an option. It didn't work very well for me though, because I'm using Visual Studio 2002, which doesn't support some of the features. Also, the template code ends up overflowing several compiler limits when I try to compile any reasonable sized parser.

It worked fine for a very, very cut down version of my effect file grammar, but when I tried to implement the whole grammar, the compiler choked.

I still use it to parse the commmand line in my console though, it's much better at this than the hand-coded solution I had previously.
well one thing is this is to help me learn programming better and also so I can parse my setting files for my games and other programes.

Also the things it needs to do it just look for one char and when it comes to it load every thing tell it comes to another one until the end of the file.

Thanks for the help this is a great starting point.
I apretate it very much
Quote:Original post by Oxyacetylene
Yeah, boost::spirit is really nice, so that's an option. It didn't work very well for me though, because I'm using Visual Studio 2002, which doesn't support some of the features. Also, the template code ends up overflowing several compiler limits when I try to compile any reasonable sized parser.


I've used it with MinGW's GCC and not had such problems :-).

Quote:It worked fine for a very, very cut down version of my effect file grammar, but when I tried to implement the whole grammar, the compiler choked.


Again, no problems here. However, I broke my grammar down into many chunks with many rules, and didn't get into the optimization options with full template deduction. I certainly wasn't using all the fancy features.

At worst, it's worth looking into :-).

This topic is closed to new replies.

Advertisement