# How do you parsing a text file in c/c++

This topic is 4555 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

any one give me a clue on how to parse a text file in c/c++ I can load the file fine now I just need to look for some symbols and words and then load what is after those symbols and words into a verible. I know how to load text file and how to set a string or char from the text file to a verible. But I cant choose were to start loading. Can any one help with this? some simple code would help or some tutorials to read Thanks

##### Share on other sites
Parsing/lexing a file in a given language is a non-trivial task that keeps many very smart people very busy with research papers. Unfortunately, this also means that a single post is beyond the scope of the information that would need to be conveyed. But if you're interested in learning more about the topic, there are a lot of terrific books out there. One that comes highly recommended is the so-called Dragon Book, which gives a terrific introduction into the theory of parsing, compilers, lexers, and the like.

##### Share on other sites
We have some good codes in here, you can use and enjoy ;)

Bye

##### Share on other sites
Okay, let me partially retract my previous statement. Were you talking about parsing a C++ file (i.e., some source code), or were you talking about merely reading in some kind of simple file from input (e.g., a comma-separated list of latitude/longitude positions) and doing things with it?

If it's the latter, then a Google search on "C++ fstream file input tutorial" and similar terms will probably give you what you want. See here for a good starting point.

##### Share on other sites
It was the last one. There is no way with my skill I would be bale to make a compiler or that type of parser that would be way out of my leag.

Thanks that will get me started

thanks for all the help.I will probubly have more questions later.

##### Share on other sites
Depending how complex the data to parse, you may want to use Flex and Bison, with them you create parser code (in the form of a c or c++ file you just compile in the usual way) that decomposes the data into usable tokens.

It would be overkill to use them for parsing comma separated values, but I've succesfuly used them to create a .MAP parser and a Doom 3 MD5 model loader, just read the manuals, they're not complicated at all.

##### Share on other sites
A very, very good reference on this for beginner/intermediate skilled programmer is 'Exploring Programming and Computer Science with C++' by Owen Astrachan of Duke University. In Section 9.3 he presents 'Case Study, Removing Comments with State Machines'.

If you don't have access to this book, you can look through some of the material at http://www.cs.duke.edu/~ola/book.html and you can also download his sample code, which is quite instructive on its own. This file you want to study is 'decomment.cc'.

--random_thinker

##### Share on other sites
If the data is simple (ie: The format is deterministic) writing a parser is a matter of fstreaming and interpretting the data. If its more free form but relatively simple then rolling your own is an option, google for tokenization, parser, lexical annalysis, etc. If its something like a simple computer language or complex free-form data you might want to check out flex, bison, yacc, etc. I've also heard good things about boost::spirit.

##### Share on other sites
Or, depending on the data you're storing, you might want to check out TinyXML or a similar XML library.

XML is a really cool and flexible way to store all kinds of data.

##### Share on other sites
Quote:
 Original post by kingpinzsany one give me a clue on how to parse a text file in c/c++I can load the file fine now I just need to look for some symbols and words and then load what is after those symbols and words into a verible.I know how to load text file and how to set a string or char from the text file to a verible. But I cant choose were to start loading.Can any one help with this?some simple code would help or some tutorials to readThanks

use CRegExp library. 25kB sources

##### Share on other sites
If I were you, I'd look into learning flex and bison, as was previously suggested. There was also a fantastic tutorial on creating a scripting engine, on flipcode, which is sadly down at the moment. The tutorial was more geared towards parsing a scripting language, but gave a good introduction to flex and bison, which can be quite hard to learn at first.

I'm pretty new to parsing, but I've written a simple parser to read in an effect file format, which was formatted a bit like this (people who are familiar from Ogre will notice that the format is very similar).

I'll give you a very very brief overview of the process. I'm not a very good writer, but hopefully this will at least give you a basic idea of what happens. Please, someone else feel free to expand on what I've said, correct anything I've said that's wrong or too complex, and make it a bit more newbie friendly :)

I'm not the best person to help you, but I wanted to at least try, because I know how frustrating it can be getting a foothold into this particular topic! Apart from the Flipcode scripting tutorial, I've never found a simple parsing tutorial geared towards the newbie. They all seem to assume some kind of previous knowledge on the subject. (I've not looked for a long time though, so maybe things have improved since then)

OK

//comment////this is a simple effect file////blah blahEffect "terrain/dirt.ofx"{     Technique     {           Sorting          SKYBOX           CastShadows      ENABLED           LODLevel         0           Pass           {                DepthWrite    DISABLED                DepthFunc     LEQUAL                FogDensity    0.7                                CullingMode   CLOCKWISE           }     }}

Part 1: The lexer:

You use flex, to write a lexer, which basically reads in data from the file, and tells the parser what kind of tokens were found, and an optional value for the token. A token could be a single word, a number, a string, whatever.

Flex takes in a file, that describes all the tokens, and generates C or C++ code for the lexer.

In an attempt to explain what a token is, here's a list of tokens that flex will understand from this effect file, from my parser. The kind of tokens it understands depends on the input file to flex, basically you give flex an input file, and it generates a C++ program to parse the input.

Anyway, i'm rambling, so here's the list of the tokens my parser would get from this effect file.

Token:                           Token Type:===============================================//comment                        COMMENT (the parser is set to just skip these//                               COMMENT  until it finds a token that isn't a //this is a simple effect file   COMMENT  comment )//                               COMMENT//blah blah                      COMMENTEffect                           EFFECT{                                LEFT_CURLY_BRACKET"terrain/dirt.ofx"               STRINGTechnique                        TECHNIQUE{                                LEFT_CURLY_BRACKETSorting                          SORTINGSKYBOX                           SKYBOXCastShadows                      CAST_SHADOWSENABLED                          BOOLEANLODLevel                         LODLEVEL0                                UNSIGNED_INTEGERPass                             PASS{                                LEFT_CURLY_BRACKETDepthWrite                       DEPTHWRITEDISABLED                         BOOLEANDepthFunc                        DEPTHFUNCFogDensity			 FOGDENSITY0.7                              FLOATLEQUAL                           CMPFUNC_LEQUALCullingMode                      CLOCKWISE}                                RIGHT_CURLY_BRACKET}                                RIGHT_CURLY_BRACKET}                                RIGHT_CURLY_BRACKET

Part two to follow in a second, I am really paranoid about my browser crashing, and losing all this.

Sorry if this explanation is a bit confusing, as I say, I'm not a good writer.

[Edited by - Oxyacetylene on September 2, 2005 6:19:09 AM]

##### Share on other sites
Part two: The parser

Similar to flex, there is a tool called bison that can work with flex to generate a parser.

Basically at the moment, all we have is a lexer that can take the input from the file, and find out what kind of tokens are present. If all we had was the lexer, then I could write the following effect file, and the parser wouldn't complain

{ Effect } Sorting DISABLED ENABLED { } TechniquePass Pass Pass ENABLED DISABLED { {{{{ }DepthFunc"hay guys whuts up in this effect file lol"{ } { } { } { }}}}}}}}{{{{{{

This effect file is clearly invalid, but the lexer would read this in quite happily.

What we need, is a program that will take tokens from the lexer, make sure that they are all in the right order, and generate some kind of output. The output could be whatever you want. My parser generates a syntax tree as its output (which I'll get into later)

The parser basically says, I want to see these tokens in the following order, if you find the right tokens in the right order, then wahey, bonus, if you don't then that's an error, and parsing has failed.

Here's a sample from the file that I used to get bison to generate a parser. Sorry, it's a bit cryptic, but I couldn't think of a better way to explain this

//An effect file consists of the token "Effect", followed by//a curly bracket, and effect block, a string, and a right curly bracketeffectfile: TOKEN_EFFECT TOKEN_STRING   TOKEN_LEFTCURLYBRACKET effectblock TOKEN_RIGHTCURLYBRACKET	;//an effect block consists of a techniqueeffectblock:	 technique;//A technique consists of the token "Technique", followed by a left//curly bracket, a list of technique statements, and a right curly brackettechnique: TOKEN_TECHNIQUE TOKEN_LEFTCURLYBRACKET techniquestatementlist TOKEN_RIGHTCURLYBRACKET;//A technique statement list consists of a //technique statement list, followed by a techniquestatement,//or nothing at all////This allows a technique statement list to contain any number of technique//statementstechniquestatmentlist: techniquestatementlist techniquestatement| /* empty */;//and so ontechniquestatment: recieveshadows| castshadows| passblock;

When the parser recognises one of these blocks, my parser generates a node in a syntax tree. A syntax tree is basically just a tree generated out of the tokens read in by the parser.

Here's a part of the syntax tree, generated from the effect file in the earlier post

             EFFECT             /     \               STRING       TECHNIQUE-----------                     /     |            \                                     SORTING   CASTSHADOWS  LODLEVEL                                 /        /               \                                  SKYBOX    ENABLED         UNSIGNED_INTEGER

My parser then reads in the syntax tree, and creates all the data structures
to hold the effect, all it's techniques and passes, etc.

Sorry, this explanation is too technical, and is really dire! :) Hopefully it will help somehow, despite its shortcomings. Someone throw me a bone here and write something better! :)

The source code for my engine, which includes the effect parser is available on request, but I'm not sure how easy it would be to understand for something new to parsing.

##### Share on other sites
Of course, with boost::spirit there's no need to ever step outside of C++:
<disclaimer>
I do not claim to be an expert on boost::spirit. I'm sure there are better ways to do the following. One particular issue you may have is that I do not follow the boost::spirit style guide
</disclaimer>
#include <algorithm>#include <fstream>#include <iostream>#include <iterator>#include <boost/spirit.hpp>#include <boost/spirit/tree/ast.hpp>#include <boost/spirit/tree/tree_to_xml.hpp>#include <map>using namespace boost::spirit;class Parser	:	public grammar< Parser >{	public:		Parser();		static int const effectFileId = 1;		static int const effectId = 2;		static int const stringId = 3;		static int const effectBlockId = 4;		static int const techniqueId = 5;		static int const techniqueStatementListId = 6;		static int const techniqueStatementId = 7;		static int const sortingTechniqueStatementId = 8;		static int const castShadowsTechniqueStatementId = 9;		static int const lodLevelTechniqueStatementId = 10;		static int const passTechniqueStatementId = 11;		static int const passStatementListId = 12;		static int const passStatementId = 13;		static int const depthWritePassStatementId = 14;		static int const depthFuncPassStatementId = 15;		static int const fogDensityPassStatementId = 16;		static int const cullingModePassStatementId = 17;		static int const commentId = 18;		template < typename ScannerT >		class definition		{			public:				definition(Parser const &);				rule< ScannerT, parser_context<>, parser_tag< effectFileId > > const & start() const;			private:				rule< ScannerT, parser_context<>, parser_tag< commentId > > comment;				rule< ScannerT, parser_context<>, parser_tag< cullingModePassStatementId > > cullingModePassStatement;				rule< ScannerT, parser_context<>, parser_tag< fogDensityPassStatementId > > fogDensityPassStatement;				rule< ScannerT, parser_context<>, parser_tag< depthFuncPassStatementId > > depthFuncPassStatement;				rule< ScannerT, parser_context<>, parser_tag< depthWritePassStatementId > > depthWritePassStatement;				rule< ScannerT, parser_context<>, parser_tag< passStatementId > > passStatement;				rule< ScannerT, parser_context<>, parser_tag< passStatementListId > > passStatementList;				rule< ScannerT, parser_context<>, parser_tag< passTechniqueStatementId > > passTechniqueStatement;				rule< ScannerT, parser_context<>, parser_tag< lodLevelTechniqueStatementId > > lodLevelTechniqueStatement;				rule< ScannerT, parser_context<>, parser_tag< castShadowsTechniqueStatementId > > castShadowsTechniqueStatement;				rule< ScannerT, parser_context<>, parser_tag< sortingTechniqueStatementId > > sortingTechniqueStatement;				rule< ScannerT, parser_context<>, parser_tag< techniqueStatementId > > techniqueStatement;				rule< ScannerT, parser_context<>, parser_tag< techniqueStatementListId > > techniqueStatementList;				rule< ScannerT, parser_context<>, parser_tag< techniqueId > > technique;				rule< ScannerT, parser_context<>, parser_tag< effectBlockId > > effectBlock;				rule< ScannerT, parser_context<>, parser_tag< stringId > > string;				rule< ScannerT, parser_context<>, parser_tag< effectId > > effect;				rule< ScannerT, parser_context<>, parser_tag< effectFileId > > effectFile;		};};Parser::Parser(){}template < typename ScannerT >Parser::definition< ScannerT >::definition(Parser const &)	:	comment	(		discard_node_d		[			lexeme_d			[				str_p("//") >>				*(anychar_p - ch_p('\n'))			]		]	),	cullingModePassStatement	(		discard_node_d		[			str_p("CullingMode")		] >>		(str_p("CLOCKWISE") | str_p("ANTICLOCKWISE"))	),	fogDensityPassStatement	(		discard_node_d		[			str_p("FogDensity")		] >>		real_p	),	depthFuncPassStatement	(		discard_node_d		[			str_p("DepthFunc")		] >>		(str_p("LEQUAL") | str_p("EQUAL") | str_p("LESS"))	),	depthWritePassStatement	(		discard_node_d		[			str_p("DepthWrite")		] >>		(str_p("ENABLED") | str_p("DISABLED"))	),	passStatement	(		depthWritePassStatement |		depthFuncPassStatement |		fogDensityPassStatement |		cullingModePassStatement	),	passStatementList	(		+passStatement	),	passTechniqueStatement	(		discard_node_d		[			str_p("Pass")		] >>		inner_node_d		[			str_p("{") >>			passStatementList >>			str_p("}")		]	),	lodLevelTechniqueStatement	(		discard_node_d		[			str_p("LODLevel")		] >>		uint_p	),	castShadowsTechniqueStatement	(		discard_node_d		[			str_p("CastShadows")		] >>		(str_p("ENABLED") | str_p("DISABLED"))	),	sortingTechniqueStatement	(		discard_node_d		[			str_p("Sorting")		] >>		(str_p("SKYBOX") | str_p("WORLD"))	),	techniqueStatement	(		sortingTechniqueStatement |		castShadowsTechniqueStatement |		lodLevelTechniqueStatement |		passTechniqueStatement	),	techniqueStatementList	(		*techniqueStatement	),	technique	(		discard_node_d		[			str_p("Technique")		] >>		inner_node_d		[			str_p("{") >>			techniqueStatementList >>			str_p("}")		]	),	effectBlock	(		technique	),	string	(		leaf_node_d		[			lexeme_d			[				ch_p('\"') >>				*(anychar_p - ch_p('\"')) >>				ch_p('\"')			]		]	),	effect	(		discard_node_d		[			str_p("Effect")		] >>		string >>		inner_node_d		[			str_p("{") >>			effectBlock >>			str_p("}")		]	),	effectFile	(		*comment >>		effect >>		*comment	){}template < typename ScannerT >rule< ScannerT, parser_context<>, parser_tag< Parser::effectFileId > > const & Parser::definition< ScannerT >::start() const{	return effectFile;}int main(){	Parser parser;	std::ifstream reader("input.txt");	reader.seekg(0, std::ios::end);	std::vector< char > input(std::streamsize(reader.tellg()) + 1, 0);	reader.seekg(0, std::ios::beg);	reader.read(&input[0], input.size());	tree_parse_info<> info = ast_parse(&input[0], parser, space_p);	if (!info.full)	{		std::cout << "Parse error\n\n";		std::cout << "\tParsed:\n";		std::copy< char const *, std::ostreambuf_iterator< char > >(&input[0], info.stop, std::ostreambuf_iterator< char >(std::cout));		return -1;	}	std::map<parser_id, std::string> rule_names;	rule_names[Parser::effectFileId] = "Effect File";	rule_names[Parser::effectId] = "Effect";	rule_names[Parser::stringId] = "String";	rule_names[Parser::effectBlockId] = "Effect Block";	rule_names[Parser::techniqueId] = "Effect Techniques";	rule_names[Parser::techniqueStatementListId] = "Techniques";	rule_names[Parser::techniqueStatementId] = "Technique";	rule_names[Parser::sortingTechniqueStatementId] = "Sorting Technique";	rule_names[Parser::castShadowsTechniqueStatementId] = "Shadow Casting Technique";	rule_names[Parser::lodLevelTechniqueStatementId] = "LOD Level Technique";	rule_names[Parser::passTechniqueStatementId] = "Pass Info";	rule_names[Parser::passStatementListId] = "Pass Info List";	rule_names[Parser::passStatementId] = "Pass Setting";	rule_names[Parser::depthWritePassStatementId] = "Depth Write Pass Setting";	rule_names[Parser::depthFuncPassStatementId] = "Depth Func Pass Setting";	rule_names[Parser::fogDensityPassStatementId] = "Fog Density Pass Setting";	rule_names[Parser::cullingModePassStatementId] = "Culling Mode Pass Setting";	rule_names[Parser::commentId] = "Comment";	tree_to_xml(std::cout, info.trees, "", rule_names);}

which, when run on Oxyacetylene's input file, produces the parse tree:
<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE parsetree SYSTEM "parsetree.dtd"><parsetree version="1.0">    <parsenode rule="Effect File">        <parsenode rule="Comment">        </parsenode>        <parsenode rule="Comment">        </parsenode>        <parsenode rule="Comment">        </parsenode>        <parsenode rule="Comment">        </parsenode>        <parsenode rule="Comment">        </parsenode>        <parsenode rule="Effect">            <parsenode rule="String">                <value>"terrain/dirt.ofx"</value>            </parsenode>            <parsenode rule="Techniques">                <parsenode rule="Sorting Technique">                    <value>SKYBOX</value>                </parsenode>                <parsenode rule="Shadow Casting Technique">                    <value>ENABLED</value>                </parsenode>                <parsenode rule="LOD Level Technique">                    <value>0</value>                </parsenode>                <parsenode rule="Pass Info List">                    <parsenode rule="Depth Write Pass Setting">                        <value>DISABLED</value>                    </parsenode>                    <parsenode rule="Depth Func Pass Setting">                        <value>LEQUAL</value>                    </parsenode>                    <parsenode rule="Fog Density Pass Setting">                        <value>0.7</value>                    </parsenode>                    <parsenode rule="Culling Mode Pass Setting">                        <value>CLOCKWISE</value>                    </parsenode>                </parsenode>            </parsenode>        </parsenode>    </parsenode></parsetree>

<off-topic>Why do we not have a [source lang="xml"] option?</off-topic>

Enigma

##### Share on other sites
Yeah, boost::spirit is really nice, so that's an option. It didn't work very well for me though, because I'm using Visual Studio 2002, which doesn't support some of the features. Also, the template code ends up overflowing several compiler limits when I try to compile any reasonable sized parser.

It worked fine for a very, very cut down version of my effect file grammar, but when I tried to implement the whole grammar, the compiler choked.

I still use it to parse the commmand line in my console though, it's much better at this than the hand-coded solution I had previously.

##### Share on other sites
well one thing is this is to help me learn programming better and also so I can parse my setting files for my games and other programes.

Also the things it needs to do it just look for one char and when it comes to it load every thing tell it comes to another one until the end of the file.

Thanks for the help this is a great starting point.
I apretate it very much

##### Share on other sites
Quote:
 Original post by OxyacetyleneYeah, boost::spirit is really nice, so that's an option. It didn't work very well for me though, because I'm using Visual Studio 2002, which doesn't support some of the features. Also, the template code ends up overflowing several compiler limits when I try to compile any reasonable sized parser.

I've used it with MinGW's GCC and not had such problems :-).

Quote:
 It worked fine for a very, very cut down version of my effect file grammar, but when I tried to implement the whole grammar, the compiler choked.

Again, no problems here. However, I broke my grammar down into many chunks with many rules, and didn't get into the optimization options with full template deduction. I certainly wasn't using all the fancy features.

At worst, it's worth looking into :-).