Sign in to follow this  

How do you parsing a text file in c/c++

This topic is 4485 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

any one give me a clue on how to parse a text file in c/c++ I can load the file fine now I just need to look for some symbols and words and then load what is after those symbols and words into a verible. I know how to load text file and how to set a string or char from the text file to a verible. But I cant choose were to start loading. Can any one help with this? some simple code would help or some tutorials to read Thanks

Share this post


Link to post
Share on other sites
Parsing/lexing a file in a given language is a non-trivial task that keeps many very smart people very busy with research papers. Unfortunately, this also means that a single post is beyond the scope of the information that would need to be conveyed. But if you're interested in learning more about the topic, there are a lot of terrific books out there. One that comes highly recommended is the so-called Dragon Book, which gives a terrific introduction into the theory of parsing, compilers, lexers, and the like.

Share this post


Link to post
Share on other sites
Okay, let me partially retract my previous statement. Were you talking about parsing a C++ file (i.e., some source code), or were you talking about merely reading in some kind of simple file from input (e.g., a comma-separated list of latitude/longitude positions) and doing things with it?

If it's the latter, then a Google search on "C++ fstream file input tutorial" and similar terms will probably give you what you want. See here for a good starting point.

Share this post


Link to post
Share on other sites
It was the last one. There is no way with my skill I would be bale to make a compiler or that type of parser that would be way out of my leag.


Thanks that will get me started

thanks for all the help.I will probubly have more questions later.

Share this post


Link to post
Share on other sites
Depending how complex the data to parse, you may want to use Flex and Bison, with them you create parser code (in the form of a c or c++ file you just compile in the usual way) that decomposes the data into usable tokens.

It would be overkill to use them for parsing comma separated values, but I've succesfuly used them to create a .MAP parser and a Doom 3 MD5 model loader, just read the manuals, they're not complicated at all.

Share this post


Link to post
Share on other sites
A very, very good reference on this for beginner/intermediate skilled programmer is 'Exploring Programming and Computer Science with C++' by Owen Astrachan of Duke University. In Section 9.3 he presents 'Case Study, Removing Comments with State Machines'.

If you don't have access to this book, you can look through some of the material at http://www.cs.duke.edu/~ola/book.html and you can also download his sample code, which is quite instructive on its own. This file you want to study is 'decomment.cc'.

--random_thinker

Share this post


Link to post
Share on other sites
If the data is simple (ie: The format is deterministic) writing a parser is a matter of fstreaming and interpretting the data. If its more free form but relatively simple then rolling your own is an option, google for tokenization, parser, lexical annalysis, etc. If its something like a simple computer language or complex free-form data you might want to check out flex, bison, yacc, etc. I've also heard good things about boost::spirit.

Share this post


Link to post
Share on other sites
Quote:
Original post by kingpinzs
any one give me a clue on how to parse a text file in c/c++

I can load the file fine now I just need to look for some symbols and words and then load what is after those symbols and words into a verible.

I know how to load text file and how to set a string or char from the text file to a verible. But I cant choose were to start loading.

Can any one help with this?

some simple code would help or some tutorials to read

Thanks


use CRegExp library. 25kB sources

Share this post


Link to post
Share on other sites
If I were you, I'd look into learning flex and bison, as was previously suggested. There was also a fantastic tutorial on creating a scripting engine, on flipcode, which is sadly down at the moment. The tutorial was more geared towards parsing a scripting language, but gave a good introduction to flex and bison, which can be quite hard to learn at first.

I'm pretty new to parsing, but I've written a simple parser to read in an effect file format, which was formatted a bit like this (people who are familiar from Ogre will notice that the format is very similar).

I'll give you a very very brief overview of the process. I'm not a very good writer, but hopefully this will at least give you a basic idea of what happens. Please, someone else feel free to expand on what I've said, correct anything I've said that's wrong or too complex, and make it a bit more newbie friendly :)

I'm not the best person to help you, but I wanted to at least try, because I know how frustrating it can be getting a foothold into this particular topic! Apart from the Flipcode scripting tutorial, I've never found a simple parsing tutorial geared towards the newbie. They all seem to assume some kind of previous knowledge on the subject. (I've not looked for a long time though, so maybe things have improved since then)

OK



//comment
//
//this is a simple effect file
//
//blah blah

Effect "terrain/dirt.ofx"
{

Technique
{
Sorting SKYBOX
CastShadows ENABLED
LODLevel 0

Pass
{
DepthWrite DISABLED
DepthFunc LEQUAL
FogDensity 0.7

CullingMode CLOCKWISE
}
}

}






Part 1: The lexer:

You use flex, to write a lexer, which basically reads in data from the file, and tells the parser what kind of tokens were found, and an optional value for the token. A token could be a single word, a number, a string, whatever.

Flex takes in a file, that describes all the tokens, and generates C or C++ code for the lexer.

In an attempt to explain what a token is, here's a list of tokens that flex will understand from this effect file, from my parser. The kind of tokens it understands depends on the input file to flex, basically you give flex an input file, and it generates a C++ program to parse the input.

Anyway, i'm rambling, so here's the list of the tokens my parser would get from this effect file.


Token: Token Type:
===============================================
//comment COMMENT (the parser is set to just skip these
// COMMENT until it finds a token that isn't a
//this is a simple effect file COMMENT comment )
// COMMENT
//blah blah COMMENT
Effect EFFECT
{ LEFT_CURLY_BRACKET
"terrain/dirt.ofx" STRING
Technique TECHNIQUE
{ LEFT_CURLY_BRACKET
Sorting SORTING
SKYBOX SKYBOX
CastShadows CAST_SHADOWS
ENABLED BOOLEAN
LODLevel LODLEVEL
0 UNSIGNED_INTEGER
Pass PASS
{ LEFT_CURLY_BRACKET
DepthWrite DEPTHWRITE
DISABLED BOOLEAN
DepthFunc DEPTHFUNC
FogDensity FOGDENSITY
0.7 FLOAT
LEQUAL CMPFUNC_LEQUAL
CullingMode CLOCKWISE
} RIGHT_CURLY_BRACKET
} RIGHT_CURLY_BRACKET
} RIGHT_CURLY_BRACKET




Part two to follow in a second, I am really paranoid about my browser crashing, and losing all this.

Sorry if this explanation is a bit confusing, as I say, I'm not a good writer.

[Edited by - Oxyacetylene on September 2, 2005 6:19:09 AM]

Share this post


Link to post
Share on other sites
Part two: The parser

Similar to flex, there is a tool called bison that can work with flex to generate a parser.

Basically at the moment, all we have is a lexer that can take the input from the file, and find out what kind of tokens are present. If all we had was the lexer, then I could write the following effect file, and the parser wouldn't complain



{ Effect } Sorting

DISABLED ENABLED { }

Technique
Pass Pass Pass ENABLED DISABLED { {{{{ }
DepthFunc

"hay guys whuts up in this effect file lol"

{ } { } { } { }}}}}}}}{{{{{{




This effect file is clearly invalid, but the lexer would read this in quite happily.

What we need, is a program that will take tokens from the lexer, make sure that they are all in the right order, and generate some kind of output. The output could be whatever you want. My parser generates a syntax tree as its output (which I'll get into later)

The parser basically says, I want to see these tokens in the following order, if you find the right tokens in the right order, then wahey, bonus, if you don't then that's an error, and parsing has failed.

Here's a sample from the file that I used to get bison to generate a parser. Sorry, it's a bit cryptic, but I couldn't think of a better way to explain this



//An effect file consists of the token "Effect", followed by
//a curly bracket, and effect block, a string, and a right curly bracket

effectfile
: TOKEN_EFFECT TOKEN_STRING
TOKEN_LEFTCURLYBRACKET effectblock TOKEN_RIGHTCURLYBRACKET
;

//an effect block consists of a technique
effectblock
: technique
;

//A technique consists of the token "Technique", followed by a left
//curly bracket, a list of technique statements, and a right curly bracket
technique
: TOKEN_TECHNIQUE TOKEN_LEFTCURLYBRACKET techniquestatementlist TOKEN_RIGHTCURLYBRACKET
;

//A technique statement list consists of a
//technique statement list, followed by a techniquestatement,
//or nothing at all
//
//This allows a technique statement list to contain any number of technique
//statements
techniquestatmentlist
: techniquestatementlist techniquestatement
| /* empty */
;

//and so on
techniquestatment
: recieveshadows
| castshadows
| passblock
;





When the parser recognises one of these blocks, my parser generates a node in a syntax tree. A syntax tree is basically just a tree generated out of the tokens read in by the parser.

Here's a part of the syntax tree, generated from the effect file in the earlier post



EFFECT
/ \
STRING TECHNIQUE-----------
/ | \
SORTING CASTSHADOWS LODLEVEL
/ / \
SKYBOX ENABLED UNSIGNED_INTEGER



My parser then reads in the syntax tree, and creates all the data structures
to hold the effect, all it's techniques and passes, etc.

Sorry, this explanation is too technical, and is really dire! :) Hopefully it will help somehow, despite its shortcomings. Someone throw me a bone here and write something better! :)

The source code for my engine, which includes the effect parser is available on request, but I'm not sure how easy it would be to understand for something new to parsing.

Share this post


Link to post
Share on other sites
Of course, with boost::spirit there's no need to ever step outside of C++:
<disclaimer>
I do not claim to be an expert on boost::spirit. I'm sure there are better ways to do the following. One particular issue you may have is that I do not follow the boost::spirit style guide
</disclaimer>
#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <boost/spirit.hpp>
#include <boost/spirit/tree/ast.hpp>
#include <boost/spirit/tree/tree_to_xml.hpp>
#include <map>

using namespace boost::spirit;

class Parser
:
public grammar< Parser >
{
public:

Parser();

static int const effectFileId = 1;
static int const effectId = 2;
static int const stringId = 3;
static int const effectBlockId = 4;
static int const techniqueId = 5;
static int const techniqueStatementListId = 6;
static int const techniqueStatementId = 7;
static int const sortingTechniqueStatementId = 8;
static int const castShadowsTechniqueStatementId = 9;
static int const lodLevelTechniqueStatementId = 10;
static int const passTechniqueStatementId = 11;
static int const passStatementListId = 12;
static int const passStatementId = 13;
static int const depthWritePassStatementId = 14;
static int const depthFuncPassStatementId = 15;
static int const fogDensityPassStatementId = 16;
static int const cullingModePassStatementId = 17;
static int const commentId = 18;

template < typename ScannerT >
class definition
{

public:

definition(Parser const &);
rule< ScannerT, parser_context<>, parser_tag< effectFileId > > const & start() const;

private:

rule< ScannerT, parser_context<>, parser_tag< commentId > > comment;
rule< ScannerT, parser_context<>, parser_tag< cullingModePassStatementId > > cullingModePassStatement;
rule< ScannerT, parser_context<>, parser_tag< fogDensityPassStatementId > > fogDensityPassStatement;
rule< ScannerT, parser_context<>, parser_tag< depthFuncPassStatementId > > depthFuncPassStatement;
rule< ScannerT, parser_context<>, parser_tag< depthWritePassStatementId > > depthWritePassStatement;
rule< ScannerT, parser_context<>, parser_tag< passStatementId > > passStatement;
rule< ScannerT, parser_context<>, parser_tag< passStatementListId > > passStatementList;
rule< ScannerT, parser_context<>, parser_tag< passTechniqueStatementId > > passTechniqueStatement;
rule< ScannerT, parser_context<>, parser_tag< lodLevelTechniqueStatementId > > lodLevelTechniqueStatement;
rule< ScannerT, parser_context<>, parser_tag< castShadowsTechniqueStatementId > > castShadowsTechniqueStatement;
rule< ScannerT, parser_context<>, parser_tag< sortingTechniqueStatementId > > sortingTechniqueStatement;
rule< ScannerT, parser_context<>, parser_tag< techniqueStatementId > > techniqueStatement;
rule< ScannerT, parser_context<>, parser_tag< techniqueStatementListId > > techniqueStatementList;
rule< ScannerT, parser_context<>, parser_tag< techniqueId > > technique;
rule< ScannerT, parser_context<>, parser_tag< effectBlockId > > effectBlock;
rule< ScannerT, parser_context<>, parser_tag< stringId > > string;
rule< ScannerT, parser_context<>, parser_tag< effectId > > effect;
rule< ScannerT, parser_context<>, parser_tag< effectFileId > > effectFile;

};

};

Parser::Parser()
{
}

template < typename ScannerT >
Parser::definition< ScannerT >::definition(Parser const &)
:
comment
(
discard_node_d
[
lexeme_d
[
str_p("//") >>
*(anychar_p - ch_p('\n'))
]
]
),
cullingModePassStatement
(
discard_node_d
[
str_p("CullingMode")
] >>
(str_p("CLOCKWISE") | str_p("ANTICLOCKWISE"))
),
fogDensityPassStatement
(
discard_node_d
[
str_p("FogDensity")
] >>
real_p
),
depthFuncPassStatement
(
discard_node_d
[
str_p("DepthFunc")
] >>
(str_p("LEQUAL") | str_p("EQUAL") | str_p("LESS"))
),
depthWritePassStatement
(
discard_node_d
[
str_p("DepthWrite")
] >>
(str_p("ENABLED") | str_p("DISABLED"))
),
passStatement
(
depthWritePassStatement |
depthFuncPassStatement |
fogDensityPassStatement |
cullingModePassStatement
),
passStatementList
(
+passStatement
),
passTechniqueStatement
(
discard_node_d
[
str_p("Pass")
] >>
inner_node_d
[
str_p("{") >>
passStatementList >>
str_p("}")
]
),
lodLevelTechniqueStatement
(
discard_node_d
[
str_p("LODLevel")
] >>
uint_p
),
castShadowsTechniqueStatement
(
discard_node_d
[
str_p("CastShadows")
] >>
(str_p("ENABLED") | str_p("DISABLED"))
),
sortingTechniqueStatement
(
discard_node_d
[
str_p("Sorting")
] >>
(str_p("SKYBOX") | str_p("WORLD"))
),
techniqueStatement
(
sortingTechniqueStatement |
castShadowsTechniqueStatement |
lodLevelTechniqueStatement |
passTechniqueStatement
),
techniqueStatementList
(
*techniqueStatement
),
technique
(
discard_node_d
[
str_p("Technique")
] >>
inner_node_d
[
str_p("{") >>
techniqueStatementList >>
str_p("}")
]
),
effectBlock
(
technique
),
string
(
leaf_node_d
[
lexeme_d
[
ch_p('\"') >>
*(anychar_p - ch_p('\"')) >>
ch_p('\"')
]
]
),
effect
(
discard_node_d
[
str_p("Effect")
] >>
string >>
inner_node_d
[
str_p("{") >>
effectBlock >>
str_p("}")
]
),
effectFile
(
*comment >>
effect >>
*comment
)
{
}

template < typename ScannerT >
rule< ScannerT, parser_context<>, parser_tag< Parser::effectFileId > > const & Parser::definition< ScannerT >::start() const
{
return effectFile;
}

int main()
{
Parser parser;
std::ifstream reader("input.txt");
reader.seekg(0, std::ios::end);
std::vector< char > input(std::streamsize(reader.tellg()) + 1, 0);
reader.seekg(0, std::ios::beg);
reader.read(&input[0], input.size());
tree_parse_info<> info = ast_parse(&input[0], parser, space_p);
if (!info.full)
{
std::cout << "Parse error\n\n";
std::cout << "\tParsed:\n";
std::copy< char const *, std::ostreambuf_iterator< char > >(&input[0], info.stop, std::ostreambuf_iterator< char >(std::cout));
return -1;
}
std::map<parser_id, std::string> rule_names;
rule_names[Parser::effectFileId] = "Effect File";
rule_names[Parser::effectId] = "Effect";
rule_names[Parser::stringId] = "String";
rule_names[Parser::effectBlockId] = "Effect Block";
rule_names[Parser::techniqueId] = "Effect Techniques";
rule_names[Parser::techniqueStatementListId] = "Techniques";
rule_names[Parser::techniqueStatementId] = "Technique";
rule_names[Parser::sortingTechniqueStatementId] = "Sorting Technique";
rule_names[Parser::castShadowsTechniqueStatementId] = "Shadow Casting Technique";
rule_names[Parser::lodLevelTechniqueStatementId] = "LOD Level Technique";
rule_names[Parser::passTechniqueStatementId] = "Pass Info";
rule_names[Parser::passStatementListId] = "Pass Info List";
rule_names[Parser::passStatementId] = "Pass Setting";
rule_names[Parser::depthWritePassStatementId] = "Depth Write Pass Setting";
rule_names[Parser::depthFuncPassStatementId] = "Depth Func Pass Setting";
rule_names[Parser::fogDensityPassStatementId] = "Fog Density Pass Setting";
rule_names[Parser::cullingModePassStatementId] = "Culling Mode Pass Setting";
rule_names[Parser::commentId] = "Comment";
tree_to_xml(std::cout, info.trees, "", rule_names);
}




which, when run on Oxyacetylene's input file, produces the parse tree:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE parsetree SYSTEM "parsetree.dtd">
<parsetree version="1.0">
<parsenode rule="Effect File">
<parsenode rule="Comment">
</parsenode>
<parsenode rule="Comment">
</parsenode>
<parsenode rule="Comment">
</parsenode>
<parsenode rule="Comment">
</parsenode>
<parsenode rule="Comment">
</parsenode>
<parsenode rule="Effect">
<parsenode rule="String">
<value>"terrain/dirt.ofx"</value>
</parsenode>
<parsenode rule="Techniques">
<parsenode rule="Sorting Technique">
<value>SKYBOX</value>
</parsenode>
<parsenode rule="Shadow Casting Technique">
<value>ENABLED</value>
</parsenode>
<parsenode rule="LOD Level Technique">
<value>0</value>
</parsenode>
<parsenode rule="Pass Info List">
<parsenode rule="Depth Write Pass Setting">
<value>DISABLED</value>
</parsenode>
<parsenode rule="Depth Func Pass Setting">
<value>LEQUAL</value>
</parsenode>
<parsenode rule="Fog Density Pass Setting">
<value>0.7</value>
</parsenode>
<parsenode rule="Culling Mode Pass Setting">
<value>CLOCKWISE</value>
</parsenode>
</parsenode>
</parsenode>
</parsenode>
</parsenode>
</parsetree>




<off-topic>Why do we not have a [source lang="xml"] option?</off-topic>

Enigma

Share this post


Link to post
Share on other sites
Yeah, boost::spirit is really nice, so that's an option. It didn't work very well for me though, because I'm using Visual Studio 2002, which doesn't support some of the features. Also, the template code ends up overflowing several compiler limits when I try to compile any reasonable sized parser.

It worked fine for a very, very cut down version of my effect file grammar, but when I tried to implement the whole grammar, the compiler choked.

I still use it to parse the commmand line in my console though, it's much better at this than the hand-coded solution I had previously.

Share this post


Link to post
Share on other sites
well one thing is this is to help me learn programming better and also so I can parse my setting files for my games and other programes.

Also the things it needs to do it just look for one char and when it comes to it load every thing tell it comes to another one until the end of the file.

Thanks for the help this is a great starting point.
I apretate it very much

Share this post


Link to post
Share on other sites
Quote:
Original post by Oxyacetylene
Yeah, boost::spirit is really nice, so that's an option. It didn't work very well for me though, because I'm using Visual Studio 2002, which doesn't support some of the features. Also, the template code ends up overflowing several compiler limits when I try to compile any reasonable sized parser.


I've used it with MinGW's GCC and not had such problems :-).

Quote:
It worked fine for a very, very cut down version of my effect file grammar, but when I tried to implement the whole grammar, the compiler choked.


Again, no problems here. However, I broke my grammar down into many chunks with many rules, and didn't get into the optimization options with full template deduction. I certainly wasn't using all the fancy features.

At worst, it's worth looking into :-).

Share this post


Link to post
Share on other sites

This topic is 4485 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this