# How can I make a text parser?

This topic is 839 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi, so I got a question.

How can I make a text parser? like:

1. Open a file.

2. Read the contents of file.

3. Consider something like this in the text file:

NewObject MyObject

HP   = 100
DMG  = 5
Weapon = SomeWeapon

End


How can I make it find keywords such as "NewObject" then save it's value(which is "MyObject") to an std::map or something?

4. How can I make it identify wrong things in the text file? like, if there is a "NewObject" keyword then there should be and "End" keyword, if not then it will produce an error.

Is there any good tutorials, c++ functions to allow me make this? Any help is appreciated. Thanks in advance...

##### Share on other sites
Thanks guys for the quick answers I think I'm beggining to understand now. Though, are there also pattern matching in C++? like in Lua I would do something like:
string.match(str, "(.*) = (%d+),(%d+),(%d+)")
to match something like,
something = 1,2,3

##### Share on other sites

Using templates you could create something like that, but it isn't an easy beginner task to write.

I don't believe there is something like that already in the standard library, but I'm not familiar with all the new C++11 and C++14 library extensions so I could be wrong. C++11 did add a regex library (#include <regex>).

##### Share on other sites

C++ has regex now, but regex alone isn't enough to implement a parser for a programming language -- the basic shortcoming is that regex can't count things for itself, and counting tokens in one form or another is necessary (e.g. keeping track of matching brackets is a form of counting) in any programming language you'd want to use. Regex is sufficient for a more declarative sytnax though -- say, key-value pairs for an initialization file.

Regex is a reasonable tool for tokenizing symbols though -- you just need to program the parsing/semantic analysis around that.

##### Share on other sites
class MyObj
{
string myStr;
int myInt;
float myFloat;

void WriteToStream(ofstream & file)
{
file << myStr << ' ' << myInt << ' ' << myFloat << std::endl;
}

{
file >> myStr >> myInt >> myFloat;
}
};


I think the simplest way to do this is as above. The first method uses a stream to output the classes fields in order, with a ' ' delimiter.

The second simply reads the same data back in and the stream will automatically know about the space for the delimiter since that's how streams work.

You can write any number of these to the same file and read them in until eof.

Otherwise if you really need something more complex, then you probably need to be using xml or json.

##### Share on other sites
If you want to be robust and well-designed, I recommend reading up on the following:

https://en.wikipedia.org/wiki/Recursive_descent_parser
https://en.wikipedia.org/wiki/LALR_parser

And examine the tools available to do most of the hard work for you:

https://en.wikipedia.org/wiki/Compiler-compiler
https://en.wikipedia.org/wiki/Category:Parser_generators Edited by Nypyren

##### Share on other sites

How can I make a text parser?

I hope you realize this is an entire branch of computer science?

Really, if you're not writing a text parser as an end in itself, choose to use a widespread common format (XML, YAML, JSON, INI) and use a library. There are many, all of which have been tested.

Otherwise, look in to writing lexical analyzers, defining grammars built in the lexemes and using semantic insertion, and creating parsers for those grammars.  There's handy tools for that, like boost::spirit (if it's still around).

##### Share on other sites

Recursive descent parsers are pretty easy to implement.

Generally if you can make a game you shouldn't have a problem understanding simple RD parsers.

Personally for anything more advanced than a key value pair format, I'd be tempted to use YACC and BISON and friends and not reinvent the whole car never mind it's wheel...

##### Share on other sites

The other alternative is to use a binary format.  Text - despite being human-readable - is not always the most appropriate choice.

##### Share on other sites

Using templates you could create something like that, but it isn't an easy beginner task to write.

I don't believe there is something like that already in the standard library, but I'm not familiar with all the new C++11 and C++14 library extensions so I could be wrong. C++11 did add a regex library (#include <regex>).

I just looked it up on google, it seems it has what I need to make a simple text parser :D, I'll try and learn how to use it.

If this is a part of something bigger, don't reinvent the wheel. Go get a JSON or YAML parser.

Why?

1. Toolchains: jq lets you pretty-print and query JSON files. Editors have modes to help you work on the files. There are syntax verifiers and so on.

2. Not finding all the bugs and edge-cases again. Someone's already worked out what to do what you try and put a JSON text into a JSON value..

Well, the parser I'm trying to make isn't going to be big, it's just going to be for some initialization for my game.

Anyway, thanks guys for answering, I think I'll go with regex first :D

##### Share on other sites

I just looked it up on google, it seems it has what I need to make a simple text parser.

No, it has what you need for a text matcher, a part of lexical anaylsis (trust me, I implemented a significant part of the regex library for GCC).  It's kind of like saying a couple of ice cubes in your cocktail is an iceburg and can sink ocean liners (disclaimer: do not captain while drunk).

It will definitely do simple pattern matching using captures, which is probably good enough for extracting data from a basic file with some simple data using a crude grammar.  In fact, I'd recommend it except I know at least one of the guys who implemented a significant part of that library is a jerk and I wouldn't trust anything he wrote.

Do yourself a favour and write a suite of unit tests for each of your initialization file lines.  You will thank me.

##### Share on other sites

Manually writing parsers, eg with regexp has been discussed. To show you the next step, generating a scanner, that produces tokens which you process, this little example.

You probably don't want to do that now, but it never hurts to see what it would look like :)

Compile with

$flex scanner.l$ g++ lex.yy.c main.cpp


and run with "./a.out input_file.txt" (lex.yy.c is generated by flex)

scanner specification (text to number specification, eg "=" is translated to "TK_EQUAL" value): scanner.l

%{
int line = 1;
char *text;

#include "tokens.h"
%}

%%

=                       { return TK_EQUAL; }
NewObject               { return KW_NEWOBJECT; }
End                     { return KW_END; }

[A-Za-z][A-Za-z0-9]*    { text = yytext; return TK_IDENTIFIER; }
[0-9]+                  { text = yytext; return TK_NUMBER; }
[ \t\r]                 ;
\n                      ;

.                       { printf("Unrecognized character 0x%02x\n", yytext[0]); }

%%

int yywrap() {
return 1;
}

Glue file to share the common definitions: tokens.h

#ifndef TOKENS_H
#define TOKENS_H

extern int line;
extern char *text;

extern FILE *yyin; // Owned by the generated scanner.

int yylex();

enum Tokens {
TK_EOF,

TK_EQUAL,
TK_IDENTIFIER,
TK_NUMBER,

KW_NEWOBJECT,
KW_END,
};

#endif

Main program file, with the actual parser

#include <cstdio>
#include <cstdlib>
#include <string>
#include "tokens.h"

bool parse()
{
int tok;

tok = yylex();
if (tok == TK_EOF) return true;

if (tok != KW_NEWOBJECT) {
printf("Expected NewObject at line %d\n", line);
return false;
}

tok = yylex();
if (tok != TK_IDENTIFIER) {
printf("Expected object name after NewObject at line %d\n", line);
return false;
}

printf("Found object name \"%s\" at line %d\n", text, line);

for (;;) {
tok = yylex();
if (tok == KW_END) break; // End of the input.

if (tok != TK_IDENTIFIER) {
printf("Expected field key at line %d\n", line);
return false;
}
std::string key = text; // Save name before it gets overwritten by a field name.

tok = yylex();
if (tok != TK_EQUAL) {
printf("Expected equal sign at line %d\n", line);
return false;
}

tok = yylex();
if (tok == TK_IDENTIFIER) {
printf("Found a field with a named value: \"%s :: %s\"\n", key.c_str(), text);
} else if (tok == TK_NUMBER) {
printf("Found a field with a number: \"%s :: %d\"\n", key.c_str(), atoi(text));
} else {
printf("Unknown field value at line %d\n", line);
return false;
}

// And loop for the next "key = value"
}

tok = yylex();
if (tok != TK_EOF) {
printf("EOF expected at line %d\n", line);
}

return true;
}

int main(int argc, char *argv[])
{
FILE *handle = (argc == 2) ? fopen(argv[1], "rt") : NULL;
if (handle == NULL) {
printf("File could not be opened\n");
exit(1);
}

yyin = handle; // Give handle to the scanner.

bool result = parse();

fclose(handle);
return result ? 0 : 1;
}

Parser just prints the values, but of course you could also put it in some data structure. (main.cpp file)

If you think the "parse" function is a bit repetitive, it is. You can step up and use a parser generator like bison, to get rid of it, and gain a lot of additional recognizing power at the same time.

I don't have a working example with a parser generator (it needs some new code, like the class definitions, and a bit additional glue code), but the core parser input specification would be like

Program : KW_NEWOBJECT TK_IDENTIFIER Fields KW_END
{
$$= new Program(2, 3); } Fields : Field {$$ = std::list<Field *>();
$$.push_back(1); } Fields : Fields Field {$$ = $1; $$.push_back(2); } Field : TK_IDENTIFIER TK_EQUAL TK_IDENTIFIER {$$ = new NameField($1, $3); } Field : TK_IDENTIFIER TK_EQUAL TK_NUMBER {$$= new NumberField($1, \$3);
}

You just write the sequences that you want to match, and what code should be executed. The parser generator generates the recognizer that reads tokens from the scanner, and calls your code when appropriate.

##### Share on other sites

Or maybe if you're not currently interested into how to write one, you may just grab an existing library, rapidjson/jsoncpp/tinyxml for example and just start using it?

Edited by imoogiBG

##### Share on other sites

Maybe writing a parser might be too big/too complicated for me, I've downloaded pugixml, a xml parser. Maybe it's good :D

I guess I'll write my own parser when I have more time and more C++ skills.