Sign in to follow this  
louie999

How can I make a text parser?

Recommended Posts

louie999    280

Hi, so I got a question.

 

How can I make a text parser? like:

1. Open a file.

2. Read the contents of file.

3. Consider something like this in the text file:

NewObject MyObject

HP   = 100
DMG  = 5
Weapon = SomeWeapon

End

How can I make it find keywords such as "NewObject" then save it's value(which is "MyObject") to an std::map or something?

4. How can I make it identify wrong things in the text file? like, if there is a "NewObject" keyword then there should be and "End" keyword, if not then it will produce an error.

 

Is there any good tutorials, c++ functions to allow me make this? Any help is appreciated. Thanks in advance... 

 

Share this post


Link to post
Share on other sites
louie999    280
Thanks guys for the quick answers smile.png I think I'm beggining to understand now. Though, are there also pattern matching in C++? like in Lua I would do something like:
string.match(str, "(.*) = (%d+),(%d+),(%d+)")
to match something like,
something = 1,2,3

Share this post


Link to post
Share on other sites

Using templates you could create something like that, but it isn't an easy beginner task to write.

 

I don't believe there is something like that already in the standard library, but I'm not familiar with all the new C++11 and C++14 library extensions so I could be wrong. C++11 did add a regex library (#include <regex>).

Share this post


Link to post
Share on other sites
Ravyne    14300

C++ has regex now, but regex alone isn't enough to implement a parser for a programming language -- the basic shortcoming is that regex can't count things for itself, and counting tokens in one form or another is necessary (e.g. keeping track of matching brackets is a form of counting) in any programming language you'd want to use. Regex is sufficient for a more declarative sytnax though -- say, key-value pairs for an initialization file.

 

Regex is a reasonable tool for tokenizing symbols though -- you just need to program the parsing/semantic analysis around that.

Share this post


Link to post
Share on other sites
EddieV223    1839
class MyObj
{
string myStr;
int myInt;
float myFloat;

void WriteToStream(ofstream & file)
    {
    file << myStr << ' ' << myInt << ' ' << myFloat << std::endl;
    }

void ReadFromStream(ifstream & file)
    {
    file >> myStr >> myInt >> myFloat;
    }
};

I think the simplest way to do this is as above. The first method uses a stream to output the classes fields in order, with a ' ' delimiter.

The second simply reads the same data back in and the stream will automatically know about the space for the delimiter since that's how streams work.

 

You can write any number of these to the same file and read them in until eof.

 

Otherwise if you really need something more complex, then you probably need to be using xml or json.

Share this post


Link to post
Share on other sites
Nypyren    12074
If you want to be robust and well-designed, I recommend reading up on the following:

https://en.wikipedia.org/wiki/Recursive_descent_parser
https://en.wikipedia.org/wiki/LALR_parser

And examine the tools available to do most of the hard work for you:

https://en.wikipedia.org/wiki/Compiler-compiler
https://en.wikipedia.org/wiki/Category:Parser_generators Edited by Nypyren

Share this post


Link to post
Share on other sites
Bregma    9214

How can I make a text parser?


I hope you realize this is an entire branch of computer science?

Really, if you're not writing a text parser as an end in itself, choose to use a widespread common format (XML, YAML, JSON, INI) and use a library. There are many, all of which have been tested.

 

Otherwise, look in to writing lexical analyzers, defining grammars built in the lexemes and using semantic insertion, and creating parsers for those grammars.  There's handy tools for that, like boost::spirit (if it's still around).

Share this post


Link to post
Share on other sites
Brain    18906

Recursive descent parsers are pretty easy to implement.

 

Generally if you can make a game you shouldn't have a problem understanding simple RD parsers.

 

Personally for anything more advanced than a key value pair format, I'd be tempted to use YACC and BISON and friends and not reinvent the whole car never mind it's wheel...

Share this post


Link to post
Share on other sites
louie999    280

Using templates you could create something like that, but it isn't an easy beginner task to write.

 

I don't believe there is something like that already in the standard library, but I'm not familiar with all the new C++11 and C++14 library extensions so I could be wrong. C++11 did add a regex library (#include <regex>).

I just looked it up on google, it seems it has what I need to make a simple text parser :D, I'll try and learn how to use it.

 

 

If this is a part of something bigger, don't reinvent the wheel. Go get a JSON or YAML parser.

 

Why?

 

1. Toolchains: jq lets you pretty-print and query JSON files. Editors have modes to help you work on the files. There are syntax verifiers and so on.

2. Not finding all the bugs and edge-cases again. Someone's already worked out what to do what you try and put a JSON text into a JSON value..

Well, the parser I'm trying to make isn't going to be big, it's just going to be for some initialization for my game.

 

Anyway, thanks guys for answering, I think I'll go with regex first :D

Share this post


Link to post
Share on other sites
Bregma    9214


I just looked it up on google, it seems it has what I need to make a simple text parser.

No, it has what you need for a text matcher, a part of lexical anaylsis (trust me, I implemented a significant part of the regex library for GCC).  It's kind of like saying a couple of ice cubes in your cocktail is an iceburg and can sink ocean liners (disclaimer: do not captain while drunk).

 

It will definitely do simple pattern matching using captures, which is probably good enough for extracting data from a basic file with some simple data using a crude grammar.  In fact, I'd recommend it except I know at least one of the guys who implemented a significant part of that library is a jerk and I wouldn't trust anything he wrote.

 

Do yourself a favour and write a suite of unit tests for each of your initialization file lines.  You will thank me.

Share this post


Link to post
Share on other sites
Alberth    9529

Manually writing parsers, eg with regexp has been discussed. To show you the next step, generating a scanner, that produces tokens which you process, this little example.

You probably don't want to do that now, but it never hurts to see what it would look like :)

 

Compile with

$ flex scanner.l
$ g++ lex.yy.c main.cpp

and run with "./a.out input_file.txt" (lex.yy.c is generated by flex)

 

 

scanner specification (text to number specification, eg "=" is translated to "TK_EQUAL" value): scanner.l

%{
int line = 1;
char *text;

#include "tokens.h"
%}

%%

=                       { return TK_EQUAL; }
NewObject               { return KW_NEWOBJECT; }
End                     { return KW_END; }

[A-Za-z][A-Za-z0-9]*    { text = yytext; return TK_IDENTIFIER; }
[0-9]+                  { text = yytext; return TK_NUMBER; }
[ \t\r]                 ;
\n                      ;

.                       { printf("Unrecognized character 0x%02x\n", yytext[0]); }

%%

int yywrap() {
        return 1;
}

Glue file to share the common definitions: tokens.h

#ifndef TOKENS_H
#define TOKENS_H

extern int line;
extern char *text;

extern FILE *yyin; // Owned by the generated scanner.

int yylex();

enum Tokens {
        TK_EOF,

        TK_EQUAL,
        TK_IDENTIFIER,
        TK_NUMBER,

        KW_NEWOBJECT,
        KW_END,
};

#endif

Main program file, with the actual parser

#include <cstdio>
#include <cstdlib>
#include <string>
#include "tokens.h"

bool parse()
{
        int tok;

        tok = yylex();
        if (tok == TK_EOF) return true;

        if (tok != KW_NEWOBJECT) {
                printf("Expected NewObject at line %d\n", line);
                return false;
        }

        tok = yylex();
        if (tok != TK_IDENTIFIER) {
                printf("Expected object name after NewObject at line %d\n", line);
                return false;
        }

        printf("Found object name \"%s\" at line %d\n", text, line);

        for (;;) {
                tok = yylex();
                if (tok == KW_END) break; // End of the input.

                if (tok != TK_IDENTIFIER) {
                        printf("Expected field key at line %d\n", line);
                        return false;
                }
                std::string key = text; // Save name before it gets overwritten by a field name.

                tok = yylex();
                if (tok != TK_EQUAL) {
                        printf("Expected equal sign at line %d\n", line);
                        return false;
                }

                tok = yylex();
                if (tok == TK_IDENTIFIER) {
                        printf("Found a field with a named value: \"%s :: %s\"\n", key.c_str(), text);
                } else if (tok == TK_NUMBER) {
                        printf("Found a field with a number: \"%s :: %d\"\n", key.c_str(), atoi(text));
                } else {
                        printf("Unknown field value at line %d\n", line);
                        return false;
                }

                // And loop for the next "key = value"
        }

        tok = yylex();
        if (tok != TK_EOF) {
                printf("EOF expected at line %d\n", line);
        }

        return true;
}

int main(int argc, char *argv[])
{
        FILE *handle = (argc == 2) ? fopen(argv[1], "rt") : NULL;
        if (handle == NULL) {
                printf("File could not be opened\n");
                exit(1);
        }

        yyin = handle; // Give handle to the scanner.

        bool result = parse();

        fclose(handle);
        return result ? 0 : 1;
}

Parser just prints the values, but of course you could also put it in some data structure. (main.cpp file)

 

If you think the "parse" function is a bit repetitive, it is. You can step up and use a parser generator like bison, to get rid of it, and gain a lot of additional recognizing power at the same time.

 

I don't have a working example with a parser generator (it needs some new code, like the class definitions, and a bit additional glue code), but the core parser input specification would be like

Program : KW_NEWOBJECT TK_IDENTIFIER Fields KW_END
{
    $$ = new Program($2, $3);
}

Fields : Field
{
    $$ = std::list<Field *>();
    $$.push_back($1);
}

Fields : Fields Field
{
    $$ = $1;
    $$.push_back($2);
}

Field : TK_IDENTIFIER TK_EQUAL TK_IDENTIFIER
{
    $$ = new NameField($1, $3);
}

Field : TK_IDENTIFIER TK_EQUAL TK_NUMBER
{
    $$ = new NumberField($1, $3);
}

You just write the sequences that you want to match, and what code should be executed. The parser generator generates the recognizer that reads tokens from the scanner, and calls your code when appropriate.

Share this post


Link to post
Share on other sites
ongamex92    3256

Or maybe if you're not currently interested into how to write one, you may just grab an existing library, rapidjson/jsoncpp/tinyxml for example and just start using it?

Edited by imoogiBG

Share this post


Link to post
Share on other sites
louie999    280

Maybe writing a parser might be too big/too complicated for me, I've downloaded pugixml, a xml parser. Maybe it's good :D

 

I guess I'll write my own parser when I have more time and more C++ skills.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this