How can I make a text parser?

Started by
17 comments, last by louie999 8 years, 6 months ago

Recursive descent parsers are pretty easy to implement.

Generally if you can make a game you shouldn't have a problem understanding simple RD parsers.

Personally for anything more advanced than a key value pair format, I'd be tempted to use YACC and BISON and friends and not reinvent the whole car never mind it's wheel...

Advertisement

The other alternative is to use a binary format. Text - despite being human-readable - is not always the most appropriate choice.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Recursive descent parsers are pretty easy to implement.

Generally if you can make a game you shouldn't have a problem understanding simple RD parsers.

Personally for anything more advanced than a key value pair format, I'd be tempted to use YACC and BISON and friends and not reinvent the whole car never mind it's wheel...

Personally... If I'm going to be using something more advanced than is necessary to parse json/xml using an off the shelf parser...
It suggests you should probably plug in a proper scripting solution (lua, pythong, etc), instead of designing your own language. Stick to standard formats (like JSON) for data. It'll save you a ton of useless wasted effort.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.

How can I make a text parser?


I hope you realize this is an entire branch of computer science?

Yea, but that can be said about almost anything.

"How do I calculate what five times four is?" -> That's a "whole branch" of math/science! smile.png

Or if I want to know, "How do clouds float in the air?", I don't actually need a degree in geophysics before asking that question - I can get the general gist of it, and learn from the answer, even if it's not the entirely 100% fully correct absolute answer that a geophysicist would actually know.

While yes, NIH syndrome as a default mode isn't ideal, some reinventing of the wheel is helpful in learning programming. You can't actually learn programming without writing a few things that others have already done better than you (here 'better' = cleaner, faster, and more flexible, and wrapped in a nice interface), and I personally think string parsing functions are excellent practice - they're self-contained data-in data-out functions, and don't require weaving code all throughout your game's logic, and further, they are one of the easiest things to debug, because they provide a ridiculously easy way to display the input and output. They also have a clear "this is correct" vs "this is wrong" result, and can be easily optimized to learn the basics of optimization.

And they're fun to write. ph34r.png

Using templates you could create something like that, but it isn't an easy beginner task to write.

I don't believe there is something like that already in the standard library, but I'm not familiar with all the new C++11 and C++14 library extensions so I could be wrong. C++11 did add a regex library (#include <regex>).

I just looked it up on google, it seems it has what I need to make a simple text parser :D, I'll try and learn how to use it.

If this is a part of something bigger, don't reinvent the wheel. Go get a JSON or YAML parser.

Why?

1. Toolchains: jq lets you pretty-print and query JSON files. Editors have modes to help you work on the files. There are syntax verifiers and so on.

2. Not finding all the bugs and edge-cases again. Someone's already worked out what to do what you try and put a JSON text into a JSON value..

Well, the parser I'm trying to make isn't going to be big, it's just going to be for some initialization for my game.

Anyway, thanks guys for answering, I think I'll go with regex first :D


I just looked it up on google, it seems it has what I need to make a simple text parser.

No, it has what you need for a text matcher, a part of lexical anaylsis (trust me, I implemented a significant part of the regex library for GCC). It's kind of like saying a couple of ice cubes in your cocktail is an iceburg and can sink ocean liners (disclaimer: do not captain while drunk).

It will definitely do simple pattern matching using captures, which is probably good enough for extracting data from a basic file with some simple data using a crude grammar. In fact, I'd recommend it except I know at least one of the guys who implemented a significant part of that library is a jerk and I wouldn't trust anything he wrote.

Do yourself a favour and write a suite of unit tests for each of your initialization file lines. You will thank me.

Stephen M. Webb
Professional Free Software Developer

Manually writing parsers, eg with regexp has been discussed. To show you the next step, generating a scanner, that produces tokens which you process, this little example.

You probably don't want to do that now, but it never hurts to see what it would look like :)

Compile with


$ flex scanner.l
$ g++ lex.yy.c main.cpp

and run with "./a.out input_file.txt" (lex.yy.c is generated by flex)

scanner specification (text to number specification, eg "=" is translated to "TK_EQUAL" value): scanner.l


%{
int line = 1;
char *text;

#include "tokens.h"
%}

%%

=                       { return TK_EQUAL; }
NewObject               { return KW_NEWOBJECT; }
End                     { return KW_END; }

[A-Za-z][A-Za-z0-9]*    { text = yytext; return TK_IDENTIFIER; }
[0-9]+                  { text = yytext; return TK_NUMBER; }
[ \t\r]                 ;
\n                      ;

.                       { printf("Unrecognized character 0x%02x\n", yytext[0]); }

%%

int yywrap() {
        return 1;
}

Glue file to share the common definitions: tokens.h


#ifndef TOKENS_H
#define TOKENS_H

extern int line;
extern char *text;

extern FILE *yyin; // Owned by the generated scanner.

int yylex();

enum Tokens {
        TK_EOF,

        TK_EQUAL,
        TK_IDENTIFIER,
        TK_NUMBER,

        KW_NEWOBJECT,
        KW_END,
};

#endif

Main program file, with the actual parser


#include <cstdio>
#include <cstdlib>
#include <string>
#include "tokens.h"

bool parse()
{
        int tok;

        tok = yylex();
        if (tok == TK_EOF) return true;

        if (tok != KW_NEWOBJECT) {
                printf("Expected NewObject at line %d\n", line);
                return false;
        }

        tok = yylex();
        if (tok != TK_IDENTIFIER) {
                printf("Expected object name after NewObject at line %d\n", line);
                return false;
        }

        printf("Found object name \"%s\" at line %d\n", text, line);

        for (;;) {
                tok = yylex();
                if (tok == KW_END) break; // End of the input.

                if (tok != TK_IDENTIFIER) {
                        printf("Expected field key at line %d\n", line);
                        return false;
                }
                std::string key = text; // Save name before it gets overwritten by a field name.

                tok = yylex();
                if (tok != TK_EQUAL) {
                        printf("Expected equal sign at line %d\n", line);
                        return false;
                }

                tok = yylex();
                if (tok == TK_IDENTIFIER) {
                        printf("Found a field with a named value: \"%s :: %s\"\n", key.c_str(), text);
                } else if (tok == TK_NUMBER) {
                        printf("Found a field with a number: \"%s :: %d\"\n", key.c_str(), atoi(text));
                } else {
                        printf("Unknown field value at line %d\n", line);
                        return false;
                }

                // And loop for the next "key = value"
        }

        tok = yylex();
        if (tok != TK_EOF) {
                printf("EOF expected at line %d\n", line);
        }

        return true;
}

int main(int argc, char *argv[])
{
        FILE *handle = (argc == 2) ? fopen(argv[1], "rt") : NULL;
        if (handle == NULL) {
                printf("File could not be opened\n");
                exit(1);
        }

        yyin = handle; // Give handle to the scanner.

        bool result = parse();

        fclose(handle);
        return result ? 0 : 1;
}

Parser just prints the values, but of course you could also put it in some data structure. (main.cpp file)

If you think the "parse" function is a bit repetitive, it is. You can step up and use a parser generator like bison, to get rid of it, and gain a lot of additional recognizing power at the same time.

I don't have a working example with a parser generator (it needs some new code, like the class definitions, and a bit additional glue code), but the core parser input specification would be like


Program : KW_NEWOBJECT TK_IDENTIFIER Fields KW_END
{
    $$ = new Program($2, $3);
}

Fields : Field
{
    $$ = std::list<Field *>();
    $$.push_back($1);
}

Fields : Fields Field
{
    $$ = $1;
    $$.push_back($2);
}

Field : TK_IDENTIFIER TK_EQUAL TK_IDENTIFIER
{
    $$ = new NameField($1, $3);
}

Field : TK_IDENTIFIER TK_EQUAL TK_NUMBER
{
    $$ = new NumberField($1, $3);
}

You just write the sequences that you want to match, and what code should be executed. The parser generator generates the recognizer that reads tokens from the scanner, and calls your code when appropriate.

Or maybe if you're not currently interested into how to write one, you may just grab an existing library, rapidjson/jsoncpp/tinyxml for example and just start using it?

Maybe writing a parser might be too big/too complicated for me, I've downloaded pugixml, a xml parser. Maybe it's good :D

I guess I'll write my own parser when I have more time and more C++ skills.

This topic is closed to new replies.

Advertisement