Very Simple Parser A Bit Similar To Bbcodes

Started by
7 comments, last by Alberth 7 years, 8 months ago

At the moment my code looks like this:


tPrint("Hello "); tSetColor("yellow"); tPrint("world"); tSetColor("white"); tPrint(". ");
tDrawImage("images/morale.png"); tPrint("Morale "); tSetColor(255,100,80); tPrint("150"); tSetColor("white");

I wish it looked something like this:


tPrintRich("Hello [color=yellow]world[color=white]. [img=images/morale.png]Morale [color=255,100,80]150[color=white].");

So, I need some sort of a very simple parser, just 2 tags/codes [color=XXX] and [_img=XXX]. It's run in realtime each frame, so, while overall such things do not consume a lot of CPU since it's text display, it should be reasonably fast.

In addition I need strlen function:

int width=tStrlenRich("a[color=yellow]b[color=white]c");

where returned width is 3 (so with all codes skipped).

How do I approach this? Probably it does not make sense to get an existing parsing library, it would be an overkill for this...

Stellar Monarch (4X, turn based, released): GDN forum topic - Twitter - Facebook - YouTube

Advertisement
What language(s) are you working with? There's a few options depending on what tools you have immediately accessible.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

You don't need a parser, just a lexical scanner will work, tokenizing the text to a sequence of tokens.

Alternatively, you can use regular expressions, or even look for "[" in the text 'manually', but that seems like more work at first sight.

If the syntax keeps as simple as your example shows then you wont even need a tokenizer just a stated reader something that does


while text < end
{
   if text is [
   {
        property : string
        value : string

        while text not = or ]
            property += text
        while text not ]
            value += text

        switch property
            case color: set color to 'value'
   }
   else
   {
      output : string

      while text not [ and < end
         output += text

      print output
   }
}

(pseudo code)

Edit

Even if you decide to use a tokenizer/parser just avoid using something like ANTLR because it produces to heavy code. Then you should define your rules by hand it is simpler as you might think even for such syntax schemas as C++

What language(s) are you working with? There's a few options depending on what tools you have immediately accessible.

C++. Added the proper tag to the post.

You don't need a parser, just a lexical scanner will work, tokenizing the text to a sequence of tokens.

Alternatively, you can use regular expressions, or even look for "[" in the text 'manually', but that seems like more work at first sight.

I see, tokenizer not a parser then :)

Is there a simple and fast tokenizer that suits my needs? Or just should I rather write one?

If the syntax keeps as simple as your example shows then you wont even need a tokenizer just a stated reader something that does



Even if you decide to use a tokenizer/parser just avoid using something like ANTLR because it produces to heavy code. Then you should define your rules by hand it is simpler as you might think even for such syntax schemas as C++

Hmmm...

Stellar Monarch (4X, turn based, released): GDN forum topic - Twitter - Facebook - YouTube

Is there a simple and fast tokenizer that suits my needs? Or just should I rather write one?


If you already depend on Boost, Spirit might be a relatively simple way to get a decent tokenizer. I'm not a big fan of writing one by hand unless you know your whole language from the start, know it's simple and never needs to be extended afterwards.
I see, tokenizer not a parser then :) Is there a simple and fast tokenizer that suits my needs? Or just should I rather write one?

lex & yacc are the defacto-standard parsing tools, where 'lex' is the scanner generator (builds a sub-routine to convert text to tokens), and 'yacc' is the parser generator (builds a routine to convert the token stream to tree according to grammar production rules). Gnu versions of this are flex & bison. These tools are also used in production.

While they are designed to be used together, they are separate tools, and you can also write a manual scanner (quite easy usually, although there is the problem of longest match you must handle), and feed its output to the generated parser. The other way around is also possible, but since a parser is usually non-trivial or too big to write by hand, you almost never see this.

Simplest lex spec (untested, and my lex is a bit rusty):


\[ *color *= *[a-z]+ *\]        { /* return color as identifier */ }
\[ *color *= *[0-9]+ *, *[0-9]+ *, *[0-9]+ *\]   { /* return color as number */ }
\[ *img *= *[^] ]+ *\]     { /* return img link */ }
.   { /* return normal character */ }

This recognizes the three [ ] constructs, and "text" for anything else (note the dot at the start of the line). The disadvantage here it only splits the strem into separate cases, you'll still have to extract the color or link yourself afterwards. This can be improved by recognizing "[color=" for example, and then jump to a color value recognition state, where you recognize the value. Similar approach works for image links too.

boost::spirit and lex/yacc are massive overkill for a simple string chunking problem. Shaarigan has the right idea.


Regular expressions, such as those supported in C++11 and newer, are probably your best bet for simple tag markup. Regexes are also super useful for other things so are worth learning as a tool if you don't already use them.

For instance, even if your code never uses a single regex, you can use them for find/replace operations in virtually every text editor out there. This is a massive time-saver for me when doing repetitive edits, for example.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

boost::spirit and lex/yacc are massive overkill for a simple string chunking problem
Except I advocated for only lex, which basically generates a string chunking state machine for you, nothing more, in about 0.2 seconds after you specified the set patterns you want to recognize.

In particular, I didn't advocate for using yacc, which is indeed overkill for the problem.

This topic is closed to new replies.

Advertisement