Sign in to follow this  
Acharis

Very Simple Parser A Bit Similar To Bbcodes

Recommended Posts

At the moment my code looks like this:

tPrint("Hello "); tSetColor("yellow"); tPrint("world"); tSetColor("white"); tPrint(". ");
tDrawImage("images/morale.png"); tPrint("Morale "); tSetColor(255,100,80); tPrint("150"); tSetColor("white");

I wish it looked something like this:

tPrintRich("Hello [color=yellow]world[color=white]. [img=images/morale.png]Morale [color=255,100,80]150[color=white].");

So, I need some sort of a very simple parser, just 2 tags/codes [color=XXX] and [_img=XXX]. It's run in realtime each frame, so, while overall such things do not consume a lot of CPU since it's text display, it should be reasonably fast.

 

In addition I need strlen function:

int width=tStrlenRich("a[color=yellow]b[color=white]c");

where returned width is 3 (so with all codes skipped).

 

 

How do I approach this? Probably it does not make sense to get an existing parsing library, it would be an overkill for this...

 

Share this post


Link to post
Share on other sites

You don't need a parser, just a lexical scanner will work, tokenizing the text to a sequence of tokens.

 

Alternatively, you can use regular expressions, or even look for "[" in the text 'manually', but that seems like more work at first sight.

Share this post


Link to post
Share on other sites

If the syntax keeps as simple as your example shows then you wont even need a tokenizer just a stated reader something that does

while text < end
{
   if text is [
   {
        property : string
        value : string

        while text not = or ]
            property += text
        while text not ]
            value += text

        switch property
            case color: set color to 'value'
   }
   else
   {
      output : string

      while text not [ and < end
         output += text

      print output
   }
}

(pseudo code)

 

Edit

 

Even if you decide to use a tokenizer/parser just avoid using something like ANTLR because it produces to heavy code. Then you should define your rules by hand it is simpler as you might think even for such syntax schemas as C++

Edited by Shaarigan

Share this post


Link to post
Share on other sites

What language(s) are you working with? There's a few options depending on what tools you have immediately accessible.

C++. Added the proper tag to the post.

 

You don't need a parser, just a lexical scanner will work, tokenizing the text to a sequence of tokens.

 

Alternatively, you can use regular expressions, or even look for "[" in the text 'manually', but that seems like more work at first sight.

I see, tokenizer not a parser then :)

Is there a simple and fast tokenizer that suits my needs? Or just should I rather write one?

 

If the syntax keeps as simple as your example shows then you wont even need a tokenizer just a stated reader something that does



Even if you decide to use a tokenizer/parser just avoid using something like ANTLR because it produces to heavy code. Then you should define your rules by hand it is simpler as you might think even for such syntax schemas as C++

Hmmm...

Share this post


Link to post
Share on other sites

Is there a simple and fast tokenizer that suits my needs? Or just should I rather write one?


If you already depend on Boost, Spirit might be a relatively simple way to get a decent tokenizer. I'm not a big fan of writing one by hand unless you know your whole language from the start, know it's simple and never needs to be extended afterwards.

Share this post


Link to post
Share on other sites
I see, tokenizer not a parser then :) Is there a simple and fast tokenizer that suits my needs? Or just should I rather write one?

lex & yacc are the defacto-standard parsing tools, where 'lex' is the scanner generator (builds a sub-routine to convert text to tokens), and 'yacc' is the parser generator (builds a routine to convert the token stream to tree according to grammar production rules). Gnu versions of this are flex & bison. These tools are also used in production.

 

While they are designed to be used together, they are separate tools, and you can also write a manual scanner (quite easy usually, although there is the problem of longest match you must handle), and feed its output to the generated parser. The other way around is also possible, but since a parser is usually non-trivial or too big to write by hand, you almost never see this.

 

Simplest lex spec (untested, and my lex is a bit rusty):

\[ *color *= *[a-z]+ *\]        { /* return color as identifier */ }
\[ *color *= *[0-9]+ *, *[0-9]+ *, *[0-9]+ *\]   { /* return color as number */ }
\[ *img *= *[^] ]+ *\]     { /* return img link */ }
.   { /* return normal character */ }

This recognizes the three [ ] constructs, and "text" for anything else (note the dot at the start of the line). The disadvantage here it only splits the strem into separate cases, you'll still have to extract the color or link yourself afterwards. This can be improved by recognizing "[color=" for example, and then jump to a color value recognition state, where you recognize the value. Similar approach works for image links too.

Edited by Alberth

Share this post


Link to post
Share on other sites
boost::spirit and lex/yacc are massive overkill for a simple string chunking problem. Shaarigan has the right idea.


Regular expressions, such as those supported in C++11 and newer, are probably your best bet for simple tag markup. Regexes are also super useful for other things so are worth learning as a tool if you don't already use them.

For instance, even if your code never uses a single regex, you can use them for find/replace operations in virtually every text editor out there. This is a massive time-saver for me when doing repetitive edits, for example.

Share this post


Link to post
Share on other sites

boost::spirit and lex/yacc are massive overkill for a simple string chunking problem
Except I advocated for only lex, which basically generates a string chunking state machine for you, nothing more, in about 0.2 seconds after you specified the set patterns you want to recognize.

 

In particular, I didn't advocate for using yacc, which is indeed overkill for the problem.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this