Jump to content
  • Advertisement
Sign in to follow this  
sakky

parsing a string

This topic is 5041 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have a question regarding lexical analysis. The purpose for the lexical analyzer is to break apart command line arguments. The analyzer will be used in one of my projects as part of a console. I have a theory of how to do this, but I was wondering if any of you had a better way. Google returns with horrible results that talk about analyzers for front end compilers. They won’t work and their design is different from what I need. My implementation language is C++ (just had to make sure). The semantics of the arguments are simple. Identifier = Value. Example: color=red. I’ve come to the conclusion that a simple lexical analysis of the example string would involve: 1) Read in line from file / console until end of line is found. 2) Read characters into buffer 1 until "=" is found. 3) Read characters into buffer 2 preceding "=" until end of string. 4) Search for keyword match with buffer 1 5) Convert buffer 2’s value (integer, float). 6) Return command identifier with value. The command identifier is parsed out with a switch statement and then the proper assignment is made to the proper variable. This seams to me that it would take a lot of time. Steps 4 and 5 would be the longest. There is still another process of elimination that takes place when I parse out the command identifier. Is there faster ways of implementing this?

Share this post


Link to post
Share on other sites
Advertisement

typedef std::list<boost::any> many;
typedef std::map<std::string, many> property_map;

property_map ParseStream(std::istream &in, const char *separators = 0) {

property_map result;

char line[1024];

while (i.getline(line, sizeof(line))) {
size_t len = strlen(line);
size_t sep = std::find(line, line + len, '=');

if (size + len == sep)
throw "ParseStream: Missing '='.";

char * key = line;

line[sep] = 0;

char * value = line + sep + 1;

many values; // yeah - this looks nice :)

if (seperators) {

char * token = strtok(value, separators);

while (token) {
values.push_back(token);
token = strtok(0, separators);
}
}

else {
values.push_back(value);
}

if (false == result.insert(std::make_pair(key, values)).second)
throw "ParseStream: Duplicate key detected.";
}

return result;
}


Something like that.
Result lookup is simple, too:

template<typename T>
bool GetValue(const property_map &properties, const std::string &key, T &value, size_t index = 0) {
property_map::const_iterator it = properties.find(key);

if (properties.end() == it)
// key is not registered
return false;

const many::const_iterator val = it->second.begin();

while (index) {

++val;

if (it->second.end() == val)
// value index out of bounds
return false;

--index;
}

if (false == boost:any_cast<T>(&val))
// value type doesn't match
return false;

value = boost::any_cast<T>(val);

// key was found and value type matches the request
return true;
}



These functions are from top of my head so no warranty on correctness [wink]

Regards,
Pat.

Share this post


Link to post
Share on other sites
Quote:
The command identifier is parsed out with a switch statement and then the proper assignment is made to the proper variable. This seams to me that it would take a lot of time. Steps 4 and 5 would be the longest. There is still another process of elimination that takes place when I parse out the command identifier. Is there faster ways of implementing this?


I didn't look at darookie's code, so I may be duplicating some or all of what he already said :-)

In my console system, commands and variables are hashed by their first letter, and then a binary search is used to find the entry. So number 4, finding the keyword, can be made pretty fast.

As for converting a string to a numerical value, I really don't know what the performance hit on something like that is. It's not something I've worried about.

To write my parser (which I'm sure I will rewrite), I studied the parsers from the Titan/Cake and Diesel Q3 rendering engines, and also read the article from one of the Game Programming Gems books. I haven't looked at it yet, but I think the Doom 3 sdk comes with id's parser, so you could look at that as well.

You could certainly write a very simple parser for your example. But writing a more generic parser can be an interesting challenge, and will give you a tool you can use in many contexts - your console, script parsing, reading text files, etc.

The GPG article describes a fairly deep parser that supports keywords, operators, etc. The others I mentioned are basically for parsing Q3 shaders. Haven't looked at the Doom 3 parser yet, but I'm curious to.

Good luck.

Share this post


Link to post
Share on other sites
I didn’t quite understand all that. I don’t like using std::namespace stuff. I need a fast way to find matching strings. I think I might code a lexical analyzer in assembly. I only need to read the command arguments from a CFG file or the console. The same underlying procedure is used in both processes.

For convenience, I’m thinking about using a symbol table. I wouldn’t have to do a secondary parse using a command identifier. I would only have to look up the given identifier in the table. The amount of symbols in the table is fixed. So an array with a set size will be used.

What about binary search? If I use numbers to represent the identifiers I could use a binary search to find the correct symbol in the table. Say I add all the ASCII char values into a number where A - Z would represent 1 – 26 for instance. I just add the corresponding values of each character. Then do a binary search for that number in the array. If the search fails then the command is invalid.

Then I just use the remaining character (value) to do what ever I need. Depending on the command, value will be converted to integer, float, or left as text. Actually I think I’ll leave float out because that conversion it time consuming. To convert an integer is as simple as char - '0'. So only to types of input are accepted, text and numbers(non float). So the code may look like this.


int atoi( char* str ) {
int val = 0;
while ( 1 ) {
if ( str > '0' && str < '0' ) {
val += *str - '0';
str++;
}
else {
val = -1;
break;
}
}
return val;
}

int getcmd( char* str, int sep ) {
int val = 0, i = 0;
while ( i < strlen( str ) ) {
if ( str[ i ] == '=' ) {
sep = i;
break;
}
val += 'a' - str[ i ] + 1;
i++;
}
return val;
}

int parse( char* str ) {
int index = 0, sep = 0, val = 0;
if ( getcmd( str, sep ) == -1 )
return -1;
index = binary_search_on_symbol_table( val );
if ( index == -1 )
return -1;
else {
val = atoi( str[ sep ] );
if ( val == -1 ) {
return -1;
set_parameter( index, val );
}
return 1;
}



That code sort of sucks but it something of what I’m talking about.

Share this post


Link to post
Share on other sites
I'd be interested in seeing your data that shows the slowness of all the "std namespace stuff." Lexical analysis isn't known for its speed. Coding it in assembly will do little more than further obfuscate the code.

Share this post


Link to post
Share on other sites
Why are you coding atoi yourself? What sort of speed are you after here?

If this is just part of a larger program, then I really don't see why you need to worry about micro-optimising like this.

If this is something else that is performance sensitive, then you should be thinking on the algorithmic level first before worrying about the details you see, most interested in. For example, get rid of the inefficient parts of your code, for instance by getting rid of the repeated strlen call and replacing it with (str[ i ] != 0 ). And perform step 4 before step 3; who cares what buffer 2 equals if the command identifier is invalid anyway?

Share this post


Link to post
Share on other sites
Quote:
Original post by Amnesty
use boost.spirit

This is the first (and probaly the only) time I'd like to advise against using a part of boost [smile].
From my exprience boost.tokenizer and boost.spirit are dog-slow compared to custom code [edit](in some cases)[/edit].

sakky:
Quote:

I didn’t quite understand all that.

That's exactly your problem. You can get rid of the std:: if it doesn't appeal to you by simply putting a
using namespace std;
above your parser.
The map container will provide you with a convinient O(log n) string lookup and using STL containers and functions will be less error prone and most of the time (depending on your skill level) faster than rolling your own functions.
Get familiar with the things you can get for free and understand how these can be used and ease up your life.
Quote:

I think I might code a lexical analyzer in assembly. I only need to read the command arguments from a CFG file or the console.

[oh]
This is like shooting a mosquito with a nuclear warhead...

Share this post


Link to post
Share on other sites
Quote:
I think I might code a lexical analyzer in assembly.


!?

Quote:
What about binary search?


Yes! Actually, I suggested that... A binary search will be fast, and a binary search plus a hash table will be even faster.

Anyway, I'll just echo the sentiment that you may not be focusing your optimization efforts in the most productive place.

Share this post


Link to post
Share on other sites
Quote:
Original post by darookie
Quote:
Original post by Amnesty
use boost.spirit

This is the first (and probaly the only) time I'd like to advise against using a part of boost [smile].
From my exprience boost.tokenizer and boost.spirit are dog-slow compared to custom code [edit](in some cases)[/edit].

--snip--


Using sub_rules and symbols were applicable I don't find spirit to much slower. It is a bit slower, but certainly not so slow I feel i should roll my own parser. Plus formulating a parser in spirit is so fast i can get on with the rest of my code :)

Also spirit v2.0 is in the works with 20x speed increase so I think it would be good to learn the how to use it. Should be out in a few months.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!