parsing a string

Started by
22 comments, last by daerid 19 years, 3 months ago
I have a question regarding lexical analysis. The purpose for the lexical analyzer is to break apart command line arguments. The analyzer will be used in one of my projects as part of a console. I have a theory of how to do this, but I was wondering if any of you had a better way. Google returns with horrible results that talk about analyzers for front end compilers. They won’t work and their design is different from what I need. My implementation language is C++ (just had to make sure). The semantics of the arguments are simple. Identifier = Value. Example: color=red. I’ve come to the conclusion that a simple lexical analysis of the example string would involve: 1) Read in line from file / console until end of line is found. 2) Read characters into buffer 1 until "=" is found. 3) Read characters into buffer 2 preceding "=" until end of string. 4) Search for keyword match with buffer 1 5) Convert buffer 2’s value (integer, float). 6) Return command identifier with value. The command identifier is parsed out with a switch statement and then the proper assignment is made to the proper variable. This seams to me that it would take a lot of time. Steps 4 and 5 would be the longest. There is still another process of elimination that takes place when I parse out the command identifier. Is there faster ways of implementing this?
Take back the internet with the most awsome browser around, FireFox
Advertisement
typedef std::list<boost::any> many;typedef std::map<std::string, many> property_map;property_map ParseStream(std::istream &in, const char *separators = 0) {    property_map result;    char line[1024];    while (i.getline(line, sizeof(line))) {        size_t len = strlen(line);        size_t sep = std::find(line, line + len, '=');        if (size + len == sep)           throw "ParseStream: Missing '='.";        char * key = line;        line[sep] = 0;        char * value = line + sep + 1;                many values;   // yeah - this looks nice :)         if (seperators) {                      char * token = strtok(value, separators);           while (token) {               values.push_back(token);               token = strtok(0, separators);           }        }        else {            values.push_back(value);        }        if (false == result.insert(std::make_pair(key, values)).second)            throw "ParseStream: Duplicate key detected.";    }    return result;}

Something like that.
Result lookup is simple, too:
template<typename T>bool GetValue(const property_map &properties, const std::string &key, T &value, size_t index = 0) {    property_map::const_iterator it = properties.find(key);    if (properties.end() == it)        // key is not registered        return false;    const many::const_iterator val = it->second.begin();    while (index) {        ++val;        if (it->second.end() == val)           // value index out of bounds           return false;        --index;    }        if (false == boost:any_cast<T>(&val))        // value type doesn't match        return false;    value = boost::any_cast<T>(val);    // key was found and value type matches the request    return true;}


These functions are from top of my head so no warranty on correctness [wink]

Regards,
Pat.
Quote:The command identifier is parsed out with a switch statement and then the proper assignment is made to the proper variable. This seams to me that it would take a lot of time. Steps 4 and 5 would be the longest. There is still another process of elimination that takes place when I parse out the command identifier. Is there faster ways of implementing this?


I didn't look at darookie's code, so I may be duplicating some or all of what he already said :-)

In my console system, commands and variables are hashed by their first letter, and then a binary search is used to find the entry. So number 4, finding the keyword, can be made pretty fast.

As for converting a string to a numerical value, I really don't know what the performance hit on something like that is. It's not something I've worried about.

To write my parser (which I'm sure I will rewrite), I studied the parsers from the Titan/Cake and Diesel Q3 rendering engines, and also read the article from one of the Game Programming Gems books. I haven't looked at it yet, but I think the Doom 3 sdk comes with id's parser, so you could look at that as well.

You could certainly write a very simple parser for your example. But writing a more generic parser can be an interesting challenge, and will give you a tool you can use in many contexts - your console, script parsing, reading text files, etc.

The GPG article describes a fairly deep parser that supports keywords, operators, etc. The others I mentioned are basically for parsing Q3 shaders. Haven't looked at the Doom 3 parser yet, but I'm curious to.

Good luck.
I didn’t quite understand all that. I don’t like using std::namespace stuff. I need a fast way to find matching strings. I think I might code a lexical analyzer in assembly. I only need to read the command arguments from a CFG file or the console. The same underlying procedure is used in both processes.

For convenience, I’m thinking about using a symbol table. I wouldn’t have to do a secondary parse using a command identifier. I would only have to look up the given identifier in the table. The amount of symbols in the table is fixed. So an array with a set size will be used.

What about binary search? If I use numbers to represent the identifiers I could use a binary search to find the correct symbol in the table. Say I add all the ASCII char values into a number where A - Z would represent 1 – 26 for instance. I just add the corresponding values of each character. Then do a binary search for that number in the array. If the search fails then the command is invalid.

Then I just use the remaining character (value) to do what ever I need. Depending on the command, value will be converted to integer, float, or left as text. Actually I think I’ll leave float out because that conversion it time consuming. To convert an integer is as simple as char - '0'. So only to types of input are accepted, text and numbers(non float). So the code may look like this.

int atoi( char* str ) {	int val = 0;	while ( 1 ) {		if ( str > '0' && str < '0' ) {			val += *str - '0';			str++;		}		else {			val = -1;			break;		}	}	return val;}int getcmd( char* str, int sep ) {	int val = 0, i = 0;	while ( i < strlen( str ) ) {		if ( str == '=' ) {<br>			sep = i;<br>			<span class="cpp-keyword">break</span>;<br>		}<br>		val += 'a' - str + <span class="cpp-number">1</span>;<br>		i++;<br>	}<br>	<span class="cpp-keyword">return</span> val;<br>}<br><br><span class="cpp-keyword">int</span> parse( <span class="cpp-keyword">char</span>* str ) {<br>	<span class="cpp-keyword">int</span> index = <span class="cpp-number">0</span>, sep = <span class="cpp-number">0</span>, val = <span class="cpp-number">0</span>;<br>	<span class="cpp-keyword">if</span> ( getcmd( str, sep ) == -<span class="cpp-number">1</span> )<br>		<span class="cpp-keyword">return</span> -<span class="cpp-number">1</span>;<br>	index = binary_search_on_symbol_table( val );<br>	<span class="cpp-keyword">if</span> ( index == -<span class="cpp-number">1</span> )<br>		<span class="cpp-keyword">return</span> -<span class="cpp-number">1</span>;<br>	<span class="cpp-keyword">else</span> {<br>		val = atoi( str[ sep ] );<br>		<span class="cpp-keyword">if</span> ( val == -<span class="cpp-number">1</span> ) {<br>			<span class="cpp-keyword">return</span> -<span class="cpp-number">1</span>;<br>		set_parameter( index, val );<br>	}<br>	<span class="cpp-keyword">return</span> <span class="cpp-number">1</span>;<br>}<br><br></pre></div><!–ENDSCRIPT–><br><br>That code sort of sucks but it something of what I’m talking about.<br>
Take back the internet with the most awsome browser around, FireFox
I'd be interested in seeing your data that shows the slowness of all the "std namespace stuff." Lexical analysis isn't known for its speed. Coding it in assembly will do little more than further obfuscate the code.
--God has paid us the intolerable compliment of loving us, in the deepest, most tragic, most inexorable sense.- C.S. Lewis
Why are you coding atoi yourself? What sort of speed are you after here?

If this is just part of a larger program, then I really don't see why you need to worry about micro-optimising like this.

If this is something else that is performance sensitive, then you should be thinking on the algorithmic level first before worrying about the details you see, most interested in. For example, get rid of the inefficient parts of your code, for instance by getting rid of the repeated strlen call and replacing it with (str != 0 ). And perform step 4 before step 3; who cares what buffer 2 equals if the command identifier is invalid anyway?
use boost.spirit
"I turned into a driveway I thought was mine and crashed into a tree I don't have."- a real insurance claim
Quote:Original post by Amnesty
use boost.spirit

This is the first (and probaly the only) time I'd like to advise against using a part of boost [smile].
From my exprience boost.tokenizer and boost.spirit are dog-slow compared to custom code [edit](in some cases)[/edit].

sakky:
Quote:
I didn’t quite understand all that.

That's exactly your problem. You can get rid of the std:: if it doesn't appeal to you by simply putting a
using namespace std;
above your parser.
The map container will provide you with a convinient O(log n) string lookup and using STL containers and functions will be less error prone and most of the time (depending on your skill level) faster than rolling your own functions.
Get familiar with the things you can get for free and understand how these can be used and ease up your life.
Quote:
I think I might code a lexical analyzer in assembly. I only need to read the command arguments from a CFG file or the console.

[oh]
This is like shooting a mosquito with a nuclear warhead...
Quote:I think I might code a lexical analyzer in assembly.


!?

Quote:What about binary search?


Yes! Actually, I suggested that... A binary search will be fast, and a binary search plus a hash table will be even faster.

Anyway, I'll just echo the sentiment that you may not be focusing your optimization efforts in the most productive place.
Quote:Original post by darookie
Quote:Original post by Amnesty
use boost.spirit

This is the first (and probaly the only) time I'd like to advise against using a part of boost [smile].
From my exprience boost.tokenizer and boost.spirit are dog-slow compared to custom code [edit](in some cases)[/edit].

--snip--


Using sub_rules and symbols were applicable I don't find spirit to much slower. It is a bit slower, but certainly not so slow I feel i should roll my own parser. Plus formulating a parser in spirit is so fast i can get on with the rest of my code :)

Also spirit v2.0 is in the works with 20x speed increase so I think it would be good to learn the how to use it. Should be out in a few months.
"I turned into a driveway I thought was mine and crashed into a tree I don't have."- a real insurance claim

This topic is closed to new replies.

Advertisement