# parsing a string

This topic is 4733 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I have a question regarding lexical analysis. The purpose for the lexical analyzer is to break apart command line arguments. The analyzer will be used in one of my projects as part of a console. I have a theory of how to do this, but I was wondering if any of you had a better way. Google returns with horrible results that talk about analyzers for front end compilers. They won’t work and their design is different from what I need. My implementation language is C++ (just had to make sure). The semantics of the arguments are simple. Identifier = Value. Example: color=red. I’ve come to the conclusion that a simple lexical analysis of the example string would involve: 1) Read in line from file / console until end of line is found. 2) Read characters into buffer 1 until "=" is found. 3) Read characters into buffer 2 preceding "=" until end of string. 4) Search for keyword match with buffer 1 5) Convert buffer 2’s value (integer, float). 6) Return command identifier with value. The command identifier is parsed out with a switch statement and then the proper assignment is made to the proper variable. This seams to me that it would take a lot of time. Steps 4 and 5 would be the longest. There is still another process of elimination that takes place when I parse out the command identifier. Is there faster ways of implementing this?

##### Share on other sites
typedef std::list<boost::any> many;typedef std::map<std::string, many> property_map;property_map ParseStream(std::istream &in, const char *separators = 0) {    property_map result;    char line[1024];    while (i.getline(line, sizeof(line))) {        size_t len = strlen(line);        size_t sep = std::find(line, line + len, '=');        if (size + len == sep)           throw "ParseStream: Missing '='.";        char * key = line;        line[sep] = 0;        char * value = line + sep + 1;                many values;   // yeah - this looks nice :)         if (seperators) {                      char * token = strtok(value, separators);           while (token) {               values.push_back(token);               token = strtok(0, separators);           }        }        else {            values.push_back(value);        }        if (false == result.insert(std::make_pair(key, values)).second)            throw "ParseStream: Duplicate key detected.";    }    return result;}

Something like that.
Result lookup is simple, too:
template<typename T>bool GetValue(const property_map &properties, const std::string &key, T &value, size_t index = 0) {    property_map::const_iterator it = properties.find(key);    if (properties.end() == it)        // key is not registered        return false;    const many::const_iterator val = it->second.begin();    while (index) {        ++val;        if (it->second.end() == val)           // value index out of bounds           return false;        --index;    }        if (false == boost:any_cast<T>(&val))        // value type doesn't match        return false;    value = boost::any_cast<T>(val);    // key was found and value type matches the request    return true;}

These functions are from top of my head so no warranty on correctness [wink]

Regards,
Pat.

##### Share on other sites
Quote:
 The command identifier is parsed out with a switch statement and then the proper assignment is made to the proper variable. This seams to me that it would take a lot of time. Steps 4 and 5 would be the longest. There is still another process of elimination that takes place when I parse out the command identifier. Is there faster ways of implementing this?

I didn't look at darookie's code, so I may be duplicating some or all of what he already said :-)

In my console system, commands and variables are hashed by their first letter, and then a binary search is used to find the entry. So number 4, finding the keyword, can be made pretty fast.

As for converting a string to a numerical value, I really don't know what the performance hit on something like that is. It's not something I've worried about.

To write my parser (which I'm sure I will rewrite), I studied the parsers from the Titan/Cake and Diesel Q3 rendering engines, and also read the article from one of the Game Programming Gems books. I haven't looked at it yet, but I think the Doom 3 sdk comes with id's parser, so you could look at that as well.

You could certainly write a very simple parser for your example. But writing a more generic parser can be an interesting challenge, and will give you a tool you can use in many contexts - your console, script parsing, reading text files, etc.

The GPG article describes a fairly deep parser that supports keywords, operators, etc. The others I mentioned are basically for parsing Q3 shaders. Haven't looked at the Doom 3 parser yet, but I'm curious to.

Good luck.

##### Share on other sites
I didn’t quite understand all that. I don’t like using std::namespace stuff. I need a fast way to find matching strings. I think I might code a lexical analyzer in assembly. I only need to read the command arguments from a CFG file or the console. The same underlying procedure is used in both processes.

For convenience, I’m thinking about using a symbol table. I wouldn’t have to do a secondary parse using a command identifier. I would only have to look up the given identifier in the table. The amount of symbols in the table is fixed. So an array with a set size will be used.

What about binary search? If I use numbers to represent the identifiers I could use a binary search to find the correct symbol in the table. Say I add all the ASCII char values into a number where A - Z would represent 1 – 26 for instance. I just add the corresponding values of each character. Then do a binary search for that number in the array. If the search fails then the command is invalid.

Then I just use the remaining character (value) to do what ever I need. Depending on the command, value will be converted to integer, float, or left as text. Actually I think I’ll leave float out because that conversion it time consuming. To convert an integer is as simple as char - '0'. So only to types of input are accepted, text and numbers(non float). So the code may look like this.

int atoi( char* str ) {	int val = 0;	while ( 1 ) {		if ( str > '0' && str < '0' ) {			val += *str - '0';			str++;		}		else {			val = -1;			break;		}	}	return val;}int getcmd( char* str, int sep ) {	int val = 0, i = 0;	while ( i < strlen( str ) ) {		if ( str[ i ] == '=' ) {			sep = i;			break;		}		val += 'a' - str[ i ] + 1;		i++;	}	return val;}int parse( char* str ) {	int index = 0, sep = 0, val = 0;	if ( getcmd( str, sep ) == -1 )		return -1;	index = binary_search_on_symbol_table( val );	if ( index == -1 )		return -1;	else {		val = atoi( str[ sep ] );		if ( val == -1 ) {			return -1;		set_parameter( index, val );	}	return 1;}

That code sort of sucks but it something of what I’m talking about.

##### Share on other sites
I'd be interested in seeing your data that shows the slowness of all the "std namespace stuff." Lexical analysis isn't known for its speed. Coding it in assembly will do little more than further obfuscate the code.

##### Share on other sites
Why are you coding atoi yourself? What sort of speed are you after here?

If this is just part of a larger program, then I really don't see why you need to worry about micro-optimising like this.

If this is something else that is performance sensitive, then you should be thinking on the algorithmic level first before worrying about the details you see, most interested in. For example, get rid of the inefficient parts of your code, for instance by getting rid of the repeated strlen call and replacing it with (str[ i ] != 0 ). And perform step 4 before step 3; who cares what buffer 2 equals if the command identifier is invalid anyway?

use boost.spirit

##### Share on other sites
Quote:
 Original post by Amnestyuse boost.spirit

This is the first (and probaly the only) time I'd like to advise against using a part of boost [smile].
From my exprience boost.tokenizer and boost.spirit are dog-slow compared to custom code (in some cases)[/edit].

sakky:
Quote:
 I didn’t quite understand all that.

That's exactly your problem. You can get rid of the std:: if it doesn't appeal to you by simply putting a
using namespace std;
The map container will provide you with a convinient O(log n) string lookup and using STL containers and functions will be less error prone and most of the time (depending on your skill level) faster than rolling your own functions.
Get familiar with the things you can get for free and understand how these can be used and ease up your life.
Quote:
 I think I might code a lexical analyzer in assembly. I only need to read the command arguments from a CFG file or the console.

[oh]
This is like shooting a mosquito with a nuclear warhead...

##### Share on other sites
Quote:
 I think I might code a lexical analyzer in assembly.

!?

Quote:

Yes! Actually, I suggested that... A binary search will be fast, and a binary search plus a hash table will be even faster.

Anyway, I'll just echo the sentiment that you may not be focusing your optimization efforts in the most productive place.

##### Share on other sites
Quote:
Original post by darookie
Quote:
 Original post by Amnestyuse boost.spirit

This is the first (and probaly the only) time I'd like to advise against using a part of boost [smile].
From my exprience boost.tokenizer and boost.spirit are dog-slow compared to custom code (in some cases)[/edit].

--snip--

Using sub_rules and symbols were applicable I don't find spirit to much slower. It is a bit slower, but certainly not so slow I feel i should roll my own parser. Plus formulating a parser in spirit is so fast i can get on with the rest of my code :)

Also spirit v2.0 is in the works with 20x speed increase so I think it would be good to learn the how to use it. Should be out in a few months.

##### Share on other sites
Amnesty:
For complex parsing tasks I choose boost.spirit over custom code any time. But for such simple thing as 'key=value\n' it's quite a bit of overkill in my eyes.
I'm also very exited to hear about spirit v2.0 - can't wait to toy with it some more [smile].

##### Share on other sites
If you seperate them by spaces "somdata = value", then something like this might help:
char *parse_single_word( char *text, char *word_buffer, int word_length ){    char *p = text;    int count;    // skip leading white space    while( isspace( *p ) )        p++;    word_buffer[0] = *p;    count = 1;    // fill the word buffer    for ( ;*p && ' ' != *p; p++ )    {        if ( count >= word_length - 1 )        {            // do something about overunning the buffer        }        word_buffer[count++] = *p;    }    // null terminate the string    word_buffer[++count] = '\0';    // return the current position in the string    return p;}

That is off the top of my head so don't kill me if it doesn't work out of the box :) Basicly if you read a line into a buffer (text) you can pull the first word from it and place it into another buffer (word_buffer) and it returns the current location in the string so you can ultimately parse a string word by word.

for example given a line in a text file:
player_hp = 2000

the following could be used to parse that line assuming a variable named text_buf contains the above string with a null terminator:
char word[32];char operator[32];char value[32];ptext = parse_single_word( text_buf, word, 32 );// check for the word.. continue if necessary...ptext = parse_single_word( ptext, operator, 32 );// make sure the operator is ok or leave it out all together depending on the keyword// = operators are kind of unnecessary in this case I feel...ptext = parse_single_word( ptext, value, 32 );// turn the value into what you need it to be...repeat

Of course there is no real error checking etc but it can be done easily enough as this is a crude example, but I think something along these lines is pretty simple to use for simple text files. A similar technique can be used to read in variable length strings spanning multiple lines if you use a unique line terminator such as a # or ^ or ~ or some other uncommonly used punctuation mark.

Hope that helps

##### Share on other sites
Quote:
 Original post by darookieThe map container will provide you with a convinient O(log n) string lookup

This is not entirely true, because since string comparisons are not O(1), some cases exist where you can have, even with a balanced tree, a hard time finding what you're after.

OP: any chance using a program such as lex to generate a lexer for you? At least, this would get 4) out of the way in O(length of the keyword). (Or you can decide to create the deterministic automata yourself, which is tricky).

##### Share on other sites
Quote:
Original post by ToohrVyk
Quote:
 Original post by darookieThe map container will provide you with a convinient O(log n) string lookup

This is not entirely true, because since string comparisons are not O(1), some cases exist where you can have, even with a balanced tree, a hard time finding what you're after.

Agreed, still O(k * log n) is faster than a brute force linear search and you have to keep in mind that these special cases are rare (e.g. a large number of long almost equal keys that would differ only in, say, the last few characters).
Otherwise you can still resort to a hash map, though the implied memory footprint might not be worth the slight speed increase. Another method would be using a heap to keep the keys sorted (also ready-to-use from within the STL).
Anyway my main point was to rather use the tools that already exist and have well-known properties instead of trying to implement your own function, which is more work and error prone.

On the other hand - if it's just for fun or educational reasons then I'd also say go for it and implement a custom lexer/parser.

Regards,
Pat.

##### Share on other sites
I got this to parse a string like I wanted. But not all I need to do is add in the association with the symbol table.

// PARSER.CPP//#include <STDIO.H>#include <CTYPE.H>int parse( char* lpszString ){	char	szBuff[ 80 ];	int		nCopyIndex = 0, 			nValue = 0, 			nState = 1, 			nIndex = 0;	szBuff[ 0 ] = '\0';	while ( 1 )	{		switch ( nState )		{		case 1:			 // Check for new line characters			 if ( lpszString[ nIndex ] == '\n' || lpszString[ nIndex ] == '\0' )				 nState = 4;			 // White space is not allowed			 else			 if ( lpszString[ nIndex ] == ' ' )				 nState = 5;			 // Check for alpha characters			 else			 if ( isalpha( lpszString[ nIndex ] ) ) 			 {				 nIndex--;				 nState = 2;			 }			 // Check for numbers			 else			 if ( isdigit( lpszString[ nIndex ] ) )			 {				 nIndex--;				 nState = 3;			 }			 // Invalid character found in stream!			 else				 nState = 9;			 break;		case 2:			 while ( 1 )			 {				 // Check for new line characters				 if ( lpszString[ nIndex ] == '\n' || lpszString[ nIndex ] == '\0' )				 {					 nState = 5;					 break;				 }				 // White space is not allowed				 else				 if ( lpszString[ nIndex ] == ' ' )				 {					 nState = 5;					 break;				 }				 // Check for the separator				 else				 if ( lpszString[ nIndex ] == '=' )				 {					 szBuff[ nIndex ] = '\0';					 nState = 1;					 break;				 }				 else				 // Copy the characters				 szBuff[ nCopyIndex ] = lpszString[ nIndex ];				 nCopyIndex++;				 nIndex++;			 }			 break;		case 3:			 nCopyIndex = 0;			 while ( isdigit( lpszString[ nIndex ] ) )			 {				nValue += ( lpszString[ nIndex ] - '0' ) + ( 10 * nCopyIndex );				nCopyIndex++;				nIndex++;			 }			 nState = 4;			 break;		case 4:			 // TODO : Make symbol table reference for match			 printf( "Found token ( %s ) that has a value of %d\n", szBuff, nValue );			 return ( 0 );		case 5:			 printf( "Error in token stream!\n" );			 return ( -1 );		}				nIndex++;	}	return ( 0 );}int main( void ){	char	szCommandStr[ 80 ];	printf( "Enter a string: " );		scanf( "%s", szCommandStr );	parse( szCommandStr );	return ( 0 );}

It may not be the best code, but that is what I was looking for. By the way, the lexical analyzer will be extended upon into something else. But for the time being all I need to do is break apart the simple expressions Variable = Value.

##### Share on other sites
I can optimize my code a bunch. Technically, I don’t have to search for invalid characters. All I have to do is copy the string until I find the separator. I still need to check for end line / new line characters, but that’s about it. I can let some other phase deal with the task of finding if the returns token is a valid identifier.

After finding the separator I jump to the next phase to find the value. Value must be numeric. So all I need to do is check for non numeric entries. So then my code would look like this.

int parse( char* lpszString ) {	char szBuff[ 80 ] = "*";	int	 nValue = 0, nState = 0, nIndex = 0;	// Parse the string!	while ( 1 ) {		switch ( nState ) {		case 0:			 // Check for new line characters			 if ( lpszString[ nIndex ] == '\n' || lpszString[ nIndex ] == '\0' )				 nState = 2;			 // Check for separator			 else if ( lpszString[ nIndex ] == '=' ) {				 szBuff[ nIndex ] = '\0';				 nValue = atoi( &lpszString[ nIndex + 1 ] );				 nState = 1;			 }			 // Copy identifier			 else				 szBuff[ nIndex ] = lpszString[ nIndex ];			 break;		case 1:			 printf( "Found token \"%s\" with value of %d!\n", szBuff, nValue );			 return ( 1 );		}		nIndex++;	}	return ( 0 );}

##### Share on other sites
Actually, I’ve come up with something else too. In Q3, you didn’t have to use an assignment operator such as =. All you did was spelled the name of the variable and to the right was the value. Oh, I almost forgot the set keyword.

All I need to do is create a lexical analyzer that will break down a single line of strings. I guess I can use white space as delimiters. Any spaces, new lines, carriage returns will be considered as a break in the string.

So then the process is even more simplistic now! All is needed is to remove white space and return the chunks (tokens) between them. Or actually, I would push them onto a stack as I went on, or maybe not. I think the lex should be a sub routine that another part of the application uses.

Let’s say, I have a FSM that changes states based on the token received. If the token doesn’t satisfy the state an error has occurred. Okay, this just got real simple! Well, thanks for your help any ways guys :)

##### Share on other sites
Off the top of my head, and with help from here:

using namespace std;string key, value;// Loop reading the first part of the current line.// optional: remove leading whitespace from linewhile(cin >> ws && !getline(cin, key, '=').eof()) {  removeTrailingWhitespace(key);  // Check if the key corresponds to an actual keyword.  if !lookup(key) continue;  // Optional: remove whitespace immediately after the equals sign  cin >> ws;  // Read the value  getline(cin, value);  removeTrailingWhitespace(value);  // Update the information  set(key, value);}  void removeTrailingWhitespace(String & s) {  // optional: remove trailing whitespace from a key or value  // If you know a better way to do this please use that instead ;)  // Also I'm not actually sure this works? Will the .str() ignore the "read"  // whitespace characters as I want it to?  reverse(s.begin(), s.end());  istringstream temp(s);  temp >> ws; s = temp.str();  reverse(s.begin(), s.end());}

##### Share on other sites
Pretty much all I need to do is build something very similar to a command line argument parser. Like the one for Windows. All is needed is to parse through the string and build tokens separated by white space. The token are pushed onto a toke table. Then the table is walked through using a FSM. The FSM is just like a translation diagram. I can make a set of error definitions so that the FSM will return once an error occurs. So if a token was improper then the FSM would return E_INVALIDVALUE or E_INVALIDTOKEN. It would also return a pointer to the last token parsed. So then I could use the information to inform the user about the error.

I could go a little farther with this and allow the use of certain symbols to denote certain operations. Like a + or – sign would enable or disable certain features. And global variables can be manipulated by simply specifying their name fallowed by a space and the value to set. And if the user wants to know the value of a variable all they have to do is specify the name.

This is the code I’m using right now.
typedef struct _SYMBOL {	_SYMBOL* pNext;	CHAR* lpszString;} SYMBOL, *LPSYMBOL;LPSYMBOL					 g_lpSymbolTable= NULL;// ----------------------------------------------------------------------------// Name		: CONEvaluateTable// Desc		: Evaluates the symbols in the table for acceptable commands// ----------------------------------------------------------------------------HRESULT __stdcall CONEvaluateTable( void ){	LPSYMBOL				 lpSymbol	= NULL;	int						 nState		= 0;	if ( NULL == ( lpSymbol = g_lpSymbolTable ) )	{		return ( E_FAIL );	}	while ( NULL != lpSymbol )	{		MessageBox( NULL, lpSymbol->lpszString, "Console", MB_OK | MB_ICONEXCLAMATION );		lpSymbol = lpSymbol->pNext;	}	CONClearTable( );	return ( S_OK );}// ----------------------------------------------------------------------------// Name		: CONParseString// Desc		: Parses a string into tokens// ----------------------------------------------------------------------------HRESULT __stdcall CONParseString( void  ){	CHAR					 szBuffer[ MAX_PATH ];	int						 nIndex = 0, nStart = 0;	HRESULT					 hRet;	while ( 1 )	{		if ( g_szConsoleBuffer[ nIndex ] == '\0' || g_uCurrConIndex == nIndex )		{			strncpy( szBuffer, &g_szConsoleBuffer[ nStart ], nIndex - nStart );			return ( CONPushSymbol( szBuffer ) );		}		else if ( g_szConsoleBuffer[ nIndex ] == ' ' )		{			strncpy( szBuffer, &g_szConsoleBuffer[ nStart ], nIndex - nStart );			if ( FAILED( hRet = CONPushSymbol( szBuffer ) ) )			{				return ( hRet );			}			nStart = nIndex + 1;		}		nIndex++;	}	return ( S_OK );}// ----------------------------------------------------------------------------// Name		: CONPushSymbol// Desc		: Pushes a symbol on the symbol table// ----------------------------------------------------------------------------HRESULT __stdcall CONPushSymbol( LPCSTR lpszString ){	LPSYMBOL				 lpSymbol = NULL;	if ( NULL == lpszString )	{		return ( E_INVALIDARG );	}	if ( NULL == ( lpSymbol = new SYMBOL ) )	{		return ( E_OUTOFMEMORY );	}	else	{		if ( NULL == ( lpSymbol->lpszString = new CHAR[ strlen( lpszString ) + 1 ] ) )		{			SAFE_DELETE( lpSymbol );			return ( E_OUTOFMEMORY );		}		else		{			strncpy( lpSymbol->lpszString, lpszString, strlen( lpszString ) );		}	}	lpSymbol->pNext = g_lpSymbolTable;	g_lpSymbolTable = lpSymbol;	return ( S_OK );}// ----------------------------------------------------------------------------// Name		: CONClearTable// Desc		: Removes all symbols from the symbol table// ----------------------------------------------------------------------------HRESULT __stdcall CONClearTable( void ){	LPSYMBOL				 lpSymbol = NULL;	while ( NULL != g_lpSymbolTable )	{		lpSymbol = g_lpSymbolTable->pNext;		if ( NULL != g_lpSymbolTable->lpszString )		{			delete [ ] g_lpSymbolTable->lpszString;			g_lpSymbolTable->lpszString = NULL;		}		delete g_lpSymbolTable;		g_lpSymbolTable = lpSymbol;	}	return ( S_OK );}

I’m thinking about using std::string and std::list instead of building my own lists and string stuff. These thing is very simple.

##### Share on other sites
Have you considered making use of tools like flex/lex and bison/yacc?

Have you read the Dragon Book?

(Note: all of the above dates to the C era, so you will have to do your own adaptations to "modern C++ style" including proper use to std::strings.)

##### Share on other sites
Quote:
 Original post by darookiechar line[1024];while (i.getline(line, sizeof(line))) {

FYI, there's also the function std::getline, which will accept std::strings. That code is (nearly) equivilant to:

std::string line;while ( getline ( i , line ) ) {

The lack of this support directly in the class is one of the STLs odd shortcommings.

##### Share on other sites
Yes I have read the dragon book! That is where I’ve gotten all the information I know about compiler construction. But a compiler is not what I’m looking for. More like a stripped down version of the front end of a compiler. I just need to tokenize word in a string. Evaluate the tokens for valid commands and then process them.

By the way, wouldn’t a single or double linked list work better then a tree for syntax or semantics? I think they would because you could use an algorithm similar to a translation diagram for finding syntax or semantic errors. A tree just seams to complicate things even more. Why branch off into leaves when you can just go straight and evaluate syntax using an FSM as I spoke of in earlier posts?

##### Share on other sites
Quote:
 Original post by sakkyPretty much all I need to do is build something very similar to a command line argument parser. Like the one for Windows. All is needed is to parse through the string and build tokens separated by white space. The token are pushed onto a toke table. Then the table is walked through using a FSM. The FSM is just like a translation diagram. I can make a set of error definitions so that the FSM will return once an error occurs. So if a token was improper then the FSM would return E_INVALIDVALUE or E_INVALIDTOKEN. It would also return a pointer to the last token parsed. So then I could use the information to inform the user about the error.I’m thinking about using std::string and std::list instead of building my own lists and string stuff. These thing is very simple.
Good to see you've come to your senses and put away your sledgehammer-flyswat. Most of the ideas thrown around here are way more than you need. You should write a simple description of valid input and code it to parse exactly that (or return an error) with the minimal code possible.
Hell, I doubt there was ever any need whatsoever to optimise this. It isn't something that should even touch your framerate, and loading times differences would be unnoticeable. K.I.S.S!

##### Share on other sites
Here's probably the easiest way to do something like that:

#include <iostream>#include <string>#include <vector>#include <map>#include <boost/bind.hpp>#include <boost/algorithm/string.hpp>using namespace std;using namespace boost;typedef vector<string> stringvector_t;typedef map<string,string> optmap_t;optmap_t parse_opts(const string& cmdline){    optmap_t ret;    stringvector_t opts;    split(opts,cmdline,is_space());        // loop through each    stringvector_t::iterator current = opts.begin();    stringvector_t::iterator end = opts.end();    for(;current != end; ++current)    {        stringvector_t vals;        split(vals,*current,bind(equal_to<char>(),'=',_1));        ret[vals[0]] = vals[1];            }    return ret;}int main(int argc,char* argv[]){    string cmdline = "name1=value1 name2=value2 name3=value3";        optmap_t opts = parse_opts(cmdline);        optmap_t::iterator current = opts.begin();    optmap_t::iterator end = opts.end();    for(;current != end;++current)        cout << "Opt Name: " << current->first << ", Opt Value: " << current->second << '\n';        return 0;}

##### Share on other sites

This topic is 4733 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.