Jump to content
  • Advertisement
Sign in to follow this  
Daaark

Parsing Help

This topic is 4873 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm trying to parse strings that look like this: x;Text;More Text;g;d;gd;gf;r; y;Text;More Text;g;d;gd;gf;r; z;;More Text;g;d;gd;gf;r; with strtok in C, using ";" as my delimiting character. It works fine until I get to lines like z with have two ;'s. Calling strtok a second time on this line returns a pointer to 'more text', and there is no warning in the strings that any line will have omitted text like line z. I shouldn't be getting 'More Text' until the third call, and this leads up to a catastrophic failure later because I'm 'lost' in the strings I'm parsing, and getting bad data. I have no clue how to check for this or how to prevent this.

Share this post


Link to post
Share on other sites
Advertisement
If you're able to use C++, try:
template<class T>
class StringTok
{
public:
StringTok( const T& seq,
typename T::size_type pos = 0 )
: seq_( seq ) , pos_( pos ) { }

T operator()( const T& delim );

private:
const T& seq_;
typename T::size_type pos_;
};

template<class T>
T StringTok<T>::operator()
( const T& delim )
{
T token;

if( pos_ != T::npos )
{
// start of found token
typename T::size_type first =
seq_.find_first_not_of
( delim.c_str(), pos_ );
if( first != T::npos )
{
// length of found token
typename T::size_type num =
seq_.find_first_of
( delim.c_str(), first ) - first;
// do all the work off to the side
token = seq_.substr( first, num );

// done; now commit using
// nonthrowing operations only
pos_ = first+num;
if( pos_ != T::npos ) ++pos_;
if( pos_ >= seq_.size() ) pos_ = T::npos;
}
}

return token;
}

from Conversations: Al-Go-Rithms

If not, I'm sure you can rewrite it in C (minus the genericity).

Enigma

Share this post


Link to post
Share on other sites
I assume you don't want to read the strings from files (but if you do, the code below should be modified so you don't waste memory storing the entire string before parsing it). What you need to do is read the first character, then if it's not a ';', then the string isn't blank, so you need to read the entire string. In code:

int index;
char* theEntireString;//this is the entire string you're going to parse
for (index = 0;index < strlen(theEntireString);){
int c = theEntireString[index];
if (c == ';'){
//zero length string
index++;
}else{
char stringBuffer[1024];//more than sufficient in most cases
memset(stringBuffer,0,1024);
sscanf(&theEntireString[index],"%[^;];",stringBuffer);
index += (strlen(stringBuffer)+1);
}
}



This code's basically useless as it is, but with a little modification, it'll give you what you want.

Share this post


Link to post
Share on other sites
From strtok's manual page:
NAME
strtok, strtok_r - extract tokens from strings

SYNOPSIS
#include <string.h>

char *strtok(char *s, const char *delim);

char *strtok_r(char *s, const char *delim, char **ptrptr);

DESCRIPTION
A `token' is a nonempty string of characters not occurring
in the string delim, followed by \0 or by a character
occurring in delim.
...

So it extracts tokens and tokens must not be empty. strtok is not a very good function also because it is not thread-safe.
Additionally you should never assume that any file you read is correctly formatted. That would create a security hole in your program. Try reading and processing one character at a time and use dynamic buffers or std::strings (probably with some maximum size) to store the text fields.

Share this post


Link to post
Share on other sites
Quote:
Original post by 255
thread-safe
wrong thread? I'm not writting a muliti threaded app :)
Quote:
Additionally you should never assume that any file you read is correctly formatted.
The omitted characters are part of the correct format. :)

Gorax, in your example... That code would basically work like strtok(I never got comfortable with char arrays and pointers)?

I have a massive amount of code using the strtok function, and it works just fine except for this one case of the omitted text. How can I just check if the pointer returned by the previous call to it is pointing to a ';'?

Like the equivalent of if (string[0] == ';')???

Would be nice to just hack in a check for now, and implement another solution later on.

Share this post


Link to post
Share on other sites
strtok will never return a pointer to a character which serves as a delimiter


you format looks like a commata seperated excel format


what you basically do is this(i ll write some little pseudocode)


char szLine[MAX_LINE_LENGTH];
char szToken[MAX_LINE_LENGTH];
while(fgets(szLine,MAX_LINE_LENGTH,streampointer)==NULL)
{
szLine[MAX_LINE_LENGTH-1]='\0';//important to avoid buffer overflow flaws

loop until you got all tokens
{
for(i=0;i<strlen(szLine);i++)
{
if(szLine!=';')
{
lastchar = szToken=szLine;
}
else if(lastchar == ';')
{
//here you got the case with 2 ";;" insert a empty token into you tokenlist
}
}
}
}
[/SOURCE]


that should work i assume you know enough c++ to finish this pieace of code

p.s.: you strtok is not threadsafe because it is no reentrant function
it stores the beginning of the last token with a static pointer
reentrant versions get a pointer->pointer as an argument and store the location there
p.p.s.: strtok is a damn ineffizient way to parse a string into tokens
lets say you have MAX_LINE_LENGTH string, you could scan the string for the number of tokens
create a pointerarray and pass the address of the first character of the token to one of the pointers
if you do this with recursion you only have to traverse string a single time and make use of the stack for temporary storage of the beginning character addresses
maybe ill write such a parse into a little library somewhen next week

Share this post


Link to post
Share on other sites
char stringBuffer[1024];//more than sufficient in most cases
memset(stringBuffer,0,1024);
sscanf(&theEntireString[index],"%[^;];",stringBuffer);

Classic buffer overrun exploit in the making.

Here's a fairly minimal hack that will do what you want:
// all your header includes

// just like strtok, this code is not reentrant
char * singleDelimStrTok_storedInputString = NULL;

// safe strtok implementation that takes a single delimiter
char * singleDelimStrTok(char * inputString, char const * delimiter)
{
if (!delimiter)
{
return NULL;
}
if (!inputString)
{
inputString = singleDelimStrTok_storedInputString;
}
else
{
singleDelimStrTok_storedInputString = inputString;
}
if (!singleDelimStrTok_storedInputString || !*singleDelimStrTok_storedInputString)
{
return NULL;
}
while (*singleDelimStrTok_storedInputString && *singleDelimStrTok_storedInputString != *delimiter)
{
++singleDelimStrTok_storedInputString;
}
if (*singleDelimStrTok_storedInputString)
{
*singleDelimStrTok_storedInputString = '\0';
++singleDelimStrTok_storedInputString;
}
return inputString;
}

// the hack - must occur after all includes
#define strtok singleDelimStrTok

// the rest of your code


Enigma

Share this post


Link to post
Share on other sites
[sad]

Funny, how this thread is turning into a thread safety one. I'm not writing, nor do I write multithreaded apps. [lol] Is there an issue even in single threaded apps?

Enigma, thanks. I'll try that out.

Share this post


Link to post
Share on other sites
Reentrancy is not just an issue for multi-threaded apps:
void someOtherFunction()
{
// some code
result = strtok(someOtherString, someOtherDelimiters);
// some code
}

void someFunction()
{
// some code
result = strtok(someString, someDelimiters);
// some code
result = strtok(NULL, someDelimiters);
// some code
if (someCondition)
{
someOtherFunction();
}
// some code
result = strtok(NULL, someDelimiters); // uh-oh!
// some code
}

If someCondition is false then someFunction will work as expected, but if someCondition is true then someOtherFunction will change the string that strtok sources from and the line marked "uh-oh!" will tokenise the wrong string.

Enigma

Share this post


Link to post
Share on other sites
Quote:
Original post by Enigma
Reentrancy is not just an issue for multi-threaded apps:
*** Source Snippet Removed ***
If someCondition is false then someFunction will work as expected, but if someCondition is true then someOtherFunction will change the string that strtok sources from and the line marked "uh-oh!" will tokenise the wrong string.

Enigma
But that's the fault of the coder for not properly checking return values.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!