Sign in to follow this  
joe_bubamara

atoi & co

Recommended Posts

I am doing little .obj parse in C (not C++) and have to write some code to parse text data. Main parsing task is parsing actual numbers for coordinates, colors etc. I am using scanf & friens for main part of the job, but I am considering using atoi and atof instead. However those functions returns 0 if no conversion is performed??? How are you then supposed to know when an error occured and when a value of 0 is found? If I did't care about syntax and file validity checking, this would be fine, but I do. The documentation does not mention anything about error number being sett.

Share this post


Link to post
Share on other sites
Or something obvious like:
if((result = atoi(str)) == 0 && *str >= '0' && *str <= '9')
handle_error();


That's not pretty, but it should work just fine. If there is at least one numeric digit, atoi() will parse some value, it won't fail. So, zero means zero, not failure.

Share this post


Link to post
Share on other sites
Quote:
Original post by Alan Kemp
You could use strtol(..). It can return (via the second paramater) a pointer to where ever it stopped parsing. You can use that to check the entire input string was converted.
You may also want to check whether the input was in range, e.g. strtol clips very small and very large values to LONG_MIN/LONG_MAX and sets errno to ERANGE to indicate the error.

E.g. something along these lines:
bool safe_atoi(const char *in, long *out) {
char *end;

errno = 0;
*out = strtol(in, &end, 10);

return end != in && !errno;
}


Quote:
Original post by samoth
Or something obvious like:
if((result = atoi(str)) == 0 && *str >= '0' && *str <= '9')
handle_error();


That's not pretty, but it should work just fine. If there is at least one numeric digit, atoi() will parse some value, it won't fail. So, zero means zero, not failure.
That won't cover negative zero or cases with leading whitespace.

Share this post


Link to post
Share on other sites
If you can use C++ code, then a more reliable approach IMHO is to use a string stream:


std::string s = "1234";
std::istringstream stream(s);
int out = 0;
stream >> out;
if (stream.eof() && !stream.fail())
{
// parsed ok, 'out' is set to value
}
else
{
// some kind of error
}



This should also work for long/float/double.

Share this post


Link to post
Share on other sites
Quote:
Original post by samoth
Or something obvious like:
if((result = atoi(str)) == 0 && *str >= '0' && *str <= '9')
handle_error();


That's not pretty, but it should work just fine. If there is at least one numeric digit, atoi() will parse some value, it won't fail. So, zero means zero, not failure.


yeah - it was obvous :-) thnks for opening my eyes!

I would just turn the clausuls in if and add extra test for + resp - sign before the number:



if( (*str == '-' || *str == '+')
str ++;
}
if(*str >= '0' && *str <= '9')
result = atoi(str);
else
str --; // back for - or + char



Feels like it should work. I will test it, and will also test if it is anything more efficient then sscanf - I am now done with .obj parser, so if I am going to change all that crap, it has to have some benefits, otherwise I don't care to do it.

Thnks for both you guys for answering and helping me out.

Share this post


Link to post
Share on other sites
Why not write the atoi yourself ? It's not that difficult, and saves you from having to call a stdlib function (which may depend on a couple of system settings, like locales etc.).
I just tried to post a little code snippet myself, but it seems like the forum software swallowed all the plus signs which turned it all into completely guff.

Share this post


Link to post
Share on other sites
Quote:
Original post by joe_bubamara
Feels like it should work. I will test it, and will also test if it is anything more efficient then sscanf - I am now done with .obj parser, so if I am going to change all that crap, it has to have some benefits, otherwise I don't care to do it.
You're still not accounting for whitespace. Whether or not this is a problem depends on how your parser splits up the strings, I suppose.
At any rate I'd strongly suggest making use of strtol instead if you want robust error detection, plus you won't have to manually tokenize the string to find the end of the number. Plus you get hexadecimal and octal literals for free by specifying base zero.

The other option is to do it manually of course: (untested code!)
bool safe_atoi(const char *in, long *out) {
char ch;
char prefix;
unsigned long value;

do
ch = isspace(*in++);
while(isspace(ch));

prefix = ch;

if(prefix == '+' || prefix == '-')
ch = *in++;

if(!isdigit(ch))
return false;

value = 0UL;

do {
unsigned long overflow = value;

ch -= '0';
value *= 10UL;
value += ch;

if(value < overflow)
return false;

ch = *in++;
} while(isdigit(ch));

if(prefix == '-')
return (*out = -(long) value) < 0L;
else
return (*out = +(long) value) >= 0L;
}

Share this post


Link to post
Share on other sites
Quote:
Original post by plastique
Why not write the atoi yourself ? It's not that difficult, and saves you from having to call a stdlib function (which may depend on a couple of system settings, like locales etc.).
I just tried to post a little code snippet myself, but it seems like the forum software swallowed all the plus signs which turned it all into completely guff.


Put your snippets into '[' code ']' and '[' /code ']' tags .... I have to add ' because of forum software, of course you use [ without '. Code tag will preserve tabs, spaces and other things. If you wish to have colorized syntax read the forum faq to see how it is used - I dislike that text area with colorized widget is so small so one sees only few lines of code at a time, so I almost never use it.

Yeah it is good question, where we should stop rolling own things instead of using libraries? :-). I believe that those conversion functions are done in optimized assembly, so that was reason why I am trying to use those rather then own thing.

cheers

Quote:
You're still not accounting for whitespace.
Yeah, you are right, but I do eat whitespaces elsewhere in my code. I do use whites as delimiters and they are all eaten until tokens are parsed. Anyway here is a loop to check for whites and I forgott to check for null pointer in snippet above, here is also check to see if end of string is found ('\0' == 0):


while(*str && isspace(*str))
str ++;

if( (*str && *str == '-' || *str == '+')
str ++;



if(*str && *str >= '0' && *str <= '9')
result = atoi(str);
else
str --; // back for - or + char


You don't have to find the end of the token nor to remove leading whitespaces. Itoa will eat itself until first - or + sign or first numeric value, and then convert any nymber of digits until it finds first non digit character. So it means that we don't have to remove trailing characters whatever they are; just be sure that we are throwing in a string that starts with whits or a number (inclusive + and - before the number).

docs: http://www.cplusplus.com/reference/clibrary/cstdlib/atoi/

[Edited by - joe_bubamara on July 23, 2009 8:32:49 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by joe_bubamara
Put your snippets into '[' code ']' and '[' /code ']' tags .... I have to add ' because of forum software, of course you use [ without '. Code tag will preserve tabs, spaces and other things. If you wish to have colorized syntax read the forum faq to see how it is used - I dislike that text area with colorized widget is so small so one sees only few lines of code at a time, so I almost never use it.


I tried to put the code into code/source formatters, but it had those issues. On second thought it also could have been the reply-preview that is buggy in this regard ...
I'll just try it again, regardless of how messed up the preview will look like
(it's not surprising that it's not very unlike the code the guy after my last post added to this thread).


inline bool StrToI32(const char * str, int32_t * number)
{
*number = 0; // just in case it wasn't initialized

while (*str == ' ') ++str;

int32_t sign = 1;
char character;

switch (*str)
{
case '-':
sign = -1;
// fall through
case '+':
++str;
// fall through
default:
character = *str;

if ((uint32_t) (character - '0') <= 9U) // is character a number ?
{
int32_t lastvalue = 0, value = 0;

do
{
value *= 10;
value += character - '0';

if (lastvalue > value)
{
return false; // overflow
}

lastvalue = value;
character = *(++str);
}
while ((uint32_t) (character - '0') <= 9U);

*number = value * sign;
return true;
}
break;
}
return false; // not found
}


Quote:
Original post by joe_bubamara
Yeah it is good question, where we should stop rolling own things instead of using libraries? :-). I believe that those conversion functions are done in optimized assembly, so that was reason why I am trying to use those rather then own thing.

cheers


It's probably optimized, but also you have to account for the function call itself, which normally will add about 60-100 cycles by itself, while a hand-rolled code could have the chance to be inlined (or forced to be) and integrate well into the code (with an additional advantage that registers can be reused and don't need to be push/popped from the stack each function call) :-)
Best to profile and see for yourself if it all makes sense in your particular piece of code.

Share this post


Link to post
Share on other sites
Quote:
Original post by plastiqueI tried to put the code into code/source formatters, but it had those issues. On second thought it also could have been the reply-preview that is buggy in this regard ...


Yes, the previewer is buggy. Your pluses are retained when you actually post it.

Share this post


Link to post
Share on other sites
I see your point with inlined code, I do agree with you, but I don't think it is of much use for me in this project. Imagine how a .obj file looks like - you would have to inline all this into many places - several places for almost any command parsed, since most commands take several parameters. Also note that 99% of numbers in .obj files are actually floats; and parsing floats is more jobb then parsing integers, so I believe it would add signifikant to binary. But it might come to use in some other project.

ps. your code needs few more checks for valid pointer; but I guess you wrote it just to illustrate the principle.

Share this post


Link to post
Share on other sites
Quote:
Original post by joe_bubamara
I see your point with inlined code, I do agree with you, but I don't think it is of much use for me in this project. Imagine how a .obj file looks like - you would have to inline all this into many places - several places for almost any command parsed, since most commands take several parameters. Also note that 99% of numbers in .obj files are actually floats; and parsing floats is more jobb then parsing integers, so I believe it would add signifikant to binary. But it might come to use in some other project.


Makes sense. A static function might be better in this regard.

Quote:
Original post by joe_bubamara
ps. your code needs few more checks for valid pointer; but I guess you wrote it just to illustrate the principle.


What do you mean ? Nope, this is working code. The only check that might be missing is to check if 'str' isn't null, but that's a bit like questioning the intelligence of the user of this function :P

Share this post


Link to post
Share on other sites
For careful parsing in C, don't use atoi or atof. As you've noted, if they don't like what they see, they just return 0, and if they do like what they see, you don't know how many characters they consumed.

But sscanf can be made to tell you exactly how many characters were consumed (%n), and can tell you how many fields were converted (return value), so with some care you can do a reasonably careful parser with it.

Or, as noted, you can write your own atoi(); that's easy. Writing your own atof() is not recommended, though, unless accuracy is not really an issue.

Share this post


Link to post
Share on other sites
Quote:
What do you mean ? Nope, this is working code. The only check that might be missing is to check if 'str' isn't null, but that's a bit like questioning the intelligence of the user of this function :P
ehh ... yepp .. and that is why you should always check for null :-), sometimes users are very stupid :-).

well seriously, source you through at it might come out from other parts of code where other errors could occured.

Every time you increment pointer you should check if it found '\0', unless you are perfectly sure how many valid characters are left in buffer. This is just my experience, I might be too carefull here.

As another point I went through great pain to eliminate any call to strtok function, exactly because of static memory it uses :-). Static functions are definitely forbidden; they should be thrown in deepest hole in the Earth and burried unless C/C++ standard specifies variables declared as static to live on thread stack space rather then global process space :-(.

I didn't know for %n option to scanf; however I go twice through every line and check number of tokens avialable; then I get second time and try to convert those tokens with scanf.

While we are talking about parsing and scanning, would you guys mind to take a look at little help library I wrote while writing this parser? I am done with .obj parser which parses all commands as specified in version 3.0 of parser (found on P. bourke's site) and will wait to release it until I finish .mtl. Meanwhile I got a bit bored and did this little abstraction layer for readin and tokenizing ascii files, which you can find on http://www.nextpoint.se. Check libToken project.

I don't have so much time for testing so maybe you can take a peek and give one or another thaught and opinion? It would be great ... sorry if I am using you too much here :-)

Share this post


Link to post
Share on other sites
Quote:
Original post by joe_bubamara
Every time you increment pointer you should check if it found '\0', unless you are perfectly sure how many valid characters are left in buffer. This is just my experience, I might be too carefull here.


If you go through the code I posted above you'll notice that only valid characters are checked (+, -, digits, whitespace), on all other characters the function will terminate (including the null terminator), that's why such additional checks are not necessary here.

Quote:
Original post by joe_bubamara
As another point I went through great pain to eliminate any call to strtok function, exactly because of static memory it uses :-). Static functions are definitely forbidden; they should be thrown in deepest hole in the Earth and burried unless C/C++ standard specifies variables declared as static to live on thread stack space rather then global process space :-(.


Don't confuse static functions with functions that make use of static data (like strtok, which is indeed not recommended as it also modifies the buffer you are tokenizing). The 'static' as how I meant it merely means the visibility of the function (= only visible in the current source file).

Share this post


Link to post
Share on other sites
Quote:
Don't confuse static functions with functions that make use of static data (like strtok, which is indeed not recommended as it also modifies the buffer you are tokenizing). The 'static' as how I meant it merely means the visibility of the function (= only visible in the current source file).


I can buy that; I was a bit fast there on word "static" :-).

I haven't compiled and run your code, I will do it later; but on glimpse it seems like you will do all "default" part on null character; also even if it + or - are followed by null character. Since '\0' == 0, you will get negative number which is <= 9u and you will get into the if statemnet.

As said didn't run through your code in debugger, will do it later.

Share this post


Link to post
Share on other sites
I just got an article on my little tokenizing library on the code proj :-) ... and I found several typos already hehe.

It would be nice if you can cast a look and give some opinion ...

http://www.codeproject.com/KB/cpp/libToken.aspx?fid=1544313

Share this post


Link to post
Share on other sites
Quote:
Original post by joe_bubamara
I haven't compiled and run your code, I will do it later; but on glimpse it seems like you will do all "default" part on null character; also even if it + or - are followed by null character. Since '\0' == 0, you will get negative number which is <= 9u and you will get into the if statemnet.


Admittedly, this comparison is a bit tricky since it uses the overflow behaviour of an integer (negative integer converted to unsigned integer will overflow to a rather high value), but it does the job in its efficient way (which is perfectly valid according to the C language specs as well of course) .

Just as an example:

int32_t si = -1;
uint32_t ui = (uint32_t) si;

As a result of the overflow the value of 'ui' is 0xFFFFFFFF now, the biggest possible uint32_t value (-2 would overflow to 0xFFFFFFFE, and so on ...).

Taken my example code,

(uint32_t) (character - '0') <= 9U

if the character is '\0', the result inside the brackets will be negative (-48 to be exact), but since it's casted to an unsigned integer it will overflow and that's why the comparison won't match here.

Hope this makes it more clear.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this