atoi & co

Recommended Posts

I am doing little .obj parse in C (not C++) and have to write some code to parse text data. Main parsing task is parsing actual numbers for coordinates, colors etc. I am using scanf & friens for main part of the job, but I am considering using atoi and atof instead. However those functions returns 0 if no conversion is performed??? How are you then supposed to know when an error occured and when a value of 0 is found? If I did't care about syntax and file validity checking, this would be fine, but I do. The documentation does not mention anything about error number being sett.

Share on other sites
You could use strtol(..). It can return (via the second paramater) a pointer to where ever it stopped parsing. You can use that to check the entire input string was converted.

Alan

Share on other sites
Or something obvious like:
if((result = atoi(str)) == 0 && *str >= '0' && *str <= '9')
handle_error();

That's not pretty, but it should work just fine. If there is at least one numeric digit, atoi() will parse some value, it won't fail. So, zero means zero, not failure.

Share on other sites
Quote:
 Original post by Alan KempYou could use strtol(..). It can return (via the second paramater) a pointer to where ever it stopped parsing. You can use that to check the entire input string was converted.
You may also want to check whether the input was in range, e.g. strtol clips very small and very large values to LONG_MIN/LONG_MAX and sets errno to ERANGE to indicate the error.

E.g. something along these lines:
bool safe_atoi(const char *in, long *out) { char *end; errno = 0; *out = strtol(in, &end, 10); return end != in && !errno;}

Quote:
 Original post by samothOr something obvious like:if((result = atoi(str)) == 0 && *str >= '0' && *str <= '9') handle_error();That's not pretty, but it should work just fine. If there is at least one numeric digit, atoi() will parse some value, it won't fail. So, zero means zero, not failure.
That won't cover negative zero or cases with leading whitespace.

Share on other sites
If you can use C++ code, then a more reliable approach IMHO is to use a string stream:

std::string s = "1234";std::istringstream stream(s);int out = 0;stream >> out;if (stream.eof() && !stream.fail()){  // parsed ok, 'out' is set to value}else{  // some kind of error}

This should also work for long/float/double.

Share on other sites
Quote:
 Original post by samothOr something obvious like:if((result = atoi(str)) == 0 && *str >= '0' && *str <= '9') handle_error();That's not pretty, but it should work just fine. If there is at least one numeric digit, atoi() will parse some value, it won't fail. So, zero means zero, not failure.

yeah - it was obvous :-) thnks for opening my eyes!

I would just turn the clausuls in if and add extra test for + resp - sign before the number:

if( (*str == '-' || *str == '+')    str ++;}if(*str >= '0' && *str <= '9')        result = atoi(str);else     str --; // back for - or + char

Feels like it should work. I will test it, and will also test if it is anything more efficient then sscanf - I am now done with .obj parser, so if I am going to change all that crap, it has to have some benefits, otherwise I don't care to do it.

Thnks for both you guys for answering and helping me out.

Share on other sites
@orangy - yeah I am quite used to streams in C++, but this project has to be in pure ansi C. No C++ allowed, but thks for interest.

Share on other sites
Why not write the atoi yourself ? It's not that difficult, and saves you from having to call a stdlib function (which may depend on a couple of system settings, like locales etc.).
I just tried to post a little code snippet myself, but it seems like the forum software swallowed all the plus signs which turned it all into completely guff.

Share on other sites
Quote:
 Original post by joe_bubamaraFeels like it should work. I will test it, and will also test if it is anything more efficient then sscanf - I am now done with .obj parser, so if I am going to change all that crap, it has to have some benefits, otherwise I don't care to do it.
You're still not accounting for whitespace. Whether or not this is a problem depends on how your parser splits up the strings, I suppose.
At any rate I'd strongly suggest making use of strtol instead if you want robust error detection, plus you won't have to manually tokenize the string to find the end of the number. Plus you get hexadecimal and octal literals for free by specifying base zero.

The other option is to do it manually of course: (untested code!)
bool safe_atoi(const char *in, long *out) { char ch; char prefix; unsigned long value; do  ch = isspace(*in++); while(isspace(ch)); prefix = ch; if(prefix == '+' || prefix == '-')  ch = *in++; if(!isdigit(ch))  return false; value = 0UL; do {  unsigned long overflow = value;  ch -= '0';  value *= 10UL;  value += ch;  if(value < overflow)   return false;  ch = *in++; } while(isdigit(ch)); if(prefix == '-')  return (*out = -(long) value) < 0L; else  return (*out = +(long) value) >= 0L;}

Share on other sites
Quote:
 Original post by plastiqueWhy not write the atoi yourself ? It's not that difficult, and saves you from having to call a stdlib function (which may depend on a couple of system settings, like locales etc.).I just tried to post a little code snippet myself, but it seems like the forum software swallowed all the plus signs which turned it all into completely guff.

Put your snippets into '[' code ']' and '[' /code ']' tags .... I have to add ' because of forum software, of course you use [ without '. Code tag will preserve tabs, spaces and other things. If you wish to have colorized syntax read the forum faq to see how it is used - I dislike that text area with colorized widget is so small so one sees only few lines of code at a time, so I almost never use it.

Yeah it is good question, where we should stop rolling own things instead of using libraries? :-). I believe that those conversion functions are done in optimized assembly, so that was reason why I am trying to use those rather then own thing.

cheers

Quote:
 You're still not accounting for whitespace.
Yeah, you are right, but I do eat whitespaces elsewhere in my code. I do use whites as delimiters and they are all eaten until tokens are parsed. Anyway here is a loop to check for whites and I forgott to check for null pointer in snippet above, here is also check to see if end of string is found ('\0' == 0):

while(*str && isspace(*str))        str ++;if( (*str && *str == '-' || *str == '+')    str ++;if(*str && *str >= '0' && *str <= '9')        result = atoi(str);else     str --; // back for - or + char

You don't have to find the end of the token nor to remove leading whitespaces. Itoa will eat itself until first - or + sign or first numeric value, and then convert any nymber of digits until it finds first non digit character. So it means that we don't have to remove trailing characters whatever they are; just be sure that we are throwing in a string that starts with whits or a number (inclusive + and - before the number).

docs: http://www.cplusplus.com/reference/clibrary/cstdlib/atoi/

[Edited by - joe_bubamara on July 23, 2009 8:32:49 AM]

Share on other sites
Quote:
 Original post by joe_bubamaraPut your snippets into '[' code ']' and '[' /code ']' tags .... I have to add ' because of forum software, of course you use [ without '. Code tag will preserve tabs, spaces and other things. If you wish to have colorized syntax read the forum faq to see how it is used - I dislike that text area with colorized widget is so small so one sees only few lines of code at a time, so I almost never use it.

I tried to put the code into code/source formatters, but it had those issues. On second thought it also could have been the reply-preview that is buggy in this regard ...
I'll just try it again, regardless of how messed up the preview will look like
(it's not surprising that it's not very unlike the code the guy after my last post added to this thread).

inline bool StrToI32(const char * str, int32_t * number){    *number = 0; // just in case it wasn't initialized    while (*str == ' ') ++str;    int32_t sign = 1;    char character;    switch (*str)    {        case '-':            sign = -1;            // fall through        case '+':            ++str;            // fall through        default:            character = *str;            if ((uint32_t) (character - '0') <= 9U) // is character a number ?            {                int32_t lastvalue = 0, value = 0;                do                {                    value *= 10;                    value += character - '0';                    if (lastvalue > value)                    {                        return false; // overflow                    }                    lastvalue = value;                    character = *(++str);                }                while ((uint32_t) (character - '0') <= 9U);                *number = value * sign;                return true;            }            break;    }    return false; // not found}

Quote:
 Original post by joe_bubamaraYeah it is good question, where we should stop rolling own things instead of using libraries? :-). I believe that those conversion functions are done in optimized assembly, so that was reason why I am trying to use those rather then own thing. cheers

It's probably optimized, but also you have to account for the function call itself, which normally will add about 60-100 cycles by itself, while a hand-rolled code could have the chance to be inlined (or forced to be) and integrate well into the code (with an additional advantage that registers can be reused and don't need to be push/popped from the stack each function call) :-)
Best to profile and see for yourself if it all makes sense in your particular piece of code.

Share on other sites
Quote:
 Original post by plastiqueI tried to put the code into code/source formatters, but it had those issues. On second thought it also could have been the reply-preview that is buggy in this regard ...

Yes, the previewer is buggy. Your pluses are retained when you actually post it.

Share on other sites
I see your point with inlined code, I do agree with you, but I don't think it is of much use for me in this project. Imagine how a .obj file looks like - you would have to inline all this into many places - several places for almost any command parsed, since most commands take several parameters. Also note that 99% of numbers in .obj files are actually floats; and parsing floats is more jobb then parsing integers, so I believe it would add signifikant to binary. But it might come to use in some other project.

ps. your code needs few more checks for valid pointer; but I guess you wrote it just to illustrate the principle.

Share on other sites
Quote:
 Original post by joe_bubamaraI see your point with inlined code, I do agree with you, but I don't think it is of much use for me in this project. Imagine how a .obj file looks like - you would have to inline all this into many places - several places for almost any command parsed, since most commands take several parameters. Also note that 99% of numbers in .obj files are actually floats; and parsing floats is more jobb then parsing integers, so I believe it would add signifikant to binary. But it might come to use in some other project.

Makes sense. A static function might be better in this regard.

Quote:
 Original post by joe_bubamaraps. your code needs few more checks for valid pointer; but I guess you wrote it just to illustrate the principle.

What do you mean ? Nope, this is working code. The only check that might be missing is to check if 'str' isn't null, but that's a bit like questioning the intelligence of the user of this function :P

Share on other sites
For careful parsing in C, don't use atoi or atof. As you've noted, if they don't like what they see, they just return 0, and if they do like what they see, you don't know how many characters they consumed.

But sscanf can be made to tell you exactly how many characters were consumed (%n), and can tell you how many fields were converted (return value), so with some care you can do a reasonably careful parser with it.

Or, as noted, you can write your own atoi(); that's easy. Writing your own atof() is not recommended, though, unless accuracy is not really an issue.

Share on other sites
Quote:
 What do you mean ? Nope, this is working code. The only check that might be missing is to check if 'str' isn't null, but that's a bit like questioning the intelligence of the user of this function :P
ehh ... yepp .. and that is why you should always check for null :-), sometimes users are very stupid :-).

well seriously, source you through at it might come out from other parts of code where other errors could occured.

Every time you increment pointer you should check if it found '\0', unless you are perfectly sure how many valid characters are left in buffer. This is just my experience, I might be too carefull here.

As another point I went through great pain to eliminate any call to strtok function, exactly because of static memory it uses :-). Static functions are definitely forbidden; they should be thrown in deepest hole in the Earth and burried unless C/C++ standard specifies variables declared as static to live on thread stack space rather then global process space :-(.

I didn't know for %n option to scanf; however I go twice through every line and check number of tokens avialable; then I get second time and try to convert those tokens with scanf.

While we are talking about parsing and scanning, would you guys mind to take a look at little help library I wrote while writing this parser? I am done with .obj parser which parses all commands as specified in version 3.0 of parser (found on P. bourke's site) and will wait to release it until I finish .mtl. Meanwhile I got a bit bored and did this little abstraction layer for readin and tokenizing ascii files, which you can find on http://www.nextpoint.se. Check libToken project.

I don't have so much time for testing so maybe you can take a peek and give one or another thaught and opinion? It would be great ... sorry if I am using you too much here :-)

Share on other sites
Quote:
 Original post by joe_bubamaraEvery time you increment pointer you should check if it found '\0', unless you are perfectly sure how many valid characters are left in buffer. This is just my experience, I might be too carefull here.

If you go through the code I posted above you'll notice that only valid characters are checked (+, -, digits, whitespace), on all other characters the function will terminate (including the null terminator), that's why such additional checks are not necessary here.

Quote:
 Original post by joe_bubamaraAs another point I went through great pain to eliminate any call to strtok function, exactly because of static memory it uses :-). Static functions are definitely forbidden; they should be thrown in deepest hole in the Earth and burried unless C/C++ standard specifies variables declared as static to live on thread stack space rather then global process space :-(.

Don't confuse static functions with functions that make use of static data (like strtok, which is indeed not recommended as it also modifies the buffer you are tokenizing). The 'static' as how I meant it merely means the visibility of the function (= only visible in the current source file).

Share on other sites
Quote:
 Don't confuse static functions with functions that make use of static data (like strtok, which is indeed not recommended as it also modifies the buffer you are tokenizing). The 'static' as how I meant it merely means the visibility of the function (= only visible in the current source file).

I can buy that; I was a bit fast there on word "static" :-).

I haven't compiled and run your code, I will do it later; but on glimpse it seems like you will do all "default" part on null character; also even if it + or - are followed by null character. Since '\0' == 0, you will get negative number which is <= 9u and you will get into the if statemnet.

As said didn't run through your code in debugger, will do it later.

Share on other sites
I just got an article on my little tokenizing library on the code proj :-) ... and I found several typos already hehe.

It would be nice if you can cast a look and give some opinion ...

http://www.codeproject.com/KB/cpp/libToken.aspx?fid=1544313

Share on other sites
Quote:
 Original post by joe_bubamaraI haven't compiled and run your code, I will do it later; but on glimpse it seems like you will do all "default" part on null character; also even if it + or - are followed by null character. Since '\0' == 0, you will get negative number which is <= 9u and you will get into the if statemnet.

Admittedly, this comparison is a bit tricky since it uses the overflow behaviour of an integer (negative integer converted to unsigned integer will overflow to a rather high value), but it does the job in its efficient way (which is perfectly valid according to the C language specs as well of course) .

Just as an example:

int32_t si = -1;
uint32_t ui = (uint32_t) si;

As a result of the overflow the value of 'ui' is 0xFFFFFFFF now, the biggest possible uint32_t value (-2 would overflow to 0xFFFFFFFE, and so on ...).

Taken my example code,

(uint32_t) (character - '0') <= 9U

if the character is '\0', the result inside the brackets will be negative (-48 to be exact), but since it's casted to an unsigned integer it will overflow and that's why the comparison won't match here.

Hope this makes it more clear.

Share on other sites
I was actually suspecting that you are counting on signed/unsigned overflow, but I didn't care much to details on sight reading.

Thnks for sharing; good to have in some project I guess.

Create an account

Register a new account

• Forum Statistics

• Total Topics
628360
• Total Posts
2982261

• 10
• 9
• 13
• 24
• 11