Sign in to follow this  
dave j

String handling in C

Recommended Posts

dave j    681
In a former colleague's code about 14 years ago:

sprintf(str, "%s %s %s", a, b);

The value of str was displayed on screen after this. This had been live at a bank for a couple of years before it was discovered. The reason it took so long to notice is that the value on the stack that was used for the third string's address happened to point to a byte containing zero.

Share this post


Link to post
Share on other sites

gcc looks at the format string for printf & co and gives a warning I think. It has to be built in to the compiler (or via metadata related to a function declaration) since using variable length argument lists removes all checking to do with type and number of arguments...

Share this post


Link to post
Share on other sites
Alternate-E    215

Could be worse, could be a web of pointers so convoluted that they point to nothing while trying to point to some embedded function, with an over-called string in it that still works for some reason.  *shudders*

Share this post


Link to post
Share on other sites
patrrr    1323

gcc looks at the format string for printf & co and gives a warning I think. It has to be built in to the compiler (or via metadata related to a function declaration) since using variable length argument lists removes all checking to do with type and number of arguments...

 

Clang also does this, plus, it also checks that the format string is correct with respect to argument types. Really handy!

Share this post


Link to post
Share on other sites
mhagain    13430

Care to elaborate? biggrin.png

 

I purposely stayed away from C's formatted output for the brief time I was learning C++.

 

You can overflow the buffer at any time, you don't know how long it is, you need to walk over the entire string in order to do any operation (which can easily lead to O(n2) algorithms) - strings in C basically contain everything that one should not do if one was going to design a string library.  See http://en.wikipedia.org/wiki/C_string_handling#Criticism and http://www.joelonsoftware.com/articles/fog0000000319.html for more.

Share this post


Link to post
Share on other sites
dave j    681

Does any C or even C++ compiler catch a mismatch like that? That's a terrible bug to have. Code reviews FTW.


This was an IBM C compiler which didn't perform any such checks. I don't think any did at the time.

The team were supposed to do code reviews and should have picked this up then. My job was developer support which included solving "our code's crashing and we don't know why" type problems. In this case I was given a memory dump and asked to figure out what was going wrong.

Share this post


Link to post
Share on other sites
dave j    681

Care to elaborate? :D

I purposely stayed away from C's formatted output for the brief time I was learning C++.

Each %s in the string means there should be another parameter that is a pointer to a string. The line should look like:

sprintf(str, "%s %s %s", a, b, c);
Because the function is expecting another parameter on the stack to go with the third %s, it will use whatever is in the next memory location after the b. This could be anything!

Share this post


Link to post
Share on other sites

Care to elaborate? biggrin.png

 

I purposely stayed away from C's formatted output for the brief time I was learning C++.

 

You can overflow the buffer at any time, you don't know how long it is, you need to walk over the entire string in order to do any operation (which can easily lead to O(n2) algorithms) - strings in C basically contain everything that one should not do if one was going to design a string library.  See http://en.wikipedia.org/wiki/C_string_handling#Criticism and http://www.joelonsoftware.com/articles/fog0000000319.html for more.

Actually it isn't just strings, it's arrays in general that suffer from that (actually with generic arrays it's even worse - with strings at least you can expect it to stop when there's a zero, with an array the only way to be 100% sure of the length is to pass it separately). Strings just happen to be one specific application of an array (to the point all array operations work on them).

Share this post


Link to post
Share on other sites
wintertime    4108

I always wonder when people just use printf-like functions with %s or even without a single %. Dont they know there are things like fputs, strcpy, strcat which dont need to parse possibly wrong format strings?

Share this post


Link to post
Share on other sites

Then the code above would have been equivalent to this (bug included):

strcpy(str, a);
strcat(str, " ");
strcat(str, b);
strcat(str, " ");

I know that isn't optimal (it'll read all of the string thrice) and you can make it faster, but then the code becomes less clear and can be much harder to read. Not like this code is not error prone anyway - I wonder how many programmers end up reading the strcpy as strcat. So in that sense sprintf looks like a good thing because it makes the code more concise without giving up much on readability (if we're talking about just a single string then it's overkill though).

Share this post


Link to post
Share on other sites
Kylotan    9860

Care to elaborate? biggrin.png

 

I purposely stayed away from C's formatted output for the brief time I was learning C++.

 

C doesn't have strings. It has arrays of characters, and some fancy goggles for the programmer which make those arrays look and act a bit like strings if you're very careful.

Share this post


Link to post
Share on other sites
Khatharr    8812

Then the code above would have been equivalent to this (bug included):

strcpy(str, a);
strcat(str, " ");
strcat(str, b);
strcat(str, " ");
I know that isn't optimal (it'll read all of the string thrice) and you can make it faster, but then the code becomes less clear and can be much harder to read. Not like this code is not error prone anyway - I wonder how many programmers end up reading the strcpy as strcat. So in that sense sprintf looks like a good thing because it makes the code more concise without giving up much on readability (if we're talking about just a single string then it's overkill though).


char* unknown;
strcpy(str, a);
strcat(str, " ");
strcat(str, b);
strcat(str, " ");
srtcat(str, unknown);
I fixed your bug for you, sir.

Share this post


Link to post
Share on other sites
Hodgman    51231

"String handling in C" is a coding horror all on it's own - no further comment is necessary.

I'd say "string manipulation in C" is a coding horror, but consuming read-only strings in C is refreshingly lacking in unnecessary abstraction.

 

In my C++ engine, I don't use any string classes. Instead I choose to use const char* for any strings, simply because I don't do any string manipulation at all, so the simplest solution works fine wink.png

[edit] to clarify, this also means not using any of the C standard library functions that work on strings [/edit]

Edited by Hodgman

Share this post


Link to post
Share on other sites
TheChubu    9448

Care to elaborate? biggrin.png

 

I purposely stayed away from C's formatted output for the brief time I was learning C++.

 

You can overflow the buffer at any time, you don't know how long it is, you need to walk over the entire string in order to do any operation (which can easily lead to O(n2) algorithms) - strings in C basically contain everything that one should not do if one was going to design a string library.  See http://en.wikipedia.org/wiki/C_string_handling#Criticism and http://www.joelonsoftware.com/articles/fog0000000319.html for more.

 

 

Care to elaborate? biggrin.png

I purposely stayed away from C's formatted output for the brief time I was learning C++.

Each %s in the string means there should be another parameter that is a pointer to a string. The line should look like:

sprintf(str, "%s %s %s", a, b, c);
Because the function is expecting another parameter on the stack to go with the third %s, it will use whatever is in the next memory location after the b. This could be anything!

 

 

Care to elaborate? biggrin.png

 

I purposely stayed away from C's formatted output for the brief time I was learning C++.

 

C doesn't have strings. It has arrays of characters, and some fancy goggles for the programmer which make those arrays look and act a bit like strings if you're very careful.

Oh I see then it might trash the memory, thanks!

Edited by TheChubu

Share this post


Link to post
Share on other sites

"String handling in C" is a coding horror all on it's own - no further comment is necessary.

I'd say "string manipulation in C" is a coding horror, but consuming read-only strings in C is refreshingly lacking in unnecessary abstraction.

Only as long as you're reading it sequentially. If you ever need to know the length, you'll need to use strlen which traverses the entire string (and thereby is a performance penalty), and if you're using a variable-length encoding such as UTF-8, consider yourself screwed as all the functions work on chars rather than the proper characters (e.g. in that case strlen would return the number of bytes, rather than the number of characters).

Share this post


Link to post
Share on other sites
dave j    681

Oh I see then it might trash the memory, thanks!


It's not just that. If the value that happens to be on the stack is invalid if used as an address, it would crash the program.

Share this post


Link to post
Share on other sites
mhagain    13430

If you're lucky it trashes the memory and gives you a nice clean crash at the point where things started going wrong.

 

More normally, it seems to work OK but at some arbitrary point later and in a completely different part of your code you start getting weird things happen.

 

Yeah, strings are just arrays, but I think it's worth singling out strings here because if you're using an array you've normally got an extra level of awareness of what you're doing, whereas the CRT tries to look like it's pretending that strings are some kind of special case or something different, which may lead the unwary into thinking that they're OK.

Share this post


Link to post
Share on other sites
Ectara    3097

It is very easy for users to make grave mistakes, and the CRT will have no way of detecting them. If you misuse strncpy(), and omit copying the null-terminator, then trying to use strlen() later on will invoke undefined behavior, and likely crash. Additionally, for reasons mentioned above, anything involving knowing the string's length is a nightmare. Unless you pass the length yourself, calculating the length all of the time is unacceptable, because it has to loop over all of the characters to find the end. Using strcat() repeatedly means repeatedly finding the length of the destination string, then writing characters to the end of it. There's no way around that, unless you keep track of the length of the destination string after the end of each operation (which might require getting the length of the source string before each operation. D'oh!) Also, don't forget, if you are using strncat(), to make sure that there is a null-terminator. Rookie mistake.

You also don't have the luxuries of a higher-level string object: things like resizing the string whenever you please, iterating through characters starting from the _end_ of the string, finding the character in the middle of the string, etc. Since you could have gotten that string though any means, if it is determined that you don't have enough space in the string to store more characters, you might not have enough information to know whether the string was allocated with malloc(), on the stack, part of a memory-mapped file, or even garbage data. An std::basic_string has the ability to resize itself with its copy of its allocator class instance. A function that handles C-strings either must refuse to resize strings, or blindly trust that the caller has provided enough space for characters in the memory block. Worrying about this leads directly to using strncpy() and strncat(), which have a case where a null-terminator isn't appended, for historical reasons!

In closing, yes, it has all of the shortcomings of an array, and it lures the inexperienced (and sometimes the experienced, too) into somewhat of a false sense of security, by hiding how fragile a C-string is, causing them to handle them wrong and then the fun begins.

Share this post


Link to post
Share on other sites
Bearhugger    1276

I have to agree with Hodgman. In the case of C strings vs std::string, it's not black and white: there are cases where C strings are perfectly viable to use.

 

Generally, I prefer to use const char* in my public methods, and use std::string internally. The first reason is that different compilers (sometimes even different versions of a same compiler) will produce incompatible binaries with a same template library, which can cause very bad bugs when you call a method that takes a std::string across DLL boundaries. This basically forces you to distribute a runtime for every single compiler. (Ogre is the perfect example of this.) The second reason I restrict myself to C strings in public interfaces is that if a function takes a std::string for parameter, you have to do 'const std::string("Hello, world!")' to pass a constant literal, in which case you're depending on the compiler and implementation to optimize it. On the other hand, passing a pointer to a char array on the stack consists of one very basic operation.

 

Of course, when it comes to string manipulation (or just storing copies), you'd be crazy to not use std::string. I'm 100% agreeing that strncat and friends are annoying, error prone and very vulnerable to attacks. Basically, the only C functions I use are strlen() and wcslen() for getting the length of character arrays, for other operations there's no point to not use std::string.

 

As for multi-byte strings, do people actually use that in game projects? For performance reasons, I'd rather use 8- or 16-bit characters anyway.

Share this post


Link to post
Share on other sites
swiftcoder    18429

As for multi-byte strings, do people actually use that in game projects? For performance reasons, I'd rather use 8- or 16-bit characters anyway.

Since when were string operations a major performance bottleneck in games?

If you plan on localising for non-Latin scripts, you probably want to use UTF-8 or UTF-16 - both of which have differing-byte characters.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this