Jump to content

  • Log In with Google      Sign In   
  • Create Account

String handling in C

  • You cannot reply to this topic
32 replies to this topic

#21 Ectara   Crossbones+   -  Reputation: 3059

Like
0Likes
Like

Posted 12 March 2013 - 12:39 PM

It is very easy for users to make grave mistakes, and the CRT will have no way of detecting them. If you misuse strncpy(), and omit copying the null-terminator, then trying to use strlen() later on will invoke undefined behavior, and likely crash. Additionally, for reasons mentioned above, anything involving knowing the string's length is a nightmare. Unless you pass the length yourself, calculating the length all of the time is unacceptable, because it has to loop over all of the characters to find the end. Using strcat() repeatedly means repeatedly finding the length of the destination string, then writing characters to the end of it. There's no way around that, unless you keep track of the length of the destination string after the end of each operation (which might require getting the length of the source string before each operation. D'oh!) Also, don't forget, if you are using strncat(), to make sure that there is a null-terminator. Rookie mistake.

You also don't have the luxuries of a higher-level string object: things like resizing the string whenever you please, iterating through characters starting from the _end_ of the string, finding the character in the middle of the string, etc. Since you could have gotten that string though any means, if it is determined that you don't have enough space in the string to store more characters, you might not have enough information to know whether the string was allocated with malloc(), on the stack, part of a memory-mapped file, or even garbage data. An std::basic_string has the ability to resize itself with its copy of its allocator class instance. A function that handles C-strings either must refuse to resize strings, or blindly trust that the caller has provided enough space for characters in the memory block. Worrying about this leads directly to using strncpy() and strncat(), which have a case where a null-terminator isn't appended, for historical reasons!

In closing, yes, it has all of the shortcomings of an array, and it lures the inexperienced (and sometimes the experienced, too) into somewhat of a false sense of security, by hiding how fragile a C-string is, causing them to handle them wrong and then the fun begins.



Sponsor:

#22 mhagain   Crossbones+   -  Reputation: 8279

Like
2Likes
Like

Posted 12 March 2013 - 02:40 PM

Back on topic, I once did this:

 

char *buf = (char *) malloc (strlen (str + 1));
strcpy (buf, str);

 

Ouch!!!!


Edited by mhagain, 12 March 2013 - 04:24 PM.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#23 Bearhugger   Members   -  Reputation: 567

Like
0Likes
Like

Posted 12 March 2013 - 07:14 PM

I have to agree with Hodgman. In the case of C strings vs std::string, it's not black and white: there are cases where C strings are perfectly viable to use.

 

Generally, I prefer to use const char* in my public methods, and use std::string internally. The first reason is that different compilers (sometimes even different versions of a same compiler) will produce incompatible binaries with a same template library, which can cause very bad bugs when you call a method that takes a std::string across DLL boundaries. This basically forces you to distribute a runtime for every single compiler. (Ogre is the perfect example of this.) The second reason I restrict myself to C strings in public interfaces is that if a function takes a std::string for parameter, you have to do 'const std::string("Hello, world!")' to pass a constant literal, in which case you're depending on the compiler and implementation to optimize it. On the other hand, passing a pointer to a char array on the stack consists of one very basic operation.

 

Of course, when it comes to string manipulation (or just storing copies), you'd be crazy to not use std::string. I'm 100% agreeing that strncat and friends are annoying, error prone and very vulnerable to attacks. Basically, the only C functions I use are strlen() and wcslen() for getting the length of character arrays, for other operations there's no point to not use std::string.

 

As for multi-byte strings, do people actually use that in game projects? For performance reasons, I'd rather use 8- or 16-bit characters anyway.



#24 Khatharr   Crossbones+   -  Reputation: 3040

Like
0Likes
Like

Posted 12 March 2013 - 11:14 PM

But without C style string manipulation we don't get all of the exciting possibilities of buffer overrun exploits! Ahh... The good old days....


void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

#25 swiftcoder   Senior Moderators   -  Reputation: 10367

Like
2Likes
Like

Posted 13 March 2013 - 08:53 AM

As for multi-byte strings, do people actually use that in game projects? For performance reasons, I'd rather use 8- or 16-bit characters anyway.

Since when were string operations a major performance bottleneck in games?

If you plan on localising for non-Latin scripts, you probably want to use UTF-8 or UTF-16 - both of which have differing-byte characters.

Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#26 Kylotan   Moderators   -  Reputation: 3338

Like
3Likes
Like

Posted 13 March 2013 - 11:15 AM

Yup - it's 2013, and well past time that we accepted that ASCII is not good enough.



#27 mhagain   Crossbones+   -  Reputation: 8279

Like
2Likes
Like

Posted 13 March 2013 - 12:05 PM

Since when were string operations a major performance bottleneck in games?

 

Unless one is doing a lot of text-parsing at runtime (in which case it may be an appropriate topic for a new post in this subforum...)


It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.


#28 Sik_the_hedgehog   Crossbones+   -  Reputation: 1833

Like
1Likes
Like

Posted 13 March 2013 - 12:21 PM

You're not supposed to do that, at worst you're supposed to do all text parsing at load time and leaving the run time parsing only for showing stuff on screen. Pretty much everything a game is bound to need can be converted to integers and such.

 

Though I do know of an engine that requires you to fetch all resources through strings. And you need to do this every time you pretend to use a resource, and consider about every object in the map is bound to use at least one or two resources of those every frame... (a similar issue happens if you use string indices instead of integer indices for arrays)


Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

#29 slicer4ever   Crossbones+   -  Reputation: 3983

Like
0Likes
Like

Posted 13 March 2013 - 05:28 PM

Though I do know of an engine that requires you to fetch all resources through strings. And you need to do this every time you pretend to use a resource, and consider about every object in the map is bound to use at least one or two resources of those every frame... (a similar issue happens if you use string indices instead of integer indices for arrays)

eh?, that sounds really broken, can't you hold an pointer to the resource?, or does the resource manager control how the resource is also used(which still sounds broken to me)?


Check out https://www.facebook.com/LiquidGames for some great games made by me on the Playstation Mobile market.

#30 Olof Hedman   Crossbones+   -  Reputation: 2950

Like
0Likes
Like

Posted 14 March 2013 - 06:34 AM

gcc looks at the format string for printf & co and gives a warning I think. It has to be built in to the compiler (or via metadata related to a function declaration) since using variable length argument lists removes all checking to do with type and number of arguments...

 

Clang also does this, plus, it also checks that the format string is correct with respect to argument types. Really handy!

 

Also LLVM (used with XCode for iOS and OSX). Indeed really handy!



#31 Kylotan   Moderators   -  Reputation: 3338

Like
3Likes
Like

Posted 14 March 2013 - 07:20 AM

Pretty much everything a game is bound to need can be converted to integers and such.

 

What if your game mostly revolves around text?



#32 Sik_the_hedgehog   Crossbones+   -  Reputation: 1833

Like
0Likes
Like

Posted 14 March 2013 - 10:13 PM

Though I do know of an engine that requires you to fetch all resources through strings. And you need to do this every time you pretend to use a resource, and consider about every object in the map is bound to use at least one or two resources of those every frame... (a similar issue happens if you use string indices instead of integer indices for arrays)

eh?, that sounds really broken, can't you hold an pointer to the resource?, or does the resource manager control how the resource is also used(which still sounds broken to me)?

Custom scripting language >.>' But even then, it could have had a retrieve ID function or something, but nope, the string is the ID, so e.g. if you want to play a sound effect you need to pass the name of the sound effect, if you want to switch to a specific sprite you need to pass the name of the sprite, etc. I suppose it's sorta mitigated by hashing, but integers/pointers/whatever as IDs would still have been a ton faster compared to strings.

 

As far as I know that engine was never pushed to its limits yet, so maybe that's why nobody got bothered by it in the first place. I presume some time in the future that will eventually happen, though. The only upside of its approach is that it may be slightly easier for beginners to get running.


Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.

#33 BGB   Crossbones+   -  Reputation: 1554

Like
0Likes
Like

Posted 14 March 2013 - 10:41 PM


Since when were string operations a major performance bottleneck in games?

 
Unless one is doing a lot of text-parsing at runtime (in which case it may be an appropriate topic for a new post in this subforum...)


or, if major pieces of engine infrastructure are based on strings...

(good or not, it can sometimes end up happening this way...).


this is basically things like using strings to identify things, and using constructs like:
if(!strcmp(str, "_foo_t"))
{
...
}else if(!strcmp(str, "_bar_t"))
{
...
}else if ...

which, if not careful, can end up eating a lot of time, and then one is left to try to figure out why "strcmp()" has jumped to the top of the list in the profiler (*1).

but, at least, one can intern the strings, and in these cases using '==' and '!=' on the pointers can lead to slightly faster string comparisons (but has other drawbacks, like often the need to cache literals in variables, or resort to ugly hacks). (if both are already interned, it is basically just the cost of the pointer comparison).

some of this may result because strings are self-describing and easier to use as decentralized unique IDs than integers, and generally also easier to work with than GUIDs.

ADD: another past trick is to basically use a hash-table to quickly map a string to an integer based index (the position of the string within an array of strings), and then use this index with a "switch()", which can at least generally be faster than a long strcmp() based if/else chain, and generally comparing favorably to a big nested switch (less awful looking, and also faster in many cases).


*1: like, one time in my renderer (earlier on), I ended up profiling things, and observing that "strcmp()" was at the top of the profiler list. I then looked into it and found that this was because an inner loop (related to querying objects) was falling back to one of these strcmp() if/else chains (dispatching to the logic for each specific model type) for each iteration of the loop (which at the time was also a linear search over every object in the world).

things have improved at least slightly since then (much of this logic has since been moved to vtables, ...).

(actually, much of the engine runs on top of a dynamic type-system, itself based mostly around string-based type-ID names, which are used for pretty much every heap-allocated object in the engine, ...).

nevermind cases where strings and while-loops directly drive program logic in a few places (typically "type signature strings", ...), ...

and also the frequent use of strings to identify things like entity field-names, the contents of a database-like structure, file paths, ...


also if one builds parts of their logic on top of working with DOM-like XML trees or similar (like, using XML trees as a data-structure for representing other data), this can also involve using a lot of strings. historically, some code had also worked largely by walking XML trees and dispatching to logic, but most of this code went away as the performance was often a bit lacking (the only major examples left have since largely been relegated to offline tools).

a lot of other code uses walking Lisp-like lists instead, which are a bit faster. (lists are basically a tree-structure composed of linked-lists of "cons-cells", with each list holding a string identifying its contents, ...).


so, depending on the code, strings can be a big deal.

though, the it makes sense to avoid a lot of stuff like this in performance-critical areas or as part of the main execution path.

Edited by cr88192, 14 March 2013 - 11:31 PM.






PARTNERS