Jump to content
  • Advertisement
Sign in to follow this  
Endurion

Use which kind of generic string?

This topic is 808 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Long time hobby developer here.

 

Over the years I built up a game framework based on Win32/DirectX. Since the days of old it's been using std::string for pretty much everything, simple lookup strings as well as file names.

 

Now the days of ASCII are going away. On Windows this means file names are Unicode. But I wouldn't want every little string being stored as Unicode. This forces me to differ between file names and other strings.

I'm not overly fond of UTF8, as variable character sizes sound like a major annoyance for GUI display.

 

How do others cope with this? Use different string types? Am I overthinking things by trying to avoid Unicode strings?

 

Share this post


Link to post
Share on other sites
Advertisement

Currently we store our string internally as UTF8 and we do operations between UTF8 strings. The cool thing about UTF8 is that they work perfectly with std::string(or in out cause a lagacy string class that was implemented around char*) (as the 0 will be found only at the end of the file).
Under windows we convert to UNICODE/MBCS when needed(example is winapi specific function as OpenFile, SendMessage, CreateWindow...)
The unix based platforms already use utf8 as a native string.

 

Actually a good starting point would be to do what the guys form Autodesk Maya do with their MString (see MString::asChar method(it returns the string value in a platform native format)). 

BTW a side hint:
In the MSVS  in the debugger's watch windows if you type "mystringVariable, s8" the value of "mystringVariable" is going to be display as UTF8 string an not as a MBCS string ^_^

Edited by imoogiBG

Share this post


Link to post
Share on other sites
UTF8 works fine. Unicode sucks, no matter how you deal with it since it is way over-engineered and has way too many redundancies and special cases (unless you simply choose to ignore all these issues -- then Unicode is a "just works" thing), but UTF8 not only plays nicely with your 40 year old C string functions and is also kinda intuitive otherwise (think sorting), it also has no more quirks than necessary. Using stuff like wchar_t, you may end up discovering that wide characters have different sizes on different systems. Surprise, surprise.

I'm not overly fond of UTF8, as variable character sizes sound like a major annoyance for GUI display.

Well, with the "standard Unicode" strings under Windows, you have the worst of both worlds. You have wide characters, and you still have variable character sizes (since Windows uses UTF-16). Unless you use WinXP or Win2K, in that case you have "broken Unicode" (UCS-16). Insofar, UTF-8 is none worse, only better. Same amount of trouble, but takes only about 50-60% as much storage (for anything non-Chinese). Plus, English text is just plain normal, readable ANSI English text even in your favorite hex editor (no zero bytes in between), and even European languages are 80-90% multibyte-free, plain ordinary ANSI characters.

The only real inconvenience about UTF8 is that -- under Windows -- you have to convert all strings that you are receiving from the OS in some way, and you may have to convert any strings you pass to an API, too. Takes 10 minutes of consideration once when writing your API wrapper. Or you may simply not care and use filenames which work with legacy functions! After all, most of the time, as a game developer, you get to choose the file name. Choose wisely, and forget about it.

All in all, UTF8 is no biggie most of the time. Under Linux, UTF8 is the default anyway, so no troubles at all there.

Share this post


Link to post
Share on other sites

I avoid ever using wide characters - UTF8 is much better IMHO.

 

There's no string class in my engine (except GUI as below), because you should never be working with strings :lol:

In some places there's debug-helpers that use const char*'s which are assumed to point to string literals, or other immutable allocations with longer lifetimes than the pointers themselves.

 

The only place that strings are used is within the GUI stuff, were some 3rd party library decodes UTF8 char streams into a series of glyphs to be rendered, and passes a stream of textured quads back to my renderer. Any char container is suitable for this.

Share this post


Link to post
Share on other sites

There's no string class in my engine (except GUI as below), because you should never be working with strings :lol:

 

It's the trio of things that are commonly used, but an engine should never have to deal with: strings, files and quaternions.

Share this post


Link to post
Share on other sites

If you want cross-platform internationalization, use UTF-8. You can convert UTF-8 to Windows wide strings (and the reverse) using MultiByteToWideChar (WideCharToMultiByte) with CP_UTF8 for use with Win32 APIs.

If you're only developing for Windows, with no plan for porting, just use Windows wide strings (UTF-16).

If you're not internationalizing your project, all this talk is pretty pointless; just enforce your native character encoding for internal strings, try to use the ANSI versions of Win32 functions, and if there is no ANSI version just tack L in front of the necessary string literals (IE L"I really hate that windows doesn't make ANSI versions of some new functions."). :)

Share this post


Link to post
Share on other sites

Like others have said, std::string (and char*) is already UTF8-compatible so far as string containers are concerned. So it's very easy to recommend UTF8 as there's (almost) nothing particularly special for you to do in terms of your string storage. Mainly it is just a matter of ensuring you invoke the unicode-aware functions in various 3rd-party APIs -- like win32 -- and whatever conversion is necessary such as converting from UTF8 to UTF16.

 

The only real issue to handle within your own code is that where multi-byte code-points are concerned (a character expressed a series of bytes) you need to be aware that std::string::length will give you the number of bytes in the string which is not the same as the number of characters. So determining the length of a string now becomes a linear-time operation rather than constant time. Similarly, indexing into the string at the Nth character is not the same as indexing to the Nth byte. In-fact you're always going to encounter those issues with any variable-length encoding; there is UTF-32 which is a fixed-length encoding (like ASCII) so it's convenient in so much as it avoids those issues but it's not exactly memory efficient.

Edited by dmatter

Share this post


Link to post
Share on other sites

 

There's no string class in my engine (except GUI as below), because you should never be working with strings :lol:

 

It's the trio of things that are commonly used, but an engine should never have to deal with: strings, files and quaternions.

 

 

At some level, you're going to have to deal with files, even if that means running everything on top of your own special snowflake custom HAL.

 

I'm not sure about the problem with quaternions either, as they're an immensely useful construct.

 

Now strings?  Totally.  In game code I try to reserve them as much as possible for UI/external purposes.  And for the love of god, use UTF-8.

Edited by SeraphLance

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!