Use which kind of generic string?

Started by
7 comments, last by SeraphLance 7 years, 11 months ago

Long time hobby developer here.

Over the years I built up a game framework based on Win32/DirectX. Since the days of old it's been using std::string for pretty much everything, simple lookup strings as well as file names.

Now the days of ASCII are going away. On Windows this means file names are Unicode. But I wouldn't want every little string being stored as Unicode. This forces me to differ between file names and other strings.

I'm not overly fond of UTF8, as variable character sizes sound like a major annoyance for GUI display.

How do others cope with this? Use different string types? Am I overthinking things by trying to avoid Unicode strings?

Fruny: Ftagn! Ia! Ia! std::time_put_byname! Mglui naflftagn std::codecvt eY'ha-nthlei!,char,mbstate_t>

Advertisement

Most systems I've worked with have std::string markers in the code, the localization system generates a different type of immutable localized string that is accepted for the UI.

Currently we store our string internally as UTF8 and we do operations between UTF8 strings. The cool thing about UTF8 is that they work perfectly with std::string(or in out cause a lagacy string class that was implemented around char*) (as the 0 will be found only at the end of the file).
Under windows we convert to UNICODE/MBCS when needed(example is winapi specific function as OpenFile, SendMessage, CreateWindow...)
The unix based platforms already use utf8 as a native string.

Actually a good starting point would be to do what the guys form Autodesk Maya do with their MString (see MString::asChar method(it returns the string value in a platform native format)).

BTW a side hint:
In the MSVS in the debugger's watch windows if you type "mystringVariable, s8" the value of "mystringVariable" is going to be display as UTF8 string an not as a MBCS string ^_^

UTF8 works fine. Unicode sucks, no matter how you deal with it since it is way over-engineered and has way too many redundancies and special cases (unless you simply choose to ignore all these issues -- then Unicode is a "just works" thing), but UTF8 not only plays nicely with your 40 year old C string functions and is also kinda intuitive otherwise (think sorting), it also has no more quirks than necessary. Using stuff like wchar_t, you may end up discovering that wide characters have different sizes on different systems. Surprise, surprise.

I'm not overly fond of UTF8, as variable character sizes sound like a major annoyance for GUI display.

Well, with the "standard Unicode" strings under Windows, you have the worst of both worlds. You have wide characters, and you still have variable character sizes (since Windows uses UTF-16). Unless you use WinXP or Win2K, in that case you have "broken Unicode" (UCS-16). Insofar, UTF-8 is none worse, only better. Same amount of trouble, but takes only about 50-60% as much storage (for anything non-Chinese). Plus, English text is just plain normal, readable ANSI English text even in your favorite hex editor (no zero bytes in between), and even European languages are 80-90% multibyte-free, plain ordinary ANSI characters.

The only real inconvenience about UTF8 is that -- under Windows -- you have to convert all strings that you are receiving from the OS in some way, and you may have to convert any strings you pass to an API, too. Takes 10 minutes of consideration once when writing your API wrapper. Or you may simply not care and use filenames which work with legacy functions! After all, most of the time, as a game developer, you get to choose the file name. Choose wisely, and forget about it.

All in all, UTF8 is no biggie most of the time. Under Linux, UTF8 is the default anyway, so no troubles at all there.

I avoid ever using wide characters - UTF8 is much better IMHO.

There's no string class in my engine (except GUI as below), because you should never be working with strings :lol:

In some places there's debug-helpers that use const char*'s which are assumed to point to string literals, or other immutable allocations with longer lifetimes than the pointers themselves.

The only place that strings are used is within the GUI stuff, were some 3rd party library decodes UTF8 char streams into a series of glyphs to be rendered, and passes a stream of textured quads back to my renderer. Any char container is suitable for this.

There's no string class in my engine (except GUI as below), because you should never be working with strings :lol:

It's the trio of things that are commonly used, but an engine should never have to deal with: strings, files and quaternions.

http://utf8everywhere.org/

Sean Middleditch – Game Systems Engineer – Join my team!

If you want cross-platform internationalization, use UTF-8. You can convert UTF-8 to Windows wide strings (and the reverse) using MultiByteToWideChar (WideCharToMultiByte) with CP_UTF8 for use with Win32 APIs.

If you're only developing for Windows, with no plan for porting, just use Windows wide strings (UTF-16).

If you're not internationalizing your project, all this talk is pretty pointless; just enforce your native character encoding for internal strings, try to use the ANSI versions of Win32 functions, and if there is no ANSI version just tack L in front of the necessary string literals (IE L"I really hate that windows doesn't make ANSI versions of some new functions."). :)

Like others have said, std::string (and char*) is already UTF8-compatible so far as string containers are concerned. So it's very easy to recommend UTF8 as there's (almost) nothing particularly special for you to do in terms of your string storage. Mainly it is just a matter of ensuring you invoke the unicode-aware functions in various 3rd-party APIs -- like win32 -- and whatever conversion is necessary such as converting from UTF8 to UTF16.

The only real issue to handle within your own code is that where multi-byte code-points are concerned (a character expressed a series of bytes) you need to be aware that std::string::length will give you the number of bytes in the string which is not the same as the number of characters. So determining the length of a string now becomes a linear-time operation rather than constant time. Similarly, indexing into the string at the Nth character is not the same as indexing to the Nth byte. In-fact you're always going to encounter those issues with any variable-length encoding; there is UTF-32 which is a fixed-length encoding (like ASCII) so it's convenient in so much as it avoids those issues but it's not exactly memory efficient.

There's no string class in my engine (except GUI as below), because you should never be working with strings :lol:

It's the trio of things that are commonly used, but an engine should never have to deal with: strings, files and quaternions.

At some level, you're going to have to deal with files, even if that means running everything on top of your own special snowflake custom HAL.

I'm not sure about the problem with quaternions either, as they're an immensely useful construct.

Now strings? Totally. In game code I try to reserve them as much as possible for UI/external purposes. And for the love of god, use UTF-8.

This topic is closed to new replies.

Advertisement