Typical Behavior Of A String

Started by
9 comments, last by Ectara 11 years, 2 months ago

What is the typical behavior of a wide-character string on a platform where it represents a UTF-16 code unit? When you subscript the string, does it return the code point of that index, doing the appropriate conversion, or does it simply return the code unit in its internal array? When you append a character to the string, does it accept it as a code point, and do the conversion into UTF-16 code units, or a single code unit, requiring you to break up any characters that require two code units into surrogate pairs?

In my string class, I have a subscript operator where it returns the code point by value, performing the appropriate conversions as necessary. I'm thinking of adding a method that allows you to access the code unit array, and this will return a reference to that element.

The question might also be extended to, where does the responsibility of the string end? Is the string just there to hold an array? If so, the length of variable-width encodings would be wrong. If the string class is aware of the correct length, it would also have the ability to return the code point, instead of code units. Is the string simply a container, or is it allowed to know what it is holding?

My string class is laid out like this:
There is a character type, and a storage type. For example, with UTF-8, the character type could be an unsigned int, and the storage type would be an unsigned char; the code points (character type) would be 32bit values, each one distinct. The code units (storage type) would be 8bit values in a sequence that could be converted into a whole Unicode code point.

There are separate length functions: length() and size() return the length in code points, while span() returns the size in code units. Algorithms that deal with the string would use the appropriate methods for their needs.

There are subscript functions that return the code point at that index in the logical string, and functions that return the code unit at that offset in the physical string.

The traits class defines the encoding and transformation format entirely. Depending on the traits, it could be a fixed-width character array, or a variable-width string encoding.

I feel that there is great utility in being able to subscript code points, as I can use a UTF-8 string, a UTF-16 string, and a UTF-32 string, and I can get the same values from all of them in the same way, so long as they contain the same code points in their respective transformation formats. My goal is to be able to write code that doesn't need to concern itself with the type of string passed; when it is unacceptable to instantiate a different version of a function for each given string type, I can use iterators to be encoding-agnostic.

I accept that at any time, there may be invalid sequences in the string, and this is handled when the conversions are being made, because I think it is an unreasonable invariant to enforce that the string be valid at all times.

So, do I have it wrong, and my string class knows too much about the data it contains?

Advertisement

What is the typical behavior of a wide-character string on a platform where it represents a UTF-16 code unit? When you subscript the string, does it return the code point of that index, doing the appropriate conversion, or does it simply return the code unit in its internal array? When you append a character to the string, does it accept it as a code point, and do the conversion into UTF-16 code units, or a single code unit, requiring you to break up any characters that require two code units into surrogate pairs?

In my mind, a unicode string class is pretty darn useless if you can't naively perform standard string formatting options on it.

I can always use a std::vector<unsigned short> if I want the idiot behaviour.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

I agree, a Unicode string is rather worthless if I can't interact with it as a string of code points. On my development platform, wchar_t is 32bit and defined to house UTF-32 code points, so I can't test out what the compiler is defined to do for UTF-16 strings. I figured I would ask, and see if anyone has experience with the ins and outs of how it is used, since I am having trouble finding anything on through searching.

I think it was wrong that they added wchar to the language, because you can never be sure what it represents. Back then they seem to have thought any char would fit in 2 bytes UCS2 and later it got inconsistent when they wanted more than 65536 characters, because like many C types they hadnt bothered to standardize the size at an exact number (I so much hate having to use two dozen types of 16bit, 32bit and 64bit numbers, with a different probability of being another size for each type). I wish they wouldnt have retrofitted a variable length UTF-16 on this and instead just moved on to UTF-32 (plus UTF-8 when size or backwards compatibility matters).

I plan on only using the new char16_t and char32_t types(+char) and ignore wchar as much as possible to avoid problems with it.

Take a look at http://www.utf8everywhere.org/ , they recommend to transform everything to UTF-8 and only convert back to something else the moment you call some oldfashioned library. I think thats a good idea, but I tend to like UTF-32 a bit more then them cause of its fixed length nature which makes internal calculations easier.

Retrofitting logic for all types of encodings into a string class seems not a good idea compared to this.

I think it was wrong that they added wchar to the language, because you can never be sure what it represents. Back then they seem to have thought any char would fit in 2 bytes UCS2 and later it got inconsistent when they wanted more than 65536 characters, because like many C types they hadnt bothered to standardize the size at an exact number (I so much hate having to use two dozen types of 16bit, 32bit and 64bit numbers, with a different probability of being another size for each type). I wish they wouldnt have retrofitted a variable length UTF-16 on this and instead just moved on to UTF-32 (plus UTF-8 when size or backwards compatibility matters).

I plan on only using the new char16_t and char32_t types(+char) and ignore wchar as much as possible to avoid problems with it.

Take a look at http://www.utf8everywhere.org/ , they recommend to transform everything to UTF-8 and only convert back to something else the moment you call some oldfashioned library. I think thats a good idea, but I tend to like UTF-32 a bit more then them cause of its fixed length nature which makes internal calculations easier.

Retrofitting logic for all types of encodings into a string class seems not a good idea compared to this.

wchar_t was a dumb idea, due to the lack of standardization. Internally, I use either UTF-8, or UTF-32, depending on where I'm using it. I'm not using UTF-16; rather, I want to leave the possibility if a traits class is written for it. The code I wrote not too long ago added support for variable-width encodings added possible support for UTF-8 and UTF-16, among others.

Typically, the rule is to pick the format that requires the least amount of conversion for your purposes. With a generic interface, I won't have to worry about conversions in my own code; I realize that I have to do a conversion for whichever libraries I come in contact with.

My goal is to abstract away what kind of string it is, so I don't have to deal with the raw logic everywhere I used it, like in C.

My goal is to abstract away what kind of string it is, so I don't have to deal with the raw logic everywhere I used it, like in C.

It is nice in theory.

If you need to work with the strings, then it implies you are going to need localization.

If you've ever worked with localization, you'd know that there are almost no operations you can do on strings if you want it to survive the localization process.

Here's what I propose. It has worked well on about 10 major projects I've worked on, and another 5 or so that didn't follow this pattern had nasty localization problems. Here's my experience:

Start with keys in a localization database. For convenience, these keys are just plain strings. The database key, language, and gender combined result in a localized string formatting string. More on this later.

Then you pass regular string to the localization database and it returns a new object from class LocalizedString. DO NOT EVER have localized strings in a plain string object. DO NOT EVER allow regular strings to be inserted into localized strings. The two should always be kept separate and distinct lest nightmares occur.

Then you can only do very few things with a localized string. There are a few limited operations you can do with localized strings, otherwise it is destined for UI only.

Example:


LocalizedString message = Localization::LocalizeString( player[i].Gender(), "GameOver", player[i].Name, player[i].Score );
DialogManager::ShowGlobalDialog(message); 

The first line looks up the string key GameOver in a string database. For English it would become something like "Great game {0.Name}. You scored {1.Number} points!". In other languages the string may have the items replaced in a different order, and they may be different for various genders.

In other words, you get an interface that looks like this:

LocalizedString Localization::LocalizedString( GenderFlag gender, std::string key, ...);

Basically the string substitution should be similar to C# and several other modern languages. You pass them as a parameter array and they are inserted in the order of {#.Format}. There should be male and female variants of each string. Format values can be things like Name, Number, Money, Date, Time, Hours, Minutes, and so on. Many of these need to be formatted differently in different languages, or marked in colors or fonts specific to your game's UI scheme. The two most absolutely necessary format values are {#.String}, meaning an unlocalized string that gets automatically localized during the process, and {#.LocalizedString}, meaning it has already been composed.

There are only a few things you should be able to do with a localized string.

You can append them as an additional line. (Appending stuff on the same line can change the meaning in some languages.)

You can append them on a new line as part of a numbered list or as part of a bulleted list.

You can add an ellipsis key (that must also be localized, use Localization::Ellipsis())

The LocalizedString should never be convertible back to a regular string, since it adds confusion and mayhem. UI elements should only take LocalizedString values if they are already localized, or standard strings if they are not localized.

I have no use for a UTF-Whatever string. I have no use for various wide character formats. There are way too many formats that have diluted it down to the point of uselessness.

Instead, I love having these things:

std::string --- an unlocalized string I can modify as I see fit, using my own language rules. They are never displayed.

string keys --- string literal values that serve as an index to the string table.

LocalizedString --- a localized string that is basically unmodifiable, for output only.

Follow those rules and you will avoid creating Yet Another Broken Wide Char system.

Follow those rules and you will avoid creating Yet Another Broken Wide Char system.

I'll take that into consideration. These strings aren't designed to be an end-all, be-all for displaying text to users. I am well aware of localization issues, and I do plan to use tables of whichever sort for different localizations. I don't plan to create "Yet Another Broken Wide Char system."

std::string --- an unlocalized string I can modify as I see fit, using my own language rules. They are never displayed.


This is precisely what this string class is meant to do. I'm not trying to make gold from lead.

There's an obvious disconnect between strings in a framework (like the approach in frob's post) and strings as a sequence of characters (like std::basic_string<aargh> in C++): the former can be opaque data types that offer only well-behaved, limited operations, the latter are tied to a treacherous and insufficient low-level representation which cannot provide integrity.

Even sequences of 32 bit code points, without surrogate pairs, which are obviously the most amenable to C-style manipulation among proper Unicode string representations, offer pitfalls like combining letters with the respective combining characters or not (e.g. for length counting purposes) and other normalization, stray byte order marks, undefined code points, etc.

Some applications can deal with these problems (for example, a data compression system can treat UTF-8, UTF-32 etc. as hints, be lenient with malformed strings and reconstruct exactly any input), but most systems need to use std::basic_string, char* or the like as aggressively encapsulated implementation details of their real, dependable string API and as an interchange format for libraries (for example font rendering, a spectacular can of worms).

Omae Wa Mou Shindeiru


I have no use for a UTF-Whatever string. I have no use for various wide character formats. There are way too many formats that have diluted it down to the point of uselessness.

Instead, I love having these things:

std::string --- an unlocalized string I can modify as I see fit, using my own language rules. They are never displayed.

string keys --- string literal values that serve as an index to the string table.

LocalizedString --- a localized string that is basically unmodifiable, for output only.

Follow those rules and you will avoid creating Yet Another Broken Wide Char system.


You say you have no use for a "UTF-Whatever" string - but what format is your LocalizedString in? In order to support different languages, the content of your localisation database must have a standard encoding - probably UTF-8 - and your LocalizedString class is probably returning instances of that. And whatever displays that string - in your post, it was the DialogManager::ShowGlobalDialog function - must also expect a standard encoding in order to be able to translate characters in the string into glyphs for the screen.

So what you have is, essentially, a UTF-8 (or 16, or 32) string, whether you like it or not!

Ideally you pick an encoding for strings - UTF-8 is best, at least for western developers - and use that exclusively. There's no reason why it can't implement pretty much all the same behaviour as a std::string does, as long as you don't expect the same performance. Languages like Python 3 and C# offer unicode strings that are fully functional in this way.

Ultimately it's better to be aware of the encoding you use, and use it consistently, than to pretend you're not dealing with Unicode at all and unnecessarily limit your program's capabilities.

You say you have no use for a "UTF-Whatever" string - but what format is your LocalizedString in? In order to support different languages, the content of your localisation database must have a standard encoding - probably UTF-8 - and your LocalizedString class is probably returning instances of that. And whatever displays that string - in your post, it was the DialogManager::ShowGlobalDialog function - must also expect a standard encoding in order to be able to translate characters in the string into glyphs for the screen.

So what you have is, essentially, a UTF-8 (or 16, or 32) string, whether you like it or not!

It does not matter what is inside a LocalizedString object. It is encapsulated. It may be UTF-whatever internally. It may be an unknown array of bytes. The class can be opaque. It matters only to UI interface. It may be Unicode, it may be character code pages. It may be custom glyphs. That is irrelevant to the consumers of the class. They exist only in the land of string keys and parameters to those keys.

Ultimately it's better to be aware of the encoding you use, and use it consistently, than to pretend you're not dealing with Unicode at all and unnecessarily limit your program's capabilities.

How does it limit your program's capabilities?

That's the real issue with localization: There is no one-size-fits-all set of string manipulations.

In English you can use simple word substitution in a few cases, but if you need to change gender or change from singular to plural you need to drastically revise the sentence; you cannot blindly use word replacement. You cannot simply include numbers as different territories mark numbers differently: 12345.67; 12,345.67; 12'345.67; 12.345,67; and 12.345'67 may be required based on your location. You cannot blindly display money or time for the same reason. You cannot compose sentences programmatically in a way that makes sense across all languages. You cannot even append two strings because doing so in a few Asian languages can potentially modify or negate the meaning of a message.

Sure, you can be aware of the encoding. But that is beside the point. The point is that localization is a black box. A string key and parameters go in to the box, and a translated object comes out. Once translated it becomes basically untouchable.

This topic is closed to new replies.

Advertisement