What is the typical behavior of a wide-character string on a platform where it represents a UTF-16 code unit? When you subscript the string, does it return the code point of that index, doing the appropriate conversion, or does it simply return the code unit in its internal array? When you append a character to the string, does it accept it as a code point, and do the conversion into UTF-16 code units, or a single code unit, requiring you to break up any characters that require two code units into surrogate pairs?
In my string class, I have a subscript operator where it returns the code point by value, performing the appropriate conversions as necessary. I'm thinking of adding a method that allows you to access the code unit array, and this will return a reference to that element.
The question might also be extended to, where does the responsibility of the string end? Is the string just there to hold an array? If so, the length of variable-width encodings would be wrong. If the string class is aware of the correct length, it would also have the ability to return the code point, instead of code units. Is the string simply a container, or is it allowed to know what it is holding?
My string class is laid out like this:
There is a character type, and a storage type. For example, with UTF-8, the character type could be an unsigned int, and the storage type would be an unsigned char; the code points (character type) would be 32bit values, each one distinct. The code units (storage type) would be 8bit values in a sequence that could be converted into a whole Unicode code point.
There are separate length functions: length() and size() return the length in code points, while span() returns the size in code units. Algorithms that deal with the string would use the appropriate methods for their needs.
There are subscript functions that return the code point at that index in the logical string, and functions that return the code unit at that offset in the physical string.
The traits class defines the encoding and transformation format entirely. Depending on the traits, it could be a fixed-width character array, or a variable-width string encoding.
I feel that there is great utility in being able to subscript code points, as I can use a UTF-8 string, a UTF-16 string, and a UTF-32 string, and I can get the same values from all of them in the same way, so long as they contain the same code points in their respective transformation formats. My goal is to be able to write code that doesn't need to concern itself with the type of string passed; when it is unacceptable to instantiate a different version of a function for each given string type, I can use iterators to be encoding-agnostic.
I accept that at any time, there may be invalid sequences in the string, and this is handled when the conversions are being made, because I think it is an unreasonable invariant to enforce that the string be valid at all times.
So, do I have it wrong, and my string class knows too much about the data it contains?