They don't need to be able to interact with each other for that.
I think I'm misrepresenting what I mean. Different types of strings are not allowed to interact with each other, unless you purposely pass a pointer to a datatype that happens to look like the storage type.
You go on to show that you do fully appreciate this difference - so the only thing I don't understand is why you talk about trying to implement an interchangeable interface, when the two are not comparable.
Again, I think I'm not expressing what I really mean. One cannot be used where the other is expected; they are two completely different opaque types. However, a templated class or function that iterates through all of the characters in a string and does something that doesn't depend upon the precise encoding would be able to use the strings with the same interface. Again, the incompatible string types don't interact with each other. It's just made so that a generic function that expects a string and does something with a string can use the same functions, notation, and algorithm for handling them, so long as it doesn't need to know what kind of string it is.
The equivalent to std::string(char*) is UTF8String(int*). There is no legitimate part of your code where you have a char* and you could interchangeably create either std::string or UTF8String.
Right, but there is no part of my code where I would ever try to make a string out of data that doesn't belong to its encoding.
We don't build strings out of arrays of the storage type, because that is just an implementation detail - we build them out of arrays of the character type.
I feel that is debatable. For instance, in Windows, the wide-char string. You would construct an std::wstring from an array of wchar_t, which is a UTF-16 code unit: it's storage type. The fixed-width strings can be interpreted either way. The new specialization std::u16string is a another example where it interacts with code units only; the only thing is, I also provide a method of decoding the character from the storage, as well. In my opinion, it is well established that several widely used string classes interact with elements of its storage, even though it is used to represent logical characters.
I would have liked to allow both arrays of code points and code units, but for strings where the two are the same, it'd fail to compile. I didn't get around to making a work-around, and to be honest, a string is allowed to have charType and storageType be the same type, but still have it be encoded, and need to be processed in terms of code points and code units. Thus, I can't have both code point and code unit input, and I decided that it is best to have code unit input instead.
After all, if you read UTF-16 from a file, why would you decode it from UTF-16 to Unicode code points, then pass those to the string class, where it re-encodes it back to UTF-16, because it accepts Unicode code points? I feel that it is a better design to create a string from the storage that you will have on hand that requires the least amount of transformation; code points need to be converted to its internal representation, which is costly if it already is in its internal representation to begin with. For adding characters, one can simply make a loop using the functions that handle single characters. If need be, I can make a specially named function that creates from a character array if it turns out to be a performance issue.
If you have a stronger argument for accepting characters only, I'm open to hearing it.
Of course you do need a function that builds UTF8 strings out of bytes, performing the encoding operation - but that has no equivalent in std::string.
I'd argue that this isn't entirely true; the function that builds an std::string out of bytes performs an encoding, one where each output is identical to the input, and I'll explain why this is important in a moment.
I don't see what you gain from that separate character type - surely that per-character validation operation is only half of the story since you already need to have a 'charType' in order to construct it.
I'm not quite sure that I follow what you mean. Are you referring to the ValidatedCharacter type? I feel that it is very important to validate unknown characters; the character input functions accept an instance of charType, and charType may or may not be a valid character. After all, it's likely a simple integer type, so using UTF-16 as an example, one would have to ensure that the character value is between 0 and 0x10FFFF, and not between 0xD800 and 0xDFFF. If you allow a character that doesn't conform to these requirements to be entered into the string, the string is now invalid. Thus, it is necessary to ensure that the character is valid by some means before using it.
As I would imagine it, once the data is successfully into the UTF-8 string, all characters are valid by definition, and before data is in the UTF-8 string, it's just bytes and meaningless until validated.
For the former, yes, very much so. There's a private method in the ValidatedCharacter class that is only accessible by the friend BasicString, that sets its validity without actually checking. When the BasicString class returns a character, all characters in the string are already valid by design, so it returns a Validated character using the special constructor that unconditionally makes it valid. This satisfies the former statement.
The latter, however, is where it differs slightly. The caller is allowed to use a single character of charType to interact with the string; this is the other half of the abstraction. A function that prints out the contents of a Unicode string can be templated to use a string in any format of UCS-2, UCS-4, or UTF-7, UTF-8, UTF-16, UTF-32, or any other Unicode Transformation Format as the source of the characters, and simply iterate through the characters in the string; these are all compatible formats in use, though they diverge in construction, when used with this string class. The caller doesn't need to know what the transformation format of the string is, so long as it uses the same individual character encoding. Thus, to say
and before data is in the UTF-8 string, it's just bytes and meaningless until validated.
it isn't strictly true; the data could be a sequence of bytes representing a UTF-8 sequence, or it could be a whole Unicode code point. Since there are two forms of input, I believe it is equally important to validate both of them, and it is very trivial to validate the characters. Additionally, the caller can explicitly instantiate the ValidatedCharacter class, and check if a single character is valid without it depending on the format being used. (I edited the class definition I posted earlier; it was missing the function for the caller to check if the character was valid!)
Now, regarding when I said:
I'd argue that this isn't entirely true; the function that builds an std::string out of bytes performs an encoding, one where each output is identical to the input, and I'll explain why this is important in a moment.
This is very important in that by thinking of all strings as having a character encoding (the POD charType), and a transformation format (the POD storageType), even ones that have an encoding that is the same as transformation format, where each output is identical to its input, it allows me to create a generic implementation where it takes storage, applies a transformation, and gets a character, and vice-versa. The result is this: with this one implementation, I can instantiate just about any string format that follows these rules, if I define the appropriate traits type. I have, so far, a UTF-8 traits class, and a generic traits class that performs the 1:1 mapping like std::basic_string. However, when I implement UTF-16, I simply define the traits class for it (which will borrow a lot from the UTF-8 code, since the encoding is the same, while the transformation format is different), and I now have a UTF-16 string class! The benefits of this approach are three-fold:
- Speed of use. In a matter of maybe an hour, I can implement a "new" type of string class. The amount of effort saved is tremendous, over writing a separate class that will perform the same job at about the same speed, and give me a headache now having to debug two classes, which leads me into my next point:
- Size and Simplicity. There is only one string class implementation! This means that as I iron out bugs and make the implementation more efficient, all of the string types will benefit from the work. I need only catch a bug once to make sure it is in none of my strings, and I do not have to repeat myself for the same operations on a slightly different format. I only need to make sure that each traits class works properly. Additionally, since the class supports using a templated allocator type, it is a much easier solution than writing duplicated code that also supports custom allocators.
- Versatility. I can support formats that I haven't even thought of supporting yet. Likely many that I won't ever use personally, but since I'm only paying for the instantiations that I use, this is a non-existent problem. For the formats that are raw arrays of characters internally, the inline no-op transformation functions get optimized out. For the formats that don't need validation because all characters are valid, the inline no-op validation functions get optimized out. And, while I plan to implement several formats, if code that uses this library needs a format that I didn't deem necessary, a traits class is all it takes for it to be supported.
Sorry for the novel if that is more information than necessary.