UTF-8 String Validation Responsibility

Started by
50 comments, last by Ectara 11 years, 2 months ago

I don't see why you need to mix string types that don't need validation and string types that do.

So that code that uses it can use the same interface without needing to know the difference. That was the whole reason.

Then have them provide the same interface. They don't need to be able to interact with each other for that. But see below...


I don't. I really don't. Somehow, text has to get into a string. If I read UTF-8 from a file, it goes into an array of code units before it goes into the string class. So, I need to interact with it there. The others functions are for convenience.

Right, but that instance of 'char* to UTF-8' is logically completely different from 'char* to std::string'. You go on to show that you do fully appreciate this difference - so the only thing I don't understand is why you talk about trying to implement an interchangeable interface, when the two are not comparable. The equivalent to std::string(char*) is UTF8String(int*). There is no legitimate part of your code where you have a char* and you could interchangeably create either std::string or UTF8String. We don't build strings out of arrays of the storage type, because that is just an implementation detail - we build them out of arrays of the character type.

Of course you do need a function that builds UTF8 strings out of bytes, performing the encoding operation - but that has no equivalent in std::string.

I don't see what you gain from that separate character type - surely that per-character validation operation is only half of the story since you already need to have a 'charType' in order to construct it. As I would imagine it, once the data is successfully into the UTF-8 string, all characters are valid by definition, and before data is in the UTF-8 string, it's just bytes and meaningless until validated.

Advertisement

They don't need to be able to interact with each other for that.

I think I'm misrepresenting what I mean. Different types of strings are not allowed to interact with each other, unless you purposely pass a pointer to a datatype that happens to look like the storage type.

You go on to show that you do fully appreciate this difference - so the only thing I don't understand is why you talk about trying to implement an interchangeable interface, when the two are not comparable.

Again, I think I'm not expressing what I really mean. One cannot be used where the other is expected; they are two completely different opaque types. However, a templated class or function that iterates through all of the characters in a string and does something that doesn't depend upon the precise encoding would be able to use the strings with the same interface. Again, the incompatible string types don't interact with each other. It's just made so that a generic function that expects a string and does something with a string can use the same functions, notation, and algorithm for handling them, so long as it doesn't need to know what kind of string it is.

The equivalent to std::string(char*) is UTF8String(int*). There is no legitimate part of your code where you have a char* and you could interchangeably create either std::string or UTF8String.

Right, but there is no part of my code where I would ever try to make a string out of data that doesn't belong to its encoding.

We don't build strings out of arrays of the storage type, because that is just an implementation detail - we build them out of arrays of the character type.

I feel that is debatable. For instance, in Windows, the wide-char string. You would construct an std::wstring from an array of wchar_t, which is a UTF-16 code unit: it's storage type. The fixed-width strings can be interpreted either way. The new specialization std::u16string is a another example where it interacts with code units only; the only thing is, I also provide a method of decoding the character from the storage, as well. In my opinion, it is well established that several widely used string classes interact with elements of its storage, even though it is used to represent logical characters.

I would have liked to allow both arrays of code points and code units, but for strings where the two are the same, it'd fail to compile. I didn't get around to making a work-around, and to be honest, a string is allowed to have charType and storageType be the same type, but still have it be encoded, and need to be processed in terms of code points and code units. Thus, I can't have both code point and code unit input, and I decided that it is best to have code unit input instead.

After all, if you read UTF-16 from a file, why would you decode it from UTF-16 to Unicode code points, then pass those to the string class, where it re-encodes it back to UTF-16, because it accepts Unicode code points? I feel that it is a better design to create a string from the storage that you will have on hand that requires the least amount of transformation; code points need to be converted to its internal representation, which is costly if it already is in its internal representation to begin with. For adding characters, one can simply make a loop using the functions that handle single characters. If need be, I can make a specially named function that creates from a character array if it turns out to be a performance issue.

If you have a stronger argument for accepting characters only, I'm open to hearing it.

Of course you do need a function that builds UTF8 strings out of bytes, performing the encoding operation - but that has no equivalent in std::string.

I'd argue that this isn't entirely true; the function that builds an std::string out of bytes performs an encoding, one where each output is identical to the input, and I'll explain why this is important in a moment.

I don't see what you gain from that separate character type - surely that per-character validation operation is only half of the story since you already need to have a 'charType' in order to construct it.

I'm not quite sure that I follow what you mean. Are you referring to the ValidatedCharacter type? I feel that it is very important to validate unknown characters; the character input functions accept an instance of charType, and charType may or may not be a valid character. After all, it's likely a simple integer type, so using UTF-16 as an example, one would have to ensure that the character value is between 0 and 0x10FFFF, and not between 0xD800 and 0xDFFF. If you allow a character that doesn't conform to these requirements to be entered into the string, the string is now invalid. Thus, it is necessary to ensure that the character is valid by some means before using it.

As I would imagine it, once the data is successfully into the UTF-8 string, all characters are valid by definition, and before data is in the UTF-8 string, it's just bytes and meaningless until validated.

For the former, yes, very much so. There's a private method in the ValidatedCharacter class that is only accessible by the friend BasicString, that sets its validity without actually checking. When the BasicString class returns a character, all characters in the string are already valid by design, so it returns a Validated character using the special constructor that unconditionally makes it valid. This satisfies the former statement.

The latter, however, is where it differs slightly. The caller is allowed to use a single character of charType to interact with the string; this is the other half of the abstraction. A function that prints out the contents of a Unicode string can be templated to use a string in any format of UCS-2, UCS-4, or UTF-7, UTF-8, UTF-16, UTF-32, or any other Unicode Transformation Format as the source of the characters, and simply iterate through the characters in the string; these are all compatible formats in use, though they diverge in construction, when used with this string class. The caller doesn't need to know what the transformation format of the string is, so long as it uses the same individual character encoding. Thus, to say

and before data is in the UTF-8 string, it's just bytes and meaningless until validated.

it isn't strictly true; the data could be a sequence of bytes representing a UTF-8 sequence, or it could be a whole Unicode code point. Since there are two forms of input, I believe it is equally important to validate both of them, and it is very trivial to validate the characters. Additionally, the caller can explicitly instantiate the ValidatedCharacter class, and check if a single character is valid without it depending on the format being used. (I edited the class definition I posted earlier; it was missing the function for the caller to check if the character was valid!)

Now, regarding when I said:

I'd argue that this isn't entirely true; the function that builds an std::string out of bytes performs an encoding, one where each output is identical to the input, and I'll explain why this is important in a moment.

This is very important in that by thinking of all strings as having a character encoding (the POD charType), and a transformation format (the POD storageType), even ones that have an encoding that is the same as transformation format, where each output is identical to its input, it allows me to create a generic implementation where it takes storage, applies a transformation, and gets a character, and vice-versa. The result is this: with this one implementation, I can instantiate just about any string format that follows these rules, if I define the appropriate traits type. I have, so far, a UTF-8 traits class, and a generic traits class that performs the 1:1 mapping like std::basic_string. However, when I implement UTF-16, I simply define the traits class for it (which will borrow a lot from the UTF-8 code, since the encoding is the same, while the transformation format is different), and I now have a UTF-16 string class! The benefits of this approach are three-fold:

  • Speed of use. In a matter of maybe an hour, I can implement a "new" type of string class. The amount of effort saved is tremendous, over writing a separate class that will perform the same job at about the same speed, and give me a headache now having to debug two classes, which leads me into my next point:
  • Size and Simplicity. There is only one string class implementation! This means that as I iron out bugs and make the implementation more efficient, all of the string types will benefit from the work. I need only catch a bug once to make sure it is in none of my strings, and I do not have to repeat myself for the same operations on a slightly different format. I only need to make sure that each traits class works properly. Additionally, since the class supports using a templated allocator type, it is a much easier solution than writing duplicated code that also supports custom allocators.
  • Versatility. I can support formats that I haven't even thought of supporting yet. Likely many that I won't ever use personally, but since I'm only paying for the instantiations that I use, this is a non-existent problem. For the formats that are raw arrays of characters internally, the inline no-op transformation functions get optimized out. For the formats that don't need validation because all characters are valid, the inline no-op validation functions get optimized out. And, while I plan to implement several formats, if code that uses this library needs a format that I didn't deem necessary, a traits class is all it takes for it to be supported.

Sorry for the novel if that is more information than necessary.

I'm going to respond out-of-line this time because managing that many quotes would be tricky!

"a templated class or function that iterates through all of the characters in a string and does something that doesn't depend upon the precise encoding would be able to use the strings with the same interface." - The reason this came up is because you said "it accepts pointers to character arrays" - if you meant the type of 'character' varies depending on the type of string, then great. But if you meant the C++ type 'char', then that isn't equivalent.

"For instance, in Windows, the wide-char string. You would construct an std::wstring from an array of wchar_t, which is a UTF-16 code unit: it's storage type." - std::string and std::wstring are fundamentally broken from a Unicode point of view. It's called a wchar_t because it's meant to represent a character. But it does not represent a character, unless you're using UCS-2. It's convenient from a performance point of view to expect the character type and the storage type to be the same, but that's only useful with fixed-length encodings, which we're not really dealing with here. (Or weren't, until your last post!)

If you want to have a system which generalises to both fixed length and variable length and plays to both their strengths, great - but that will complicate the interface because the assumptions you can safely make about fixed length don't apply to variable length. There's a reason we still don't have proper Unicode support in C++, after all!

"the function that builds an std::string out of bytes performs an encoding, one where each output is identical to the input" - Sure. If your argument is that you want a function to be able to efficiently construct your UTF-8 strings out of bytes that you assume are already in UTF-8 form, that's fine (albeit something of a leaky abstraction - although labelling the string UTF-8 is itself a leaky abstraction to some degree). The issue in my mind is that std::string considers bytes and characters interchangeable, implying the data is already encoded before addition, and Unicode strings don't. So by copying that function signature, you only actually copy part of the interface in logical terms. Any bytes you pass to std::string yield a valid string. That isn't the case for a UTF-8 string.

"I feel that it is very important to validate unknown characters" - Yes, but I don't see the worth of this class for it.

  • Do you have a stream of bytes which is presumed to be encoded correctly? Then you validate it by attempting to create a utf-8 string out of it.
  • Do you have a single code point representing a character? Then again, you validate it by attempting to create a utf-8 string out of it.

I can't see a use case for a separate character type that needs to know whether it is valid or not. A one-length string performs the same role and simplifies the code.

I'd be tempted to go with something like:


class UTF8String
{
public:
    enum ERROR_CODE { SUCCESS, INVALID_CODE_POINT, INVALID_LENGTH, INVALID_PARAMETER  };
 
    UTF8String();  // Construct empty string
 
    // Initialize string from given data. Returns error code on failure.
    ERROR_CODE InitFromASCII(const char *data, int codepage);
    ERROR_CODE InitFromUTF8(const void *data);
    ERROR_CODE InitFromCodePoint(int codePoint);
};

That way the string is always valid, and the initialization is done with functions that can return an error code. It doesn't force any heap allocation on you either. It also means the user needs to be explicit about what format their data is in. For example is a char * pointing at ASCII or UTF8 data?

By the way to keep the heap allocation overhead of short strings down even further you can borrow a trick from std::string. It declares a small statically sized buffer within the class (usually around 16 bytes I believe). Strings which fit in that buffer can avoid heap allocations completely.

"a templated class or function that iterates through all of the characters in a string and does something that doesn't depend upon the precise encoding would be able to use the strings with the same interface." - The reason this came up is because you said "it accepts pointers to character arrays" - if you meant the type of 'character' varies depending on the type of string, then great. But if you meant the C++ type 'char', then that isn't equivalent.

My wording was ambiguous, and I apologize. What I really meant was a pointer to an array of storage units; that's the closest that the class will get to an "array of characters". All I really meant was an array of transformed character storage.

If you want to have a system which generalises to both fixed length and variable length and plays to both their strengths, great - but that will complicate the interface because the assumptions you can safely make about fixed length don't apply to variable length. There's a reason we still don't have proper Unicode support in C++, after all!

I agree, and this class was designed to make an effort to dodge that entirely, by abstracting how many code units there are to a code point. Majority of the functions (now that the overloads accepting pointers to storage arrays are removed) only deal with BasicString instances and ValidatedCharacter instances, removing how the characters are stored from view. You can still access the read-only storage and the size of the storage through special functions, but that is an implementation detail that has more use internally than externally.

So by copying that function signature, you only actually copy part of the interface in logical terms. Any bytes you pass to std::string yield a valid string. That isn't the case for a UTF-8 string.

I agree. It is up to the validation function to determine if the sequence of code units is valid; the only remaining places where questionable code units are passed are the string creation methods, and the standalone validation methods that simply check for validity. I have been thoroughly convinced to drop all of the unsafe overloads.

I can't see a use case for a separate character type that needs to know whether it is valid or not. A one-length string performs the same role and simplifies the code.

Simple in that no other classes need to be written, yes, but not simple by any other measures I can see. Regarding when I described its purpose earlier:

I don't understand the motivation behind dealing with validated characters individually, versus validated strings?

To find a single character within a string incurs a relatively large amount of overhead and dynamic allocation if you first convert the single character to a string, then do a more expensive string search/comparison.

Just about any operation is faster in this class if you are working with only a single character, and it uses a lot less memory. Operations on single characters are very frequent with what I do. It really is worth it, in terms of measurable performance and dynamic allocation efficiency.

It obviously costs a lot more memory and time to make a whole temporary string out of one character just to throw it away when I'm done because I only wanted one character. To demonstrate it, these are the abstract steps to do such a thing:
Allocate an instance of the string (about 36 bytes for the default allocator, plus a pointer if dynamically allocated).
Call a create string factory method to create from a character.
- Validate the character.
- Calculate the length of the character in storage units. (adds more bytes plus allocator overhead)
- Allocate storage for the character.
- Transform the character to storage.
- Perform bookkeeping on the string properties.
Pass the string by reference to the function.
- Do a more expensive string version, because it could be any length other than one, as well.
- The function also has to decode the characters one at a time, which transforms the storage back to a character again.
Destructor is called, frees the storage.
Frees the string instance.

OR

Create ValidatedCharacter automatically (5 bytes)
Validates the character.
Pass the ValidatedCharacter.
Use the ValidatedCharacter's character member as-is.
Automatically reclaim the ValidatedCharacter instance.

It uses a lot less memory, and can be several times faster for pretty common operations that involve only one character, like finding a single character, appending a single character, inserting a single character, replacing a single character, etc. The benefits are pretty worth it to me. Additionally, you don't have a function with a return code to check; in an example of copying all characters of a UTF-16 string to a UTF-8 string, one can iterate through the first and append to the second quickly. Since it comes from a valid string, it even skips the validation check, making it just copying a small POD class.

That way the string is always valid, and the initialization is done with functions that can return an error code. It doesn't force any heap allocation on you either. It also means the user needs to be explicit about what format their data is in. For example is a char * pointing at ASCII or UTF8 data?

One problem is that this requires the string be instantiated already, which calls any constructor used. This will result in a small penalty, that I will try to optimize, regardless.

Additionally, having the user be explicit by calling explicitly named functions will not work; the class is templated with any encoding that follows certain rules, so it is possible that a given encoding cannot represent all of the code points that either ASCII or UTF-8 can describe. This problem is alleviated, however, by only allowing input in the chosen encoding. I can swap the bool for an enum, at some point; I'm temporarily avoiding enumerations because I have to come up with a new list of return codes in the process of porting my C code to C++.

By the way to keep the heap allocation overhead of short strings down even further you can borrow a trick from std::string. It declares a small statically sized buffer within the class (usually around 16 bytes I believe). Strings which fit in that buffer can avoid heap allocations completely.

A few implementations of std::basic_string use short string optimization (most notably MSVC), but I'd rather not. For a transformation format that isn't 1:1 in its code point to code unit size, it is difficult to determine what is a good size. If you keep only 16 bytes around, it makes for a minimum of 4 valid UTF-8 characters, which isn't too worth the effort. If you try to predict the maximum size of 16 characters, UTF-7's largest valid character takes 8 bytes to store, requiring 128 bytes total. Any way you try, the benefit is too weak to be worth making the effort. I'd much prefer an implementation like GCC's, where it uses a copy-on-write strategy, though I'm nowhere near that point yet. I am next going to optimize default-constructed strings to do no dynamic allocation, however.

Copy on write isn't used nearly as much as it used to be since it doesn't play nice with multiple threads (need to keep locking the string, even if reading since another thread may try and do a write).

"Most people think, great God will come from the sky, take away everything, and make everybody feel high" - Bob Marley

Copy on write isn't used nearly as much as it used to be since it doesn't play nice with multiple threads (need to keep locking the string, even if reading since another thread may try and do a write).

Yeah, I know that locking is necessary. But, it is one of the many optimization options, and they do serve different purposes. If I'm handling very large strings, copy on write would be worth it.

It uses a lot less memory, and can be several times faster for pretty common operations that involve only one character, like finding a single character, appending a single character, inserting a single character, replacing a single character, etc.

I still don't see a use for this. if I want to append a single character to a Unicode string, I can have an append or insert method that takes a code point as an integer. Yes, it will have to be able to deal with an invalid code point, but the alternative is that it will have to deal with an ValidatedCharacter - you still have to check the 'valid' flag to know that what you're adding is safe (which incidentally makes the class name a bit misleading). Both ways require that the append/insert/replace operation checks validity and has a way of dealing with a validity error.

Part of this is because you've forced yourself to jump through hoops by having exception handling turned off, and your character class is an attempt to get back to stack-allocated cheap objects - but it just reintroduces the problem you originally had in that you can create invalid data. Being able to add this type into your string is basically poking a hole through the firewall you set up.

To be honest I generally doubt the usefulness of per-character access in a unicode string anyway. Most per-character operations performed on strings in C or C++ are really operations performed on bytes in a byte array. When working with actual text it's hard to come up with real world use cases that involve tinkering with individual characters. The first ones that come to mind are things like Upper/Lower/Capitalise, but you can't do them correctly on individual characters - the German character ß becomes SS in upper case, for example. I would argue that legitimate character-level operations are rare enough that expecting them to be done with string instances is reasonable.

you still have to check the 'valid' flag to know that what you're adding is safe


You'll have to validate no matter what, and this provides a simpler way of doing so, by wrapping the call to the traits class' character validation function. The only difference between inserting a plain charType and this, is that this way ensures that there isn't an invalid character. Otherwise, _every_ function must now check to see if the character is valid; with this class, it checks in one place only, and the result can be re-used without the caller tampering with it. I see absolutely no reason why this is an inferior solution.

(which incidentally makes the class name a bit misleading)

What would you suggest?

Both ways require that the append/insert/replace operation checks validity and has a way of dealing with a validity error.


Not entirely accurate; one way has the validity checked once, and then everywhere that uses it simply queries a flag to see if it is valid. Without the class, every function must call the validation function, even on repeated operations with the same character. I can't see how this is inferior.

but it just reintroduces the problem you originally had in that you can create invalid data. Being able to add this type into your string is basically poking a hole through the firewall you set up.


I don't see it that way. It provides the same securities as a full-blown string class with more efficiency. Even if I allowed adding a plain integer to the string, that would have the same implications.

To be honest I generally doubt the usefulness of per-character access in a unicode string anyway.


I am not against you leaving it out of your own string class. Keep in mind, this string class is not Unicode only; it handles other string types like a simple char string.

I would argue that legitimate character-level operations are rare enough that expecting them to be done with string instances is reasonable.


I would disagree heavily. If I read a configuration file into a char string, and I go to parse it, it would be ridiculous to treat every single character as its own string. It would be horridly inefficient.

EDIT:

Also, in the factory method, should I use the string's allocator to allocate its own instance? It seems like it would make it hard to free it, though if someone is using a custom allocator, they'd be likely to call the destructor manually in some way, then free the memory themselves in some manner, so it would be possible to use the string's allocator to allocate/free the string itself. The question is, does it make sense, and should this behavior be expected.

This topic is closed to new replies.

Advertisement