# UTF-8 String Validation Responsibility

This topic is 1791 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

For these all-valid types, the current validation method would essentially be inlined to just returning true for every check, and thus be optimized out.

That's a good reason not to have a separate validation function, agreed.

But for every set of traits that do impose restrictions, you can't allow any eventuality wherein the contents of the string have not been successfully validated.

##### Share on other sites

But for every set of traits that do impose restrictions, you can't allow any eventuality wherein the contents of the string have not been successfully validated.

I agree. I was considering some sort of hybrid approach where a validated class could be implicitly constructed as a temporary to be passed immediately for the current behavior, or optionally instantiated explicitly and have its return code checked.

I really don't want to have error codes; there are a little over 65 functions that will need to have an out parameter to then have hundreds of lines of return code checking, when it is likely that no error will occur.

The main problem is, a lot of the functions have meaningful return values, so extra parameters are needed if I do go that route. I guess I thought I had finally gotten away from spending 50% of my effort writing code that simply checks the return code if it is okay to continue, or cleans up otherwise and propagates it upward. The one feature that would have revamped that has unacceptable overhead, it seems.

So, it seems the question was never assert() vs exceptions, because assert() is unacceptable behavior. Do you use return codes for everything?

Edited by Ectara

##### Share on other sites

I also came to the realization that validating strings internally might be difficult; the string class receives code unit arrays and their length in code points, not code units. In the example of UTF-8, if a code unit array that is only three units long is passed, and the lead unit indicates that there are four units, a function that only knows its length in characters might segfault when it attempts to validate the fourth unit. I can't find a way around this without knowing the length of the array in units.

##### Share on other sites

So, it seems the question was never assert() vs exceptions, because assert() is unacceptable behavior. Do you use return codes for everything?

I personally use exceptions for anything where I can't just log and abort(). They are far from perfect, but they are the best I currently have in C++.

Funnily enough, I think Perl may be the only language that gets this somewhat right. It has a bunch of convenient constructs to let you chain errors without excessive syntactic overhead:

my $var = myfunc$0 or return "myfunc failed"

##### Share on other sites

They are far from perfect, but they are the best I currently have in C++.

I agree; for their faults, they do seem to be very useful in this regard. However, with the ever-looming possibility that exceptions might be disabled, seems like they're out for me.

Funnily enough, I think Perl may be the only language that gets this somewhat right.

I wish other languages had simpler error handling routines. I'd settle for exceptions that handled only integer return codes, or something. The RTTI implied in checking which type the exception was can be killer. I've been reading that some embedded implementations perform this by doing string comparisons against the name of the type.

I have no idea how I am supposed to do all of these things without error codes. I can't use concatenation operators on my string class; those can't sensibly return an error code, and they only accept two operands. If I have anything but assertions or exceptions for error handling, it means I am not allowed to use operators or constructors that actually do anything. Is this the state of things, that without exceptions, we go back to C with classes?

##### Share on other sites

In the constructor, it only validates once. However, if it fails, there's a critical error that results in breaking into the debugger. So, in order to prevent this error, one must validate or sanitize before passing the data to the constructor. Or, do people not even make sure it is valid, and let the application abort to let them know?

It only breaks into the debugger because you've told it to. That is not usually what an exception thrown from a constructor does. It seems like you want (or wanted?) to use exceptions as assertions, but they're for different problems. Development-time errors can be fixed with static typing and assertions, and run-time errors on user input absolutely have to have conditional checks at some point along the pipeline. Both have their place. Exceptions can be used to implement both, but they're not necessary.

If you don't want to throw exceptions from a constructor, I humbly re-submit the idea of using a factory function to generate your strings. Either it validates the data and gives you a legit object or it rejects the data and tells you with a stern error code and a null pointer. This gives you one clear gateway between the unvalidated data and the UTF-8 strings.

I have 42 functions that accept a string of possibly invalid code units, and 24 functions that accept a possibly invalid character.

Sounds to me like you've got the abstraction in the wrong place. Why add unvalidated characters? Some languages solve this the brute force way - you can only add strings to strings, and characters are 1-length strings. So you wouldn't have all these functions that accept unvalidated data - you'd have them require the UTF-8 string and the caller bears the responsibility of passing in the correct type, just as it should for all the other types you pass into a function. You also mention needing to check a return code from a concatenation operation - why? If both operands are legitimate strings, the result will be legitimate also. Validate the data early, in creating the string, and then you don't need to worry when it comes to performing future operations on them.

The idea is that the Caller passes the correct types in, and the Callee returns the correct types out. This should be done to whatever degree your programming language allows. To the extent where the language can't enforce it or makes it tricky (eg. the function allows values from 1 to 100, but creating a type to enforce that is a hassle), you can check explicitly and consider an exception, an assertion, or an error code. But in cases where you can guarantee the correct data - ie. you have a type that enforces that constraint - then that type is what you should be passing in.

##### Share on other sites

That is not usually what an exception thrown from a constructor does.

I'm aware, but now I can't use exceptions, because of the possibility of them being disabled, which would result in termination regardless of whether an exception or an assertion was used.

I humbly re-submit the idea of using a factory function to generate your strings.

It looks like that's how it is going to be, just like in C. Though, I don't want to return a pointer to a dynamically allocated string, because I'd rather I be able to automatically manage its lifetime. This means that an invalid string might be wandering around if it fails initialization, or it might be double constructed if I tried to have a default, but valid, state. Is this the only way, return a newly allocated pointer?

Sounds to me like you've got the abstraction in the wrong place. Why add unvalidated characters?

I initially designed it after the std::basic_string class; it was mature, very commonly used, and provided features that were incredibly useful in a generic manner. As a result, it accepts pointers to character arrays, as well. I suppose I could ditch all of that, and accept only instances of the string class and instances of a unique validated character class.

It seems like an unnecessary operation to create temporary class instances for a string type that doesn't require validation. Perhaps some sort of template meta-programming that can check the traits class to see if validation is required, and then not enable the insecure functions.

You also mention needing to check a return code from a concatenation operation - why? If both operands are legitimate strings, the result will be legitimate also.

If the resulting length is greater than the maximum length allowed, or something like that, then the caller must know that the string is impossible to access in entirety, or something like that. In this case, the resulting string would be illegitimate, because it violates the maximum length.

But in cases where you can guarantee the correct data - ie. you have a type that enforces that constraint - then that type is what you should be passing in.

I suppose you're right. I'm extremely reluctant to go back to how I always was, with checking return codes everywhere. I suppose I could make more of them simply fatal errors, like triggering assertions on receiving invalid parameters that violate the contract the function requires.

These changes also mean that I need to completely redesign the class, when I was so close to finishing it. This has me entirely frustrated.

##### Share on other sites

I initially designed it after the std::basic_string class; it was mature, very commonly used, and provided features that were incredibly useful in a generic manner. As a result, it accepts pointers to character arrays, as well. I suppose I could ditch all of that, and accept only instances of the string class and instances of a unique validated character class.

You are designing something for a fundamentally different purpose to std::basic_string, though. In general, std::basic_string doesn't give a damn whether it's contents are valid, and for the most part, isn't even aware of the possibility of invalid strings.

##### Share on other sites

You are designing something for a fundamentally different purpose to std::basic_string, though. In general, std::basic_string doesn't give a damn whether it's contents are valid, and for the most part, isn't even aware of the possibility of invalid strings.

Yeah, I've always been aware of that. I was originally planning on ignoring or replacing invalid characters as it encountered them, but it now seems like that'd be doing too much; it shouldn't be responsible for the data being valid.

##### Share on other sites

I humbly re-submit the idea of using a factory function to generate your strings.

It looks like that's how it is going to be, just like in C. Though, I don't want to return a pointer to a dynamically allocated string, because I'd rather I be able to automatically manage its lifetime. This means that an invalid string might be wandering around if it fails initialization, or it might be double constructed if I tried to have a default, but valid, state. Is this the only way, return a newly allocated pointer?

You can automatically manage its lifetime if you store the pointer in a smart pointer wrapper.

Alternatively you could just include a 'bad' or 'fail' flag (a bit like iostreams do), indicating that the object is not in a useful state. All member functions do nothing if the flag is set, and in debug mode they can assert if you like. This is reasonable if you're acting on data supplied by the programmer. If you're acting on data supplied by the user, then you might consider validating before construction, and you can expose the validation routine as a static function to permit that.

I initially designed it after the std::basic_string class; it was mature, very commonly used, and provided features that were incredibly useful in a generic manner. As a result, it accepts pointers to character arrays, as well.

Right, but basic_string is just a list of char anyway, with no encoding information. A pointer to a character array is just copying the data, which is guaranteed to be valid. You're thinking of a char* as "a string" but that's a bad way to look at it. It's a pointer to several instances of char - they are valid for a string with no encoding, but not valid for your string.

It seems like an unnecessary operation to create temporary class instances for a string type that doesn't require validation.

I don't see why you need to mix string types that don't need validation and string types that do. I think this is the downfall of many developers, often from English-speaking countries, who think of char* and std::string and UTF-8 all as text that should be easily interchangeable. Really you have to think of char* as bytes, UTF-8 as text, and std::string as a ham-fisted compromise between the two which isn't really useful for real world internationalised text.

You also mention needing to check a return code from a concatenation operation - why? If both operands are legitimate strings, the result will be legitimate also.

If the resulting length is greater than the maximum length allowed, or something like that, then the caller must know that the string is impossible to access in entirety, or something like that. In this case, the resulting string would be illegitimate, because it violates the maximum length.

I humbly suggest that you make the maximum size whatever you have room for in memory. If you try to add together more than 4GB of text, you probably have problems beyond Unicode issues.

Note that this problem you have raised is not unique to your string type. Basic types often overflow and/or raise exceptions because there's no good way to implement error codes with infix notation. Usually best just to accept the risk.

I'm extremely reluctant to go back to how I always was, with checking return codes everywhere.

That's why I think the error codes should be concentrated in one place, ie. the creation of the string. Then everything else can be enforced by type-checking. That's pretty much how Python and C# do it - you have a routine to get bytes into a Unicode string and vice versa, and nothing else needs to consider the chance of encoding errors.

##### Share on other sites

You can automatically manage its lifetime if you store the pointer in a smart pointer wrapper.

That strikes me as automatically manually managing its life time; I suppose I could simply provide an overload: one that takes a reference to a string, and one that returns a pointer to a string.

If you're acting on data supplied by the user, then you might consider validating before construction, and you can expose the validation routine as a static function to permit that.

I agree; I would only have an empty string constructor, and a copy constructor, aside from the static member function for character array initialization.

I don't see why you need to mix string types that don't need validation and string types that do.

So that code that uses it can use the same interface without needing to know the difference. That was the whole reason.

I think this is the downfall of many developers, often from English-speaking countries, who think of char* and std::string and UTF-8 all as text that should be easily interchangeable.

I don't. I really don't. Somehow, text has to get into a string. If I read UTF-8 from a file, it goes into an array of code units before it goes into the string class. So, I need to interact with it there. The others functions are for convenience.

The class has a character type, and a storage type. For ASCII, both are char. For UTF-8, the character type is an int, and the storage type is char. For UTF-16, the character type is an int, and the storage type is a short integer. I don't use string literals for UTF-8 text, nor do I use an std::string. I don't think of them as interchangeable; the storage type just coincidentally is a char array. This is placed for convenience, so someone doesn't have to create a string object, allocate a new internal storage array, copy the data, do the operation, free the data, then destroy the object. The fact that the array of storage units happens to be represented by a pointer to char is a pure coincidence.

I humbly suggest that you make the maximum size whatever you have room for in memory. If you try to add together more than 4GB of text, you probably have problems beyond Unicode issues.

Note that this problem you have raised is not unique to your string type. Basic types often overflow and/or raise exceptions because there's no good way to implement error codes with infix notation. Usually best just to accept the risk.

The max is the maximum number countable, so in other words, the most that can fit in the size type while allowing one that corresponds to no valid index.

That's why I think the error codes should be concentrated in one place, ie. the creation of the string.

I agree. I can try my hardest to assure that no errors can occur that aren't show-stoppers, and try to handle them up front, I guess.

##### Share on other sites

Here is my first implementation of a validated character type:

class ValidatedCharacter{
charType c_;
bool valid_;

ValidatedCharacter(charType c, bool validity) : c_(c), valid_(validity) { }

public:
ValidatedCharacter(void) : c_(0), valid_(false) { }

ValidatedCharacter(charType c){
valid_ = traitsType::isCharValid(c);
c_ = c;
}

ValidatedCharacter(const ValidatedCharacter & other) : c_(other.c_), valid_(other.valid_) { }

inline bool isValid(void){
return valid_;
}

inline operator charType(void){
_E_ASSERT(valid_);

return c_;
}

friend BasicString;
};

This will offer the same interface as before, where passing a character will automatically implicitly construct a validated character class and check it on the way in. You can also manually instantiate ValidatedCharacter and check if it is valid before using it and have the option of reusing the result. If someone tries to read the character while it is marked as invalid, an assertion is triggered. Additionally, the string class has the ability to mark the character as valid unconditionally, when it returns a character from within the already validated string. Any tips on that, in the meantime? It seems like it works out nicely; since there's only one parameter to the visible constructors, an opaque class type will function in an ideal fashion by not requiring an explicit instantiation if it isn't necessary.

Edited by Ectara

##### Share on other sites
I don't understand the motivation behind dealing with validated characters individually, versus validated strings?

##### Share on other sites

I don't understand the motivation behind dealing with validated characters individually, versus validated strings?

To find a single character within a string incurs a relatively large amount of overhead and dynamic allocation if you first convert the single character to a string, then do a more expensive string search/comparison.

Just about any operation is faster in this class if you are working with only a single character, and it uses a lot less memory. Operations on single characters are very frequent with what I do. It really is worth it, in terms of measurable performance and dynamic allocation efficiency.

##### Share on other sites

I have removed all publicly accessible functions that deal with using unvalidated text, and provided two main mechanisms for creating strings: a factory that returns a pointer to a dynamically allocated string instance or null on failure and a factory that takes a reference to a string and returns a bool indicating the result. I like having the choice of how I allocate a class instance, and being able to reuse already allocated instances. Additionally, there is a constructor that accepts text in a similar fashion (internally calling the factory that takes a reference), and triggers an assertion on failure, for strings created from internal text that is known to be valid.

And characters, as above, can be passed and implicitly validated, or explicitly instantiated and checked/reused. As soon as the character is read while invalidated, it triggers the assertion, so if a character originates from within the code, and it is found to be invalid, it fails as soon as it is used.

Again, I want to stress, the stuff that triggers an assertion on invalid text is for internal use only, that is absolutely not expected to be invalid; if I am manipulating invalid text where I shouldn't, I want it to abort as soon as possible.

If anyone can imagine a better way to do this, let me know. I'm starting to feel better about this, after all of the changes made.

On a side note, I have to say that I'm extremely thankful for the fact that I developed this class using TDD strategies, so after I removed unusable tests for functionality that no longer exists, the existing tests caught just about all of the immediate bugs that resulted from the rewrite, in addition to carefully placed assertions.

EDIT:

Also, in the factory method, should I use the string's allocator to allocate its own instance? It seems like it would make it hard to free it, though if someone is using a custom allocator, they'd be likely to call the destructor manually in some way, then free the memory themselves in some manner, so it would be possible to use the string's allocator to allocate/free the string itself. The question is, does it make sense, and should this behavior be expected.

Edited by Ectara

##### Share on other sites

I don't see why you need to mix string types that don't need validation and string types that do.

So that code that uses it can use the same interface without needing to know the difference. That was the whole reason.

Then have them provide the same interface. They don't need to be able to interact with each other for that. But see below...

I don't. I really don't. Somehow, text has to get into a string. If I read UTF-8 from a file, it goes into an array of code units before it goes into the string class. So, I need to interact with it there. The others functions are for convenience.

Right, but that instance of 'char* to UTF-8' is logically completely different from 'char* to std::string'. You go on to show that you do fully appreciate this difference - so the only thing I don't understand is why you talk about trying to implement an interchangeable interface, when the two are not comparable. The equivalent to std::string(char*) is UTF8String(int*). There is no legitimate part of your code where you have a char* and you could interchangeably create either std::string or UTF8String. We don't build strings out of arrays of the storage type, because that is just an implementation detail - we build them out of arrays of the character type.

Of course you do need a function that builds UTF8 strings out of bytes, performing the encoding operation - but that has no equivalent in std::string.

I don't see what you gain from that separate character type - surely that per-character validation operation is only half of the story since you already need to have a 'charType' in order to construct it. As I would imagine it, once the data is successfully into the UTF-8 string, all characters are valid by definition, and before data is in the UTF-8 string, it's just bytes and meaningless until validated.

##### Share on other sites

They don't need to be able to interact with each other for that.

I think I'm misrepresenting what I mean. Different types of strings are not allowed to interact with each other, unless you purposely pass a pointer to a datatype that happens to look like the storage type.

You go on to show that you do fully appreciate this difference - so the only thing I don't understand is why you talk about trying to implement an interchangeable interface, when the two are not comparable.

Again, I think I'm not expressing what I really mean. One cannot be used where the other is expected; they are two completely different opaque types. However, a templated class or function that iterates through all of the characters in a string and does something that doesn't depend upon the precise encoding would be able to use the strings with the same interface. Again, the incompatible string types don't interact with each other. It's just made so that a generic function that expects a string and does something with a string can use the same functions, notation, and algorithm for handling them, so long as it doesn't need to know what kind of string it is.

The equivalent to std::string(char*) is UTF8String(int*). There is no legitimate part of your code where you have a char* and you could interchangeably create either std::string or UTF8String.

Right, but there is no part of my code where I would ever try to make a string out of data that doesn't belong to its encoding.

We don't build strings out of arrays of the storage type, because that is just an implementation detail - we build them out of arrays of the character type.

I feel that is debatable. For instance, in Windows, the wide-char string. You would construct an std::wstring from an array of wchar_t, which is a UTF-16 code unit: it's storage type. The fixed-width strings can be interpreted either way. The new specialization std::u16string is a another example where it interacts with code units only; the only thing is, I also provide a method of decoding the character from the storage, as well. In my opinion, it is well established that several widely used string classes interact with elements of its storage, even though it is used to represent logical characters.

I would have liked to allow both arrays of code points and code units, but for strings where the two are the same, it'd fail to compile. I didn't get around to making a work-around, and to be honest, a string is allowed to have charType and storageType be the same type, but still have it be encoded, and need to be processed in terms of code points and code units. Thus, I can't have both code point and code unit input, and I decided that it is best to have code unit input instead.

After all, if you read UTF-16 from a file, why would you decode it from UTF-16 to Unicode code points, then pass those to the string class, where it re-encodes it back to UTF-16, because it accepts Unicode code points? I feel that it is a better design to create a string from the storage that you will have on hand that requires the least amount of transformation; code points need to be converted to its internal representation, which is costly if it already is in its internal representation to begin with. For adding characters, one can simply make a loop using the functions that handle single characters. If need be, I can make a specially named function that creates from a character array if it turns out to be a performance issue.

If you have a stronger argument for accepting characters only, I'm open to hearing it.

Of course you do need a function that builds UTF8 strings out of bytes, performing the encoding operation - but that has no equivalent in std::string.

I'd argue that this isn't entirely true; the function that builds an std::string out of bytes performs an encoding, one where each output is identical to the input, and I'll explain why this is important in a moment.

I don't see what you gain from that separate character type - surely that per-character validation operation is only half of the story since you already need to have a 'charType' in order to construct it.

I'm not quite sure that I follow what you mean. Are you referring to the ValidatedCharacter type? I feel that it is very important to validate unknown characters; the character input functions accept an instance of charType, and charType may or may not be a valid character. After all, it's likely a simple integer type, so using UTF-16 as an example, one would have to ensure that the character value is between 0 and 0x10FFFF, and not between 0xD800 and 0xDFFF. If you allow a character that doesn't conform to these requirements to be entered into the string, the string is now invalid. Thus, it is necessary to ensure that the character is valid by some means before using it.

As I would imagine it, once the data is successfully into the UTF-8 string, all characters are valid by definition, and before data is in the UTF-8 string, it's just bytes and meaningless until validated.

For the former, yes, very much so. There's a private method in the ValidatedCharacter class that is only accessible by the friend BasicString, that sets its validity without actually checking. When the BasicString class returns a character, all characters in the string are already valid by design, so it returns a Validated character using the special constructor that unconditionally makes it valid. This satisfies the former statement.

The latter, however, is where it differs slightly. The caller is allowed to use a single character of charType to interact with the string; this is the other half of the abstraction. A function that prints out the contents of a Unicode string can be templated to use a string in any format of UCS-2, UCS-4, or UTF-7, UTF-8, UTF-16, UTF-32, or any other Unicode Transformation Format as the source of the characters, and simply iterate through the characters in the string; these are all compatible formats in use, though they diverge in construction, when used with this string class. The caller doesn't need to know what the transformation format of the string is, so long as it uses the same individual character encoding. Thus, to say

and before data is in the UTF-8 string, it's just bytes and meaningless until validated.

it isn't strictly true; the data could be a sequence of bytes representing a UTF-8 sequence, or it could be a whole Unicode code point. Since there are two forms of input, I believe it is equally important to validate both of them, and it is very trivial to validate the characters. Additionally, the caller can explicitly instantiate the ValidatedCharacter class, and check if a single character is valid without it depending on the format being used. (I edited the class definition I posted earlier; it was missing the function for the caller to check if the character was valid!)

Now, regarding when I said:

I'd argue that this isn't entirely true; the function that builds an std::string out of bytes performs an encoding, one where each output is identical to the input, and I'll explain why this is important in a moment.

This is very important in that by thinking of all strings as having a character encoding (the POD charType), and a transformation format (the POD storageType), even ones that have an encoding that is the same as transformation format, where each output is identical to its input, it allows me to create a generic implementation where it takes storage, applies a transformation, and gets a character, and vice-versa. The result is this: with this one implementation, I can instantiate just about any string format that follows these rules, if I define the appropriate traits type. I have, so far, a UTF-8 traits class, and a generic traits class that performs the 1:1 mapping like std::basic_string. However, when I implement UTF-16, I simply define the traits class for it (which will borrow a lot from the UTF-8 code, since the encoding is the same, while the transformation format is different), and I now have a UTF-16 string class! The benefits of this approach are three-fold:

• Speed of use. In a matter of maybe an hour, I can implement a "new" type of string class. The amount of effort saved is tremendous, over writing a separate class that will perform the same job at about the same speed, and give me a headache now having to debug two classes, which leads me into my next point:
• Size and Simplicity. There is only one string class implementation! This means that as I iron out bugs and make the implementation more efficient, all of the string types will benefit from the work. I need only catch a bug once to make sure it is in none of my strings, and I do not have to repeat myself for the same operations on a slightly different format. I only need to make sure that each traits class works properly. Additionally, since the class supports using a templated allocator type, it is a much easier solution than writing duplicated code that also supports custom allocators.
• Versatility. I can support formats that I haven't even thought of supporting yet. Likely many that I won't ever use personally, but since I'm only paying for the instantiations that I use, this is a non-existent problem. For the formats that are raw arrays of characters internally, the inline no-op transformation functions get optimized out. For the formats that don't need validation because all characters are valid, the inline no-op validation functions get optimized out. And, while I plan to implement several formats, if code that uses this library needs a format that I didn't deem necessary, a traits class is all it takes for it to be supported.

Edited by Ectara

##### Share on other sites

I'm going to respond out-of-line this time because managing that many quotes would be tricky!

"a templated class or function that iterates through all of the characters in a string and does something that doesn't depend upon the precise encoding would be able to use the strings with the same interface." - The reason this came up is because you said "it accepts pointers to character arrays" - if you meant the type of 'character' varies depending on the type of string, then great. But if you meant the C++ type 'char', then that isn't equivalent.

"For instance, in Windows, the wide-char string. You would construct an std::wstring from an array of wchar_t, which is a UTF-16 code unit: it's storage type." - std::string and std::wstring are fundamentally broken from a Unicode point of view. It's called a wchar_t because it's meant to represent a character. But it does not represent a character, unless you're using UCS-2. It's convenient from a performance point of view to expect the character type and the storage type to be the same, but that's only useful with fixed-length encodings, which we're not really dealing with here. (Or weren't, until your last post!)

If you want to have a system which generalises to both fixed length and variable length and plays to both their strengths, great - but that will complicate the interface because the assumptions you can safely make about fixed length don't apply to variable length. There's a reason we still don't have proper Unicode support in C++, after all!

"the function that builds an std::string out of bytes performs an encoding, one where each output is identical to the input" - Sure. If your argument is that you want a function to be able to efficiently construct your UTF-8 strings out of bytes that you assume are already in UTF-8 form, that's fine (albeit something of a leaky abstraction - although labelling the string UTF-8 is itself a leaky abstraction to some degree). The issue in my mind is that std::string considers bytes and characters interchangeable, implying the data is already encoded before addition, and Unicode strings don't. So by copying that function signature, you only actually copy part of the interface in logical terms. Any bytes you pass to std::string yield a valid string. That isn't the case for a UTF-8 string.

"I feel that it is very important to validate unknown characters" - Yes, but I don't see the worth of this class for it.

• Do you have a stream of bytes which is presumed to be encoded correctly? Then you validate it by attempting to create a utf-8 string out of it.
• Do you have a single code point representing a character? Then again, you validate it by attempting to create a utf-8 string out of it.

I can't see a use case for a separate character type that needs to know whether it is valid or not. A one-length string performs the same role and simplifies the code.

##### Share on other sites

I'd be tempted to go with something like:

class UTF8String
{
public:
enum ERROR_CODE { SUCCESS, INVALID_CODE_POINT, INVALID_LENGTH, INVALID_PARAMETER  };

UTF8String();  // Construct empty string

// Initialize string from given data. Returns error code on failure.
ERROR_CODE InitFromASCII(const char *data, int codepage);
ERROR_CODE InitFromUTF8(const void *data);
ERROR_CODE InitFromCodePoint(int codePoint);
};

That way the string is always valid, and the initialization is done with functions that can return an error code. It doesn't force any heap allocation on you either. It also means the user needs to be explicit about what format their data is in. For example is a char * pointing at ASCII or UTF8 data?

By the way to keep the heap allocation overhead of short strings down even further you can borrow a trick from std::string. It declares a small statically sized buffer within the class (usually around 16 bytes I believe). Strings which fit in that buffer can avoid heap allocations completely.

##### Share on other sites

"a templated class or function that iterates through all of the characters in a string and does something that doesn't depend upon the precise encoding would be able to use the strings with the same interface." - The reason this came up is because you said "it accepts pointers to character arrays" - if you meant the type of 'character' varies depending on the type of string, then great. But if you meant the C++ type 'char', then that isn't equivalent.

My wording was ambiguous, and I apologize. What I really meant was a pointer to an array of storage units; that's the closest that the class will get to an "array of characters". All I really meant was an array of transformed character storage.

If you want to have a system which generalises to both fixed length and variable length and plays to both their strengths, great - but that will complicate the interface because the assumptions you can safely make about fixed length don't apply to variable length. There's a reason we still don't have proper Unicode support in C++, after all!

I agree, and this class was designed to make an effort to dodge that entirely, by abstracting how many code units there are to a code point. Majority of the functions (now that the overloads accepting pointers to storage arrays are removed) only deal with BasicString instances and ValidatedCharacter instances, removing how the characters are stored from view. You can still access the read-only storage and the size of the storage through special functions, but that is an implementation detail that has more use internally than externally.

So by copying that function signature, you only actually copy part of the interface in logical terms. Any bytes you pass to std::string yield a valid string. That isn't the case for a UTF-8 string.

I agree. It is up to the validation function to determine if the sequence of code units is valid; the only remaining places where questionable code units are passed are the string creation methods, and the standalone validation methods that simply check for validity. I have been thoroughly convinced to drop all of the unsafe overloads.

I can't see a use case for a separate character type that needs to know whether it is valid or not. A one-length string performs the same role and simplifies the code.

Simple in that no other classes need to be written, yes, but not simple by any other measures I can see. Regarding when I described its purpose earlier:

I don't understand the motivation behind dealing with validated characters individually, versus validated strings?

To find a single character within a string incurs a relatively large amount of overhead and dynamic allocation if you first convert the single character to a string, then do a more expensive string search/comparison.

Just about any operation is faster in this class if you are working with only a single character, and it uses a lot less memory. Operations on single characters are very frequent with what I do. It really is worth it, in terms of measurable performance and dynamic allocation efficiency.

It obviously costs a lot more memory and time to make a whole temporary string out of one character just to throw it away when I'm done because I only wanted one character. To demonstrate it, these are the abstract steps to do such a thing:
Allocate an instance of the string (about 36 bytes for the default allocator, plus a pointer if dynamically allocated).
Call a create string factory method to create from a character.
- Validate the character.
- Calculate the length of the character in storage units. (adds more bytes plus allocator overhead)
- Allocate storage for the character.
- Transform the character to storage.
- Perform bookkeeping on the string properties.
Pass the string by reference to the function.
- Do a more expensive string version, because it could be any length other than one, as well.
- The function also has to decode the characters one at a time, which transforms the storage back to a character again.
Destructor is called, frees the storage.
Frees the string instance.

OR

Create ValidatedCharacter automatically (5 bytes)
Validates the character.
Pass the ValidatedCharacter.
Use the ValidatedCharacter's character member as-is.
Automatically reclaim the ValidatedCharacter instance.

It uses a lot less memory, and can be several times faster for pretty common operations that involve only one character, like finding a single character, appending a single character, inserting a single character, replacing a single character, etc. The benefits are pretty worth it to me. Additionally, you don't have a function with a return code to check; in an example of copying all characters of a UTF-16 string to a UTF-8 string, one can iterate through the first and append to the second quickly. Since it comes from a valid string, it even skips the validation check, making it just copying a small POD class.

That way the string is always valid, and the initialization is done with functions that can return an error code. It doesn't force any heap allocation on you either. It also means the user needs to be explicit about what format their data is in. For example is a char * pointing at ASCII or UTF8 data?

One problem is that this requires the string be instantiated already, which calls any constructor used. This will result in a small penalty, that I will try to optimize, regardless.

Additionally, having the user be explicit by calling explicitly named functions will not work; the class is templated with any encoding that follows certain rules, so it is possible that a given encoding cannot represent all of the code points that either ASCII or UTF-8 can describe. This problem is alleviated, however, by only allowing input in the chosen encoding. I can swap the bool for an enum, at some point; I'm temporarily avoiding enumerations because I have to come up with a new list of return codes in the process of porting my C code to C++.

By the way to keep the heap allocation overhead of short strings down even further you can borrow a trick from std::string. It declares a small statically sized buffer within the class (usually around 16 bytes I believe). Strings which fit in that buffer can avoid heap allocations completely.

A few implementations of std::basic_string use short string optimization (most notably MSVC), but I'd rather not. For a transformation format that isn't 1:1 in its code point to code unit size, it is difficult to determine what is a good size. If you keep only 16 bytes around, it makes for a minimum of 4 valid UTF-8 characters, which isn't too worth the effort. If you try to predict the maximum size of 16 characters, UTF-7's largest valid character takes 8 bytes to store, requiring 128 bytes total. Any way you try, the benefit is too weak to be worth making the effort. I'd much prefer an implementation like GCC's, where it uses a copy-on-write strategy, though I'm nowhere near that point yet. I am next going to optimize default-constructed strings to do no dynamic allocation, however.

Edited by Ectara

##### Share on other sites

Copy on write isn't used nearly as much as it used to be since it doesn't play nice with multiple threads (need to keep locking the string, even if reading since another thread may try and do a write).

##### Share on other sites

Copy on write isn't used nearly as much as it used to be since it doesn't play nice with multiple threads (need to keep locking the string, even if reading since another thread may try and do a write).

Yeah, I know that locking is necessary. But, it is one of the many optimization options, and they do serve different purposes. If I'm handling very large strings, copy on write would be worth it.

##### Share on other sites

It uses a lot less memory, and can be several times faster for pretty common operations that involve only one character, like finding a single character, appending a single character, inserting a single character, replacing a single character, etc.

I still don't see a use for this. if I want to append a single character to a Unicode string, I can have an append or insert method that takes a code point as an integer. Yes, it will have to be able to deal with an invalid code point, but the alternative is that it will have to deal with an ValidatedCharacter - you still have to check the 'valid' flag to know that what you're adding is safe (which incidentally makes the class name a bit misleading). Both ways require that the append/insert/replace operation checks validity and has a way of dealing with a validity error.

Part of this is because you've forced yourself to jump through hoops by having exception handling turned off, and your character class is an attempt to get back to stack-allocated cheap objects - but it just reintroduces the problem you originally had in that you can create invalid data. Being able to add this type into your string is basically poking a hole through the firewall you set up.

To be honest I generally doubt the usefulness of per-character access in a unicode string anyway. Most per-character operations performed on strings in C or C++ are really operations performed on bytes in a byte array. When working with actual text it's hard to come up with real world use cases that involve tinkering with individual characters. The first ones that come to mind are things like Upper/Lower/Capitalise, but you can't do them correctly on individual characters - the German character ß becomes SS in upper case, for example. I would argue that legitimate character-level operations are rare enough that expecting them to be done with string instances is reasonable.

##### Share on other sites

you still have to check the 'valid' flag to know that what you're adding is safe

You'll have to validate no matter what, and this provides a simpler way of doing so, by wrapping the call to the traits class' character validation function. The only difference between inserting a plain charType and this, is that this way ensures that there isn't an invalid character. Otherwise, _every_ function must now check to see if the character is valid; with this class, it checks in one place only, and the result can be re-used without the caller tampering with it. I see absolutely no reason why this is an inferior solution.

(which incidentally makes the class name a bit misleading)

What would you suggest?

Both ways require that the append/insert/replace operation checks validity and has a way of dealing with a validity error.

Not entirely accurate; one way has the validity checked once, and then everywhere that uses it simply queries a flag to see if it is valid. Without the class, every function must call the validation function, even on repeated operations with the same character. I can't see how this is inferior.

but it just reintroduces the problem you originally had in that you can create invalid data. Being able to add this type into your string is basically poking a hole through the firewall you set up.

I don't see it that way. It provides the same securities as a full-blown string class with more efficiency. Even if I allowed adding a plain integer to the string, that would have the same implications.

To be honest I generally doubt the usefulness of per-character access in a unicode string anyway.

I am not against you leaving it out of your own string class. Keep in mind, this string class is not Unicode only; it handles other string types like a simple char string.

I would argue that legitimate character-level operations are rare enough that expecting them to be done with string instances is reasonable.

I would disagree heavily. If I read a configuration file into a char string, and I go to parse it, it would be ridiculous to treat every single character as its own string. It would be horridly inefficient.

##### Share on other sites

EDIT:

Also, in the factory method, should I use the string's allocator to allocate its own instance? It seems like it would make it hard to free it, though if someone is using a custom allocator, they'd be likely to call the destructor manually in some way, then free the memory themselves in some manner, so it would be possible to use the string's allocator to allocate/free the string itself. The question is, does it make sense, and should this behavior be expected.