Jump to content

  • Log In with Google      Sign In   
  • Create Account


UTF-8 String Validation Responsibility


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
51 replies to this topic

#21 Ectara   Crossbones+   -  Reputation: 2745

Like
0Likes
Like

Posted 21 February 2013 - 02:36 PM

In these situations you need to give the user some way to check the results. Either a by-reference error code, a factory function that returns a status, or a thrown exception.


...I suppose now that I've gone the route of avoiding exceptions, there is no way to return an error code from the constructor without an out parameter, and EVERY function that ever accepts external text must now have an out parameter that returns a status code.

I also would not recommend a separate validation pass, unless validation is very cheap, and conversion very expensive.


Then how do I establish the invariant that the text is now valid, if I don't check? Which validation is superfluous? If you are referring to what the caller would do to make sure the data is valid, they could

validate or sanitize


the data; one reports error if it is invalid, one would provide a valid string, even if the input is invalid.

I kind of feel like this is all going backwards on what was said previously.

Sponsor:

#22 swiftcoder   Senior Moderators   -  Reputation: 9540

Like
1Likes
Like

Posted 21 February 2013 - 03:37 PM

I would make validation an external operation, which produces an opaque type, and construct your string from the opaque type.
 
Something like:
class StringUtf8;

class ValidatedUtf8
{
	char *data;
	size_t length;

	friend class StringUtf8;
	friend bool Validate(char *input, ValidatedUtf8 &output);

public:
	bool Valid() {return length == 0 || data != NULL;}
};

ValidatedUtf8 Validate(char *input) {
	// ...
}

class StringUtf8
{
	char *data;
	size_t length;
public:
	StringUtf8(const ValidatedUtf8 &input) : data(copy(input.data)), length(input.length) {
		if (!input.Valid())
			std::abort();
	}
};
This has several advantages:
  • All inputs to your system will have been validated.
  • The programmer can check the results of validation, if they choose to.
  • You can fail fast on invalid data, since the programmer had the opportunity to check it.
  • The cost of validation is only paid once (validation can be expensive for long strings).

Edited by swiftcoder, 21 February 2013 - 03:38 PM.

Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#23 Ectara   Crossbones+   -  Reputation: 2745

Like
0Likes
Like

Posted 21 February 2013 - 04:08 PM


Something like:

This seems superfluous, too. If my string class already has an invariant that the contents must be valid, why would I have a second class that also is required to contain valid text, or nothing at all? This seems like an overly complicated design that splits existing functionality into two classes.

 

All inputs to your system will have been validated.


I have 42 functions that accept a string of possibly invalid code units, and 24 functions that accept a possibly invalid character. For a simple for loop that appends a character to the end of the string, now I need to have a large loop where a factory makes an opaque data type containing information on whether or not the data is valid, then pass that to the string. It seems like everywhere that use a character array requires several lines of boilerplate code (constructing a string from "hello, world" now requires a temporary class instance, and several more lines to prove to the string class that the text is valid.)

The programmer can check the results of validation, if they choose to.


I have a static member function that uses the character traits to check if a character array is valid.

 

You can fail fast on invalid data, since the programmer had the opportunity to check it.


You can check at the same points without this extra class.

 

The cost of validation is only paid once (validation can be expensive for long strings).

This is also true, but when compiling in release mode, the assertions disappear, and thus the inner validation is not done; you'd then be paying for it only once.

Surely there must be a better solution than doubling the amount of effort required to use the class, even when you absolutely know that the text about to be entered is valid (because you just created it programmatically, or it is a predefined string).



#24 swiftcoder   Senior Moderators   -  Reputation: 9540

Like
0Likes
Like

Posted 21 February 2013 - 04:21 PM

This is also true, but when compiling in release mode, the assertions disappear, and thus the inner validation is not done; you'd then be paying for it only once.

It's not actually feasible to do this, in the real world.

The programmer may pass only valid strings during development, and thus never discover the need to manually run validation. When his software launches into the wild, and everyone loads their own data, your string class suddenly contains invalid utf8, and now you have the potential for crashes and security flaws...

I admit that my solution isn't the most elegant, but if you don't want to use exceptions, out parameters, or factory functions, I can't think of a markedly cleaner way.


Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#25 Ectara   Crossbones+   -  Reputation: 2745

Like
0Likes
Like

Posted 21 February 2013 - 06:29 PM

The programmer may pass only valid strings during development, and thus never discover the need to manually run validation. When his software launches into the wild, and everyone loads their own data, your string class suddenly contains invalid utf8, and now you have the potential for crashes and security flaws...


I admit that it might sound insecure, but there's also the problem that the class is templated, and has different behavior based on the traits provided; it is possible for an encoding to be provided, where all strings and characters are valid, like a default char string. Thus for some string types, the extra validation step that is now required is a complete waste of time, but the interface now requires it. For these all-valid types, the current validation method would essentially be inlined to just returning true for every check, and thus be optimized out.

#26 swiftcoder   Senior Moderators   -  Reputation: 9540

Like
0Likes
Like

Posted 21 February 2013 - 07:06 PM

For these all-valid types, the current validation method would essentially be inlined to just returning true for every check, and thus be optimized out.

That's a good reason not to have a separate validation function, agreed.
 
But for every set of traits that do impose restrictions, you can't allow any eventuality wherein the contents of the string have not been successfully validated.


Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#27 Ectara   Crossbones+   -  Reputation: 2745

Like
0Likes
Like

Posted 21 February 2013 - 07:35 PM

But for every set of traits that do impose restrictions, you can't allow any eventuality wherein the contents of the string have not been successfully validated.


I agree. I was considering some sort of hybrid approach where a validated class could be implicitly constructed as a temporary to be passed immediately for the current behavior, or optionally instantiated explicitly and have its return code checked.

I really don't want to have error codes; there are a little over 65 functions that will need to have an out parameter to then have hundreds of lines of return code checking, when it is likely that no error will occur.

The main problem is, a lot of the functions have meaningful return values, so extra parameters are needed if I do go that route. I guess I thought I had finally gotten away from spending 50% of my effort writing code that simply checks the return code if it is okay to continue, or cleans up otherwise and propagates it upward. The one feature that would have revamped that has unacceptable overhead, it seems.

So, it seems the question was never assert() vs exceptions, because assert() is unacceptable behavior. Do you use return codes for everything?


Edited by Ectara, 21 February 2013 - 07:38 PM.


#28 Ectara   Crossbones+   -  Reputation: 2745

Like
0Likes
Like

Posted 21 February 2013 - 09:08 PM

I also came to the realization that validating strings internally might be difficult; the string class receives code unit arrays and their length in code points, not code units. In the example of UTF-8, if a code unit array that is only three units long is passed, and the lead unit indicates that there are four units, a function that only knows its length in characters might segfault when it attempts to validate the fourth unit. I can't find a way around this without knowing the length of the array in units.



#29 swiftcoder   Senior Moderators   -  Reputation: 9540

Like
0Likes
Like

Posted 21 February 2013 - 09:14 PM

So, it seems the question was never assert() vs exceptions, because assert() is unacceptable behavior. Do you use return codes for everything?

I personally use exceptions for anything where I can't just log and abort(). They are far from perfect, but they are the best I currently have in C++.

 

Funnily enough, I think Perl may be the only language that gets this somewhat right. It has a bunch of convenient constructs to let you chain errors without excessive syntactic overhead:

my $var = myfunc $0 or return "myfunc failed"

Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#30 Ectara   Crossbones+   -  Reputation: 2745

Like
0Likes
Like

Posted 21 February 2013 - 10:16 PM

They are far from perfect, but they are the best I currently have in C++.


I agree; for their faults, they do seem to be very useful in this regard. However, with the ever-looming possibility that exceptions might be disabled, seems like they're out for me.


Funnily enough, I think Perl may be the only language that gets this somewhat right.


I wish other languages had simpler error handling routines. I'd settle for exceptions that handled only integer return codes, or something. The RTTI implied in checking which type the exception was can be killer. I've been reading that some embedded implementations perform this by doing string comparisons against the name of the type.

I have no idea how I am supposed to do all of these things without error codes. I can't use concatenation operators on my string class; those can't sensibly return an error code, and they only accept two operands. If I have anything but assertions or exceptions for error handling, it means I am not allowed to use operators or constructors that actually do anything. Is this the state of things, that without exceptions, we go back to C with classes?

#31 Kylotan   Moderators   -  Reputation: 3324

Like
1Likes
Like

Posted 22 February 2013 - 06:19 AM

In the constructor, it only validates once. However, if it fails, there's a critical error that results in breaking into the debugger. So, in order to prevent this error, one must validate or sanitize before passing the data to the constructor. Or, do people not even make sure it is valid, and let the application abort to let them know?

 

It only breaks into the debugger because you've told it to. That is not usually what an exception thrown from a constructor does. It seems like you want (or wanted?) to use exceptions as assertions, but they're for different problems. Development-time errors can be fixed with static typing and assertions, and run-time errors on user input absolutely have to have conditional checks at some point along the pipeline. Both have their place. Exceptions can be used to implement both, but they're not necessary.

 

If you don't want to throw exceptions from a constructor, I humbly re-submit the idea of using a factory function to generate your strings. Either it validates the data and gives you a legit object or it rejects the data and tells you with a stern error code and a null pointer. This gives you one clear gateway between the unvalidated data and the UTF-8 strings.

 

I have 42 functions that accept a string of possibly invalid code units, and 24 functions that accept a possibly invalid character.

 

Sounds to me like you've got the abstraction in the wrong place. Why add unvalidated characters? Some languages solve this the brute force way - you can only add strings to strings, and characters are 1-length strings. So you wouldn't have all these functions that accept unvalidated data - you'd have them require the UTF-8 string and the caller bears the responsibility of passing in the correct type, just as it should for all the other types you pass into a function. You also mention needing to check a return code from a concatenation operation - why? If both operands are legitimate strings, the result will be legitimate also. Validate the data early, in creating the string, and then you don't need to worry when it comes to performing future operations on them.

 

The idea is that the Caller passes the correct types in, and the Callee returns the correct types out. This should be done to whatever degree your programming language allows. To the extent where the language can't enforce it or makes it tricky (eg. the function allows values from 1 to 100, but creating a type to enforce that is a hassle), you can check explicitly and consider an exception, an assertion, or an error code. But in cases where you can guarantee the correct data - ie. you have a type that enforces that constraint - then that type is what you should be passing in.



#32 Ectara   Crossbones+   -  Reputation: 2745

Like
0Likes
Like

Posted 22 February 2013 - 09:21 AM

That is not usually what an exception thrown from a constructor does.


I'm aware, but now I can't use exceptions, because of the possibility of them being disabled, which would result in termination regardless of whether an exception or an assertion was used.

 

I humbly re-submit the idea of using a factory function to generate your strings.

It looks like that's how it is going to be, just like in C. Though, I don't want to return a pointer to a dynamically allocated string, because I'd rather I be able to automatically manage its lifetime. This means that an invalid string might be wandering around if it fails initialization, or it might be double constructed if I tried to have a default, but valid, state. Is this the only way, return a newly allocated pointer?

Sounds to me like you've got the abstraction in the wrong place. Why add unvalidated characters?


I initially designed it after the std::basic_string class; it was mature, very commonly used, and provided features that were incredibly useful in a generic manner. As a result, it accepts pointers to character arrays, as well. I suppose I could ditch all of that, and accept only instances of the string class and instances of a unique validated character class.

 

It seems like an unnecessary operation to create temporary class instances for a string type that doesn't require validation. Perhaps some sort of template meta-programming that can check the traits class to see if validation is required, and then not enable the insecure functions.

 

You also mention needing to check a return code from a concatenation operation - why? If both operands are legitimate strings, the result will be legitimate also.


If the resulting length is greater than the maximum length allowed, or something like that, then the caller must know that the string is impossible to access in entirety, or something like that. In this case, the resulting string would be illegitimate, because it violates the maximum length.

But in cases where you can guarantee the correct data - ie. you have a type that enforces that constraint - then that type is what you should be passing in.

I suppose you're right. I'm extremely reluctant to go back to how I always was, with checking return codes everywhere. I suppose I could make more of them simply fatal errors, like triggering assertions on receiving invalid parameters that violate the contract the function requires.

These changes also mean that I need to completely redesign the class, when I was so close to finishing it. This has me entirely frustrated.
 



#33 swiftcoder   Senior Moderators   -  Reputation: 9540

Like
0Likes
Like

Posted 22 February 2013 - 09:32 AM

I initially designed it after the std::basic_string class; it was mature, very commonly used, and provided features that were incredibly useful in a generic manner. As a result, it accepts pointers to character arrays, as well. I suppose I could ditch all of that, and accept only instances of the string class and instances of a unique validated character class.

You are designing something for a fundamentally different purpose to std::basic_string, though. In general, std::basic_string doesn't give a damn whether it's contents are valid, and for the most part, isn't even aware of the possibility of invalid strings.


Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#34 Ectara   Crossbones+   -  Reputation: 2745

Like
0Likes
Like

Posted 22 February 2013 - 11:51 AM

You are designing something for a fundamentally different purpose to std::basic_string, though. In general, std::basic_string doesn't give a damn whether it's contents are valid, and for the most part, isn't even aware of the possibility of invalid strings.


Yeah, I've always been aware of that. I was originally planning on ignoring or replacing invalid characters as it encountered them, but it now seems like that'd be doing too much; it shouldn't be responsible for the data being valid.

#35 Kylotan   Moderators   -  Reputation: 3324

Like
1Likes
Like

Posted 22 February 2013 - 02:42 PM

I humbly re-submit the idea of using a factory function to generate your strings.

It looks like that's how it is going to be, just like in C. Though, I don't want to return a pointer to a dynamically allocated string, because I'd rather I be able to automatically manage its lifetime. This means that an invalid string might be wandering around if it fails initialization, or it might be double constructed if I tried to have a default, but valid, state. Is this the only way, return a newly allocated pointer?

 
You can automatically manage its lifetime if you store the pointer in a smart pointer wrapper.

Alternatively you could just include a 'bad' or 'fail' flag (a bit like iostreams do), indicating that the object is not in a useful state. All member functions do nothing if the flag is set, and in debug mode they can assert if you like. This is reasonable if you're acting on data supplied by the programmer. If you're acting on data supplied by the user, then you might consider validating before construction, and you can expose the validation routine as a static function to permit that.
 

I initially designed it after the std::basic_string class; it was mature, very commonly used, and provided features that were incredibly useful in a generic manner. As a result, it accepts pointers to character arrays, as well.

Right, but basic_string is just a list of char anyway, with no encoding information. A pointer to a character array is just copying the data, which is guaranteed to be valid. You're thinking of a char* as "a string" but that's a bad way to look at it. It's a pointer to several instances of char - they are valid for a string with no encoding, but not valid for your string.

 

It seems like an unnecessary operation to create temporary class instances for a string type that doesn't require validation.

 

I don't see why you need to mix string types that don't need validation and string types that do. I think this is the downfall of many developers, often from English-speaking countries, who think of char* and std::string and UTF-8 all as text that should be easily interchangeable. Really you have to think of char* as bytes, UTF-8 as text, and std::string as a ham-fisted compromise between the two which isn't really useful for real world internationalised text.

 

 

You also mention needing to check a return code from a concatenation operation - why? If both operands are legitimate strings, the result will be legitimate also.

If the resulting length is greater than the maximum length allowed, or something like that, then the caller must know that the string is impossible to access in entirety, or something like that. In this case, the resulting string would be illegitimate, because it violates the maximum length.

 

I humbly suggest that you make the maximum size whatever you have room for in memory. If you try to add together more than 4GB of text, you probably have problems beyond Unicode issues.

Note that this problem you have raised is not unique to your string type. Basic types often overflow and/or raise exceptions because there's no good way to implement error codes with infix notation. Usually best just to accept the risk.

 

I'm extremely reluctant to go back to how I always was, with checking return codes everywhere.

 

That's why I think the error codes should be concentrated in one place, ie. the creation of the string. Then everything else can be enforced by type-checking. That's pretty much how Python and C# do it - you have a routine to get bytes into a Unicode string and vice versa, and nothing else needs to consider the chance of encoding errors.



#36 Ectara   Crossbones+   -  Reputation: 2745

Like
0Likes
Like

Posted 22 February 2013 - 07:19 PM

You can automatically manage its lifetime if you store the pointer in a smart pointer wrapper.


That strikes me as automatically manually managing its life time; I suppose I could simply provide an overload: one that takes a reference to a string, and one that returns a pointer to a string.

 

If you're acting on data supplied by the user, then you might consider validating before construction, and you can expose the validation routine as a static function to permit that.


I agree; I would only have an empty string constructor, and a copy constructor, aside from the static member function for character array initialization.

I don't see why you need to mix string types that don't need validation and string types that do.

So that code that uses it can use the same interface without needing to know the difference. That was the whole reason.
 

I think this is the downfall of many developers, often from English-speaking countries, who think of char* and std::string and UTF-8 all as text that should be easily interchangeable.

I don't. I really don't. Somehow, text has to get into a string. If I read UTF-8 from a file, it goes into an array of code units before it goes into the string class. So, I need to interact with it there. The others functions are for convenience.

The class has a character type, and a storage type. For ASCII, both are char. For UTF-8, the character type is an int, and the storage type is char. For UTF-16, the character type is an int, and the storage type is a short integer. I don't use string literals for UTF-8 text, nor do I use an std::string. I don't think of them as interchangeable; the storage type just coincidentally is a char array. This is placed for convenience, so someone doesn't have to create a string object, allocate a new internal storage array, copy the data, do the operation, free the data, then destroy the object. The fact that the array of storage units happens to be represented by a pointer to char is a pure coincidence.

I humbly suggest that you make the maximum size whatever you have room for in memory. If you try to add together more than 4GB of text, you probably have problems beyond Unicode issues.

Note that this problem you have raised is not unique to your string type. Basic types often overflow and/or raise exceptions because there's no good way to implement error codes with infix notation. Usually best just to accept the risk.


The max is the maximum number countable, so in other words, the most that can fit in the size type while allowing one that corresponds to no valid index.

That's why I think the error codes should be concentrated in one place, ie. the creation of the string.


I agree. I can try my hardest to assure that no errors can occur that aren't show-stoppers, and try to handle them up front, I guess.

#37 Ectara   Crossbones+   -  Reputation: 2745

Like
0Likes
Like

Posted 22 February 2013 - 08:57 PM

Here is my first implementation of a validated character type:
 

class ValidatedCharacter{
        charType c_;
        bool valid_;
        
        ValidatedCharacter(charType c, bool validity) : c_(c), valid_(validity) { }
        
public:
        ValidatedCharacter(void) : c_(0), valid_(false) { }
        
        ValidatedCharacter(charType c){
                valid_ = traitsType::isCharValid(c);
                c_ = c;
        }
        
        ValidatedCharacter(const ValidatedCharacter & other) : c_(other.c_), valid_(other.valid_) { }
                        
        inline bool isValid(void){
                return valid_;
        }
        
        inline operator charType(void){
                _E_ASSERT(valid_);
                
                return c_;
        }
        
        friend BasicString;
};

This will offer the same interface as before, where passing a character will automatically implicitly construct a validated character class and check it on the way in. You can also manually instantiate ValidatedCharacter and check if it is valid before using it and have the option of reusing the result. If someone tries to read the character while it is marked as invalid, an assertion is triggered. Additionally, the string class has the ability to mark the character as valid unconditionally, when it returns a character from within the already validated string. Any tips on that, in the meantime? It seems like it works out nicely; since there's only one parameter to the visible constructors, an opaque class type will function in an ideal fashion by not requiring an explicit instantiation if it isn't necessary.


Edited by Ectara, 23 February 2013 - 12:59 PM.


#38 swiftcoder   Senior Moderators   -  Reputation: 9540

Like
0Likes
Like

Posted 22 February 2013 - 09:38 PM

I don't understand the motivation behind dealing with validated characters individually, versus validated strings?

Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#39 Ectara   Crossbones+   -  Reputation: 2745

Like
0Likes
Like

Posted 22 February 2013 - 10:11 PM

I don't understand the motivation behind dealing with validated characters individually, versus validated strings?

To find a single character within a string incurs a relatively large amount of overhead and dynamic allocation if you first convert the single character to a string, then do a more expensive string search/comparison.

Just about any operation is faster in this class if you are working with only a single character, and it uses a lot less memory. Operations on single characters are very frequent with what I do. It really is worth it, in terms of measurable performance and dynamic allocation efficiency.



#40 Ectara   Crossbones+   -  Reputation: 2745

Like
0Likes
Like

Posted 22 February 2013 - 11:21 PM

I have removed all publicly accessible functions that deal with using unvalidated text, and provided two main mechanisms for creating strings: a factory that returns a pointer to a dynamically allocated string instance or null on failure and a factory that takes a reference to a string and returns a bool indicating the result. I like having the choice of how I allocate a class instance, and being able to reuse already allocated instances. Additionally, there is a constructor that accepts text in a similar fashion (internally calling the factory that takes a reference), and triggers an assertion on failure, for strings created from internal text that is known to be valid.

And characters, as above, can be passed and implicitly validated, or explicitly instantiated and checked/reused. As soon as the character is read while invalidated, it triggers the assertion, so if a character originates from within the code, and it is found to be invalid, it fails as soon as it is used.

Again, I want to stress, the stuff that triggers an assertion on invalid text is for internal use only, that is absolutely not expected to be invalid; if I am manipulating invalid text where I shouldn't, I want it to abort as soon as possible.

If anyone can imagine a better way to do this, let me know. I'm starting to feel better about this, after all of the changes made.

 

On a side note, I have to say that I'm extremely thankful for the fact that I developed this class using TDD strategies, so after I removed unusable tests for functionality that no longer exists, the existing tests caught just about all of the immediate bugs that resulted from the rewrite, in addition to carefully placed assertions.

 

EDIT:

Also, in the factory method, should I use the string's allocator to allocate its own instance? It seems like it would make it hard to free it, though if someone is using a custom allocator, they'd be likely to call the destructor manually in some way, then free the memory themselves in some manner, so it would be possible to use the string's allocator to allocate/free the string itself. The question is, does it make sense, and should this behavior be expected.


Edited by Ectara, 23 February 2013 - 12:26 AM.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS