Sign in to follow this  
Ectara

UTF-8 String Validation Responsibility

Recommended Posts

Ectara    3097

I'm having a problem with implementing a UTF-8 string where I'm unsure where the responsibility should lie for validating the strings contents. My options seem to be this:

1. Make it an invariant that the contained string must be valid. If invalid characters are attempted to be inserted, ignore them, or replace them with the replacement character.

 

2. Contain possibly invalid code units, and provide sanitized output to code that tries to read the string. This seems like a bad combination of the first, and the next.

 

3. Contain raw code units, and place the responsibility of anything that uses the string to check if the string is valid before it uses it. It can then filter out the invalid characters, or replace them.

 

The first would be very simple; validation upon inserting new characters would simplify the logic of handling the characters inside.
The second would be simple for entering code units, as it will go unchecked, but every time the string is used, the logic would be complicated, and error prone.

The third would be like what I see in a lot of places that use UTF-8 encoding, but it would be error-prone and things would be more complicated, like calculating the length of strings in code units, and advancing a certain number of code points ahead.

So, which do you think would be the best behavior? I feel like the first would be ideal, but it might provide unexpected behavior for people that purposely input bad data, and expect it to be output verbatim, or people would expect to have to sanitize input themselves, by their own rules.

Share this post


Link to post
Share on other sites
swiftcoder    18437

So, which do you think would be the best behavior? I feel like the first would be ideal, but it might provide unexpected behavior for people that purposely input bad data, and expect it to be output verbatim, or people would expect to have to sanitize input themselves, by their own rules.

The first is ideal.

If someone doesn't want sanitised storage, they can always use a std::vector<uchar>.

Share this post


Link to post
Share on other sites
Ectara    3097

So, which do you think would be the best behavior? I feel like the first would be ideal, but it might provide unexpected behavior for people that purposely input bad data, and expect it to be output verbatim, or people would expect to have to sanitize input themselves, by their own rules.

The first is ideal.

If someone doesn't want sanitised storage, they can always use a std::vector<uchar>.

I keep referring back to the argument that if someone wants unsanitized storage, they can just use a vector of code units; I guess what holds me back from following this is that the string functions are made to handle strings, so vectors can't be passed, aside from passing a pointer to the internal storage. But, on the other hand, passing unsanitized strings around seems counterproductive, so leaving this ability in seems unreasonable.

Share this post


Link to post
Share on other sites
swiftcoder    18437

But, on the other hand, passing unsanitized strings around seems counterproductive, so leaving this ability in seems unreasonable.

To my mind, the principle you want to follow here is fail fast, fail early.

 

I'd much rather have an exception thrown right when I attempt to load a string, than spend hours debugging the output end of my program to try and figure out what scrambled my characters...

Share this post


Link to post
Share on other sites
Ectara    3097

I'd much rather have an exception thrown right when I attempt to load a string, than spend hours debugging the output end of my program to try and figure out what scrambled my characters...


I wholly agree with failing at where the problem lies (I used to be a fan of safely undoing what I had done so far, reporting failure, and then letting the caller decide what to do about it), though I guess I have concerns about using exceptions in general. I'd like to avoid them, and so far, I use them simply to gracefully unwind and call destructors; I'd like no (correct) behavior to change if I disable exceptions. I was considering having a macro definition that decides whether it would ignore invalid characters or replace them with the replacement character.

Is it worth it to use exceptions for this? It seems like it might provide behavior that is easier to anticipate and understand.

Share this post


Link to post
Share on other sites
swiftcoder    18437

I'd like to avoid them, and so far, I use them simply to gracefully unwind and call destructors; I'd like no (correct) behavior to change if I disable exceptions.

Are you intending this library for use on a platform that does not support exceptions?

 

I don't suppose it terribly matters what mechanism you use to inform the programmer of an error condition, so long as:

  • The error handling mechanism is consistent throughout your API.
  • The programmer has the option to handle the error.
  • The error causes the program/debugger to stop at the given location unless handled.

To my mind, exceptions are the easiest way to achieve those ends - though continuations and callback functors are both viable given the right supporting toolset.

 

Is it worth it to use exceptions for this? It seems like it might provide behavior that is easier to anticipate and understand.

You definitely need some way to inform the programmer of the error.

 

If you don't use exceptions, then you have a whole raft of thorny issues to consider:

  • How do you signal failure from a constructor?
  • How do you signal failure from an append operator?
  • How do you make sure that the programmer is checking the error state?

Share this post


Link to post
Share on other sites
Ectara    3097

How do you signal failure from a constructor?
How do you signal failure from an append operator?
How do you make sure that the programmer is checking the error state?


Coming from C, I'm still very used to returning error codes; however, I haven't written a function in C++ that uses them yet. I've marked places where an error should be handled, but I've yet to decide how I will implement it.

Exceptions seem extremely ideal, but I want to maintain cross-platform ability for games that I write to at least the platforms I have on hand; not all of them support standard exceptions, and some take a heavy performance hit. I'm thinking some kind of assertion system that aborts on serious error. Not sure yet; that's one reason that I've left error handling placeholders.

Share this post


Link to post
Share on other sites
Ectara    3097

Additionally, there is a new problem of how to handle creation of a string from unsanitized code units; do I perform sanitization on the input, through omission or substitution, then the invariant is in place? If someone passes in an array of five UTF-8 character sequences, but it is four valid character sequence, and one invalid sequence of bytes, the resulting length can change based on how I handle it, and the contents can change drastically based on where the invalid bytes are within the string. I see people throwing an exception when bad text is inserted; this seems rather harsh, as it then means that it must be sanitized before the text ever reaches the string.

Edited by Ectara

Share this post


Link to post
Share on other sites
For a general purpose system, keep everything internally in one format and assume it's all validated. Validate all input before it's allowed to wander around in the system.

What kind of functionality are you putting together for UTF-8 handling? Are some operations performance critical? How large amounts of strings might have to be processed?

+ have you considered using existing libraries?
https://sites.google.com/site/icusite/
http://utfcpp.sourceforge.net/

Share this post


Link to post
Share on other sites
Ectara    3097

For a general purpose system, keep everything internally in one format and assume it's all validated. Validate all input before it's allowed to wander around in the system.


Having everything be valid internally seems to be the best way to go. It seems that I should treat invalid text being assigned to the string as a fatal error, and have a function that does something like taking input iterators and an output iterator, and sanitizing the output to a destination, to allow easy sanitation if the string will not sanitize data.

The immediate concern is that I need to make a temporary copy of the sanitized data before I can pass it to the string. It can be done in blocks, and appended to the string, for a similar cost to resizing one giant vector. If I can add raw code units to the string's storage, it'd seem to be a method of injecting invalid characters into the string.

 

What kind of functionality are you putting together for UTF-8 handling?
Are some operations performance critical? How large amounts of strings
might have to be processed?

It has all of the features of an std::basic_string, plus more. Many operations are performance critical; while it won't be used as much, UTF-8 being slower to use than a fixed-width encoding by definition puts it at a disadvantage. The largest amount of strings will be difficult to predict. A game that is more text-heavy might use it all of the time, especially if it is in a HUD. UTF-8 is the transformation format of choice in my scripting language, so it will be used there a lot. I also will use it for configuration files, and localization data. Just about everywhere where text will be printed to the screen, or read from a file intended to be edited. The strings can grow quite large, too, when handling an entire script file, or if one reads an entire configuration file. It would be in my best interest to streamline the performance by removing all things that would check for validity from the string handling, and make sure that it is valid upon entry.

It seems that I could sanitize upon assigning to a string, but it seems like that'd be unwise, and that it should be a separate function.

 

+ have you considered using existing libraries?
https://sites.google.com/site/icusite/
http://utfcpp.sourceforge.net/

Nope. I have read up on them, though.

Share this post


Link to post
Share on other sites
Kylotan    9994

To me, the answer is clear - you validate the data as it is added to the class. There's no point having a UTF-8 class that doesn't ensure that it is always valid UTF-8 all of the time. It should validate on the way in, not require that you have pre-validated data. What if you switch to UTF-16 later? Then you'd have 2 (or more) places that need changing. At the very least, your validation and string creation function should be a static member of your UTF-8 class, ie. a factory that takes the data and returns the valid UTF-8 string.

 

I don't think performance will be an issue. Languages like C# and Python 3 use Unicode strings everywhere and the problem doesn't really come up. The need for random access to arbitrary places in the text is rare, length in characters can be cached, etc.

Share this post


Link to post
Share on other sites
Ectara    3097

It should validate on the way in, not require that you have pre-validated data.


Well, if it is going to throw some sort of exception and abort on finding that the text isn't valid, then in places where the text isn't always expected to be valid, the text needs to be sanitized first, and thus made valid. Then the string class asserts its validity.

Are you saying the the string should do the sanitizing, or the validation?

 

What if you switch to UTF-16 later? Then you'd have 2 (or more) places that need changing.


I agree with this, though my mentality is if they are interacting with the string class in a generic manner (the string class is a templated class), then the data should already be valid; when the data is being created, then one must deal with specifics.

 

At the very least, your validation and string creation function should be a static member of your UTF-8 class, ie. a factory that takes the data and returns the valid UTF-8 string.


This one troubles me. Having a constructor that can do the work seems like a very powerful asset to me, and replacing it with a static string factory would be a large paradigm shift in how the string will be used; having a static function for creation, then having a destructor do cleanup seems like an anti-pattern to me. Surely this isn't the only way of doing things.

 

The need for random access to arbitrary places in the text is rare, length in characters can be cached, etc.


I agree. Hence, why I went with UTF-8 over UTF-32.

Share this post


Link to post
Share on other sites
Kylotan    9994

I'm not entirely sure what your distinction between sanitizing and validation is. To sanitize an input essentially requires validating it, so I'm using the terms interchangeably. Sometimes sanitizing means removing or fixing bad data and carrying on but generally that is a bad approach because it hides the error. If you need to transform the data, that is a different concept, but requires a different approach.

 

You have 2 possible situations really:

  1. You have byte data that you require to be already in UTF-8, and essentially want to convert its type into an instance of your UTF-8 class.
  2. You have text data in a different encoding that you need to have in UTF-8 form.

ASCII data can obviously be handled either way, providing you're 100% confident it's ASCII. If you're not confident, you have to explicitly choose one of the above, because there's no single correct way to transform multiple arbitrary encodings and get correct information out. You can't 'clean' a non-UTF-8 string to make it UTF-8 because there's no information telling you exactly what codepoints the invalid characters should be.

 

Method 1 can check that the bytes are valid UTF-8, and fail if not. It wouldn't take any action to change invalid data, hence it would be wrong to say it was sanitizing it.

Method 2 would need to decode the characters to their Unicode code points, then re-encode as UTF-8. Again, it's not really sanitizing it, just converting it.

 

I would argue that your string constructor (or constructors, more likely) needs to look at the data, and either create a valid UTF-8 string out of it, or fail. If you can use an exception, that is ideal. If you can't use exceptions, then doing the work in the constructor isn't ideal, hence my suggestion of a static factory method or methods. But basically there should be one clear step which either yields up a valid UTF-8 object or tells you that the data you provided is wrong.
 

Share this post


Link to post
Share on other sites
Ectara    3097

I'm not entirely sure what your distinction between sanitizing and validation is.


I interpret validation as checking the data's validity, and marking an error condition when the data isn't valid. On the other hand, I interpret sanitation as removing or replacing invalid data, resulting in data that will pass validation.

 

Sometimes sanitizing means removing or fixing bad data and carrying on but generally that is a bad approach because it hides the error.


This is something I want to avoid, so I'm leaning toward the string not sanitizing on entry, but just validating.

1. You have byte data that you require to be already in UTF-8, and essentially want to convert its type into an instance of your UTF-8 class.
2. You have text data in a different encoding that you need to have in UTF-8 form.


The first is most likely, and what I am handling. The second will be handled elsewhere; I can easily use character iterators to convert UTF-16 to UTF-8 and vice-versa. Other encodings will require a custom function to get the character, convert it to its equivalent encoding, then put it in the destination string.

I would argue that your string constructor (or constructors, more likely) needs to look at the data, and either create a valid UTF-8 string out of it, or fail. If you can use an exception, that is ideal. If you can't use exceptions, then doing the work in the constructor isn't ideal, hence my suggestion of a static factory method or methods. But basically there should be one clear step which either yields up a valid UTF-8 object or tells you that the data you provided is wrong.


That makes a lot more sense, and I agree; if the constructor can't use exceptions, then the construct can't be used when an exceptional condition arises without violating RAII by having to do extra steps like ensuring everything went fine, and adding an extra cleanup path in case it failed for every instance of the code.

I'm having some concerns about exceptions that I'll raise in another thread.

Share this post


Link to post
Share on other sites
Ectara    3097

Another question on the string validation responsibility, if I am validating on the way in, to avoid an error, the caller needs to ensure that the string is valid. This effectively requires that a string be validated twice; once to check for errors, and again when the text is actually inserted. Is this the right way to do it?

Share this post


Link to post
Share on other sites
Ectara    3097

In the constructor, it only validates once. However, if it fails, there's a critical error that results in breaking into the debugger. So, in order to prevent this error, one must validate or sanitize before passing the data to the constructor. Or, do people not even make sure it is valid, and let the application abort to let them know?

Share this post


Link to post
Share on other sites
swiftcoder    18437

Or, do people not even make sure it is valid, and let the application abort to let them know?

Abort is the wrong way to go here - validation errors are not necessarily fatal, and it is up to the caller to make that determination.

 

In these situations you need to give the user some way to check the results. Either a by-reference error code, a factory function that returns a status, or a thrown exception.

 

I also would not recommend a separate validation pass, unless validation is very cheap, and conversion very expensive. If validation has roughly the same complexity as conversion (and I assume that both are O(N), or thereabouts), you don't want to incur two passes over the data...

Edited by swiftcoder

Share this post


Link to post
Share on other sites
Ectara    3097

In these situations you need to give the user some way to check the results. Either a by-reference error code, a factory function that returns a status, or a thrown exception.


...I suppose now that I've gone the route of avoiding exceptions, there is no way to return an error code from the constructor without an out parameter, and EVERY function that ever accepts external text must now have an out parameter that returns a status code.

I also would not recommend a separate validation pass, unless validation is very cheap, and conversion very expensive.


Then how do I establish the invariant that the text is now valid, if I don't check? Which validation is superfluous? If you are referring to what the caller would do to make sure the data is valid, they could

validate or sanitize


the data; one reports error if it is invalid, one would provide a valid string, even if the input is invalid.

I kind of feel like this is all going backwards on what was said previously.

Share this post


Link to post
Share on other sites
swiftcoder    18437
I would make validation an external operation, which produces an opaque type, and construct your string from the opaque type.
 
Something like:
class StringUtf8;

class ValidatedUtf8
{
	char *data;
	size_t length;

	friend class StringUtf8;
	friend bool Validate(char *input, ValidatedUtf8 &output);

public:
	bool Valid() {return length == 0 || data != NULL;}
};

ValidatedUtf8 Validate(char *input) {
	// ...
}

class StringUtf8
{
	char *data;
	size_t length;
public:
	StringUtf8(const ValidatedUtf8 &input) : data(copy(input.data)), length(input.length) {
		if (!input.Valid())
			std::abort();
	}
};
This has several advantages:
  • All inputs to your system will have been validated.
  • The programmer can check the results of validation, if they choose to.
  • You can fail fast on invalid data, since the programmer had the opportunity to check it.
  • The cost of validation is only paid once (validation can be expensive for long strings).
Edited by swiftcoder

Share this post


Link to post
Share on other sites
Ectara    3097


Something like:

This seems superfluous, too. If my string class already has an invariant that the contents must be valid, why would I have a second class that also is required to contain valid text, or nothing at all? This seems like an overly complicated design that splits existing functionality into two classes.

 

All inputs to your system will have been validated.


I have 42 functions that accept a string of possibly invalid code units, and 24 functions that accept a possibly invalid character. For a simple for loop that appends a character to the end of the string, now I need to have a large loop where a factory makes an opaque data type containing information on whether or not the data is valid, then pass that to the string. It seems like everywhere that use a character array requires several lines of boilerplate code (constructing a string from "hello, world" now requires a temporary class instance, and several more lines to prove to the string class that the text is valid.)

The programmer can check the results of validation, if they choose to.


I have a static member function that uses the character traits to check if a character array is valid.

 

You can fail fast on invalid data, since the programmer had the opportunity to check it.


You can check at the same points without this extra class.

 

The cost of validation is only paid once (validation can be expensive for long strings).

This is also true, but when compiling in release mode, the assertions disappear, and thus the inner validation is not done; you'd then be paying for it only once.

Surely there must be a better solution than doubling the amount of effort required to use the class, even when you absolutely know that the text about to be entered is valid (because you just created it programmatically, or it is a predefined string).

Share this post


Link to post
Share on other sites
swiftcoder    18437

This is also true, but when compiling in release mode, the assertions disappear, and thus the inner validation is not done; you'd then be paying for it only once.

It's not actually feasible to do this, in the real world.

The programmer may pass only valid strings during development, and thus never discover the need to manually run validation. When his software launches into the wild, and everyone loads their own data, your string class suddenly contains invalid utf8, and now you have the potential for crashes and security flaws...

I admit that my solution isn't the most elegant, but if you don't want to use exceptions, out parameters, or factory functions, I can't think of a markedly cleaner way.

Share this post


Link to post
Share on other sites
Ectara    3097

The programmer may pass only valid strings during development, and thus never discover the need to manually run validation. When his software launches into the wild, and everyone loads their own data, your string class suddenly contains invalid utf8, and now you have the potential for crashes and security flaws...


I admit that it might sound insecure, but there's also the problem that the class is templated, and has different behavior based on the traits provided; it is possible for an encoding to be provided, where all strings and characters are valid, like a default char string. Thus for some string types, the extra validation step that is now required is a complete waste of time, but the interface now requires it. For these all-valid types, the current validation method would essentially be inlined to just returning true for every check, and thus be optimized out.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this