UTF-8 String Validation Responsibility

Started by
50 comments, last by Ectara 11 years, 2 months ago

To me, the answer is clear - you validate the data as it is added to the class. There's no point having a UTF-8 class that doesn't ensure that it is always valid UTF-8 all of the time. It should validate on the way in, not require that you have pre-validated data. What if you switch to UTF-16 later? Then you'd have 2 (or more) places that need changing. At the very least, your validation and string creation function should be a static member of your UTF-8 class, ie. a factory that takes the data and returns the valid UTF-8 string.

I don't think performance will be an issue. Languages like C# and Python 3 use Unicode strings everywhere and the problem doesn't really come up. The need for random access to arbitrary places in the text is rare, length in characters can be cached, etc.

Advertisement

It should validate on the way in, not require that you have pre-validated data.


Well, if it is going to throw some sort of exception and abort on finding that the text isn't valid, then in places where the text isn't always expected to be valid, the text needs to be sanitized first, and thus made valid. Then the string class asserts its validity.

Are you saying the the string should do the sanitizing, or the validation?

What if you switch to UTF-16 later? Then you'd have 2 (or more) places that need changing.


I agree with this, though my mentality is if they are interacting with the string class in a generic manner (the string class is a templated class), then the data should already be valid; when the data is being created, then one must deal with specifics.

At the very least, your validation and string creation function should be a static member of your UTF-8 class, ie. a factory that takes the data and returns the valid UTF-8 string.


This one troubles me. Having a constructor that can do the work seems like a very powerful asset to me, and replacing it with a static string factory would be a large paradigm shift in how the string will be used; having a static function for creation, then having a destructor do cleanup seems like an anti-pattern to me. Surely this isn't the only way of doing things.

The need for random access to arbitrary places in the text is rare, length in characters can be cached, etc.


I agree. Hence, why I went with UTF-8 over UTF-32.

I'm not entirely sure what your distinction between sanitizing and validation is. To sanitize an input essentially requires validating it, so I'm using the terms interchangeably. Sometimes sanitizing means removing or fixing bad data and carrying on but generally that is a bad approach because it hides the error. If you need to transform the data, that is a different concept, but requires a different approach.

You have 2 possible situations really:

  1. You have byte data that you require to be already in UTF-8, and essentially want to convert its type into an instance of your UTF-8 class.
  2. You have text data in a different encoding that you need to have in UTF-8 form.

ASCII data can obviously be handled either way, providing you're 100% confident it's ASCII. If you're not confident, you have to explicitly choose one of the above, because there's no single correct way to transform multiple arbitrary encodings and get correct information out. You can't 'clean' a non-UTF-8 string to make it UTF-8 because there's no information telling you exactly what codepoints the invalid characters should be.

Method 1 can check that the bytes are valid UTF-8, and fail if not. It wouldn't take any action to change invalid data, hence it would be wrong to say it was sanitizing it.

Method 2 would need to decode the characters to their Unicode code points, then re-encode as UTF-8. Again, it's not really sanitizing it, just converting it.

I would argue that your string constructor (or constructors, more likely) needs to look at the data, and either create a valid UTF-8 string out of it, or fail. If you can use an exception, that is ideal. If you can't use exceptions, then doing the work in the constructor isn't ideal, hence my suggestion of a static factory method or methods. But basically there should be one clear step which either yields up a valid UTF-8 object or tells you that the data you provided is wrong.

I'm not entirely sure what your distinction between sanitizing and validation is.


I interpret validation as checking the data's validity, and marking an error condition when the data isn't valid. On the other hand, I interpret sanitation as removing or replacing invalid data, resulting in data that will pass validation.

Sometimes sanitizing means removing or fixing bad data and carrying on but generally that is a bad approach because it hides the error.


This is something I want to avoid, so I'm leaning toward the string not sanitizing on entry, but just validating.

1. You have byte data that you require to be already in UTF-8, and essentially want to convert its type into an instance of your UTF-8 class.
2. You have text data in a different encoding that you need to have in UTF-8 form.


The first is most likely, and what I am handling. The second will be handled elsewhere; I can easily use character iterators to convert UTF-16 to UTF-8 and vice-versa. Other encodings will require a custom function to get the character, convert it to its equivalent encoding, then put it in the destination string.

I would argue that your string constructor (or constructors, more likely) needs to look at the data, and either create a valid UTF-8 string out of it, or fail. If you can use an exception, that is ideal. If you can't use exceptions, then doing the work in the constructor isn't ideal, hence my suggestion of a static factory method or methods. But basically there should be one clear step which either yields up a valid UTF-8 object or tells you that the data you provided is wrong.


That makes a lot more sense, and I agree; if the constructor can't use exceptions, then the construct can't be used when an exceptional condition arises without violating RAII by having to do extra steps like ensuring everything went fine, and adding an extra cleanup path in case it failed for every instance of the code.

I'm having some concerns about exceptions that I'll raise in another thread.

I'm having some concerns about exceptions that I'll raise in another thread.

Link me to that thread when you do - I'm currently working on a design for a robust exception handling alternative, and I'm interested to see concerns and alternate use cases.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Link me to that thread when you do


Here's the link; I'll send it in a PM as well.
http://www.gamedev.net/topic/639123-portable-use-and-disabling-of-exceptions/

Another question on the string validation responsibility, if I am validating on the way in, to avoid an error, the caller needs to ensure that the string is valid. This effectively requires that a string be validated twice; once to check for errors, and again when the text is actually inserted. Is this the right way to do it?

No. Why should there be more than a single point of validation, in the constructor?

In the constructor, it only validates once. However, if it fails, there's a critical error that results in breaking into the debugger. So, in order to prevent this error, one must validate or sanitize before passing the data to the constructor. Or, do people not even make sure it is valid, and let the application abort to let them know?

Or, do people not even make sure it is valid, and let the application abort to let them know?

Abort is the wrong way to go here - validation errors are not necessarily fatal, and it is up to the caller to make that determination.

In these situations you need to give the user some way to check the results. Either a by-reference error code, a factory function that returns a status, or a thrown exception.

I also would not recommend a separate validation pass, unless validation is very cheap, and conversion very expensive. If validation has roughly the same complexity as conversion (and I assume that both are O(N), or thereabouts), you don't want to incur two passes over the data...

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

This topic is closed to new replies.

Advertisement