Jump to content

  • Log In with Google      Sign In   
  • Create Account

UTF-8 String Validation Responsibility


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
51 replies to this topic

#1 Ectara   Crossbones+   -  Reputation: 3058

Like
0Likes
Like

Posted 19 February 2013 - 05:28 PM

I'm having a problem with implementing a UTF-8 string where I'm unsure where the responsibility should lie for validating the strings contents. My options seem to be this:

1. Make it an invariant that the contained string must be valid. If invalid characters are attempted to be inserted, ignore them, or replace them with the replacement character.

 

2. Contain possibly invalid code units, and provide sanitized output to code that tries to read the string. This seems like a bad combination of the first, and the next.

 

3. Contain raw code units, and place the responsibility of anything that uses the string to check if the string is valid before it uses it. It can then filter out the invalid characters, or replace them.

 

The first would be very simple; validation upon inserting new characters would simplify the logic of handling the characters inside.
The second would be simple for entering code units, as it will go unchecked, but every time the string is used, the logic would be complicated, and error prone.

The third would be like what I see in a lot of places that use UTF-8 encoding, but it would be error-prone and things would be more complicated, like calculating the length of strings in code units, and advancing a certain number of code points ahead.

So, which do you think would be the best behavior? I feel like the first would be ideal, but it might provide unexpected behavior for people that purposely input bad data, and expect it to be output verbatim, or people would expect to have to sanitize input themselves, by their own rules.



Sponsor:

#2 swiftcoder   Senior Moderators   -  Reputation: 10364

Like
1Likes
Like

Posted 19 February 2013 - 05:40 PM

So, which do you think would be the best behavior? I feel like the first would be ideal, but it might provide unexpected behavior for people that purposely input bad data, and expect it to be output verbatim, or people would expect to have to sanitize input themselves, by their own rules.

The first is ideal.

If someone doesn't want sanitised storage, they can always use a std::vector<uchar>.


Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#3 Ectara   Crossbones+   -  Reputation: 3058

Like
0Likes
Like

Posted 19 February 2013 - 06:02 PM

So, which do you think would be the best behavior? I feel like the first would be ideal, but it might provide unexpected behavior for people that purposely input bad data, and expect it to be output verbatim, or people would expect to have to sanitize input themselves, by their own rules.

The first is ideal.

If someone doesn't want sanitised storage, they can always use a std::vector<uchar>.

I keep referring back to the argument that if someone wants unsanitized storage, they can just use a vector of code units; I guess what holds me back from following this is that the string functions are made to handle strings, so vectors can't be passed, aside from passing a pointer to the internal storage. But, on the other hand, passing unsanitized strings around seems counterproductive, so leaving this ability in seems unreasonable.



#4 swiftcoder   Senior Moderators   -  Reputation: 10364

Like
2Likes
Like

Posted 19 February 2013 - 06:29 PM

But, on the other hand, passing unsanitized strings around seems counterproductive, so leaving this ability in seems unreasonable.

To my mind, the principle you want to follow here is fail fast, fail early.

 

I'd much rather have an exception thrown right when I attempt to load a string, than spend hours debugging the output end of my program to try and figure out what scrambled my characters...


Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#5 Ectara   Crossbones+   -  Reputation: 3058

Like
0Likes
Like

Posted 19 February 2013 - 06:45 PM

I'd much rather have an exception thrown right when I attempt to load a string, than spend hours debugging the output end of my program to try and figure out what scrambled my characters...


I wholly agree with failing at where the problem lies (I used to be a fan of safely undoing what I had done so far, reporting failure, and then letting the caller decide what to do about it), though I guess I have concerns about using exceptions in general. I'd like to avoid them, and so far, I use them simply to gracefully unwind and call destructors; I'd like no (correct) behavior to change if I disable exceptions. I was considering having a macro definition that decides whether it would ignore invalid characters or replace them with the replacement character.

Is it worth it to use exceptions for this? It seems like it might provide behavior that is easier to anticipate and understand.

#6 swiftcoder   Senior Moderators   -  Reputation: 10364

Like
0Likes
Like

Posted 19 February 2013 - 06:57 PM

I'd like to avoid them, and so far, I use them simply to gracefully unwind and call destructors; I'd like no (correct) behavior to change if I disable exceptions.

Are you intending this library for use on a platform that does not support exceptions?

 

I don't suppose it terribly matters what mechanism you use to inform the programmer of an error condition, so long as:

  • The error handling mechanism is consistent throughout your API.
  • The programmer has the option to handle the error.
  • The error causes the program/debugger to stop at the given location unless handled.

To my mind, exceptions are the easiest way to achieve those ends - though continuations and callback functors are both viable given the right supporting toolset.

 

Is it worth it to use exceptions for this? It seems like it might provide behavior that is easier to anticipate and understand.

You definitely need some way to inform the programmer of the error.

 

If you don't use exceptions, then you have a whole raft of thorny issues to consider:

  • How do you signal failure from a constructor?
  • How do you signal failure from an append operator?
  • How do you make sure that the programmer is checking the error state?

Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#7 Ectara   Crossbones+   -  Reputation: 3058

Like
0Likes
Like

Posted 19 February 2013 - 07:14 PM

How do you signal failure from a constructor?
How do you signal failure from an append operator?
How do you make sure that the programmer is checking the error state?


Coming from C, I'm still very used to returning error codes; however, I haven't written a function in C++ that uses them yet. I've marked places where an error should be handled, but I've yet to decide how I will implement it.

Exceptions seem extremely ideal, but I want to maintain cross-platform ability for games that I write to at least the platforms I have on hand; not all of them support standard exceptions, and some take a heavy performance hit. I'm thinking some kind of assertion system that aborts on serious error. Not sure yet; that's one reason that I've left error handling placeholders.

#8 Ectara   Crossbones+   -  Reputation: 3058

Like
0Likes
Like

Posted 19 February 2013 - 10:17 PM

Additionally, there is a new problem of how to handle creation of a string from unsanitized code units; do I perform sanitization on the input, through omission or substitution, then the invariant is in place? If someone passes in an array of five UTF-8 character sequences, but it is four valid character sequence, and one invalid sequence of bytes, the resulting length can change based on how I handle it, and the contents can change drastically based on where the invalid bytes are within the string. I see people throwing an exception when bad text is inserted; this seems rather harsh, as it then means that it must be sanitized before the text ever reaches the string.


Edited by Ectara, 20 February 2013 - 08:49 AM.


#9 Yrjö P.   Crossbones+   -  Reputation: 1412

Like
1Likes
Like

Posted 20 February 2013 - 01:12 AM

For a general purpose system, keep everything internally in one format and assume it's all validated. Validate all input before it's allowed to wander around in the system.

What kind of functionality are you putting together for UTF-8 handling? Are some operations performance critical? How large amounts of strings might have to be processed?

+ have you considered using existing libraries?
https://sites.google.com/site/icusite/
http://utfcpp.sourceforge.net/

#10 Ectara   Crossbones+   -  Reputation: 3058

Like
0Likes
Like

Posted 20 February 2013 - 09:37 AM

For a general purpose system, keep everything internally in one format and assume it's all validated. Validate all input before it's allowed to wander around in the system.


Having everything be valid internally seems to be the best way to go. It seems that I should treat invalid text being assigned to the string as a fatal error, and have a function that does something like taking input iterators and an output iterator, and sanitizing the output to a destination, to allow easy sanitation if the string will not sanitize data.

The immediate concern is that I need to make a temporary copy of the sanitized data before I can pass it to the string. It can be done in blocks, and appended to the string, for a similar cost to resizing one giant vector. If I can add raw code units to the string's storage, it'd seem to be a method of injecting invalid characters into the string.

 

What kind of functionality are you putting together for UTF-8 handling?
Are some operations performance critical? How large amounts of strings
might have to be processed?

It has all of the features of an std::basic_string, plus more. Many operations are performance critical; while it won't be used as much, UTF-8 being slower to use than a fixed-width encoding by definition puts it at a disadvantage. The largest amount of strings will be difficult to predict. A game that is more text-heavy might use it all of the time, especially if it is in a HUD. UTF-8 is the transformation format of choice in my scripting language, so it will be used there a lot. I also will use it for configuration files, and localization data. Just about everywhere where text will be printed to the screen, or read from a file intended to be edited. The strings can grow quite large, too, when handling an entire script file, or if one reads an entire configuration file. It would be in my best interest to streamline the performance by removing all things that would check for validity from the string handling, and make sure that it is valid upon entry.

It seems that I could sanitize upon assigning to a string, but it seems like that'd be unwise, and that it should be a separate function.

 

+ have you considered using existing libraries?
https://sites.google.com/site/icusite/
http://utfcpp.sourceforge.net/

Nope. I have read up on them, though.



#11 Kylotan   Moderators   -  Reputation: 3338

Like
1Likes
Like

Posted 20 February 2013 - 11:11 AM

To me, the answer is clear - you validate the data as it is added to the class. There's no point having a UTF-8 class that doesn't ensure that it is always valid UTF-8 all of the time. It should validate on the way in, not require that you have pre-validated data. What if you switch to UTF-16 later? Then you'd have 2 (or more) places that need changing. At the very least, your validation and string creation function should be a static member of your UTF-8 class, ie. a factory that takes the data and returns the valid UTF-8 string.

 

I don't think performance will be an issue. Languages like C# and Python 3 use Unicode strings everywhere and the problem doesn't really come up. The need for random access to arbitrary places in the text is rare, length in characters can be cached, etc.



#12 Ectara   Crossbones+   -  Reputation: 3058

Like
0Likes
Like

Posted 20 February 2013 - 11:26 AM

It should validate on the way in, not require that you have pre-validated data.


Well, if it is going to throw some sort of exception and abort on finding that the text isn't valid, then in places where the text isn't always expected to be valid, the text needs to be sanitized first, and thus made valid. Then the string class asserts its validity.

Are you saying the the string should do the sanitizing, or the validation?

 

What if you switch to UTF-16 later? Then you'd have 2 (or more) places that need changing.


I agree with this, though my mentality is if they are interacting with the string class in a generic manner (the string class is a templated class), then the data should already be valid; when the data is being created, then one must deal with specifics.

 

At the very least, your validation and string creation function should be a static member of your UTF-8 class, ie. a factory that takes the data and returns the valid UTF-8 string.


This one troubles me. Having a constructor that can do the work seems like a very powerful asset to me, and replacing it with a static string factory would be a large paradigm shift in how the string will be used; having a static function for creation, then having a destructor do cleanup seems like an anti-pattern to me. Surely this isn't the only way of doing things.

 

The need for random access to arbitrary places in the text is rare, length in characters can be cached, etc.


I agree. Hence, why I went with UTF-8 over UTF-32.

#13 Kylotan   Moderators   -  Reputation: 3338

Like
0Likes
Like

Posted 20 February 2013 - 01:40 PM

I'm not entirely sure what your distinction between sanitizing and validation is. To sanitize an input essentially requires validating it, so I'm using the terms interchangeably. Sometimes sanitizing means removing or fixing bad data and carrying on but generally that is a bad approach because it hides the error. If you need to transform the data, that is a different concept, but requires a different approach.

 

You have 2 possible situations really:

  1. You have byte data that you require to be already in UTF-8, and essentially want to convert its type into an instance of your UTF-8 class.
  2. You have text data in a different encoding that you need to have in UTF-8 form.

ASCII data can obviously be handled either way, providing you're 100% confident it's ASCII. If you're not confident, you have to explicitly choose one of the above, because there's no single correct way to transform multiple arbitrary encodings and get correct information out. You can't 'clean' a non-UTF-8 string to make it UTF-8 because there's no information telling you exactly what codepoints the invalid characters should be.

 

Method 1 can check that the bytes are valid UTF-8, and fail if not. It wouldn't take any action to change invalid data, hence it would be wrong to say it was sanitizing it.

Method 2 would need to decode the characters to their Unicode code points, then re-encode as UTF-8. Again, it's not really sanitizing it, just converting it.

 

I would argue that your string constructor (or constructors, more likely) needs to look at the data, and either create a valid UTF-8 string out of it, or fail. If you can use an exception, that is ideal. If you can't use exceptions, then doing the work in the constructor isn't ideal, hence my suggestion of a static factory method or methods. But basically there should be one clear step which either yields up a valid UTF-8 object or tells you that the data you provided is wrong.
 



#14 Ectara   Crossbones+   -  Reputation: 3058

Like
0Likes
Like

Posted 20 February 2013 - 02:44 PM

I'm not entirely sure what your distinction between sanitizing and validation is.


I interpret validation as checking the data's validity, and marking an error condition when the data isn't valid. On the other hand, I interpret sanitation as removing or replacing invalid data, resulting in data that will pass validation.

 

Sometimes sanitizing means removing or fixing bad data and carrying on but generally that is a bad approach because it hides the error.


This is something I want to avoid, so I'm leaning toward the string not sanitizing on entry, but just validating.

1. You have byte data that you require to be already in UTF-8, and essentially want to convert its type into an instance of your UTF-8 class.
2. You have text data in a different encoding that you need to have in UTF-8 form.


The first is most likely, and what I am handling. The second will be handled elsewhere; I can easily use character iterators to convert UTF-16 to UTF-8 and vice-versa. Other encodings will require a custom function to get the character, convert it to its equivalent encoding, then put it in the destination string.

I would argue that your string constructor (or constructors, more likely) needs to look at the data, and either create a valid UTF-8 string out of it, or fail. If you can use an exception, that is ideal. If you can't use exceptions, then doing the work in the constructor isn't ideal, hence my suggestion of a static factory method or methods. But basically there should be one clear step which either yields up a valid UTF-8 object or tells you that the data you provided is wrong.


That makes a lot more sense, and I agree; if the constructor can't use exceptions, then the construct can't be used when an exceptional condition arises without violating RAII by having to do extra steps like ensuring everything went fine, and adding an extra cleanup path in case it failed for every instance of the code.

I'm having some concerns about exceptions that I'll raise in another thread.

#15 swiftcoder   Senior Moderators   -  Reputation: 10364

Like
0Likes
Like

Posted 20 February 2013 - 03:02 PM

I'm having some concerns about exceptions that I'll raise in another thread.

Link me to that thread when you do - I'm currently working on a design for a robust exception handling alternative, and I'm interested to see concerns and alternate use cases.


Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]


#16 Ectara   Crossbones+   -  Reputation: 3058

Like
1Likes
Like

Posted 20 February 2013 - 03:28 PM

Link me to that thread when you do


Here's the link; I'll send it in a PM as well.
http://www.gamedev.net/topic/639123-portable-use-and-disabling-of-exceptions/

#17 Ectara   Crossbones+   -  Reputation: 3058

Like
0Likes
Like

Posted 21 February 2013 - 10:19 AM

Another question on the string validation responsibility, if I am validating on the way in, to avoid an error, the caller needs to ensure that the string is valid. This effectively requires that a string be validated twice; once to check for errors, and again when the text is actually inserted. Is this the right way to do it?



#18 Kylotan   Moderators   -  Reputation: 3338

Like
0Likes
Like

Posted 21 February 2013 - 01:32 PM

No. Why should there be more than a single point of validation, in the constructor?



#19 Ectara   Crossbones+   -  Reputation: 3058

Like
0Likes
Like

Posted 21 February 2013 - 02:08 PM

In the constructor, it only validates once. However, if it fails, there's a critical error that results in breaking into the debugger. So, in order to prevent this error, one must validate or sanitize before passing the data to the constructor. Or, do people not even make sure it is valid, and let the application abort to let them know?



#20 swiftcoder   Senior Moderators   -  Reputation: 10364

Like
0Likes
Like

Posted 21 February 2013 - 02:13 PM

Or, do people not even make sure it is valid, and let the application abort to let them know?

Abort is the wrong way to go here - validation errors are not necessarily fatal, and it is up to the caller to make that determination.

 

In these situations you need to give the user some way to check the results. Either a by-reference error code, a factory function that returns a status, or a thrown exception.

 

I also would not recommend a separate validation pass, unless validation is very cheap, and conversion very expensive. If validation has roughly the same complexity as conversion (and I assume that both are O(N), or thereabouts), you don't want to incur two passes over the data...


Edited by swiftcoder, 21 February 2013 - 02:14 PM.

Tristam MacDonald - Software Engineer @Amazon - [swiftcoding]





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS