UTF-8 String Validation Responsibility

Started by
50 comments, last by Ectara 11 years, 1 month ago

I'm having a problem with implementing a UTF-8 string where I'm unsure where the responsibility should lie for validating the strings contents. My options seem to be this:

1. Make it an invariant that the contained string must be valid. If invalid characters are attempted to be inserted, ignore them, or replace them with the replacement character.

2. Contain possibly invalid code units, and provide sanitized output to code that tries to read the string. This seems like a bad combination of the first, and the next.

3. Contain raw code units, and place the responsibility of anything that uses the string to check if the string is valid before it uses it. It can then filter out the invalid characters, or replace them.

The first would be very simple; validation upon inserting new characters would simplify the logic of handling the characters inside.
The second would be simple for entering code units, as it will go unchecked, but every time the string is used, the logic would be complicated, and error prone.

The third would be like what I see in a lot of places that use UTF-8 encoding, but it would be error-prone and things would be more complicated, like calculating the length of strings in code units, and advancing a certain number of code points ahead.

So, which do you think would be the best behavior? I feel like the first would be ideal, but it might provide unexpected behavior for people that purposely input bad data, and expect it to be output verbatim, or people would expect to have to sanitize input themselves, by their own rules.

Advertisement

So, which do you think would be the best behavior? I feel like the first would be ideal, but it might provide unexpected behavior for people that purposely input bad data, and expect it to be output verbatim, or people would expect to have to sanitize input themselves, by their own rules.

The first is ideal.

If someone doesn't want sanitised storage, they can always use a std::vector<uchar>.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

So, which do you think would be the best behavior? I feel like the first would be ideal, but it might provide unexpected behavior for people that purposely input bad data, and expect it to be output verbatim, or people would expect to have to sanitize input themselves, by their own rules.

The first is ideal.

If someone doesn't want sanitised storage, they can always use a std::vector<uchar>.

I keep referring back to the argument that if someone wants unsanitized storage, they can just use a vector of code units; I guess what holds me back from following this is that the string functions are made to handle strings, so vectors can't be passed, aside from passing a pointer to the internal storage. But, on the other hand, passing unsanitized strings around seems counterproductive, so leaving this ability in seems unreasonable.

But, on the other hand, passing unsanitized strings around seems counterproductive, so leaving this ability in seems unreasonable.

To my mind, the principle you want to follow here is fail fast, fail early.

I'd much rather have an exception thrown right when I attempt to load a string, than spend hours debugging the output end of my program to try and figure out what scrambled my characters...

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

I'd much rather have an exception thrown right when I attempt to load a string, than spend hours debugging the output end of my program to try and figure out what scrambled my characters...


I wholly agree with failing at where the problem lies (I used to be a fan of safely undoing what I had done so far, reporting failure, and then letting the caller decide what to do about it), though I guess I have concerns about using exceptions in general. I'd like to avoid them, and so far, I use them simply to gracefully unwind and call destructors; I'd like no (correct) behavior to change if I disable exceptions. I was considering having a macro definition that decides whether it would ignore invalid characters or replace them with the replacement character.

Is it worth it to use exceptions for this? It seems like it might provide behavior that is easier to anticipate and understand.

I'd like to avoid them, and so far, I use them simply to gracefully unwind and call destructors; I'd like no (correct) behavior to change if I disable exceptions.

Are you intending this library for use on a platform that does not support exceptions?

I don't suppose it terribly matters what mechanism you use to inform the programmer of an error condition, so long as:

  • The error handling mechanism is consistent throughout your API.
  • The programmer has the option to handle the error.
  • The error causes the program/debugger to stop at the given location unless handled.

To my mind, exceptions are the easiest way to achieve those ends - though continuations and callback functors are both viable given the right supporting toolset.

Is it worth it to use exceptions for this? It seems like it might provide behavior that is easier to anticipate and understand.

You definitely need some way to inform the programmer of the error.

If you don't use exceptions, then you have a whole raft of thorny issues to consider:

  • How do you signal failure from a constructor?
  • How do you signal failure from an append operator?
  • How do you make sure that the programmer is checking the error state?

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

How do you signal failure from a constructor?
How do you signal failure from an append operator?
How do you make sure that the programmer is checking the error state?


Coming from C, I'm still very used to returning error codes; however, I haven't written a function in C++ that uses them yet. I've marked places where an error should be handled, but I've yet to decide how I will implement it.

Exceptions seem extremely ideal, but I want to maintain cross-platform ability for games that I write to at least the platforms I have on hand; not all of them support standard exceptions, and some take a heavy performance hit. I'm thinking some kind of assertion system that aborts on serious error. Not sure yet; that's one reason that I've left error handling placeholders.

Additionally, there is a new problem of how to handle creation of a string from unsanitized code units; do I perform sanitization on the input, through omission or substitution, then the invariant is in place? If someone passes in an array of five UTF-8 character sequences, but it is four valid character sequence, and one invalid sequence of bytes, the resulting length can change based on how I handle it, and the contents can change drastically based on where the invalid bytes are within the string. I see people throwing an exception when bad text is inserted; this seems rather harsh, as it then means that it must be sanitized before the text ever reaches the string.

For a general purpose system, keep everything internally in one format and assume it's all validated. Validate all input before it's allowed to wander around in the system.

What kind of functionality are you putting together for UTF-8 handling? Are some operations performance critical? How large amounts of strings might have to be processed?

+ have you considered using existing libraries?
https://sites.google.com/site/icusite/
http://utfcpp.sourceforge.net/

For a general purpose system, keep everything internally in one format and assume it's all validated. Validate all input before it's allowed to wander around in the system.


Having everything be valid internally seems to be the best way to go. It seems that I should treat invalid text being assigned to the string as a fatal error, and have a function that does something like taking input iterators and an output iterator, and sanitizing the output to a destination, to allow easy sanitation if the string will not sanitize data.

The immediate concern is that I need to make a temporary copy of the sanitized data before I can pass it to the string. It can be done in blocks, and appended to the string, for a similar cost to resizing one giant vector. If I can add raw code units to the string's storage, it'd seem to be a method of injecting invalid characters into the string.

What kind of functionality are you putting together for UTF-8 handling?
Are some operations performance critical? How large amounts of strings
might have to be processed?

It has all of the features of an std::basic_string, plus more. Many operations are performance critical; while it won't be used as much, UTF-8 being slower to use than a fixed-width encoding by definition puts it at a disadvantage. The largest amount of strings will be difficult to predict. A game that is more text-heavy might use it all of the time, especially if it is in a HUD. UTF-8 is the transformation format of choice in my scripting language, so it will be used there a lot. I also will use it for configuration files, and localization data. Just about everywhere where text will be printed to the screen, or read from a file intended to be edited. The strings can grow quite large, too, when handling an entire script file, or if one reads an entire configuration file. It would be in my best interest to streamline the performance by removing all things that would check for validity from the string handling, and make sure that it is valid upon entry.

It seems that I could sanitize upon assigning to a string, but it seems like that'd be unwise, and that it should be a separate function.

+ have you considered using existing libraries?
https://sites.google.com/site/icusite/
http://utfcpp.sourceforge.net/

Nope. I have read up on them, though.

This topic is closed to new replies.

Advertisement