I'm having a problem with implementing a UTF-8 string where I'm unsure where the responsibility should lie for validating the strings contents. My options seem to be this:
1. Make it an invariant that the contained string must be valid. If invalid characters are attempted to be inserted, ignore them, or replace them with the replacement character.
2. Contain possibly invalid code units, and provide sanitized output to code that tries to read the string. This seems like a bad combination of the first, and the next.
3. Contain raw code units, and place the responsibility of anything that uses the string to check if the string is valid before it uses it. It can then filter out the invalid characters, or replace them.
The first would be very simple; validation upon inserting new characters would simplify the logic of handling the characters inside.
The second would be simple for entering code units, as it will go unchecked, but every time the string is used, the logic would be complicated, and error prone.
The third would be like what I see in a lot of places that use UTF-8 encoding, but it would be error-prone and things would be more complicated, like calculating the length of strings in code units, and advancing a certain number of code points ahead.
So, which do you think would be the best behavior? I feel like the first would be ideal, but it might provide unexpected behavior for people that purposely input bad data, and expect it to be output verbatim, or people would expect to have to sanitize input themselves, by their own rules.