I'm not entirely sure what your distinction between sanitizing and validation is.
I interpret validation as checking the data's validity, and marking an error condition when the data isn't valid. On the other hand, I interpret sanitation as removing or replacing invalid data, resulting in data that will pass validation.
Sometimes sanitizing means removing or fixing bad data and carrying on but generally that is a bad approach because it hides the error.
This is something I want to avoid, so I'm leaning toward the string not sanitizing on entry, but just validating.
1. You have byte data that you require to be already in UTF-8, and essentially want to convert its type into an instance of your UTF-8 class.
2. You have text data in a different encoding that you need to have in UTF-8 form.
The first is most likely, and what I am handling. The second will be handled elsewhere; I can easily use character iterators to convert UTF-16 to UTF-8 and vice-versa. Other encodings will require a custom function to get the character, convert it to its equivalent encoding, then put it in the destination string.
I would argue that your string constructor (or constructors, more likely) needs to look at the data, and either create a valid UTF-8 string out of it, or fail. If you can use an exception, that is ideal. If you can't use exceptions, then doing the work in the constructor isn't ideal, hence my suggestion of a static factory method or methods. But basically there should be one clear step which either yields up a valid UTF-8 object or tells you that the data you provided is wrong.
That makes a lot more sense, and I agree; if the constructor can't use exceptions, then the construct can't be used when an exceptional condition arises without violating RAII by having to do extra steps like ensuring everything went fine, and adding an extra cleanup path in case it failed for every instance of the code.
I'm having some concerns about exceptions that I'll raise in another thread.