• 9
• 9
• 10
• 9
• 10

# Problem With Xml Utf-8 Encoding

This topic is 620 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I have stupid problem. My XML file has <?xml version="1.0" encoding="UTF-8" standalone="yes"?> I want to save into it characters from c++ code like ??ó???-File already has these kind of characters so it has good coding. I can write normal characters but when I tried

std::string za="?a?aba";
file.insert(234,za,0,za.size());

I've got "The XML data is invalid according to the schema" error on file open

When I change
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

to

<?xml version="1.0" encoding="Windows-1250" standalone="yes"?>
Characters that I write to file was correct but any special characters in file were broken.
Same with ISO-8859-2

I also tried to edit XML by normal notepad and I ve got same error "The XML data is invalid according to the schema".
I also tried
std::string za="?a?aba"; with \u xxxx for unicode but I always got '?' in file
How can I solve this?
Edited by widmowyfox

##### Share on other sites
If you promise a file will contain UTF-8 (by specifying that encoding), then you must honor that promise.

That means your source file containing the
std::string za="?a?aba";
must be properly encoded as UTF-8 or you must in code convert the string from whatever encoding you have to UTF-8 before writing it into the file. If we are talking about MSVC you will probably need to open the source file in a decent text editor (for example NotePad++, simple Windows notepad will not do) and convert the source to UTF-8 (including probably a leading BOM because all MSVC versions I worked with so far do not recognize UTF-8 without the BOM, even if it is bad practice).
In my experience once a file has been converted to UTF-8 (including the BOM), MSVC is willing to work with it as expected, including converting pasted strings into proper UTF-8.

As said above, Windows notepad is completely useless once encoding becomes important. Notepad++ (or any other decent text editor) will not only tell you as what kind of encoding it interprets your text file, it will also give you the option of converting between encodings.

##### Share on other sites

For text files you can start them with the byte order mark that matches your type.

For UTF-8, that is the sequence 0xEF 0xBB 0xBF.

Most editors and tools out there recognize the byte order mark, notepad included.

##### Share on other sites

I've always found that the best way to deal with UTF8 in MSVC is to be more explicit.  Start with a std::wstring, then explicitly convert it to a UTF8 std::string using whichever conversion process you prefer.  I've still had issues even when trying to save the file in Notepad++ with encoding, then loading in Visual Studio and trying to work with it after the fact.  For my code, the same goes in reverse, I load in UTF8 but then convert to wstring since Windows has never really been a fan of UTF8.

##### Share on other sites

I've always found that the best way to deal with UTF8 in MSVC is to be more explicit. Start with a std::wstring, then explicitly convert it to a UTF8 std::string using whichever conversion process you prefer.

Hell no, that way lies madness. The only situation in which you should even consider std::wstring is when you have to deal explicitly with the Windows API. Ideally all of that should be hidden by at least a thin layer of paint which takes UTF-8 like everyone else these days. Especially on MSVC, std::wstring is just the worst of both worlds. It's variable-length (like UTF-8) and has all the disadvantages of non-byte units.

UTF-8 in source files for string literals works fine. The only coercion has to be applied to the MSVC IDE and having the UTF-8 BOM (as both I and frob already said) is enough from at least MSVC 2012 onwards.