Reading/Writing UTF8 with filestreams

Started by
6 comments, last by SeanMiddleditch 9 years, 10 months ago

Hello,

I've just recently come across the problem that I cannot load certain vowels like ä, ü, ö from my xml-files. I'm using a std::wfstream, so how do I configure it to load those characters? (when I load it now, I always get 2 nonsense-chars from the one supposed-to-be widechar).

I tried:


std::wfstream m_stream;

std::locale locale("");
m_stream.imbue(locale);

m_stream.open(stFilename.c_str(), std::ios::in);

but it doesn't work, regardless of whether I open the stream before or after, even though my machine is set to german. Any ideas?

Advertisement

What exactly do you mean by "nonsense-chars"? It's expected for ä, ü or ö to be encoded as two bytes in UTF8.

If you know your file is encoded in UTF8, you should open it as an std::fstream, not an std::wfstream. Load your text as an std::string (that's the nice thing about UTF8 - everything fits into normal chars). If you need it as something else (like UTF16 or UTF32) use an UTF conversion library of your choice.


What exactly do you mean by "nonsense-chars"? It's expected for ä, ü or ö to be encoded as two bytes in UTF8.

Yeah, I meant just that, I got one char with value 195 (A with ~ on top) and another one with 132. I was not aware that this was supposed to be that way.

I am using std::wstring throughout my engine though, and do a lot of manual text parsing, where I require those chars to fit in one wchar_t.

So I quess I have to use a conversion libary? Do you recall any on top of your head whose source is available & whose licence allows me to compile the code within my engine? Don't really want to add another external dependency...

195 means the top three bits are 110. That's the start of a two byte UTF8 for everything from U+0080 to U+07FF (Wikipedia has an explanation of that).

I'm usually not converting away from UTF8 but shouldn't codecvt int C++11 do the job?


I'm usually not converting away from UTF8 but shouldn't codecvt int C++11 do the job?

I tried it, but without success:

See anything wrong about my usage? the std::u16string still contains 195,132 in the first two entries though, for loading I'm using a std::fstream now. From what I've read though std::wfstream should already use the right codecvt-overload anyways... any ideas?

EDIT: Got it, this does the trick:


		void Parser::ParseValue(Node& node)
		{
			std::string stValue("");

			// parse value until next token is encountered
			while(m_stream)
			{
				char c = m_stream.peek();
				if(c == '<')
					break;

				m_stream.get(c);

				if(c == '\n')
					m_line++;
				else
					stValue += c;
			}

			std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> conversion;
			const auto stValueW = conversion.from_bytes(stValue);

			node.SetValue(stValueW);
		}

was supposed to use wstring_convert. This still mystifies me as the translated char of 'Ä' e.g. translates to 196, which would fit nicely into the UFT8-space, also the thing about std::wfstream being supposed to do the conversion themselfes... now I have some more workload when loading these files, plus I've got to apply this everywhere. Now, Ok, I don't have that many file accesses except that XML parser of mine, shouldn't be much of a trouble...

I am using std::wstring throughout my engine though, and do a lot of manual text parsing, where I require those chars to fit in one wchar_t.


Down this path lies madness. wstring is in a compiler/vendor-specific format with no standards. The vendor formats are often either broken (UCS-2) or wasteful (UCS-4) or just useless (ASCII/EBCDIC). Just use a regular 8-bit std::string and interpret it as UTF-8 anywhere it matters (which is surprisingly few places outside of display). Convert to wide strings only in places you have to, like when talking to native OS path routines on certain platforms (or better, use a library that abstracts those details and just works in UTF-8 itself). The price of converting a path here and there is completely swamped by the price of the actual I/O itself, so don't worry about perf.

http://www.utf8everywhere.org/

Sean Middleditch – Game Systems Engineer – Join my team!


Down this path lies madness. wstring is in a compiler/vendor-specific format with no standards. The vendor formats are often either broken (UCS-2) or wasteful (UCS-4) or just useless (ASCII/EBCDIC). Just use a regular 8-bit std::string and interpret it as UTF-8 anywhere it matters (which is surprisingly few places outside of display). Convert to wide strings only in places you have to, like when talking to native OS path routines on certain platforms (or better, use a library that abstracts those details and just works in UTF-8 itself). The price of converting a path here and there is completely swamped by the price of the actual I/O itself, so don't worry about perf.

Sounds plausible, I quess I'll switch to std::string once I've got those few things I currently want to get done finished. I really hope global search and replace doesn't let me down, otherwise its going to be another fun afternoon doing nothing than repeatetly compiling, correcting few compile errors, and so on... (aren't those great)? Actually, I've already got wrappers for most file/path accesses, so its not a problem changing the underlying routines globally. I quess I'll just add a wrapper for file loading instead of using plain std::fstream for that matter, also I'm thinking about making a custom string wrapper. I can implement the underlying storage using std::string anyways, but than at least I can switch the implementation without hanging everything, plus I could add custom stuff like conversion routines, etc...

I can implement the underlying storage using std::string anyways, but than at least I can switch the implementation without hanging everything, plus I could add custom stuff like conversion routines, etc...


This also leads to madness. Remember separation of concerns. Your string needs to store a sequence of characters and that's it. Everything else should be in a different class/module. Even std::string is considered to be a train-wreck by many people in the same committee that created it due to all the searching and modification routines built into it. smile.png

If you write your own string, I'd recommend the thinnest and simplest wrapper around a self-managed char* as possible. Something like:

class string final {
public:
  string();
  string(std::nullptr_t);
  string(string const& src);
  string(string && src);
  string(char const* begin, char const* end);
  string(char const* begin, size_t length);
  explicit string(const char* zstr);

  ~string();

  string& operator=(string const& src);
  string& operator=(string && src);
  string& operator=(std::nullptr_t);

  bool empty() const;
  explicit operator bool() const;
  size_t size() const;

  char operator[](size_t index) const;

  const char* begin() const;
  const char* c_str() const;
  const char* end() const;
}
And that's it. I'd highly recommend you use 8-bit characters, always. Never treat the NUL byte as special (except in the zstr constructor, which calls std::strlen to compute the length), but always keep your string NUL-terminated for the cases you have to interact with old C-style APIs via c_str. Don't make your strings mutable but rather use a separate StringBuilder or the like for cases you need to programatically generate a string. Any other manipulation function (make_lower, find_substr, etc.) can and should be in a separate utility class or be free functions; this both keeps your string class simpler and less buggy and it makes it easier to offer a single interface for all types of string via overloads and ADL (your string, std::string, std::wstring, char*, vector<char>, etc.).

Note that comparison operators are not defined. Make those separate (free) operators. This is necessary anyway for layering with comparing string with char* and other string types. Comparison with strings also benefits a lot from offering a number of functor comparators since strings can be compared so many different ways (character case, sorting of numeric ranges, etc.).

It's also handy to make a variant of the above called something like 'string_ref' or 'string_range' or 'string_view' that does not own its contents. This makes it easy to pass subsets of strings around to algorithms without making copies. It's basically the same as the above, only you don't need move operators (because copying is trivial), you don't need a destructor (because it too is trivial), and you can't include c_str (because arbitrary subsets of a string will not be NUL-terminated). string_view is so handy it's part of a TS for a future version of C++.

update: forgot operator[] and notes on comparison operators.

Sean Middleditch – Game Systems Engineer – Join my team!

This topic is closed to new replies.

Advertisement