Simple code for reading and writing file in UTF-16 mode

Started by
5 comments, last by Khatharr 8 years ago

I have an input file named "Input.txt". I have prepared it in Notepad++, then converted to UTF-16 from "Notepad++ Menu > Encoding > Convert to UCS-2 LE BOM". I want to read the contents of this file line by line, then print these lines to a file named "Output.txt".

Here is my code:


#include <vector>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>

int wmain(int argc, wchar_t *argv[])
{
	std::vector<std::wstring> Lines;
	std::wstring Line;
	const std::wstring LINE_END = L"\n";

	std::wifstream InputFileStream(L"C:\\Users\\Administrator\\Desktop\\Test\\Input.txt", std::wifstream::in);
	InputFileStream.imbue(std::locale(InputFileStream.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>));
	while (std::getline(InputFileStream, Line))
	{
		if (Line.size())	// Delete the '\r' character from the line ending mark if it exists.
		{
			if (Line.back() == L'\r')
			{
				Line.pop_back();
			}
		}
		Lines.emplace_back(std::move(Line));
	}
	InputFileStream.close();
	
	std::wofstream OutputFileStream(L"C:\\Users\\Administrator\\Desktop\\Test\\Output.txt", std::ios_base::out | std::ios_base::trunc);
	OutputFileStream.imbue(std::locale(OutputFileStream.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>));
	
	for (uintmax_t i=0; i<Lines.size(); i++)
	{
		OutputFileStream << Lines[i] /*<< LINE_END*/ << std::endl;
	}
	OutputFileStream.close();
	
	return 0;
}

Input.txt

Hello World!

Variable1 = 12.345
Variable2 = "Test"

End

Output.txt

Hello World!??
Variable1 = 12.345????????????????????
?????

Input.txt
xIhxBIQ.png

Output.txt

GYWuOcF.png

Requirements

  • The input and output data must always be kept in UTF-16 encoding.
  • The same text must appear at the output.
  • The line ending format may change. \r\n line endings in the input file may change to \n in the output file, and vice versa.
  • There must be cyclic consistency between the input and output. If I give n-1th output to the nth code run as input, it must give correct and exactly the same output. As I stated in the previous item, the line endings may change after the first run.

How do I make this code run?

Advertisement
Are you sure you need to imbue the input/output streams? Since you do not want to modify the encoding that seems like one additional point of failure...

Are you sure you need to imbue the input/output streams? Since you do not want to modify the encoding that seems like one additional point of failure...

No, I'm not sure. I read the documentation of std::codecvt, but I didn't understand much. I can't find an online example matching my use case. I am totally stuck.

The problem seems to be std::endl outputting ascii line endings instead of UTF-16 encoded line endings (notice the missing 00 between them, which is there in the input).

This makes the whole string get shifted after it, so all code points are no longer aligned to word boundaries. Only reason the second line works at all is because there is two line endings after the first line making the string aligned again.

Unfortunately I have no idea why std::endl would behave in this way when you have set the locale on the stream... I would expect that to work, but I don't know, since I haven't used UTF-16 strings much, I usually have my strings in UTF-8.

I would just try to read/write the data using standard streams on char16_t instead of char while not trying to change the locale or anything else.

Well, actually I would avoid UTF-16 like the plague.

On Windows (but preferably everywhere else too) always use binary mode file streams, because text mode file streams may replace end of line characters:


std::wofstream OutputFileStream(L"C:\\Users\\Administrator\\Desktop\\Test\\Output.txt", std::ios_base::out | std::ios_base::trunc | std::ios_base::binary);

I have no idea why std::endl would behave in this way


Because endl is a turd that only causes problems. You don't need to flush the stream at that point either (whoops! endl does that too) because all it will do is harm performance. Just use your LINE_END instead or just use L"\n", which is less typing.
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

This topic is closed to new replies.

Advertisement