Sign in to follow this  
vcGamer

reading unicode text files in C++

Recommended Posts

hi

I use Visual C++ 6, how can I read unicode text files? (this file has both english and persian characters, persian is like arabic). how can I read and parse this file? I want to read this file and save it as binary, what is the solution?

Share this post


Link to post
Share on other sites
You read it as any other file. Do you want to load the text into unicode strings, or what do you want to do with it?
You can read more about Unicode here:http://en.wikipedia.org/wiki/Unicode.
I guess your file will be in http://en.wikipedia.org/wiki/UTF-16 format, though there is also the http://en.wikipedia.org/wiki/UTF-8 format.
Usually you can just read the file data, and the first 2 bytes will be a marker, and for the rest of the file each two bytes is a single 16-bit wchar_t that represents a character, but it can depend on the exact format of the file.

Share this post


Link to post
Share on other sites
To expand a little on what Erik said, Unicode specifies a mapping between characters/combining marks and numbers. So a chunk of Unicode text is an array of such numbers.

When it comes time to write those numbers to a file, we have to decide how to encode them in to a sequence of bytes. There are a number of such encodings defined as part of the Unicode specification(s), though the most common are probably UTF8 and UTF16, though there are a number of others too.

UTF8 takes each Unicode number (called a code point) and expresses it as 1 to 4 8 bit values. The advantages of UTF8 is that you can write UTF8-encoded bytes to a file without having to worry about endianess, and UTF8 encoded strings can be sent to many of the C and C++ standard library string manipulation functions.

UTF16 takes each code point and expresses it as either 1 or 2 16 bit values. When it comes time to write a UTF16 encoded string to a file, one has to decide on the endianess to use. At the other end, you also have to know which endianess was used in order to decode the text in to the UTF16 encoding of the code points. The advantage of UTF16 is that you're more likely to end up with a tighter data representation when your text includes a lot of characters outside of languages that originated in western Europe.

Ok, so you have a file. Chances are, it's probably UTF8, UTF16 little endian or UTF16 big endian encoded. How do we tell which?

Often, a file will start with what's known as a Byte Order Mark, or BOM. A BOM indicates which byte-level encoding of the code points was chosen for the file when it was saved. If the file doesn't have a BOM, you either have to know/assume which encoding was used, or you could employ some heuristics.

Note that if somebody tells you a file is UTF16 encoded, that's still not enough. You need to ask them about the endianess of the encoding they used too.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this