unicode consideration

Started by
8 comments, last by smart_idiot 18 years, 11 months ago
Hi gamedevelopers I want to modify the project (C++) I'm actually developing in order to switch from ansi to unicode. Now what is the best way to handle this process? One step at time, first problem: 1) What macro to use for handling "bla" L"bla", I've seen the Ms _T" " but I don't know if it is the most elegant way. 2) You use typedef for the standard container? such as:

#ifdef UNICODE
typedef tstringstream std::stringstream
#else
typedef tstringstream std::wstringstream
#endif

3) what macro for char and char string? It's better a typedef like "tchar"? 4) what encoding? I need my *.ini (on win systems) to handle unicode. I've seen that the application called "notepad" open the "unicode" "unicode big endian" and "UTF-8" encoding so I think that it would be a good idea to choose from these three. What is the most widely used? You can provide me a link with a signature list for all encoding? I'm talking about the first 4 bytes that in teory identify the used encoding. p.s. thanks to the guy how will loose time in answering my stupid question
[ILTUOMONDOFUTURO]
Advertisement
What is your target platform?
Under Windows just using _T() around your string literals, and TCHAR in your declarations can be enough to get it to compile for either.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
my program is shipped in win* platform, what about the encoding to use?
[ILTUOMONDOFUTURO]
Windows uses little-endian unicode (Notepad just calls it "Unicode"). Files saved as big-endian you will have to convert using (WideChar&0xFF << 8 + (WideChar>>8) &0xFF) in your code (where WideChar is the widecharacter you read), and UTF-8 you will have to convert with MultiByteToWideChar(CP_UTF8, ...)
----Erzengel des Lichtes光の大天使Archangel of LightEverything has a use. You must know that use, and when to properly use the effects.♀≈♂?
UTF-8 is nice. You can store your text in normal strings, and, at least here in Linux with GCC, can be used with no extra work. Liek magik.

(There's the obvious problems of not being able to trust the string's length and inserting/removing characters with normal string functions can break them, but I won't tell if you won't.)
Chess is played by three people. Two people play the game; the third provides moral support for the pawns. The object of the game is to kill your opponent by flinging captured pieces at his head. Since the only piece that can be killed is a pawn, the two armies agree to meet in a pawn-infested area (or even a pawn shop) and kill as many pawns as possible in the crossfire. If the game goes on for an hour, one player may legally attempt to gouge out the other player's eyes with his King.
mmh I think that the using of the little endian from Ms is due to the fact that (as I've seen) the last nt series uses it internally. I've decided to go with:
typedef tchar (char,wchar_t)
typedef tstring (string, wstring)
#define T() ((),L())
why the standard use the L" " before unicode literal? It means Long? It wasn't better S" " (Short)?
p.s. thanks
[ILTUOMONDOFUTURO]
Quote:Original post by smart_idiot
UTF-8 is nice. You can store your text in normal strings, and, at least here in Linux with GCC, can be used with no extra work. Liek magik.

(There's the obvious problems of not being able to trust the string's length and inserting/removing characters with normal string functions can break them, but I won't tell if you won't.)


And the problem of not all international characters being represented, of some character varying from font to font, and so on and so on... ;)

But sure, if you can guarantee that your program will never be used by anyone outside the US/UK, UTF8 works fine. ;)
Quote:Original post by Spoonbender
Quote:Original post by smart_idiot
UTF-8 is nice. You can store your text in normal strings, and, at least here in Linux with GCC, can be used with no extra work. Liek magik.

(There's the obvious problems of not being able to trust the string's length and inserting/removing characters with normal string functions can break them, but I won't tell if you won't.)


And the problem of not all international characters being represented, of some character varying from font to font, and so on and so on... ;)


What are you on about? UTF-8 is a Unicode encoding, which allows for representing any Unicode code point (including those outside the BMP).
UTF-8 is a unicode encoding using 8-bit characters. It maps basically directly to ASCII for the lower values, but it does some tricks when representing numbers that are beyond the 255 max value of an 8-bit character. The unicode standard character is stored as a 32-bit number, so a utf-8 character can also have upto (iirc) 3 extra sets of 8 bits (for a total of 4 8-bit characters) to represent other characters. It does have the advantage of not being affected by endian issues.

There is also UTF-16 (both big and little endian) and UTF-32 (also both big and little endian) that fall under the standard.

Unicode Website

You can even download the official unicode standard book there.

"I can't believe I'm defending logic to a turing machine." - Kent Woolworth [Other Space]

UTF-8 isn't a codepage, if that is what you were thinking. Characters have variable length, ranging from 1-6 bytes. It can hold every possible unicode character.

I wrote some iterator adaptors for dealing with UTF-8 strings, to make life easier. Here is some code from my drawText function as an example:

UTF8::InputAdaptor<std::string::const_iterator> pos(string.begin(), string.end());const UTF8::InputAdaptor<std::string::const_iterator> end(string.end());for(; pos != end; ++pos) {  const Font::Glyph &glyph(font.getGlyph(*pos, font_height));    if(x+glyph.width >= 0)   drawGlyph<bpp>(glyph, x, y, p, c);    if((x += glyph.width) >= width)   break; }


Note that my adaptor can optionally take two extra iterators; one for one past the end, and one for the beginning. This is protection from malformed strings so it doesn't end up trying to read a multi-byte character and skipping over the beginning or end, which of course would be bad.
Chess is played by three people. Two people play the game; the third provides moral support for the pawns. The object of the game is to kill your opponent by flinging captured pieces at his head. Since the only piece that can be killed is a pawn, the two armies agree to meet in a pawn-infested area (or even a pawn shop) and kill as many pawns as possible in the crossfire. If the game goes on for an hour, one player may legally attempt to gouge out the other player's eyes with his King.

This topic is closed to new replies.

Advertisement