Back to General and Gameplay Programming

unicode consideration

General and Gameplay Programming Programming

Started by bjogio May 22, 2005 12:42 PM

8 comments, last by smart_idiot 18 years, 11 months ago

bjogio

242

Author

May 22, 2005 12:42 PM

Hi gamedevelopers I want to modify the project (C++) I'm actually developing in order to switch from ansi to unicode. Now what is the best way to handle this process? One step at time, first problem: 1) What macro to use for handling "bla" L"bla", I've seen the Ms _T" " but I don't know if it is the most elegant way. 2) You use typedef for the standard container? such as:


#ifdef UNICODE
typedef tstringstream std::stringstream
#else
typedef tstringstream std::wstringstream
#endif

3) what macro for char and char string? It's better a typedef like "tchar"? 4) what encoding? I need my *.ini (on win systems) to handle unicode. I've seen that the application called "notepad" open the "unicode" "unicode big endian" and "UTF-8" encoding so I think that it would be a good idea to choose from these three. What is the most widely used? You can provide me a link with a signature list for all encoding? I'm talking about the first 4 bytes that in teory identify the used encoding. p.s. thanks to the guy how will loose time in answering my stupid question

[ILTUOMONDOFUTURO]

iMalc

2,466

May 22, 2005 02:11 PM

What is your target platform?
Under Windows just using _T() around your string literals, and TCHAR in your declarations can be enough to get it to compile for either.

"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

bjogio

242

Author

May 22, 2005 02:34 PM

my program is shipped in win* platform, what about the encoding to use?

[ILTUOMONDOFUTURO]

Erzengeldeslichtes

336

May 22, 2005 05:07 PM

Windows uses little-endian unicode (Notepad just calls it "Unicode"). Files saved as big-endian you will have to convert using (WideChar&0xFF << 8 + (WideChar>>8) &0xFF) in your code (where WideChar is the widecharacter you read), and UTF-8 you will have to convert with MultiByteToWideChar(CP_UTF8, ...)

----Erzengel des Lichtes光の大天使Archangel of LightEverything has a use. You must know that use, and when to properly use the effects.♀≈♂?

smart_idiot

1,298

May 22, 2005 11:25 PM

UTF-8 is nice. You can store your text in normal strings, and, at least here in Linux with GCC, can be used with no extra work. Liek magik.

(There's the obvious problems of not being able to trust the string's length and inserting/removing characters with normal string functions can break them, but I won't tell if you won't.)

Chess is played by three people. Two people play the game; the third provides moral support for the pawns. The object of the game is to kill your opponent by flinging captured pieces at his head. Since the only piece that can be killed is a pawn, the two armies agree to meet in a pawn-infested area (or even a pawn shop) and kill as many pawns as possible in the crossfire. If the game goes on for an hour, one player may legally attempt to gouge out the other player's eyes with his King.

bjogio

242

Author

May 23, 2005 01:38 AM

mmh I think that the using of the little endian from Ms is due to the fact that (as I've seen) the last nt series uses it internally. I've decided to go with:
typedef tchar (char,wchar_t)
typedef tstring (string, wstring)
#define T() ((),L())
why the standard use the L" " before unicode literal? It means Long? It wasn't better S" " (Short)?
p.s. thanks

[ILTUOMONDOFUTURO]

Spoonbender

1,258

May 23, 2005 07:57 AM

Quote:Original post by smart_idiot
UTF-8 is nice. You can store your text in normal strings, and, at least here in Linux with GCC, can be used with no extra work. Liek magik.

(There's the obvious problems of not being able to trust the string's length and inserting/removing characters with normal string functions can break them, but I won't tell if you won't.)

And the problem of not all international characters being represented, of some character varying from font to font, and so on and so on... ;)

But sure, if you can guarantee that your program will never be used by anyone outside the US/UK, UTF8 works fine. ;)

Zahlman

1,682

May 23, 2005 12:11 PM

Quote:Original post by Spoonbender
Quote:Original post by smart_idiot
UTF-8 is nice. You can store your text in normal strings, and, at least here in Linux with GCC, can be used with no extra work. Liek magik.

(There's the obvious problems of not being able to trust the string's length and inserting/removing characters with normal string functions can break them, but I won't tell if you won't.)

And the problem of not all international characters being represented, of some character varying from font to font, and so on and so on... ;)

What are you on about? UTF-8 is a Unicode encoding, which allows for representing any Unicode code point (including those outside the BMP).

Rattrap

3,386

May 23, 2005 04:09 PM

UTF-8 is a unicode encoding using 8-bit characters. It maps basically directly to ASCII for the lower values, but it does some tricks when representing numbers that are beyond the 255 max value of an 8-bit character. The unicode standard character is stored as a 32-bit number, so a utf-8 character can also have upto (iirc) 3 extra sets of 8 bits (for a total of 4 8-bit characters) to represent other characters. It does have the advantage of not being affected by endian issues.

There is also UTF-16 (both big and little endian) and UTF-32 (also both big and little endian) that fall under the standard.

Unicode Website

You can even download the official unicode standard book there.

"I can't believe I'm defending logic to a turing machine." - Kent Woolworth [Other Space]

smart_idiot

1,298

May 23, 2005 04:47 PM

UTF-8 isn't a codepage, if that is what you were thinking. Characters have variable length, ranging from 1-6 bytes. It can hold every possible unicode character.

I wrote some iterator adaptors for dealing with UTF-8 strings, to make life easier. Here is some code from my drawText function as an example:

UTF8::InputAdaptor<std::string::const_iterator> pos(string.begin(), string.end());const UTF8::InputAdaptor<std::string::const_iterator> end(string.end());for(; pos != end; ++pos) {  const Font::Glyph &glyph(font.getGlyph(*pos, font_height));    if(x+glyph.width >= 0)   drawGlyph<bpp>(glyph, x, y, p, c);    if((x += glyph.width) >= width)   break; }

Note that my adaptor can optionally take two extra iterators; one for one past the end, and one for the beginning. This is protection from malformed strings so it doesn't end up trying to read a multi-byte character and skipping over the beginning or end, which of course would be bad.

unicode consideration

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

unicode consideration

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines