# unicode consideration

This topic is 4990 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi gamedevelopers I want to modify the project (C++) I'm actually developing in order to switch from ansi to unicode. Now what is the best way to handle this process? One step at time, first problem: 1) What macro to use for handling "bla" L"bla", I've seen the Ms _T" " but I don't know if it is the most elegant way. 2) You use typedef for the standard container? such as:
#ifdef UNICODE
typedef tstringstream std::stringstream
#else
typedef tstringstream std::wstringstream
#endif


3) what macro for char and char string? It's better a typedef like "tchar"? 4) what encoding? I need my *.ini (on win systems) to handle unicode. I've seen that the application called "notepad" open the "unicode" "unicode big endian" and "UTF-8" encoding so I think that it would be a good idea to choose from these three. What is the most widely used? You can provide me a link with a signature list for all encoding? I'm talking about the first 4 bytes that in teory identify the used encoding. p.s. thanks to the guy how will loose time in answering my stupid question

##### Share on other sites
Under Windows just using _T() around your string literals, and TCHAR in your declarations can be enough to get it to compile for either.

##### Share on other sites
my program is shipped in win* platform, what about the encoding to use?

##### Share on other sites
Windows uses little-endian unicode (Notepad just calls it "Unicode"). Files saved as big-endian you will have to convert using (WideChar&0xFF << 8 + (WideChar>>8) &0xFF) in your code (where WideChar is the widecharacter you read), and UTF-8 you will have to convert with MultiByteToWideChar(CP_UTF8, ...)

##### Share on other sites
UTF-8 is nice. You can store your text in normal strings, and, at least here in Linux with GCC, can be used with no extra work. Liek magik.

(There's the obvious problems of not being able to trust the string's length and inserting/removing characters with normal string functions can break them, but I won't tell if you won't.)

##### Share on other sites
mmh I think that the using of the little endian from Ms is due to the fact that (as I've seen) the last nt series uses it internally. I've decided to go with:
typedef tchar (char,wchar_t)
typedef tstring (string, wstring)
#define T() ((),L())
why the standard use the L" " before unicode literal? It means Long? It wasn't better S" " (Short)?
p.s. thanks

##### Share on other sites
Quote:
 Original post by smart_idiotUTF-8 is nice. You can store your text in normal strings, and, at least here in Linux with GCC, can be used with no extra work. Liek magik.(There's the obvious problems of not being able to trust the string's length and inserting/removing characters with normal string functions can break them, but I won't tell if you won't.)

And the problem of not all international characters being represented, of some character varying from font to font, and so on and so on... ;)

But sure, if you can guarantee that your program will never be used by anyone outside the US/UK, UTF8 works fine. ;)

##### Share on other sites
Quote:
Original post by Spoonbender
Quote:
 Original post by smart_idiotUTF-8 is nice. You can store your text in normal strings, and, at least here in Linux with GCC, can be used with no extra work. Liek magik.(There's the obvious problems of not being able to trust the string's length and inserting/removing characters with normal string functions can break them, but I won't tell if you won't.)

And the problem of not all international characters being represented, of some character varying from font to font, and so on and so on... ;)

What are you on about? UTF-8 is a Unicode encoding, which allows for representing any Unicode code point (including those outside the BMP).

##### Share on other sites
UTF-8 is a unicode encoding using 8-bit characters. It maps basically directly to ASCII for the lower values, but it does some tricks when representing numbers that are beyond the 255 max value of an 8-bit character. The unicode standard character is stored as a 32-bit number, so a utf-8 character can also have upto (iirc) 3 extra sets of 8 bits (for a total of 4 8-bit characters) to represent other characters. It does have the advantage of not being affected by endian issues.

There is also UTF-16 (both big and little endian) and UTF-32 (also both big and little endian) that fall under the standard.

Unicode Website

##### Share on other sites
UTF-8 isn't a codepage, if that is what you were thinking. Characters have variable length, ranging from 1-6 bytes. It can hold every possible unicode character.

I wrote some iterator adaptors for dealing with UTF-8 strings, to make life easier. Here is some code from my drawText function as an example:

UTF8::InputAdaptor<std::string::const_iterator> pos(string.begin(), string.end());const UTF8::InputAdaptor<std::string::const_iterator> end(string.end());for(; pos != end; ++pos) {  const Font::Glyph &glyph(font.getGlyph(*pos, font_height));    if(x+glyph.width >= 0)   drawGlyph<bpp>(glyph, x, y, p, c);    if((x += glyph.width) >= width)   break; }

Note that my adaptor can optionally take two extra iterators; one for one past the end, and one for the beginning. This is protection from malformed strings so it doesn't end up trying to read a multi-byte character and skipping over the beginning or end, which of course would be bad.

• ### What is your GameDev Story?

In 2019 we are celebrating 20 years of GameDev.net! Share your GameDev Story with us.

• 28
• 16
• 10
• 10
• 11
• ### Forum Statistics

• Total Topics
634110
• Total Posts
3015558
×