#ifdef UNICODE
typedef tstringstream std::stringstream
#else
typedef tstringstream std::wstringstream
#endif
unicode consideration
Hi gamedevelopers
I want to modify the project (C++) I'm actually developing in order to switch from ansi to unicode. Now what is the best way to handle this process? One step at time, first problem:
1) What macro to use for handling "bla" L"bla", I've seen the Ms _T" " but I don't know if it is the most elegant way.
2) You use typedef for the standard container? such as:
3) what macro for char and char string? It's better a typedef like "tchar"?
4) what encoding? I need my *.ini (on win systems) to handle unicode. I've seen that the application called "notepad" open the "unicode" "unicode big endian" and "UTF-8" encoding so I think that it would be a good idea to choose from these three. What is the most widely used?
You can provide me a link with a signature list for all encoding? I'm talking about the first 4 bytes that in teory identify the used encoding.
p.s. thanks to the guy how will loose time in answering my stupid question
What is your target platform?
Under Windows just using _T() around your string literals, and TCHAR in your declarations can be enough to get it to compile for either.
Under Windows just using _T() around your string literals, and TCHAR in your declarations can be enough to get it to compile for either.
Windows uses little-endian unicode (Notepad just calls it "Unicode"). Files saved as big-endian you will have to convert using (WideChar&0xFF << 8 + (WideChar>>8) &0xFF) in your code (where WideChar is the widecharacter you read), and UTF-8 you will have to convert with MultiByteToWideChar(CP_UTF8, ...)
UTF-8 is nice. You can store your text in normal strings, and, at least here in Linux with GCC, can be used with no extra work. Liek magik.
(There's the obvious problems of not being able to trust the string's length and inserting/removing characters with normal string functions can break them, but I won't tell if you won't.)
(There's the obvious problems of not being able to trust the string's length and inserting/removing characters with normal string functions can break them, but I won't tell if you won't.)
mmh I think that the using of the little endian from Ms is due to the fact that (as I've seen) the last nt series uses it internally. I've decided to go with:
typedef tchar (char,wchar_t)
typedef tstring (string, wstring)
#define T() ((),L())
why the standard use the L" " before unicode literal? It means Long? It wasn't better S" " (Short)?
p.s. thanks
typedef tchar (char,wchar_t)
typedef tstring (string, wstring)
#define T() ((),L())
why the standard use the L" " before unicode literal? It means Long? It wasn't better S" " (Short)?
p.s. thanks
Quote:Original post by smart_idiot
UTF-8 is nice. You can store your text in normal strings, and, at least here in Linux with GCC, can be used with no extra work. Liek magik.
(There's the obvious problems of not being able to trust the string's length and inserting/removing characters with normal string functions can break them, but I won't tell if you won't.)
And the problem of not all international characters being represented, of some character varying from font to font, and so on and so on... ;)
But sure, if you can guarantee that your program will never be used by anyone outside the US/UK, UTF8 works fine. ;)
Quote:Original post by SpoonbenderQuote:Original post by smart_idiot
UTF-8 is nice. You can store your text in normal strings, and, at least here in Linux with GCC, can be used with no extra work. Liek magik.
(There's the obvious problems of not being able to trust the string's length and inserting/removing characters with normal string functions can break them, but I won't tell if you won't.)
And the problem of not all international characters being represented, of some character varying from font to font, and so on and so on... ;)
What are you on about? UTF-8 is a Unicode encoding, which allows for representing any Unicode code point (including those outside the BMP).
UTF-8 is a unicode encoding using 8-bit characters. It maps basically directly to ASCII for the lower values, but it does some tricks when representing numbers that are beyond the 255 max value of an 8-bit character. The unicode standard character is stored as a 32-bit number, so a utf-8 character can also have upto (iirc) 3 extra sets of 8 bits (for a total of 4 8-bit characters) to represent other characters. It does have the advantage of not being affected by endian issues.
There is also UTF-16 (both big and little endian) and UTF-32 (also both big and little endian) that fall under the standard.
Unicode Website
You can even download the official unicode standard book there.
There is also UTF-16 (both big and little endian) and UTF-32 (also both big and little endian) that fall under the standard.
Unicode Website
You can even download the official unicode standard book there.
UTF-8 isn't a codepage, if that is what you were thinking. Characters have variable length, ranging from 1-6 bytes. It can hold every possible unicode character.
I wrote some iterator adaptors for dealing with UTF-8 strings, to make life easier. Here is some code from my drawText function as an example:
Note that my adaptor can optionally take two extra iterators; one for one past the end, and one for the beginning. This is protection from malformed strings so it doesn't end up trying to read a multi-byte character and skipping over the beginning or end, which of course would be bad.
I wrote some iterator adaptors for dealing with UTF-8 strings, to make life easier. Here is some code from my drawText function as an example:
UTF8::InputAdaptor<std::string::const_iterator> pos(string.begin(), string.end());const UTF8::InputAdaptor<std::string::const_iterator> end(string.end());for(; pos != end; ++pos) { const Font::Glyph &glyph(font.getGlyph(*pos, font_height)); if(x+glyph.width >= 0) drawGlyph<bpp>(glyph, x, y, p, c); if((x += glyph.width) >= width) break; }
Note that my adaptor can optionally take two extra iterators; one for one past the end, and one for the beginning. This is protection from malformed strings so it doesn't end up trying to read a multi-byte character and skipping over the beginning or end, which of course would be bad.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement