# Ascii/Unicode code, duplication avoidable?

This topic is 3685 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi, Once upon a time when Unicode was no big thing i built a codebase which consisted of pure ansii/ascii functions. During a project i noticed i needed to work with Unicode so i created typedefs (utilizing the Windows define UNICODE) and changed all the string based methods to use a variable typedef (think TCHAR for std::string/std::wstring). This worked fine so far. Now i face one problem: In my latest project i work with mixed code. Some will use "standard" strings, others Unicode strings. Obviously the define only creates on set of functions so the other is not available. What is the usual way to work around this? I could either 1) Create duplicates of all those functions 2) Use templates, but then i'd have to move all the code from the .cpp into the header file which feels very unclean (besides having a few sideeffects with static variables). I'd choose 1 but before i start the tedious journey i'd like to know if there's other possibilities.

##### Share on other sites
#2 is the only way I know of avoid #1. Although if you decide to make ansi/unicode duplicate functions, you could always make one of those a proxy function that converts the string to the appropriate type and then calls the "real" version.

##### Share on other sites
Maybe you could add an UTF-8 interface to your Unicode classes. If your data is English, it will be exactly the same, and if it's German (since you're from Austria) it will still be 90% the same, as for most other European or Latin languages. You might have a small overhead for the conversion, but that's it. Most of your data goes straight through.
Actually an entire two functions (Ansi2Uni and Uni2Ansi) might do...

It only really gets expensive for Asian languages, but those are problematic to deal with either way, if you don't have Unicode. Your ANSI components won't work with languages that have a few thousand distinct characters anyway.

##### Share on other sites
I guess it's #1 then. As for "foreign" chars; my wife is Thai, so that has some nice training effects on my code. And that's a tad more complicated than just the german umlauts ;)

I'd rather not make proxy functions (internal conversion destroys the more exotic chars unless i carry the encoding with each string).

Anyway, thanks both of you for your input! (hands out rating++)

##### Share on other sites
3) Use UTF-8 strings where possible, and only convert to wide-character (eg. Microsoft's UNICODE or ISO standard Unicode UCS-4 as appropriate) where necessary. No code duplication, no macros, no templates. One single code path for everything.

Unless you're actually parsing strings (and in an internationalized environment, good luck with that) you should stick to UTF-8 as much as possible.

##### Share on other sites
I would use something like this

//Foo.h
int Foo(std::string str);int Foo(std::wstring str);

//Foo.cpp
template<class T>int RealFoo(T str){  return whatever;}int Foo(std::string str){  return RealFoo<std::string>(str);}int Foo(std::wstring str){  return RealFoo<std::wstring>(str);}

##### Share on other sites
Correct me when i'm completely off, but UTF-8 strikes me as a bit complicated due to the different character lengths.
The string functions also need to work with old char* API calls; with umlauts that wouldn't work anymore.

Thanks though ;)

##### Share on other sites
UTF-8 has the nice property that many functions will work with it without really supporting it. Even though those functions "see" the actual content as crap (for example, Thai is entirely unreadable as UTF-8), it's still a valid 8-bit character string. Therefore, it will pass through most string functions like for example strcpy() or strcat() without a single problem. As long as you don't really need to identify distinct characters, everything is fine.

##### Share on other sites
If this application is just for Windows and you dont need to support Windows 9x, forget ANSI (this includes the use of TCHAR). Prefer wide-character strings. If you can rewrite your ANSI-based functions to use wide-characters, do so. The best way to avoid duplication while supporting Unicode is to do everything in Unicode.

##### Share on other sites
Quote:
 Original post by EndurionCorrect me when i'm completely off, but UTF-8 strikes me as a bit complicated due to the different character lengths.

Like I said, unless you're parsing the strings. If you're doing that, then yes UTF-8 is more complicated. If you're not parsing strings, UTF-8 is as complicated as good-ol' ASCII.

If you're parsing strings in internationalized software you're probably up shift creek anyhow, since your dealing with, say, composed characters in Thai or Korean, or mixed left-right and right-left strings in Hebrew and Arabic, or odd parses like ll in Spanish or ch in Czech, not to mention the infamous German ss.
Quote:
 The string functions also need to work with old char* API calls; with umlauts that wouldn't work anymore.

It depends on what you're trying to do. UTF-8 is a char* API. ASCII isUTF-8. The problems only crop up if your char* API is a CP-850 (or ISO 8859-1, or MAC-1000) API. Then, it's not gonna give expected results if you pass in UTF-8. If an API is designed for Unicode (notUNICODE) it will work with pretty much any language. If it's designed for only a single langue, it will only work with a single language.

So like I said, stick to UTF-8 where you can, and convert where you have to.

1. 1
2. 2
3. 3
4. 4
frob
14
5. 5

• 16
• 13
• 20
• 12
• 19
• ### Forum Statistics

• Total Topics
632168
• Total Posts
3004544

×