Jump to content
  • Advertisement
Sign in to follow this  
mackron

UTF-8 char* strings

This topic is 3451 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Gday, I'm currently writing a Unicode string manipulation library and I was curious as to whether or not there are any platforms that assume UTF-8 encoding for their str*() family of C library functions. That way I can use those routines instead of writing my own. So, are there any platforms that do this? I had a quick look on Google, but couldn't find anything succinct. I'll keep looking... Thanks a lot.

Share this post


Link to post
Share on other sites
Advertisement
Quote:
Original post by mackron
[are there] any platforms that assume UTF-8 encoding for their str*() family of C library functions? That way I can use those routines instead of writing my own.


I'm afraid (AFAIK) the standard C library has no such thing.
The good news is, that you don't have to write your own; unicode.org already has!

If you're still interested in writing your own, there is this article I came across recently.

Share this post


Link to post
Share on other sites
The standard C library is... kind of... older than Unicode (and certainly older than the popular adaptation of Unicode), so no, it's not going to work that way. The signatures for these functions are defined by the standard to accept and return char*, and the definition of e.g. strlen() requires it to count all the chars until a '\0', which rules out UTF-8 (since it's a variable-length encoding).

There are certainly lots of existing code snippets for doing the necessary translation work.

Share this post


Link to post
Share on other sites
Thanks guys.

Quote:

The good news is, that you don't have to write your own; unicode.org already has!

Yeah I saw that, and I've based all of my conversion functions off of that code.

Quote:

The standard C library is... kind of... older than Unicode (and certainly older than the popular adaptation of Unicode), so no, it's not going to work that way. The signatures for these functions are defined by the standard to accept and return char*, and the definition of e.g. strlen() requires it to count all the chars until a '\0', which rules out UTF-8 (since it's a variable-length encoding).


I always thought strlen() is supposed to calculate the number of bytes and not necessarily the number of characters. I think I might have read that from MSDN...

So will the str*() family of functions only work with ASCII strings?

Thanks again.


Share this post


Link to post
Share on other sites
Quote:
Original post by mackron
I always thought strlen() is supposed to calculate the number of bytes and not necessarily the number of characters. I think I might have read that from MSDN...


That is correct, str* works on bytes not characters.

Quote:
Original post by mackron
So will the str*() family of functions only work with ASCII strings?


With the above in mind, str* will still "work" with UTF-8 strings, depending on what it is you actually expect it to do (i.e. it doesn't "work" if you expect it to count characters, but it does "work" if you expect it count bytes).

By the way, I should just point out that even wcslen and friends doesn't really count "characters" either. It counts UTF-16 code points. After all, "é" is one character, but wcslen("é") would return 2, because the string is actually <U+0065 (LATIN CAPITAL LETTER E), U+0301 (COMBINING ACUTE ACCENT)>.

Share this post


Link to post
Share on other sites
Quote:
Original post by Codeka
By the way, I should just point out that even wcslen and friends doesn't really count "characters" either. It counts UTF-16 code points. After all, "é" is one character, but wcslen("é") would return 2, because the string is actually <U+0065 (LATIN CAPITAL LETTER E), U+0301 (COMBINING ACUTE ACCENT)>.
The funny thing about Unicode is that there are several ways to get certain characters: é.

Share this post


Link to post
Share on other sites
Quote:
Original post by ToohrVyk
The funny thing about Unicode is that there are several ways to get certain characters: é.


That was kind of my point [smile]

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!