Percent encoding non-english characters - any Win32 api?

Started by
12 comments, last by ApochPiQ 15 years, 3 months ago
Hi, I need to create a IUri object out of a character string which can contain non-english characters as well. The api CreateUri() fails and returns an error code E_FAIL when I pass a string containing non-english characters. I think they need to be percent encoded. InternetCanonicalizeUrl() and UrlEscape() both are useful only for converting unsafe characters. Is there any Win32 api for converting alpha numeric characters into the percent encoded form? Thanks, M
Advertisement
What type/class are you using for your strings? (Also, what language are you writing in?)

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

It sounds like InternetCanonicalizeUrl() will automatically encode all characters not in the US-ASCII character set from this document.
Hi,
Looked up InternetCanonicalizeUrl(). Looks like it does encode non - English characters as well. Thanks a lot.

-M
Hi,
I tried InternetCanonicalizeUrl() along with dwFlags = ICU_BROWSER_MODE. It doesn't convert non-english urls into % encoded form, though it does convert reserved characters into their corresponding encoding. I'm using VC++ and all my strings are WCHAR*.

Thanks,
M
From my research, there isn't actually any standard way to encode Unicode characters in a URL; there are a few random implementations that do different things with Unicode characters, but no established common method. According to the RFCs there is no legal way to pass Unicode in a URL.

So the question becomes, why do you have this requirement? What exactly is supposed to be accomplished here?

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

I'm trying to create a IUri object by passing a url string which could be in any laguage. I'm calling CreateUri() which fails for non-English characters. This is for a word processing app.

Thanks,
M
Well, the really lazy way would be to reinterpret_cast the string to a char*, and loop through each byte, doing a simple conversion to % codes as you go (in case you aren't aware, the % codes are literally the byte values in hexadecimal). This will thoroughly mangle the URL and it probably won't work, because URIs are not intended to have characters outside a very limited subset of the Latin-1 codepage. In other words, if you create a URL this way, chances are whatever serves that URL will have no idea what you want [wink]

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Hi Apoch,
Thanks for the prompt replies. When I could not find any api, converting the characters to their corresponding hex value was what I tried doing. But I'm running into some weird issues. For eg: an Arabic character L'ت'is showing up as 1578 (decimal). This cannot be right. I'm passing the string as a CString and calling PercentEncode(strRaw). See anything wrong in what I'm doing?

Thanks,
M
First, you need to know what encoding your input is using. Is it already UTF-8? UTF-16? Even UTF-32? What is the endianness of the encoding?

All that will affect how you do this.

(By the way, 1578 is the correct code point number for the character you posted. It smells like little-endian UTF-16 to me, which is pretty much what Windows does, but that's just a guess.)


Additionally - I'd recommend using ICU if you want good Unicode support from C/C++. It's a great library and handles (as well as explains) the finer points of how Unicode works.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

This topic is closed to new replies.

Advertisement