Sign in to follow this  

Unicode Letters and Localization

This topic is 3899 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm attempting to generate a texture (for use in OpenGL fonts) holding all of the possible unicode letters in a certain language (for example, if the person is using a computer that is localized to Japanese, it will render all of the possible Japanese characters to this texture). My only problem is this: I am unsure how to (A) detect the user's localized language (I plan on making this cross-platform, so I have to get this working on Windows, Mac, and Linux) and (B) how to obtain all of the renderable characters of that certain language in unicode (for example, I don't want to render 'delete', of course). Any help in this is greatly appreciated.

Share this post


Link to post
Share on other sites
There are posix locale functions which should work everywhere. (or standard C++ <locale> stuff)

Unicode has ranges which are set for groups of characters. Here is a site which lists them.

That said, this is likely a solved problem (or at least the problem you're looking to solve is), take a look and see how others do it.

Share this post


Link to post
Share on other sites
Can I ask why you want to do this?

Some languages (like Japanese/Chinese) have a rather large number of possible characters.

Other languages have combinatorial problems, where the rendering and positioning of a particular unicode character depends on what is before and after the character.

Basically, there is more effort to rendering a string of unicode characters than you are capable of dealing with by rendering individual characters.

Share this post


Link to post
Share on other sites
Is there a good reason why you wouldn't just render all of the glyphs of a given font to the texture? It would take up more memory, I suppose, and I can see how that might end up being a real constraint.

As NotAYakk mentioned above, there isn't a one to one correspondance between unicode characters (\uxxxx) to the Glyph that gets rendered on screen. Arabic is full of examples of this where three characters might call for a single glyph, but in a different order, they'd call for three glyphs.

wikipedia actually has a good article on unicode (http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters) that details how the BMP divides up unicode characters. You could probably use these ranges to pull glyphs from a given font out, but I think you'd miss the glyphs resulting from character combinations this way.

As a complete guess, maybe font files themselves give more clues to the locale of their glyphs? The mapping has to be done somehow. It looks like OpenType tags might have what you need http://en.wikipedia.org/wiki/OpenType

Share this post


Link to post
Share on other sites
Quote:
Original post by Valere
Is there a good reason why you wouldn't just render all of the glyphs of a given font to the texture? It would take up more memory, I suppose, and I can see how that might end up being a real constraint.


Because not all languages are written like english, where each latin letter is seperate and distinctive regardless of context (the only "context sensitive" part of the latin alphabet would be capitalization).

By comparison many other writing systems can have the "glyph" for each letter altered based on the glyphs around, the tense, the sex, or any other form of language context.

For example take the Arabic alphabet, which is composed of 18 main "letters" which represent sounds, much like english - however each of those letters can have up to 4 different forms. The reason is that arabic writing system flows from one letter to the next within a word: think of it like cursive writing, only it's a requirement. Each arabic letter has up to 4 forms, an isolated form, a beginning form, a middle form and an end form. While they often look similar, the beginning/middle forms often have very noticable differences from the ending/isolated forms.

Share this post


Link to post
Share on other sites
I don't mean to be argumentative Michalson, but that's the exact point I was attempting to make with the rest of my post. :)

The way I was taught internationalization, a character is a single logical unit of text (the 18 characters in Arabic), while a glyph is a single visual unit of text that may represent more than one character, or there may be more than one glyph for a given logical character depending on context.

I assume the OP is trying to develop a localizable app in OpenGL, and wants to optimize text draw rate by using a texture that he can grab glyphs from. If memory is a real constraint, then a texture containing up to 65,000 or so glyphs might be too big, if not, then it's the easiest way to avoid the hassle of trying to identify which glyphs belong to the current locale.

Of course, that doesn't address some of the complexities of dealing with the CJK families, like external glyphs, but that may not be necessary for what the OP needs.

Share this post


Link to post
Share on other sites

This topic is 3899 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this