Jump to content
  • Advertisement
Sign in to follow this  
the_edd

Unicode: how to determine current/future combining character codepoint allocation?

This topic is 2399 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

The Unicode 6.0 standard, section 7.9, mentions a number of blocks containing combining marks. One of these blocks, 'Combining Marks for Symbols' runs from U+20D0 to U+20FF. Now, there are a number of code points in that range which are not assigned (at least not in version 6.0). The implication seems to be that those unassigned code points are reserved for additional combining marks, in future revisions. Is there any way of deducing this implication purely from the files in the Unicode database? I'd like to avoid having to trawl the Unicode PDF document for mention of potential future code point allocations.


I see that the database labels combining characters for assigned code points with General Category values of 'Mn', 'Mc', or 'Me', but since unassigned code points aren't listed in the database it doesn't look like I can infer their intended future use. But maybe I've missed a trick somewhere?

To ask the question another way, what's a reasonably efficient way of determining if a code point is a combining mark? Should/could the algorithm include as-yet-unassigned code points in its domain?

Share this post


Link to post
Share on other sites
Advertisement
Off the top of my head, you could use your favorite Unicode library's mechanism to retrieve the combining class for the code point. Only combining marks have a non-zero combining class. Ex: ICU's UCharacter.getCombiningClass()

Share this post


Link to post
Share on other sites
Right, but I want to go one layer deeper than that; I'm curious as to how something like UCharacter.getCombiningClass() would be implemented from scratch. How does one decide from 'first principles' whether or not e.g. U+20F1 should be counted as a combining character? The implication in the PDF is that it should be counted as such, but the U+20F1 code point is not mentioned in the database (as far as I can see) as it is unassigned.

In other words, what parts of the Unicode database should I look at to confirm that UCharacter.getCombiningClass() is correctly implemented? Could a script be written (in theory) that unambiguously checks getCombiningClass() against every single code point (including unassigned ones) given only the Unicode database files as input?

Share this post


Link to post
Share on other sites
I don't know if it's possible to write a function that can successfully predict if future code points are combining characters, but I don't think so. IIRC, there are still segments of the Unicode range that are reserved for potential future usage. If I'm reading the code correctly, ICU's getCombiningClass() just grabs the information from the Unicode database and doesn't try to do any range based logic.

Share this post


Link to post
Share on other sites

If I'm reading the code correctly, ICU's getCombiningClass() just grabs the information from the Unicode database and doesn't try to do any range based logic.

Right, that agrees with my reading. Thanks for having a look.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!