Back to General and Gameplay Programming

Unicode: how to determine current/future combining character codepoint allocation?

General and Gameplay Programming Programming

Started by the_edd March 25, 2012 11:18 PM

3 comments, last by the_edd 12 years, 1 month ago

the_edd

2,109

Author

March 25, 2012 11:18 PM

The Unicode 6.0 standard, section 7.9, mentions a number of blocks containing combining marks. One of these blocks, 'Combining Marks for Symbols' runs from U+20D0 to U+20FF. Now, there are a number of code points in that range which are not assigned (at least not in version 6.0). The implication seems to be that those unassigned code points are reserved for additional combining marks, in future revisions. Is there any way of deducing this implication purely from the files in the Unicode database? I'd like to avoid having to trawl the Unicode PDF document for mention of potential future code point allocations.

I see that the database labels combining characters for assigned code points with General Category values of 'Mn', 'Mc', or 'Me', but since unassigned code points aren't listed in the database it doesn't look like I can infer their intended future use. But maybe I've missed a trick somewhere?

To ask the question another way, what's a reasonably efficient way of determining if a code point is a combining mark? Should/could the algorithm include as-yet-unassigned code points in its domain?

http://www.mr-edd.co.uk
http://bitbucket.org/edd

SiCrane

11,840

March 25, 2012 11:36 PM

Off the top of my head, you could use your favorite Unicode library's mechanism to retrieve the combining class for the code point. Only combining marks have a non-zero combining class. Ex: ICU's UCharacter.getCombiningClass()

the_edd

2,109

Author

March 26, 2012 12:44 AM

Right, but I want to go one layer deeper than that; I'm curious as to how something like UCharacter.getCombiningClass() would be implemented from scratch. How does one decide from 'first principles' whether or not e.g. U+20F1 should be counted as a combining character? The implication in the PDF is that it should be counted as such, but the U+20F1 code point is not mentioned in the database (as far as I can see) as it is unassigned.

In other words, what parts of the Unicode database should I look at to confirm that UCharacter.getCombiningClass() is correctly implemented? Could a script be written (in theory) that unambiguously checks getCombiningClass() against every single code point (including unassigned ones) given only the Unicode database files as input?

http://www.mr-edd.co.uk
http://bitbucket.org/edd

SiCrane

11,840

March 26, 2012 01:46 PM

I don't know if it's possible to write a function that can successfully predict if future code points are combining characters, but I don't think so. IIRC, there are still segments of the Unicode range that are reserved for potential future usage. If I'm reading the code correctly, ICU's getCombiningClass() just grabs the information from the Unicode database and doesn't try to do any range based logic.

the_edd

2,109

Author

March 26, 2012 09:41 PM

If I'm reading the code correctly, ICU's getCombiningClass() just grabs the information from the Unicode database and doesn't try to do any range based logic.

Right, that agrees with my reading. Thanks for having a look.

http://www.mr-edd.co.uk
http://bitbucket.org/edd

Unicode: how to determine current/future combining character codepoint allocation?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Unicode: how to determine current/future combining character codepoint allocation?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines