Unicode: how to determine current/future combining character codepoint allocation?

Started by
3 comments, last by the_edd 12 years, 1 month ago
The Unicode 6.0 standard, section 7.9, mentions a number of blocks containing combining marks. One of these blocks, 'Combining Marks for Symbols' runs from U+20D0 to U+20FF. Now, there are a number of code points in that range which are not assigned (at least not in version 6.0). The implication seems to be that those unassigned code points are reserved for additional combining marks, in future revisions. Is there any way of deducing this implication purely from the files in the Unicode database? I'd like to avoid having to trawl the Unicode PDF document for mention of potential future code point allocations.


I see that the database labels combining characters for assigned code points with General Category values of 'Mn', 'Mc', or 'Me', but since unassigned code points aren't listed in the database it doesn't look like I can infer their intended future use. But maybe I've missed a trick somewhere?

To ask the question another way, what's a reasonably efficient way of determining if a code point is a combining mark? Should/could the algorithm include as-yet-unassigned code points in its domain?
Advertisement
Off the top of my head, you could use your favorite Unicode library's mechanism to retrieve the combining class for the code point. Only combining marks have a non-zero combining class. Ex: ICU's UCharacter.getCombiningClass()
Right, but I want to go one layer deeper than that; I'm curious as to how something like UCharacter.getCombiningClass() would be implemented from scratch. How does one decide from 'first principles' whether or not e.g. U+20F1 should be counted as a combining character? The implication in the PDF is that it should be counted as such, but the U+20F1 code point is not mentioned in the database (as far as I can see) as it is unassigned.

In other words, what parts of the Unicode database should I look at to confirm that UCharacter.getCombiningClass() is correctly implemented? Could a script be written (in theory) that unambiguously checks getCombiningClass() against every single code point (including unassigned ones) given only the Unicode database files as input?
I don't know if it's possible to write a function that can successfully predict if future code points are combining characters, but I don't think so. IIRC, there are still segments of the Unicode range that are reserved for potential future usage. If I'm reading the code correctly, ICU's getCombiningClass() just grabs the information from the Unicode database and doesn't try to do any range based logic.

If I'm reading the code correctly, ICU's getCombiningClass() just grabs the information from the Unicode database and doesn't try to do any range based logic.

Right, that agrees with my reading. Thanks for having a look.

This topic is closed to new replies.

Advertisement