Jump to content

  • Log In with Google      Sign In   
  • Create Account

Unicode: how to determine current/future combining character codepoint allocation?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
4 replies to this topic

#1 e‍dd   Members   -  Reputation: 2105

Like
0Likes
Like

Posted 25 March 2012 - 05:18 PM

The Unicode 6.0 standard, section 7.9, mentions a number of blocks containing combining marks. One of these blocks, 'Combining Marks for Symbols' runs from U+20D0 to U+20FF. Now, there are a number of code points in that range which are not assigned (at least not in version 6.0). The implication seems to be that those unassigned code points are reserved for additional combining marks, in future revisions. Is there any way of deducing this implication purely from the files in the Unicode database? I'd like to avoid having to trawl the Unicode PDF document for mention of potential future code point allocations.


I see that the database labels combining characters for assigned code points with General Category values of 'Mn', 'Mc', or 'Me', but since unassigned code points aren't listed in the database it doesn't look like I can infer their intended future use. But maybe I've missed a trick somewhere?

To ask the question another way, what's a reasonably efficient way of determining if a code point is a combining mark? Should/could the algorithm include as-yet-unassigned code points in its domain?

Sponsor:

#2 SiCrane   Moderators   -  Reputation: 9675

Like
0Likes
Like

Posted 25 March 2012 - 05:36 PM

Off the top of my head, you could use your favorite Unicode library's mechanism to retrieve the combining class for the code point. Only combining marks have a non-zero combining class. Ex: ICU's UCharacter.getCombiningClass()

#3 e‍dd   Members   -  Reputation: 2105

Like
0Likes
Like

Posted 25 March 2012 - 06:44 PM

Right, but I want to go one layer deeper than that; I'm curious as to how something like UCharacter.getCombiningClass() would be implemented from scratch. How does one decide from 'first principles' whether or not e.g. U+20F1 should be counted as a combining character? The implication in the PDF is that it should be counted as such, but the U+20F1 code point is not mentioned in the database (as far as I can see) as it is unassigned.

In other words, what parts of the Unicode database should I look at to confirm that UCharacter.getCombiningClass() is correctly implemented? Could a script be written (in theory) that unambiguously checks getCombiningClass() against every single code point (including unassigned ones) given only the Unicode database files as input?

#4 SiCrane   Moderators   -  Reputation: 9675

Like
1Likes
Like

Posted 26 March 2012 - 07:46 AM

I don't know if it's possible to write a function that can successfully predict if future code points are combining characters, but I don't think so. IIRC, there are still segments of the Unicode range that are reserved for potential future usage. If I'm reading the code correctly, ICU's getCombiningClass() just grabs the information from the Unicode database and doesn't try to do any range based logic.

#5 e‍dd   Members   -  Reputation: 2105

Like
0Likes
Like

Posted 26 March 2012 - 03:41 PM

If I'm reading the code correctly, ICU's getCombiningClass() just grabs the information from the Unicode database and doesn't try to do any range based logic.

Right, that agrees with my reading. Thanks for having a look.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS