|
|
Home » Developing U++ » U++ Developers corner » Choosing the best way to go full UNICODE
|
|
|
|
|
|
|
|
|
|
Re: Choosing the best way to go full UNICODE [message #48272 is a reply to message #48259] |
Tue, 13 June 2017 16:31   |
cbpporter
Messages: 1427 Registered: September 2007
|
Ultimate Contributor |
|
|
Well this was quite frankly not necessary and a huge waste of time, but I managed to get down my Unicode data from 130K to 68K. It includes 3 planes with charter type, upper, lower and title case. I guess the nonexistent users of my library will be happy .
I should have probably went with your compressed scheme, but I'm stubborn. We'll see what the future holds, since only now am I getting to writing the decomposition API.
[Updated on: Tue, 13 June 2017 17:44] Report message to a moderator
|
|
|
|
|
|
|
|
|
|
Re: Choosing the best way to go full UNICODE [message #48288 is a reply to message #48287] |
Wed, 14 June 2017 23:31   |
cbpporter
Messages: 1427 Registered: September 2007
|
Ultimate Contributor |
|
|
PS: that is a special composition. I went over the data over and over again and I found no good reason to handle box decomposition.
It is not like U++ will check to see if the font supports that character and if not, decompose it and build CJK in a small box on the fly.
Decompositions that start with <smth> are all special, like <font>, meaning that you can decompose that character if you are doing font substitution to an approximation or <square>, meaning that the code point is multiple characters arranged in a square or like <fraction>, when you have 1/2 as a single code point and you can decompose it as 1/2, using 3 code points.
I choose to ignore all these for now since I can't figure out how to offer any worthwhile feature related to these special substitutions. I don't even need normal decompositions, but it is pretty cool to decompose diacritics and replace some bits since I'm an European and my native language uses diacritics.
As for NCF and NDF I only found two good use cases: string equality and search. With the forms, you don't compare code points, but glyphs, without building glyphs. If two string look the same on your display, but have different code points due to diacritics, it is very useful to tell if they are visually identical or not. Basically I want "ț" encoded as a precomposed character and "ț" encoded as a "t" with a composition mark to be identified as the same string.
|
|
|
Re: Choosing the best way to go full UNICODE [message #48302 is a reply to message #48288] |
Mon, 19 June 2017 10:03   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
My undestanding is that if decomposition sequence starts with "<", it is 'compatibility', if not, it is 'canonical'.
I believe that you should use compatibility seqences e.g. for comparing, but you should never 'recompose' these into single codepoint - one of reasons is that canonical compositions are unique, but there can be the same compatibility decompositions for multiple codepoints (found out that hard way during testing).
In either case, i have added a bool
int UnicodeDecompose(dword codepoint, dword t[MAX_DECOMPOSED], bool& canonical);
to 'decompose' API and Compose is now not using noncanonical decompositions.
I believe that my "Unicode INFO" code is now complete. In the end, it is about 12KB of data (6KB compressed and 6KB of 'fast tables' for the first 2048 codepoints).
Documentation needs updating. Then the next part would be updating / deprecating those ToLower/ToUpper routines for Strings, and most importantly, implementing "apparent character logic".
|
|
|
Goto Forum:
Current Time: Fri May 09 11:41:33 CEST 2025
Total time taken to generate the page: 0.00560 seconds
|
|
|