|
|
Home » Developing U++ » U++ Developers corner » Choosing the best way to go full UNICODE
|
|
|
|
|
|
|
|
|
|
Re: Choosing the best way to go full UNICODE [message #48223 is a reply to message #48222] |
Tue, 06 June 2017 13:39   |
cbpporter
Messages: 1427 Registered: September 2007
|
Ultimate Contributor |
|
|
mirek wrote on Tue, 06 June 2017 14:21cbpporter wrote on Tue, 06 June 2017 11:18
Characters are uniquely defined, so are the canonical composition and decomposition rules, together with compatibility substitutions.
I suppose so. Still learning.
Your lib is BSD? Available somewhere?
Oh yeah, the Unicode spec is huge. But it also recommends that you implement as much as you need for your needs, not the whole thing.
The lib is a bit more complicated. It is Apache Version 2.0, but not really released yet. More precisely, the more advanced Unicode parts only exist on my disks yet, they are not committed. But when they will be committed, it will be under Apache. How does future licensing work? 
But I can share that, with the only caveat that the only things I'm really doing and are meant for the final lib is conversion and encoding of UnicodeData.txt. I'm not planning on composition handling for now since it is not about GUI or displaying text. My encoding scheme handles 4 things: category, upper, lower and title case.
I would like still to add script to that data and probably you can't dodge forever adding canonical normalization support.
|
|
|
|
|
|
Re: Choosing the best way to go full UNICODE [message #48236 is a reply to message #48235] |
Thu, 08 June 2017 10:43   |
cbpporter
Messages: 1427 Registered: September 2007
|
Ultimate Contributor |
|
|
mirek wrote on Thu, 08 June 2017 11:26cbpporter wrote on Thu, 08 June 2017 10:00My priority is CodeEditor right now, but right after I do want to take care on my side of canonical decomposition too.
What do you plan?
I guess you missed my saga: http://www.ultimatepp.org/forums/index.php?t=msg&th=9945 &start=0&
I'm guessing I have about 3-4 hours more of work and I'll have a preview version done. Then I'll upload it there and include it into my daily builds and test it for a couple of weeks.
mirek wrote on Thu, 08 June 2017 11:26
Quote:
I'll gather stats like what % of characters can be decomposed and into how many characters on average to come up with an optimal scheme. For Latin languages I expect a few hundred with at most 3 characters in decomposition, with most having 2.
That is my estimate too. I even thing that 3 codepoints is so sparse, that the basic table should only store 2 (which means single base char + single combining mark) - that will allow for more dense table, and 3 codepoint characters should be handled as exception.
That's why I'm gathering data to make informed decisions. I'll get back to you with the stats.
How do you want to handle compatibility decomposition? Like: http://www.fileformat.info/info/unicode/char/0149/index.htm
My plan is to have a flag for compatibility decomposition vs normal ones, with it being off by default. I'm not sure, but I think you can exclude them all if you don't want to bother with. Unicode can be complicated.
|
|
|
|
Re: Choosing the best way to go full UNICODE [message #48238 is a reply to message #48191] |
Thu, 08 June 2017 13:00   |
cbpporter
Messages: 1427 Registered: September 2007
|
Ultimate Contributor |
|
|
Not quite 100% done analyzing the data, but here is what I think I'll do:
- respect the 3 plane convention. Unicode has 17 planes, with the first 3 in active use. Plane 14 is used, but it is specific and only has 368 allocated code points. It is so specific that I'll add exclude it, the same as I do all planes except planes 0-2. All excluded planes have the property that any function f(cp) = cp.
- I'll ignore all special substitutions: sub and superscript, font, circle, square, fractions and of course compatibility substitutions. I won't be using a flag for now, just exclude them.
- I'll ignore all CJK COMPATIBILITY IDEOGRAPHs. There is no way a general purpose library can provide satisfactory use case for these. If you really needs such substitution, you will probably use a more competent third party library. f(CJK COMPATIBILITY IDEOGRAPH) = CJK COMPATIBILITY IDEOGRAPH
All these combined with my two table solution, with a chunk size of 256 to 1024 will leave me with around 8000-9000 bytes of data in each executable that does decomposition. Final numbers will be determined once implementation is done and round trip testing is complete.
I think this is a reasonable subset that can handle NFD, at the small price of a flat 9K in exe size, + the size of the actual methods.
|
|
|
Re: Choosing the best way to go full UNICODE [message #48243 is a reply to message #48238] |
Sun, 11 June 2017 13:57   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
cbpporter wrote on Thu, 08 June 2017 13:00Not quite 100% done analyzing the data, but here is what I think I'll do:
- respect the 3 plane convention. Unicode has 17 planes, with the first 3 in active use. Plane 14 is used, but it is specific and only has 368 allocated code points. It is so specific that I'll add exclude it, the same as I do all planes except planes 0-2. All excluded planes have the property that any function f(cp) = cp.
- I'll ignore all special substitutions: sub and superscript, font, circle, square, fractions and of course compatibility substitutions. I won't be using a flag for now, just exclude them.
- I'll ignore all CJK COMPATIBILITY IDEOGRAPHs. There is no way a general purpose library can provide satisfactory use case for these. If you really needs such substitution, you will probably use a more competent third party library. f(CJK COMPATIBILITY IDEOGRAPH) = CJK COMPATIBILITY IDEOGRAPH
All these combined with my two table solution, with a chunk size of 256 to 1024 will leave me with around 8000-9000 bytes of data in each executable that does decomposition. Final numbers will be determined once implementation is done and round trip testing is complete.
I think this is a reasonable subset that can handle NFD, at the small price of a flat 9K in exe size, + the size of the actual methods.
I have managed to squeeze complete composition to 3.8KB table... 
Interesting observation: With UnicodeCompose / Decompose, with first 2048 codepoints covered by "fast table", there is no need for further tables for ToUpper, ToLower, ToAscii.
|
|
|
|
|
Goto Forum:
Current Time: Fri May 09 11:43:18 CEST 2025
Total time taken to generate the page: 0.01460 seconds
|
|
|