Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » Developing U++ » U++ Developers corner » Choosing the best way to go full UNICODE
Re: Choosing the best way to go full UNICODE [message #48191 is a reply to message #48189] Wed, 31 May 2017 15:06 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
OK, let's see it in action.

I still think that MString is a text book case of Unicode indexability fallacy, but maybe I'm wrong.

And if it works, maybe it is fine.

Please let me know if you need some help for the basic stuff. I can also review Unicode conformity of algorithms.

Do you want go full Utf8 minimal valid sequence validation and overlong prevention?
Code Points        First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F     00..7F
U+0080..U+07FF     C2..DF     80..BF
U+0800..U+0FFF     E0         A0..BF      80..BF
U+1000..U+CFFF     E1..EC     80..BF      80..BF
U+D000..U+D7FF     ED         80..9F      80..BF
U+E000..U+FFFF     EE..EF     80..BF      80..BF
U+10000..U+3FFFF   F0         90..BF      80..BF     80..BF
U+40000..U+FFFFF   F1..F3     80..BF      80..BF     80..BF
U+100000..U+10FFFF F4         80..8F      80..BF     80..BF


With the rest error escaped?
Re: Choosing the best way to go full UNICODE [message #48192 is a reply to message #48191] Wed, 31 May 2017 15:25 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
cbpporter wrote on Wed, 31 May 2017 15:06
OK, let's see it in action.

I still think that MString is a text book case of Unicode indexability fallacy, but maybe I'm wrong.

And if it works, maybe it is fine.

Please let me know if you need some help for the basic stuff. I can also review Unicode conformity of algorithms.

Do you want go full Utf8 minimal valid sequence validation and overlong prevention?
Code Points        First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F     00..7F
U+0080..U+07FF     C2..DF     80..BF
U+0800..U+0FFF     E0         A0..BF      80..BF
U+1000..U+CFFF     E1..EC     80..BF      80..BF
U+D000..U+D7FF     ED         80..9F      80..BF
U+E000..U+FFFF     EE..EF     80..BF      80..BF
U+10000..U+3FFFF   F0         90..BF      80..BF     80..BF
U+40000..U+FFFFF   F1..F3     80..BF      80..BF     80..BF
U+100000..U+10FFFF F4         80..8F      80..BF     80..BF


With the rest error escaped?


Not sure about that, but what I know for sure that I want to put decoding in single template (unlike current charset.cpp) so that it can be fixed easily...
Re: Choosing the best way to go full UNICODE [message #48193 is a reply to message #48192] Wed, 31 May 2017 15:34 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
OK, after thinking about it, overlong should be error escaped.

I think that the basic routine should produce some error flag if it does error escape.

I also think that (maybe on flag), I would like the basic encoding extended to full 32-bits (with error flag). The reason is that it could be handy outside character use (e.g. storing relative offsets of something).
Re: Choosing the best way to go full UNICODE [message #48194 is a reply to message #48193] Wed, 31 May 2017 15:38 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
mirek wrote on Wed, 31 May 2017 16:34

I also think that (maybe on flag), I would like the basic encoding extended to full 32-bits (with error flag). The reason is that it could be handy outside character use (e.g. storing relative offsets of something).

Sorry, I do not understand what you mean here...
Re: Choosing the best way to go full UNICODE [message #48195 is a reply to message #48194] Wed, 31 May 2017 15:50 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
cbpporter wrote on Wed, 31 May 2017 15:38
mirek wrote on Wed, 31 May 2017 16:34

I also think that (maybe on flag), I would like the basic encoding extended to full 32-bits (with error flag). The reason is that it could be handy outside character use (e.g. storing relative offsets of something).

Sorry, I do not understand what you mean here...


Ah, it is not really unicode related. But sometimes you have a set of 32-bit (or 31-bit) numbers where you know that most of them are <128 (but some are not) and want to store it effectively. So I was thinking that it would be nice to reuse the very same code for this...
Re: Choosing the best way to go full UNICODE [message #48208 is a reply to message #48195] Mon, 05 June 2017 17:51 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
Utf[8,16,32] <=> Utf[8,16,32] conversion routines commited...

Any tips for good unicode classification data (in single file)?
Re: Choosing the best way to go full UNICODE [message #48214 is a reply to message #48208] Tue, 06 June 2017 09:28 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
You mean Unicode General Category?

I have that encoded in my two table scheme.

If that is all you can want to include, it should be extremely small, something like (number of characters / chunk size) * 1.2.

If you mean scripts, that is a bit more complex, mostly because I want the data small.

I'm also working on Unicode names. It turns out that there is under 200 unique words in the names. I want to use a huge string buffer where each byte in the string is a word, not a character.
Re: Choosing the best way to go full UNICODE [message #48219 is a reply to message #48214] Tue, 06 June 2017 10:41 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
cbpporter wrote on Tue, 06 June 2017 09:28
You mean Unicode General Category?


I need to make some sense of it all... Smile

Not yet sure what exactly I will need, but for now I am pretty sure I would like to have info that e.g.

Č

is character based on

C

with

ˇ

combining character.

That C is uppercase and there is corresponding lowercase c. Now I can se I can have

"Latin Capital Letter C with caron"

which is probably OK, but not sure if it is without ambiguities.

Re: Choosing the best way to go full UNICODE [message #48220 is a reply to message #48219] Tue, 06 June 2017 11:18 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
There is no easy solution here I'm afraid.

You probably know the file:
http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt

What I do is read that file, compile all the information in RAM and write out C++ tables.

The file has a lot but not all of the needed information. The question is how much of it you need and how are you going to store it.

I don't understand the final point you were making about ambiguities? Characters are uniquely defined, so are the canonical composition and decomposition rules, together with compatibility substitutions.
Re: Choosing the best way to go full UNICODE [message #48222 is a reply to message #48220] Tue, 06 June 2017 13:21 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
cbpporter wrote on Tue, 06 June 2017 11:18

Characters are uniquely defined, so are the canonical composition and decomposition rules, together with compatibility substitutions.


I suppose so. Still learning.

Your lib is BSD? Available somewhere?
Re: Choosing the best way to go full UNICODE [message #48223 is a reply to message #48222] Tue, 06 June 2017 13:39 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
mirek wrote on Tue, 06 June 2017 14:21
cbpporter wrote on Tue, 06 June 2017 11:18

Characters are uniquely defined, so are the canonical composition and decomposition rules, together with compatibility substitutions.


I suppose so. Still learning.

Your lib is BSD? Available somewhere?

Oh yeah, the Unicode spec is huge. But it also recommends that you implement as much as you need for your needs, not the whole thing.

The lib is a bit more complicated. It is Apache Version 2.0, but not really released yet. More precisely, the more advanced Unicode parts only exist on my disks yet, they are not committed. But when they will be committed, it will be under Apache. How does future licensing work? Smile

But I can share that, with the only caveat that the only things I'm really doing and are meant for the final lib is conversion and encoding of UnicodeData.txt. I'm not planning on composition handling for now since it is not about GUI or displaying text. My encoding scheme handles 4 things: category, upper, lower and title case.

I would like still to add script to that data and probably you can't dodge forever adding canonical normalization support.
Re: Choosing the best way to go full UNICODE [message #48224 is a reply to message #48223] Tue, 06 June 2017 13:58 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
As an example, here is my extractor.

It is part of the workbench projects, small poorly written fire and forget programs, so expect it to be messy. Super messy and not maintained.

This one generates two tables which are then used by the real code to power 3 Unicode plane functionality.
  • Attachment: udb.zip
    (Size: 223.72KB, Downloaded 81 times)
Re: Choosing the best way to go full UNICODE [message #48234 is a reply to message #48224] Thu, 08 June 2017 10:00 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
My priority is CodeEditor right now, but right after I do want to take care on my side of canonical decomposition too.

I'll gather stats like what % of characters can be decomposed and into how many characters on average to come up with an optimal scheme. For Latin languages I expect a few hundred with at most 3 characters in decomposition, with most having 2.
Re: Choosing the best way to go full UNICODE [message #48235 is a reply to message #48234] Thu, 08 June 2017 10:26 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
cbpporter wrote on Thu, 08 June 2017 10:00
My priority is CodeEditor right now, but right after I do want to take care on my side of canonical decomposition too.


What do you plan?

Quote:

I'll gather stats like what % of characters can be decomposed and into how many characters on average to come up with an optimal scheme. For Latin languages I expect a few hundred with at most 3 characters in decomposition, with most having 2.


That is my estimate too. I even thing that 3 codepoints is so sparse, that the basic table should only store 2 (which means single base char + single combining mark) - that will allow for more dense table, and 3 codepoint characters should be handled as exception.
Re: Choosing the best way to go full UNICODE [message #48236 is a reply to message #48235] Thu, 08 June 2017 10:43 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
mirek wrote on Thu, 08 June 2017 11:26
cbpporter wrote on Thu, 08 June 2017 10:00
My priority is CodeEditor right now, but right after I do want to take care on my side of canonical decomposition too.


What do you plan?


I guess you missed my saga: http://www.ultimatepp.org/forums/index.php?t=msg&th=9945 &start=0&

I'm guessing I have about 3-4 hours more of work and I'll have a preview version done. Then I'll upload it there and include it into my daily builds and test it for a couple of weeks.

mirek wrote on Thu, 08 June 2017 11:26

Quote:

I'll gather stats like what % of characters can be decomposed and into how many characters on average to come up with an optimal scheme. For Latin languages I expect a few hundred with at most 3 characters in decomposition, with most having 2.


That is my estimate too. I even thing that 3 codepoints is so sparse, that the basic table should only store 2 (which means single base char + single combining mark) - that will allow for more dense table, and 3 codepoint characters should be handled as exception.

That's why I'm gathering data to make informed decisions. I'll get back to you with the stats.

How do you want to handle compatibility decomposition? Like: http://www.fileformat.info/info/unicode/char/0149/index.htm

My plan is to have a flag for compatibility decomposition vs normal ones, with it being off by default. I'm not sure, but I think you can exclude them all if you don't want to bother with. Unicode can be complicated.
Re: Choosing the best way to go full UNICODE [message #48237 is a reply to message #48236] Thu, 08 June 2017 11:22 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
Here are the primary stats:
- there are only 5721 characters that have decomposition.
- 2060 out of them have normal decomposition. All of these are two characters. So representing them is easy. Unfortunately the highest CP is 2FA1D. But it is CJK compatibility. I think it should be ignored. Without those the highest codepoint is 1D1C0. But if you ignore a bunch of stuff, like hebrew, hiragana, musical notations I think one could stop even as low as the aptly named character: http://www.fileformat.info/info/unicode/char/2adc/index.htm

Going lower than 0x2adc will soon cut of stuff like Greek. You need 2000 to not cut off Greek.

- the rest of decomposition are 2 to 4, but there are some weird exceptions, like a 18 character one.
- if you stop ar 0x2000/Greek the max CP a decomposition is 8190. If yous top at 0x2adc, it is 12297.

So a dump scheme for the first 0x2adc or a bit higher would be 43.888 bytes. In my lib I already have (256 * 256 * PLANES / BS) * 2 + count2 * 8 = 131456 bytes of data used by uppercase/lower case, so even 43k more is pushing it. I'll investigate how to represent sparsely both the first 0x2adc characters with exact 2 character long decomposition the entire Unicode range.

Re: Choosing the best way to go full UNICODE [message #48238 is a reply to message #48191] Thu, 08 June 2017 13:00 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
Not quite 100% done analyzing the data, but here is what I think I'll do:
- respect the 3 plane convention. Unicode has 17 planes, with the first 3 in active use. Plane 14 is used, but it is specific and only has 368 allocated code points. It is so specific that I'll add exclude it, the same as I do all planes except planes 0-2. All excluded planes have the property that any function f(cp) = cp.
- I'll ignore all special substitutions: sub and superscript, font, circle, square, fractions and of course compatibility substitutions. I won't be using a flag for now, just exclude them.
- I'll ignore all CJK COMPATIBILITY IDEOGRAPHs. There is no way a general purpose library can provide satisfactory use case for these. If you really needs such substitution, you will probably use a more competent third party library. f(CJK COMPATIBILITY IDEOGRAPH) = CJK COMPATIBILITY IDEOGRAPH

All these combined with my two table solution, with a chunk size of 256 to 1024 will leave me with around 8000-9000 bytes of data in each executable that does decomposition. Final numbers will be determined once implementation is done and round trip testing is complete.

I think this is a reasonable subset that can handle NFD, at the small price of a flat 9K in exe size, + the size of the actual methods.

Re: Choosing the best way to go full UNICODE [message #48243 is a reply to message #48238] Sun, 11 June 2017 13:57 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
cbpporter wrote on Thu, 08 June 2017 13:00
Not quite 100% done analyzing the data, but here is what I think I'll do:
- respect the 3 plane convention. Unicode has 17 planes, with the first 3 in active use. Plane 14 is used, but it is specific and only has 368 allocated code points. It is so specific that I'll add exclude it, the same as I do all planes except planes 0-2. All excluded planes have the property that any function f(cp) = cp.
- I'll ignore all special substitutions: sub and superscript, font, circle, square, fractions and of course compatibility substitutions. I won't be using a flag for now, just exclude them.
- I'll ignore all CJK COMPATIBILITY IDEOGRAPHs. There is no way a general purpose library can provide satisfactory use case for these. If you really needs such substitution, you will probably use a more competent third party library. f(CJK COMPATIBILITY IDEOGRAPH) = CJK COMPATIBILITY IDEOGRAPH

All these combined with my two table solution, with a chunk size of 256 to 1024 will leave me with around 8000-9000 bytes of data in each executable that does decomposition. Final numbers will be determined once implementation is done and round trip testing is complete.

I think this is a reasonable subset that can handle NFD, at the small price of a flat 9K in exe size, + the size of the actual methods.



I have managed to squeeze complete composition to 3.8KB table... Smile

Interesting observation: With UnicodeCompose / Decompose, with first 2048 codepoints covered by "fast table", there is no need for further tables for ToUpper, ToLower, ToAscii.
Re: Choosing the best way to go full UNICODE [message #48246 is a reply to message #48243] Mon, 12 June 2017 09:39 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
What? How? Need to check it out.

I made a mistake in the estimate and actually how much data I need and the 9k got up to 34. So I started working on a 3 table solution Smile.
Re: Choosing the best way to go full UNICODE [message #48247 is a reply to message #48246] Mon, 12 June 2017 10:13 Go to previous messageGo to previous message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
cbpporter wrote on Mon, 12 June 2017 09:39
What? How? Need to check it out.

I made a mistake in the estimate and actually how much data I need and the 9k got up to 34. So I started working on a 3 table solution Smile.



Really trivial. 4 Vectors of dwords (original code, 3 decomposed codes), then delta it (that will result in most values being the same), then ZCompress... (that will RLE and Huffman those same values).

That said, I am only doing decomposition of characters in UnicodeData.txt.

Also, I perhaps need to add this

https://en.wikipedia.org/wiki/Korean_language_and_computers# Hangul_in_Unicode

too...
Previous Topic: Some addition proposals
Next Topic: Help needed with link errors (serversocket)
Goto Forum:
  


Current Time: Thu Jul 02 11:32:01 CEST 2020

Total time taken to generate the page: 0.01373 seconds