Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » Developing U++ » U++ Developers corner » Choosing the best way to go full UNICODE
Re: Choosing the best way to go full UNICODE [message #48248 is a reply to message #48247] Mon, 12 June 2017 10:21 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
If you use an up to date UnicodeData or the one I uploaded and you go though the entire file, you should have 100% coverage of all 120k Unicode codepoints. But do take care to cover the gaps.

I was thinking about calling compress too, but I'm not sure. That's why I cam up with the table scheme. Even if I manage to massively compress it, I don't want three planes of Unicode eating up a ton of RAM with a "dumb" decompressed massive memory layout.
Re: Choosing the best way to go full UNICODE [message #48249 is a reply to message #48248] Mon, 12 June 2017 10:28 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
It does not need to. I am using Indexes after decompression.

Frankly I yet need to calculate how much memory it takes, but I think it will be about 100K.
Re: Choosing the best way to go full UNICODE [message #48250 is a reply to message #48249] Mon, 12 June 2017 10:31 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
120K of Index data. Entirely acceptable for me.
Re: Choosing the best way to go full UNICODE [message #48251 is a reply to message #48189] Mon, 12 June 2017 10:53 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
For how many code points do you have upper/lower support?
Re: Choosing the best way to go full UNICODE [message #48252 is a reply to message #48251] Mon, 12 June 2017 10:57 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
It is all work in progress, at this moment, I have only that composition/decomposition.

However, as I have written before, if you cover first 2048 codepoints with separate 'fast' table (which I plan to do anyway), it is possible to implement lowercase/uppercase just by decompose, alter first codepoint (using 'fast' table) and then recompose. Seems to work for all codepoints > 2048.
Re: Choosing the best way to go full UNICODE [message #48253 is a reply to message #48252] Mon, 12 June 2017 11:37 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
Interesting things in uppbox.

I might be able to reduce my 130 Kib UnicodeData.

But dammit, this is not what I'm supposed to be doing right now! You managed to side track me Smile.
Re: Choosing the best way to go full UNICODE [message #48254 is a reply to message #48253] Mon, 12 June 2017 11:41 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
cbpporter wrote on Mon, 12 June 2017 11:37
Interesting things in uppbox.

I might be able to reduce my 130 Kib UnicodeData.

But dammit, this is not what I'm supposed to be doing right now! You managed to side track me Smile.


...or you can wait for me to finish... Smile
Re: Choosing the best way to go full UNICODE [message #48257 is a reply to message #48254] Mon, 12 June 2017 12:50 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
No, I'll do it too. Reducing the 130 KiB will be welcome but really not a priority. If I had time, I would do it right now.

But the CodeEditor reword is a bust for now. There is no easy way to map a CodeEditor to all the CSyntax objects that are created and retroactively update them. Plus I represent syntax in expensive to copy structures, so I need to rework that. But scheduling is not on my side. So for some time more I'll continue using the CodeEdtior fork.

Plus, I spent most of the day today first isolating Pdb from ide/Debuggers and then from IDE. It almost compiles. But I have one major problem left before it compiles:

I can't find AK_ADDWATCH.

I searched all of U++ and WinSDK and I couldn't find where it is defined.
Re: Choosing the best way to go full UNICODE [message #48258 is a reply to message #48257] Mon, 12 June 2017 13:06 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
Found it!

Shenanigans again with those include file tricks you like Smile.

Anyway, it compiles and links. Crashes on start even though nothing is called. But I'll get it to work!
Re: Choosing the best way to go full UNICODE [message #48259 is a reply to message #48258] Mon, 12 June 2017 14:20 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
Back to composition: tested out the 3 table method. Can't get it under 24000 bytes. I need to represent 2116 code points and that is almost 17000 bytes. The rest is index data.

But seeing as there are only 7189 * 8 bytes of case data represented as 130 KiB, I guess it would be better to touch up on that and leave composition as is. There must be a better way to compactly represent 7189 code points from 200k ones. That is super sparse.
Re: Choosing the best way to go full UNICODE [message #48272 is a reply to message #48259] Tue, 13 June 2017 16:31 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
Well this was quite frankly not necessary and a huge waste of time, but I managed to get down my Unicode data from 130K to 68K. It includes 3 planes with charter type, upper, lower and title case. I guess the nonexistent users of my library will be happy Smile.

I should have probably went with your compressed scheme, but I'm stubborn. We'll see what the future holds, since only now am I getting to writing the decomposition API.

[Updated on: Tue, 13 June 2017 17:44]

Report message to a moderator

Re: Choosing the best way to go full UNICODE [message #48277 is a reply to message #48272] Wed, 14 June 2017 11:07 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
Why the hell am I worrying about size of executables? It is not like hello world is not 400 KiB under TDM Smile.

Anyway, Mirek, please let me know when the new support is in in Core in a nightly.

There is pretty much one way to convert from Utf8 to Ut16 and co, so comparing codes is a pretty good method of spotting errors.

Thank you!
Re: Choosing the best way to go full UNICODE [message #48280 is a reply to message #48277] Wed, 14 June 2017 12:07 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
It looks like decomposition is not as simple as I first though. Take a look at:
http://www.fileformat.info/info/unicode/char/1f80/index.htm

This character decomposes into a composed character and a mark. So I guess decomposition needs to be recursive.
Re: Choosing the best way to go full UNICODE [message #48281 is a reply to message #48277] Wed, 14 June 2017 12:17 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
cbpporter wrote on Wed, 14 June 2017 11:07
Why the hell am I worrying about size of executables? It is not like hello world is not 400 KiB under TDM Smile.


Auhm, I still do, to the extent. Surely 10KB is nothing, but 500KB would be too much - just for unicode support that 99% of users is never going to use.

Quote:

Anyway, Mirek, please let me know when the new support is in in Core in a nightly.

There is pretty much one way to convert from Utf8 to Ut16 and co, so comparing codes is a pretty good method of spotting errors.

Thank you!


What I have implemented is already in trunk Core (and that means in nightly).

A lot is missing. I want to update existing 'fast' tables - I have some new about the information that I would like to know, like

- IsRTL
- IsWide
- IsSymbol
- IsControl

etc...
Re: Choosing the best way to go full UNICODE [message #48282 is a reply to message #48280] Wed, 14 June 2017 12:30 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Good find. What do you suggest to do about that?

- I can leave current code as is and perhaps add "FullDecompose" variant.
- I can decompose it to 3 codepoints outright

I think I like the second option better...
Re: Choosing the best way to go full UNICODE [message #48283 is a reply to message #48282] Wed, 14 June 2017 12:42 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
mirek wrote on Wed, 14 June 2017 13:30
Good find. What do you suggest to do about that?

- I can leave current code as is and perhaps add "FullDecompose" variant.
- I can decompose it to 3 codepoints outright

I think I like the second option better...


Here is what I am implementing right now:
1. A Decompose method. You give it a code point and it gives you raw UnicodeData.txt data. That problematic character will still give you two results. This is already done.

But I practice I doubt it will ever be used, so much so that it barely qualifies as a public method. Instead, everybody will use...

2. A ToNFD() method. NFD is the canonical decomposition. The problematic character will result in 3 code points. This will be the main public method.

So my main method will be the method you prefer, giving 3 results. I just gave it the Unicode name.
Re: Choosing the best way to go full UNICODE [message #48286 is a reply to message #48280] Wed, 14 June 2017 19:09 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
even more fun is:

3311

That expands to two katakana characters and mark, and both katakana characters further expand to base katakana and mark.

All in all expands to 5 codepoints.

Trouble is that recompose must account for all combinations. I guess it will have to be multipass...
Re: Choosing the best way to go full UNICODE [message #48287 is a reply to message #48286] Wed, 14 June 2017 23:19 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
I hope the NDF algorithm chapter on unicode.org covers that: the hows and whys.

The whole thing is pretty crazy though. I'll have excellent Unicode support eventually, but there is no way to get it under 100k data.

But I did experiment with Zlib, and all the tables can be squashed down a ton. Except the case table, which only goes down to 50%. The only problem is that I don't have any Zlib support in the my library yet. Plus, I would like to add conditional compilation.

You inspired me with the plugin system. I would like uncompressed data if the z plugin is absent, otherwise automatic compression. I tested that the exe size growth due to zlib is outweighed by the table compression. Deflate is pretty small.
Re: Choosing the best way to go full UNICODE [message #48288 is a reply to message #48287] Wed, 14 June 2017 23:31 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
PS: that is a special composition. I went over the data over and over again and I found no good reason to handle box decomposition.

It is not like U++ will check to see if the font supports that character and if not, decompose it and build CJK in a small box on the fly.

Decompositions that start with <smth> are all special, like <font>, meaning that you can decompose that character if you are doing font substitution to an approximation or <square>, meaning that the code point is multiple characters arranged in a square or like <fraction>, when you have 1/2 as a single code point and you can decompose it as 1/2, using 3 code points.

I choose to ignore all these for now since I can't figure out how to offer any worthwhile feature related to these special substitutions. I don't even need normal decompositions, but it is pretty cool to decompose diacritics and replace some bits since I'm an European and my native language uses diacritics.

As for NCF and NDF I only found two good use cases: string equality and search. With the forms, you don't compare code points, but glyphs, without building glyphs. If two string look the same on your display, but have different code points due to diacritics, it is very useful to tell if they are visually identical or not. Basically I want "ț" encoded as a precomposed character and "ț" encoded as a "t" with a composition mark to be identified as the same string.
Re: Choosing the best way to go full UNICODE [message #48302 is a reply to message #48288] Mon, 19 June 2017 10:03 Go to previous messageGo to previous message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
My undestanding is that if decomposition sequence starts with "<", it is 'compatibility', if not, it is 'canonical'.

I believe that you should use compatibility seqences e.g. for comparing, but you should never 'recompose' these into single codepoint - one of reasons is that canonical compositions are unique, but there can be the same compatibility decompositions for multiple codepoints (found out that hard way during testing).

In either case, i have added a bool

int UnicodeDecompose(dword codepoint, dword t[MAX_DECOMPOSED], bool& canonical);

to 'decompose' API and Compose is now not using noncanonical decompositions.

I believe that my "Unicode INFO" code is now complete. In the end, it is about 12KB of data (6KB compressed and 6KB of 'fast tables' for the first 2048 codepoints).

Documentation needs updating. Then the next part would be updating / deprecating those ToLower/ToUpper routines for Strings, and most importantly, implementing "apparent character logic".
Previous Topic: Some addition proposals
Next Topic: Help needed with link errors (serversocket)
Goto Forum:
  


Current Time: Thu Mar 28 10:53:37 CET 2024

Total time taken to generate the page: 0.01128 seconds