U++ forum: Welcome to the forum

Status & Roadmap

Authors & License

Funding Ultimate++

Search on this site

Search in forums

Home » Developing U++ » U++ Developers corner » Choosing the best way to go full UNICODE

Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend

Switch to threaded view of this topic

Create a new topic

Submit Reply

Re: Choosing the best way to go full UNICODE [message #48248 is a reply to message #48247]

Mon, 12 June 2017 10:21

cbpporter is currently offline

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

If you use an up to date UnicodeData or the one I uploaded and you go though the entire file, you should have 100% coverage of all 120k Unicode codepoints. But do take care to cover the gaps.

I was thinking about calling compress too, but I'm not sure. That's why I cam up with the table scheme. Even if I manage to massively compress it, I don't want three planes of Unicode eating up a ton of RAM with a "dumb" decompressed massive memory layout.

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48249 is a reply to message #48248]

Mon, 12 June 2017 10:28

mirek is currently offline

mirek
Messages: 14257
Registered: November 2005

Ultimate Member

It does not need to. I am using Indexes after decompression.

Frankly I yet need to calculate how much memory it takes, but I think it will be about 100K.

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48250 is a reply to message #48249]

Mon, 12 June 2017 10:31

mirek is currently offline

mirek
Messages: 14257
Registered: November 2005

Ultimate Member

120K of Index data. Entirely acceptable for me.

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48251 is a reply to message #48189]

Mon, 12 June 2017 10:53

cbpporter is currently offline

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

For how many code points do you have upper/lower support?

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48252 is a reply to message #48251]

Mon, 12 June 2017 10:57

mirek is currently offline

mirek
Messages: 14257
Registered: November 2005

Ultimate Member

It is all work in progress, at this moment, I have only that composition/decomposition.

However, as I have written before, if you cover first 2048 codepoints with separate 'fast' table (which I plan to do anyway), it is possible to implement lowercase/uppercase just by decompose, alter first codepoint (using 'fast' table) and then recompose. Seems to work for all codepoints > 2048.

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48253 is a reply to message #48252]

Mon, 12 June 2017 11:37

cbpporter is currently offline

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

Interesting things in uppbox.

I might be able to reduce my 130 Kib UnicodeData.

But dammit, this is not what I'm supposed to be doing right now! You managed to side track me Smile

Smile

.

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48254 is a reply to message #48253]

Mon, 12 June 2017 11:41

mirek is currently offline

mirek
Messages: 14257
Registered: November 2005

Ultimate Member

cbpporter wrote on Mon, 12 June 2017 11:37

Interesting things in uppbox.

I might be able to reduce my 130 Kib UnicodeData.

But dammit, this is not what I'm supposed to be doing right now! You managed to side track me .

...or you can wait for me to finish... Smile

Smile

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48257 is a reply to message #48254]

Mon, 12 June 2017 12:50

cbpporter is currently offline

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

No, I'll do it too. Reducing the 130 KiB will be welcome but really not a priority. If I had time, I would do it right now.

But the CodeEditor reword is a bust for now. There is no easy way to map a CodeEditor to all the CSyntax objects that are created and retroactively update them. Plus I represent syntax in expensive to copy structures, so I need to rework that. But scheduling is not on my side. So for some time more I'll continue using the CodeEdtior fork.

Plus, I spent most of the day today first isolating Pdb from ide/Debuggers and then from IDE. It almost compiles. But I have one major problem left before it compiles:

I can't find AK_ADDWATCH.

I searched all of U++ and WinSDK and I couldn't find where it is defined.

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48258 is a reply to message #48257]

Mon, 12 June 2017 13:06

cbpporter is currently offline

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

Found it!

Shenanigans again with those include file tricks you like Smile

Smile

.

Anyway, it compiles and links. Crashes on start even though nothing is called. But I'll get it to work!

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48259 is a reply to message #48258]

Mon, 12 June 2017 14:20

cbpporter is currently offline

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

Back to composition: tested out the 3 table method. Can't get it under 24000 bytes. I need to represent 2116 code points and that is almost 17000 bytes. The rest is index data.

But seeing as there are only 7189 * 8 bytes of case data represented as 130 KiB, I guess it would be better to touch up on that and leave composition as is. There must be a better way to compactly represent 7189 code points from 200k ones. That is super sparse.

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48272 is a reply to message #48259]

Tue, 13 June 2017 16:31

cbpporter is currently offline

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

Well this was quite frankly not necessary and a huge waste of time, but I managed to get down my Unicode data from 130K to 68K. It includes 3 planes with charter type, upper, lower and title case. I guess the nonexistent users of my library will be happy Smile

Smile

.

I should have probably went with your compressed scheme, but I'm stubborn. We'll see what the future holds, since only now am I getting to writing the decomposition API.

[Updated on: Tue, 13 June 2017 17:44]

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48277 is a reply to message #48272]

Wed, 14 June 2017 11:07

cbpporter is currently offline

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

Why the hell am I worrying about size of executables? It is not like hello world is not 400 KiB under TDM Smile

Smile

.

Anyway, Mirek, please let me know when the new support is in in Core in a nightly.

There is pretty much one way to convert from Utf8 to Ut16 and co, so comparing codes is a pretty good method of spotting errors.

Thank you!

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48280 is a reply to message #48277]

Wed, 14 June 2017 12:07

cbpporter is currently offline

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

It looks like decomposition is not as simple as I first though. Take a look at:
http://www.fileformat.info/info/unicode/char/1f80/index.htm

This character decomposes into a composed character and a mark. So I guess decomposition needs to be recursive.

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48281 is a reply to message #48277]

Wed, 14 June 2017 12:17

mirek is currently offline

mirek
Messages: 14257
Registered: November 2005

Ultimate Member

cbpporter wrote on Wed, 14 June 2017 11:07

Why the hell am I worrying about size of executables? It is not like hello world is not 400 KiB under TDM .

Auhm, I still do, to the extent. Surely 10KB is nothing, but 500KB would be too much - just for unicode support that 99% of users is never going to use.

Quote:

Anyway, Mirek, please let me know when the new support is in in Core in a nightly.

There is pretty much one way to convert from Utf8 to Ut16 and co, so comparing codes is a pretty good method of spotting errors.

Thank you!

What I have implemented is already in trunk Core (and that means in nightly).

A lot is missing. I want to update existing 'fast' tables - I have some new about the information that I would like to know, like

- IsRTL
- IsWide
- IsSymbol
- IsControl

etc...

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48282 is a reply to message #48280]

Wed, 14 June 2017 12:30

mirek is currently offline

mirek
Messages: 14257
Registered: November 2005

Ultimate Member

Good find. What do you suggest to do about that?

- I can leave current code as is and perhaps add "FullDecompose" variant.
- I can decompose it to 3 codepoints outright

I think I like the second option better...

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48283 is a reply to message #48282]

Wed, 14 June 2017 12:42

cbpporter is currently offline

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

mirek wrote on Wed, 14 June 2017 13:30

Good find. What do you suggest to do about that?

- I can leave current code as is and perhaps add "FullDecompose" variant.
- I can decompose it to 3 codepoints outright

I think I like the second option better...

Here is what I am implementing right now:
1. A Decompose method. You give it a code point and it gives you raw UnicodeData.txt data. That problematic character will still give you two results. This is already done.

But I practice I doubt it will ever be used, so much so that it barely qualifies as a public method. Instead, everybody will use...

2. A ToNFD() method. NFD is the canonical decomposition. The problematic character will result in 3 code points. This will be the main public method.

So my main method will be the method you prefer, giving 3 results. I just gave it the Unicode name.

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48286 is a reply to message #48280]

Wed, 14 June 2017 19:09

mirek is currently offline

mirek
Messages: 14257
Registered: November 2005

Ultimate Member

even more fun is:

3311

That expands to two katakana characters and mark, and both katakana characters further expand to base katakana and mark.

All in all expands to 5 codepoints.

Trouble is that recompose must account for all combinations. I guess it will have to be multipass...

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48287 is a reply to message #48286]

Wed, 14 June 2017 23:19

cbpporter is currently offline

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

I hope the NDF algorithm chapter on unicode.org covers that: the hows and whys.

The whole thing is pretty crazy though. I'll have excellent Unicode support eventually, but there is no way to get it under 100k data.

But I did experiment with Zlib, and all the tables can be squashed down a ton. Except the case table, which only goes down to 50%. The only problem is that I don't have any Zlib support in the my library yet. Plus, I would like to add conditional compilation.

You inspired me with the plugin system. I would like uncompressed data if the z plugin is absent, otherwise automatic compression. I tested that the exe size growth due to zlib is outweighed by the table compression. Deflate is pretty small.

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48288 is a reply to message #48287]

Wed, 14 June 2017 23:31

cbpporter is currently offline

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

PS: that is a special composition. I went over the data over and over again and I found no good reason to handle box decomposition.

It is not like U++ will check to see if the font supports that character and if not, decompose it and build CJK in a small box on the fly.

Decompositions that start with <smth> are all special, like <font>, meaning that you can decompose that character if you are doing font substitution to an approximation or <square>, meaning that the code point is multiple characters arranged in a square or like <fraction>, when you have 1/2 as a single code point and you can decompose it as 1/2, using 3 code points.

I choose to ignore all these for now since I can't figure out how to offer any worthwhile feature related to these special substitutions. I don't even need normal decompositions, but it is pretty cool to decompose diacritics and replace some bits since I'm an European and my native language uses diacritics.

As for NCF and NDF I only found two good use cases: string equality and search. With the forms, you don't compare code points, but glyphs, without building glyphs. If two string look the same on your display, but have different code points due to diacritics, it is very useful to tell if they are visually identical or not. Basically I want "ț" encoded as a precomposed character and "ț" encoded as a "t" with a composition mark to be identified as the same string.

Report message to a moderator

Send a private message to this user

Re: Choosing the best way to go full UNICODE [message #48302 is a reply to message #48288]

Mon, 19 June 2017 10:03

mirek is currently offline

mirek
Messages: 14257
Registered: November 2005

Ultimate Member

My undestanding is that if decomposition sequence starts with "<", it is 'compatibility', if not, it is 'canonical'.

I believe that you should use compatibility seqences e.g. for comparing, but you should never 'recompose' these into single codepoint - one of reasons is that canonical compositions are unique, but there can be the same compatibility decompositions for multiple codepoints (found out that hard way during testing).

In either case, i have added a bool

int UnicodeDecompose(dword codepoint, dword t[MAX_DECOMPOSED], bool& canonical);

to 'decompose' API and Compose is now not using noncanonical decompositions.

I believe that my "Unicode INFO" code is now complete. In the end, it is about 12KB of data (6KB compressed and 6KB of 'fast tables' for the first 2048 codepoints).

Documentation needs updating. Then the next part would be updating / deprecating those ToLower/ToUpper routines for Strings, and most importantly, implementing "apparent character logic".

Report message to a moderator

Send a private message to this user

Pages (4): [ « ‹ 1 2 3 4 › »]

Switch to threaded view of this topic

Create a new topic

Submit Reply

Previous Topic:	Some addition proposals
Next Topic:	Help needed with link errors (serversocket)

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

PDF

]

Current Time: Fri May 09 11:41:33 CEST 2025

Total time taken to generate the page: 0.00560 seconds