Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » Developing U++ » U++ Developers corner » Will UPP support full UNICODE (21bits long codepoint)?
Will UPP support full UNICODE (21bits long codepoint)? [message #54538] Sun, 09 August 2020 10:40 Go to next message
chivstyle is currently offline  chivstyle
Messages: 4
Registered: August 2020
Location: China
Junior Member
Hi,

Ctrls can't display those characters out of range, such as 𐐇, .etc. How to fix this issue please ?
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54541 is a reply to message #54538] Tue, 11 August 2020 17:22 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13053
Registered: November 2005
Ultimate Member
chivstyle wrote on Sun, 09 August 2020 10:40
Hi,

Ctrls can't display those characters out of range, such as 𐐇, .etc. How to fix this issue please ?


I am sorry, but we currently support just 16-bit unicode at the moment. Extended support was considered, but as there was not enough userbase pressure up until now, it was postponed.

But I guess it is time to fix this.

One problem is that going 32-bit codepoints is just the part of the problem, real solution should deal with all composing issues. But I guess it is a good start...

Mirek
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54564 is a reply to message #54541] Sat, 15 August 2020 01:02 Go to previous messageGo to next message
Oblivion is currently offline  Oblivion
Messages: 828
Registered: August 2007
Experienced Contributor
I would love to see in U++ too.

One problem is that going 32-bit codepoints is just the part of the problem, real solution should deal with all composing issues. But I guess it is a good start...


As a first step, I can send in the updated tables for Upp::UnicodeCombine() (full set of canonical compositions, including codepoints > 16 bit). Currently this function is missing a lot of compsitions anyway...

They can replace the existing tables (and maybe we can cast down the dword table results to word for the time being? Laer we can switch to dword)

But beware, it is a long list of tables, that would deserbe another file (unicode.i maybe?)

comb300
comb301
comb302
comb303
comb304
comb306
comb307
comb308
comb309
comb30a
comb30b
comb30c
comb30f
comb311
comb313
comb314
comb31b
comb323
comb324
comb325
comb326
comb327
comb328
comb32d
comb32e
comb330
comb331
comb338
comb342
comb345
comb5b4
comb5b7
comb5b8
comb5b9
comb5bc
comb5bf
comb5c1
comb5c2
comb653
comb654
comb655
comb93c
comb9bc
comb9be
comb9d7
comba3c
combb3c
combb3e
combb56
combb57
combbbe
combbd7
combc56
combcc2
combcd5
combcd6
combd3e
combd57
combdca
combdcf
combddf
combf72
combf74
combf80
combfb5
combfb7
comb102e
comb1b35
comb3099
comb309a
comb110ba
comb11127
comb1133e
comb11357
comb114b0
comb114ba
comb114bd
comb115af
comb11930
comb1d165
comb1d16e
comb1d16f
comb1d170
comb1d171
comb1d172


What do you think?

Best regards,
Oblivion


Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54565 is a reply to message #54541] Sat, 15 August 2020 01:02 Go to previous messageGo to next message
Oblivion is currently offline  Oblivion
Messages: 828
Registered: August 2007
Experienced Contributor
I would love to see in U++ too.

One problem is that going 32-bit codepoints is just the part of the problem, real solution should deal with all composing issues. But I guess it is a good start...


As a first step, I can send in the updated tables for Upp::UnicodeCombine() (full set of canonical compositions, including codepoints > 16 bit). Currently this function is missing a lot of compsitions anyway...

They can replace the existing tables (and maybe we can cast down the dword table results (ignore cp > 65535) to word for the time being? Laer we can switch to dword)

But beware, it is a long list of tables, that would deserve another file (unicode.i maybe?)

comb300
comb301
comb302
comb303
comb304
comb306
comb307
comb308
comb309
comb30a
comb30b
comb30c
comb30f
comb311
comb313
comb314
comb31b
comb323
comb324
comb325
comb326
comb327
comb328
comb32d
comb32e
comb330
comb331
comb338
comb342
comb345
comb5b4
comb5b7
comb5b8
comb5b9
comb5bc
comb5bf
comb5c1
comb5c2
comb653
comb654
comb655
comb93c
comb9bc
comb9be
comb9d7
comba3c
combb3c
combb3e
combb56
combb57
combbbe
combbd7
combc56
combcc2
combcd5
combcd6
combd3e
combd57
combdca
combdcf
combddf
combf72
combf74
combf80
combfb5
combfb7
comb102e
comb1b35
comb3099
comb309a
comb110ba
comb11127
comb1133e
comb11357
comb114b0
comb114ba
comb114bd
comb115af
comb11930
comb1d165
comb1d16e
comb1d16f
comb1d170
comb1d171
comb1d172


What do you think?

Best regards,
Oblivion


[Updated on: Sat, 15 August 2020 01:15]

Report message to a moderator

Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54566 is a reply to message #54564] Sat, 15 August 2020 01:39 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13053
Registered: November 2005
Ultimate Member
Oblivion wrote on Sat, 15 August 2020 01:02
I would love to see in U++ too.

One problem is that going 32-bit codepoints is just the part of the problem, real solution should deal with all composing issues. But I guess it is a good start...


As a first step, I can send in the updated tables for Upp::UnicodeCombine() (full set of canonical compositions, including codepoints > 16 bit). Currently this function is missing a lot of compsitions anyway...

They can replace the existing tables (and maybe we can cast down the dword table results to word for the time being? Laer we can switch to dword)

But beware, it is a long list of tables, that would deserbe another file (unicode.i maybe?)

comb300
comb301
comb302
comb303
comb304
comb306
comb307
comb308
comb309
comb30a
comb30b
comb30c
comb30f
comb311
comb313
comb314
comb31b
comb323
comb324
comb325
comb326
comb327
comb328
comb32d
comb32e
comb330
comb331
comb338
comb342
comb345
comb5b4
comb5b7
comb5b8
comb5b9
comb5bc
comb5bf
comb5c1
comb5c2
comb653
comb654
comb655
comb93c
comb9bc
comb9be
comb9d7
comba3c
combb3c
combb3e
combb56
combb57
combbbe
combbd7
combc56
combcc2
combcd5
combcd6
combd3e
combd57
combdca
combdcf
combddf
combf72
combf74
combf80
combfb5
combfb7
comb102e
comb1b35
comb3099
comb309a
comb110ba
comb11127
comb1133e
comb11357
comb114b0
comb114ba
comb114bd
comb115af
comb11930
comb1d165
comb1d16e
comb1d16f
comb1d170
comb1d171
comb1d172


What do you think?

Best regards,
Oblivion


Combine is fine, but..

After a lot of thinking, I believe that one step forward is to have

int GraphemeLength(const char *s);
int GraphemeLength(const wchar *s);

functions...

Mirek
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54569 is a reply to message #54566] Sat, 15 August 2020 11:12 Go to previous messageGo to next message
Oblivion is currently offline  Oblivion
Messages: 828
Registered: August 2007
Experienced Contributor
Quote:
int GraphemeLength(const char *s);
int GraphemeLength(const wchar *s);


That requires width tables.

How about generating the width tables with uniset scripts (used by xterm et al.)?

They can generate width tables for doublewidth, ambiguous width, unknown width and combining chars from the unicode database (latest version).
They are quite comprehensive.

I already started using them here (For double width and combining chars, ATM): https://github.com/ismail-yilmaz/upp-components/blob/master/ CtrlLib/Terminal/Cell.cpp

We can put these into a generic GetCharWidth(int c) function, then utilize them in GetGraphemeLength(const wchar *s)?

This might not be the definitve solution, but it is the battle-tested one out in the wild.

Only real downside is that the tables might need updating albeit ifrequently.


Best regards,
Oblivion


[Updated on: Sat, 15 August 2020 11:15]

Report message to a moderator

Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54570 is a reply to message #54569] Sat, 15 August 2020 11:33 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13053
Registered: November 2005
Ultimate Member
Oblivion wrote on Sat, 15 August 2020 11:12
Quote:
int GraphemeLength(const char *s);
int GraphemeLength(const wchar *s);


That requires width tables.

How about generating the width tables with uniset scripts (used by xterm et al.)?

They can generate width tables for doublewidth, ambiguous width, unknown width and combining chars from the UnicodeData.txt (latest version).
They are quite comprehensive.

I already started using them here (For double width and combining chars, ATM): https://github.com/ismail-yilmaz/upp-components/blob/master/ CtrlLib/Terminal/Cell.cpp

We can put these into a generic GetCharWidth(int c) function, then utilize them in GetGraphemeLength(const wchar *s)?

This might not be the definitve solution, but it is the battle-tested one out in the wild.

Only real downside is that the tables might need updating albeit ifrequently.


Best regards,
Oblivion


Just to make sure we are at the same page: I am not speaking about graphics width here, but a number of bytes (or words) that form a single grapheme ("combined character").

EDIT: Still not clear enough: Number of bytes of the first grapheme at s.

Mirek

[Updated on: Sat, 15 August 2020 12:27]

Report message to a moderator

Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54571 is a reply to message #54570] Sat, 15 August 2020 12:15 Go to previous messageGo to next message
Oblivion is currently offline  Oblivion
Messages: 828
Registered: August 2007
Experienced Contributor
Ah yes, I was thinking about the graphical width (in units),sorry.

But don't we still need to detect base char + combining char(s), which will eventually require tables to be a fast operation? (also for decomposition?).

Best regards,
Oblivion


Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54572 is a reply to message #54571] Sat, 15 August 2020 12:44 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13053
Registered: November 2005
Ultimate Member
Oblivion wrote on Sat, 15 August 2020 12:15
Ah yes, I was thinking about the graphical width (in units),sorry.

But don't we still need to detect base char + combining char(s), which will eventually require tables to be a fast operation? (also for decomposition?).



Very likely yes, to implement GetGraphemeLength. But I am not at the moment sure whether combining characters are the only source of multi-codepoint graphemes.

Mirek
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54573 is a reply to message #54572] Sat, 15 August 2020 13:15 Go to previous messageGo to next message
Oblivion is currently offline  Oblivion
Messages: 828
Registered: August 2007
Experienced Contributor
Quote:
But I am not at the moment sure whether combining characters are the only source of multi-codepoint graphemes.


Yeah, there are at least surrogate pairs (Since U++ use 16-bit wchar), ligatures and IIRC some hangul graphemes. Anything else that I miss?

Surrogate pairs are rather well formed, but ligatures and multi-codepoint CJK/Devanagari stuff may pose problems...


Edit: Ah yes, what I miss is explained under the grapheme clusters section: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Bounda ries


[Updated on: Sat, 15 August 2020 13:35]

Report message to a moderator

Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54574 is a reply to message #54573] Sat, 15 August 2020 14:07 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13053
Registered: November 2005
Ultimate Member
Oblivion wrote on Sat, 15 August 2020 13:15
Quote:
But I am not at the moment sure whether combining characters are the only source of multi-codepoint graphemes.


Yeah, there are at least surrogate pairs (Since U++ use 16-bit wchar), ligatures and IIRC some hangul graphemes. Anything else that I miss?


Correct me when I am wrong:

I do not think surrogate pairs are the thing - those are just a way to encode codepoints.

Also ligatures are probably not a concern too, these can be considered characters of its own.

But hangul graphems is what worries me... Not that I have searched too much, but so far I have failed to find simple algo how to combine hangul characters...

Anyway, to explain my way of thinking: So far, in GUI, we have wchar == grapheme equivalency. That e.g. means that inside e.g. EditString, when user enters a character, it is simply inserted at cursor position in the text.

For full unicode, we will need to change 16bit wchars to graphemes. That means that "length" of text will now be number of graphemes, position of text the position of grapheme etc...

Of course, in later phase, we will need to deal with graphics too, but I think that might be relatively easy to changing all text edit routines. (To deal with graphics, Font::operator[](int chr) should probably change to something like Font::operator[](const char *grapheme)...)

Mirek
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54575 is a reply to message #54574] Sat, 15 August 2020 15:11 Go to previous messageGo to next message
Oblivion is currently offline  Oblivion
Messages: 828
Registered: August 2007
Experienced Contributor
You are correct.

Quote:
But hangul graphems is what worries me... Not that I have searched too much, but so far I have failed to find simple algo how to combine hangul characters...


I'm no expert on Eastern-Asian languages either, but I've been studying unicode database for some time and related existing codes that deal with UC The one source code that can give an idea is in perl):

Maybe this could help (as a starter)?: https://metacpan.org/pod/Lingua::KO::Hangul::Util

Ps. The code does not support Hangul letters (jamo) assigned after Unicode 5.2, but it can serve as a starting point. (See also the references below the page).


Also: Hangul syllable types databese for composition/decomposition: http://www.unicode.org/Public/UNIDATA/HangulSyllableType.txt

Best regards,
Oblivion



[Updated on: Sat, 15 August 2020 15:38]

Report message to a moderator

Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54578 is a reply to message #54574] Mon, 17 August 2020 07:22 Go to previous messageGo to next message
chivstyle is currently offline  chivstyle
Messages: 4
Registered: August 2020
Location: China
Junior Member
You are right. The renderer dose not support UNICODE, it treats WCHAR as a character, it not always right. Some other routines such as GetWidth [Font] and subroutines in Font.cpp should be changed to support UNICODE.

Now, I have got a TTF font file that support almost all UNICODE codepoints. I rendered some CJK characters by FreeType, got it.

So, I think it's not a complicated job to support UNICODE.
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54581 is a reply to message #54578] Mon, 17 August 2020 10:17 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13053
Registered: November 2005
Ultimate Member
chivstyle wrote on Mon, 17 August 2020 07:22

So, I think it's not a complicated job to support UNICODE.


Actually, this is not true Smile

It is relatively easy to support 32-bit codepoints. But that is only a tiny part of the problem.

Mirek
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54583 is a reply to message #54573] Mon, 17 August 2020 11:01 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13053
Registered: November 2005
Ultimate Member
Oblivion wrote on Sat, 15 August 2020 13:15
Quote:
But I am not at the moment sure whether combining characters are the only source of multi-codepoint graphemes.


Yeah, there are at least surrogate pairs (Since U++ use 16-bit wchar), ligatures and IIRC some hangul graphemes. Anything else that I miss?

Surrogate pairs are rather well formed, but ligatures and multi-codepoint CJK/Devanagari stuff may pose problems...


Edit: Ah yes, what I miss is explained under the grapheme clusters section: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Bounda ries


I guess this might be the path forward:

https://en.wikipedia.org/wiki/HarfBuzz

It looks like most toolkits simply use HarfBuzz anyway... Smile

Mirek
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54585 is a reply to message #54583] Mon, 17 August 2020 11:50 Go to previous messageGo to next message
Oblivion is currently offline  Oblivion
Messages: 828
Registered: August 2007
Experienced Contributor
Quote:
It looks like most toolkits simply use HarfBuzz anyway...


Well, this seems to be the best option but I was even afraid of suggesting it, as it means another dependency (and possibly a lot of work) Smile


By the way, If you think it's ok, In the meantime we can have better precomposition support. I've attached Charset.cpp with the patched UnicodeCombine for full precompositions support (for 16-bit UCS canonicals only).

(I can also send the extractor code for uppbox if needed)

Best regards,
Oblivion
  • Attachment: CharSet.cpp
    (Size: 125.71KB, Downloaded 13 times)


[Updated on: Mon, 17 August 2020 11:52]

Report message to a moderator

Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54586 is a reply to message #54585] Mon, 17 August 2020 14:04 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13053
Registered: November 2005
Ultimate Member
Oblivion wrote on Mon, 17 August 2020 11:50
Quote:
It looks like most toolkits simply use HarfBuzz anyway...


Well, this seems to be the best option but I was even afraid of suggesting it, as it means another dependency (and possibly a lot of work) Smile


By the way, If you think it's ok, In the meantime we can have better precomposition support. I've attached Charset.cpp with the patched UnicodeCombine for full precompositions support (for 16-bit UCS canonicals only).

(I can also send the extractor code for uppbox if needed)

Best regards,
Oblivion


I am sorry to say that because it is mostly due to lack of docs, but I think all this is already better covered with

int UnicodeDecompose(dword codepoint, dword t[MAX_DECOMPOSED], bool only_canonical);
Vector<dword> UnicodeDecompose(dword codepoint, bool only_canonical);

- the reason why this was not quite documented is that above functions are sort of abandoned effort in previous attempt at better Unicode support. Anyway, they are using quite effective z-compressed table (as not to increase .exe size too much). This (and other) tables are producced directly from Unicode tables by uppbox/unicode. And they should also support more than 1 combining marks...
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54587 is a reply to message #54586] Mon, 17 August 2020 15:17 Go to previous messageGo to next message
Oblivion is currently offline  Oblivion
Messages: 828
Registered: August 2007
Experienced Contributor
Shocked

I see. Until we have a better unicode support I think I can use this "undocumented" feature. Thanks!

Best regards,
Oblivion


Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54590 is a reply to message #54587] Tue, 18 August 2020 16:55 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13053
Registered: November 2005
Ultimate Member
I have optimised some UnicodeInfo routines and obsoleted UnicodeCombine, while changing its implementation it use UnicodeCompose.

Also I have checked the hangul issue and it rather seems like it is not an issue at all. First of all, algorithm is really trivial, second, maybe it is not even needed as korean texts probably typically contain composed characters anyway, so perhaps it is even better to treat jamo as individual graphemes.

Mirek

[Updated on: Tue, 18 August 2020 16:55]

Report message to a moderator

Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54611 is a reply to message #54590] Thu, 20 August 2020 07:10 Go to previous messageGo to previous message
deep is currently offline  deep
Messages: 227
Registered: July 2011
Location: Bangalore
Experienced Member
Hi,

As part of this can we have Indian char sets also working.
There is some similarity between Korean and Indian scripts.

I am willing to work on this with some guidance.


Warm Regards

Deepak
Previous Topic: [Proposition] Simply source package manager for Upp
Next Topic: Uppiverse2
Goto Forum:
  


Current Time: Mon Jan 25 22:22:01 CET 2021

Total time taken to generate the page: 0.02715 seconds