|
|
Home » Developing U++ » U++ Developers corner » Will UPP support full UNICODE (21bits long codepoint)?
|
|
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54564 is a reply to message #54541] |
Sat, 15 August 2020 01:02   |
Oblivion
Messages: 1206 Registered: August 2007
|
Senior Contributor |
|
|
I would love to see in U++ too.
One problem is that going 32-bit codepoints is just the part of the problem, real solution should deal with all composing issues. But I guess it is a good start...
As a first step, I can send in the updated tables for Upp::UnicodeCombine() (full set of canonical compositions, including codepoints > 16 bit). Currently this function is missing a lot of compsitions anyway...
They can replace the existing tables (and maybe we can cast down the dword table results to word for the time being? Laer we can switch to dword)
But beware, it is a long list of tables, that would deserbe another file (unicode.i maybe?)
comb300
comb301
comb302
comb303
comb304
comb306
comb307
comb308
comb309
comb30a
comb30b
comb30c
comb30f
comb311
comb313
comb314
comb31b
comb323
comb324
comb325
comb326
comb327
comb328
comb32d
comb32e
comb330
comb331
comb338
comb342
comb345
comb5b4
comb5b7
comb5b8
comb5b9
comb5bc
comb5bf
comb5c1
comb5c2
comb653
comb654
comb655
comb93c
comb9bc
comb9be
comb9d7
comba3c
combb3c
combb3e
combb56
combb57
combbbe
combbd7
combc56
combcc2
combcd5
combcd6
combd3e
combd57
combdca
combdcf
combddf
combf72
combf74
combf80
combfb5
combfb7
comb102e
comb1b35
comb3099
comb309a
comb110ba
comb11127
comb1133e
comb11357
comb114b0
comb114ba
comb114bd
comb115af
comb11930
comb1d165
comb1d16e
comb1d16f
comb1d170
comb1d171
comb1d172
What do you think?
Best regards,
Oblivion
Github page: https://github.com/ismail-yilmaz
upp-components: https://github.com/ismail-yilmaz/upp-components
Bobcat the terminal emulator: https://github.com/ismail-yilmaz/Bobcat
|
|
|
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54565 is a reply to message #54541] |
Sat, 15 August 2020 01:02   |
Oblivion
Messages: 1206 Registered: August 2007
|
Senior Contributor |
|
|
I would love to see in U++ too.
One problem is that going 32-bit codepoints is just the part of the problem, real solution should deal with all composing issues. But I guess it is a good start...
As a first step, I can send in the updated tables for Upp::UnicodeCombine() (full set of canonical compositions, including codepoints > 16 bit). Currently this function is missing a lot of compsitions anyway...
They can replace the existing tables (and maybe we can cast down the dword table results (ignore cp > 65535) to word for the time being? Laer we can switch to dword)
But beware, it is a long list of tables, that would deserve another file (unicode.i maybe?)
comb300
comb301
comb302
comb303
comb304
comb306
comb307
comb308
comb309
comb30a
comb30b
comb30c
comb30f
comb311
comb313
comb314
comb31b
comb323
comb324
comb325
comb326
comb327
comb328
comb32d
comb32e
comb330
comb331
comb338
comb342
comb345
comb5b4
comb5b7
comb5b8
comb5b9
comb5bc
comb5bf
comb5c1
comb5c2
comb653
comb654
comb655
comb93c
comb9bc
comb9be
comb9d7
comba3c
combb3c
combb3e
combb56
combb57
combbbe
combbd7
combc56
combcc2
combcd5
combcd6
combd3e
combd57
combdca
combdcf
combddf
combf72
combf74
combf80
combfb5
combfb7
comb102e
comb1b35
comb3099
comb309a
comb110ba
comb11127
comb1133e
comb11357
comb114b0
comb114ba
comb114bd
comb115af
comb11930
comb1d165
comb1d16e
comb1d16f
comb1d170
comb1d171
comb1d172
What do you think?
Best regards,
Oblivion
Github page: https://github.com/ismail-yilmaz
upp-components: https://github.com/ismail-yilmaz/upp-components
Bobcat the terminal emulator: https://github.com/ismail-yilmaz/Bobcat
[Updated on: Sat, 15 August 2020 01:15] Report message to a moderator
|
|
|
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54566 is a reply to message #54564] |
Sat, 15 August 2020 01:39   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Oblivion wrote on Sat, 15 August 2020 01:02I would love to see in U++ too.
One problem is that going 32-bit codepoints is just the part of the problem, real solution should deal with all composing issues. But I guess it is a good start...
As a first step, I can send in the updated tables for Upp::UnicodeCombine() (full set of canonical compositions, including codepoints > 16 bit). Currently this function is missing a lot of compsitions anyway...
They can replace the existing tables (and maybe we can cast down the dword table results to word for the time being? Laer we can switch to dword)
But beware, it is a long list of tables, that would deserbe another file (unicode.i maybe?)
comb300
comb301
comb302
comb303
comb304
comb306
comb307
comb308
comb309
comb30a
comb30b
comb30c
comb30f
comb311
comb313
comb314
comb31b
comb323
comb324
comb325
comb326
comb327
comb328
comb32d
comb32e
comb330
comb331
comb338
comb342
comb345
comb5b4
comb5b7
comb5b8
comb5b9
comb5bc
comb5bf
comb5c1
comb5c2
comb653
comb654
comb655
comb93c
comb9bc
comb9be
comb9d7
comba3c
combb3c
combb3e
combb56
combb57
combbbe
combbd7
combc56
combcc2
combcd5
combcd6
combd3e
combd57
combdca
combdcf
combddf
combf72
combf74
combf80
combfb5
combfb7
comb102e
comb1b35
comb3099
comb309a
comb110ba
comb11127
comb1133e
comb11357
comb114b0
comb114ba
comb114bd
comb115af
comb11930
comb1d165
comb1d16e
comb1d16f
comb1d170
comb1d171
comb1d172
What do you think?
Best regards,
Oblivion
Combine is fine, but..
After a lot of thinking, I believe that one step forward is to have
int GraphemeLength(const char *s);
int GraphemeLength(const wchar *s);
functions...
Mirek
|
|
|
|
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54570 is a reply to message #54569] |
Sat, 15 August 2020 11:33   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Oblivion wrote on Sat, 15 August 2020 11:12Quote:int GraphemeLength(const char *s);
int GraphemeLength(const wchar *s);
That requires width tables.
How about generating the width tables with uniset scripts (used by xterm et al.)?
They can generate width tables for doublewidth, ambiguous width, unknown width and combining chars from the UnicodeData.txt (latest version).
They are quite comprehensive.
I already started using them here (For double width and combining chars, ATM): https://github.com/ismail-yilmaz/upp-components/blob/master/ CtrlLib/Terminal/Cell.cpp
We can put these into a generic GetCharWidth(int c) function, then utilize them in GetGraphemeLength(const wchar *s)?
This might not be the definitve solution, but it is the battle-tested one out in the wild.
Only real downside is that the tables might need updating albeit ifrequently.
Best regards,
Oblivion
Just to make sure we are at the same page: I am not speaking about graphics width here, but a number of bytes (or words) that form a single grapheme ("combined character").
EDIT: Still not clear enough: Number of bytes of the first grapheme at s.
Mirek
[Updated on: Sat, 15 August 2020 12:27] Report message to a moderator
|
|
|
|
|
|
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54574 is a reply to message #54573] |
Sat, 15 August 2020 14:07   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Oblivion wrote on Sat, 15 August 2020 13:15Quote: But I am not at the moment sure whether combining characters are the only source of multi-codepoint graphemes.
Yeah, there are at least surrogate pairs (Since U++ use 16-bit wchar), ligatures and IIRC some hangul graphemes. Anything else that I miss?
Correct me when I am wrong:
I do not think surrogate pairs are the thing - those are just a way to encode codepoints.
Also ligatures are probably not a concern too, these can be considered characters of its own.
But hangul graphems is what worries me... Not that I have searched too much, but so far I have failed to find simple algo how to combine hangul characters...
Anyway, to explain my way of thinking: So far, in GUI, we have wchar == grapheme equivalency. That e.g. means that inside e.g. EditString, when user enters a character, it is simply inserted at cursor position in the text.
For full unicode, we will need to change 16bit wchars to graphemes. That means that "length" of text will now be number of graphemes, position of text the position of grapheme etc...
Of course, in later phase, we will need to deal with graphics too, but I think that might be relatively easy to changing all text edit routines. (To deal with graphics, Font::operator[](int chr) should probably change to something like Font::operator[](const char *grapheme)...)
Mirek
|
|
|
|
|
|
|
|
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54586 is a reply to message #54585] |
Mon, 17 August 2020 14:04   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Oblivion wrote on Mon, 17 August 2020 11:50Quote:It looks like most toolkits simply use HarfBuzz anyway...
Well, this seems to be the best option but I was even afraid of suggesting it, as it means another dependency (and possibly a lot of work) 
By the way, If you think it's ok, In the meantime we can have better precomposition support. I've attached Charset.cpp with the patched UnicodeCombine for full precompositions support (for 16-bit UCS canonicals only).
(I can also send the extractor code for uppbox if needed)
Best regards,
Oblivion
I am sorry to say that because it is mostly due to lack of docs, but I think all this is already better covered with
int UnicodeDecompose(dword codepoint, dword t[MAX_DECOMPOSED], bool only_canonical);
Vector<dword> UnicodeDecompose(dword codepoint, bool only_canonical);
- the reason why this was not quite documented is that above functions are sort of abandoned effort in previous attempt at better Unicode support. Anyway, they are using quite effective z-compressed table (as not to increase .exe size too much). This (and other) tables are producced directly from Unicode tables by uppbox/unicode. And they should also support more than 1 combining marks...
|
|
|
|
Re: Will UPP support full UNICODE (21bits long codepoint)? [message #54590 is a reply to message #54587] |
Tue, 18 August 2020 16:55   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
I have optimised some UnicodeInfo routines and obsoleted UnicodeCombine, while changing its implementation it use UnicodeCompose.
Also I have checked the hangul issue and it rather seems like it is not an issue at all. First of all, algorithm is really trivial, second, maybe it is not even needed as korean texts probably typically contain composed characters anyway, so perhaps it is even better to treat jamo as individual graphemes.
Mirek
[Updated on: Tue, 18 August 2020 16:55] Report message to a moderator
|
|
|
|
Goto Forum:
Current Time: Sun May 11 06:38:33 CEST 2025
Total time taken to generate the page: 0.03014 seconds
|
|
|