Hi,
Ctrls can't display those characters out of range, such as 𐐇, .etc. How to fix this issue please ?
One problem is that going 32-bit codepoints is just the part of the problem, real solution should deal with all composing issues. But I guess it is a good start...
comb300 comb301 comb302 comb303 comb304 comb306 comb307 comb308 comb309 comb30a comb30b comb30c comb30f comb311 comb313 comb314 comb31b comb323 comb324 comb325 comb326 comb327 comb328 comb32d comb32e comb330 comb331 comb338 comb342 comb345 comb5b4 comb5b7 comb5b8 comb5b9 comb5bc comb5bf comb5c1 comb5c2 comb653 comb654 comb655 comb93c comb9bc comb9be comb9d7 comba3c combb3c combb3e combb56 combb57 combbbe combbd7 combc56 combcc2 combcd5 combcd6 combd3e combd57 combdca combdcf combddf combf72 combf74 combf80 combfb5 combfb7 comb102e comb1b35 comb3099 comb309a comb110ba comb11127 comb1133e comb11357 comb114b0 comb114ba comb114bd comb115af comb11930 comb1d165 comb1d16e comb1d16f comb1d170 comb1d171 comb1d172
One problem is that going 32-bit codepoints is just the part of the problem, real solution should deal with all composing issues. But I guess it is a good start...
comb300 comb301 comb302 comb303 comb304 comb306 comb307 comb308 comb309 comb30a comb30b comb30c comb30f comb311 comb313 comb314 comb31b comb323 comb324 comb325 comb326 comb327 comb328 comb32d comb32e comb330 comb331 comb338 comb342 comb345 comb5b4 comb5b7 comb5b8 comb5b9 comb5bc comb5bf comb5c1 comb5c2 comb653 comb654 comb655 comb93c comb9bc comb9be comb9d7 comba3c combb3c combb3e combb56 combb57 combbbe combbd7 combc56 combcc2 combcd5 combcd6 combd3e combd57 combdca combdcf combddf combf72 combf74 combf80 combfb5 combfb7 comb102e comb1b35 comb3099 comb309a comb110ba comb11127 comb1133e comb11357 comb114b0 comb114ba comb114bd comb115af comb11930 comb1d165 comb1d16e comb1d16f comb1d170 comb1d171 comb1d172
I would love to see in U++ too.
One problem is that going 32-bit codepoints is just the part of the problem, real solution should deal with all composing issues. But I guess it is a good start...
As a first step, I can send in the updated tables for Upp::UnicodeCombine() (full set of canonical compositions, including codepoints > 16 bit). Currently this function is missing a lot of compsitions anyway...
They can replace the existing tables (and maybe we can cast down the dword table results to word for the time being? Laer we can switch to dword)
But beware, it is a long list of tables, that would deserbe another file (unicode.i maybe?)
comb300 comb301 comb302 comb303 comb304 comb306 comb307 comb308 comb309 comb30a comb30b comb30c comb30f comb311 comb313 comb314 comb31b comb323 comb324 comb325 comb326 comb327 comb328 comb32d comb32e comb330 comb331 comb338 comb342 comb345 comb5b4 comb5b7 comb5b8 comb5b9 comb5bc comb5bf comb5c1 comb5c2 comb653 comb654 comb655 comb93c comb9bc comb9be comb9d7 comba3c combb3c combb3e combb56 combb57 combbbe combbd7 combc56 combcc2 combcd5 combcd6 combd3e combd57 combdca combdcf combddf combf72 combf74 combf80 combfb5 combfb7 comb102e comb1b35 comb3099 comb309a comb110ba comb11127 comb1133e comb11357 comb114b0 comb114ba comb114bd comb115af comb11930 comb1d165 comb1d16e comb1d16f comb1d170 comb1d171 comb1d172
What do you think?
Best regards,
Oblivion
int GraphemeLength(const char *s);
int GraphemeLength(const wchar *s);
Quote:int GraphemeLength(const char *s);
int GraphemeLength(const wchar *s);
That requires width tables.
How about generating the width tables with uniset scripts (used by xterm et al.)?
They can generate width tables for doublewidth, ambiguous width, unknown width and combining chars from the UnicodeData.txt (latest version).
They are quite comprehensive.
I already started using them here (For double width and combining chars, ATM): https://github.com/ismail-yilmaz/upp-components/blob/master/ CtrlLib/Terminal/Cell.cpp
We can put these into a generic GetCharWidth(int c) function, then utilize them in GetGraphemeLength(const wchar *s)?
This might not be the definitve solution, but it is the battle-tested one out in the wild.
Only real downside is that the tables might need updating albeit ifrequently.
Best regards,
Oblivion
Ah yes, I was thinking about the graphical width (in units),sorry.
But don't we still need to detect base char + combining char(s), which will eventually require tables to be a fast operation? (also for decomposition?).
But I am not at the moment sure whether combining characters are the only source of multi-codepoint graphemes.
Quote:But I am not at the moment sure whether combining characters are the only source of multi-codepoint graphemes.
Yeah, there are at least surrogate pairs (Since U++ use 16-bit wchar), ligatures and IIRC some hangul graphemes. Anything else that I miss?
But hangul graphems is what worries me... Not that I have searched too much, but so far I have failed to find simple algo how to combine hangul characters...
So, I think it's not a complicated job to support UNICODE.
Quote:But I am not at the moment sure whether combining characters are the only source of multi-codepoint graphemes.
Yeah, there are at least surrogate pairs (Since U++ use 16-bit wchar), ligatures and IIRC some hangul graphemes. Anything else that I miss?
Surrogate pairs are rather well formed, but ligatures and multi-codepoint CJK/Devanagari stuff may pose problems...
Edit: Ah yes, what I miss is explained under the grapheme clusters section: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Bounda ries
It looks like most toolkits simply use HarfBuzz anyway...
Quote:It looks like most toolkits simply use HarfBuzz anyway...
Well, this seems to be the best option but I was even afraid of suggesting it, as it means another dependency (and possibly a lot of work)
By the way, If you think it's ok, In the meantime we can have better precomposition support. I've attached Charset.cpp with the patched UnicodeCombine for full precompositions support (for 16-bit UCS canonicals only).
(I can also send the extractor code for uppbox if needed)
Best regards,
Oblivion
void Test(int ch, Font fnt) { TIMING("Glyph"); HFONT hfont = GetWin32Font(fnt, 0); VERIFY(hfont); if(hfont) { HDC hdc = CreateIC("DISPLAY", NULL, NULL, NULL); HFONT ohfont = (HFONT) ::SelectObject(hdc, hfont); GLYPHMETRICS gm; memset(&gm, 0, sizeof(gm)); MAT2 m_matrix; memset8(&m_matrix, 0, sizeof(m_matrix)); m_matrix.eM11.value = 1; m_matrix.eM22.value = 1; int gsz = GetGlyphOutlineW(hdc, ch, GGO_NATIVE|GGO_UNHINTED|GGO_METRICS, &gm, 0, NULL, &m_matrix); if(gsz == GDI_ERROR) DLOG("Failed " << ch); if(gm.gmCellIncX != 75) DLOG(ch << " " << gm.gmCellIncX << ", " << gm.gmCellIncY << ", " << gm.gmptGlyphOrigin.x); ::SelectObject(hdc, ohfont); ::DeleteDC(hdc); } } GUI_APP_MAIN { for(int i = 32; i < 100000; i++) Test(i, Font().Height(100).FaceName("MingLiU-ExtB")); }
If we do not want to use Uniscribe, just good old GDI,
Quote:If we do not want to use Uniscribe, just good old GDI,
Why not? Does U++ also takes care of devices that rely exclusively on GDI?
for phase 3, we can either decide to use platform specific library (Uniscribe/pango), which would mean encapsulating it to something platform independent, or just use Harfbuzz
Quote:
for phase 3, we can either decide to use platform specific library (Uniscribe/pango), which would mean encapsulating it to something platform independent, or just use Harfbuzz
To my knowledge, the latest version of pango can be compiled on Windows too (but satisfying its dependencies might not worth the effort). Also, the latest version of it seems to support harfbuzz as a backend.