U++ forum: Welcome to the forum

Status & Roadmap

Authors & License

Funding Ultimate++

Search on this site

Search in forums

Home » U++ Library support » U++ Libraries and TheIDE: i18n, Unicode and Internationalization » 16 bits wchar

Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend

Switch to threaded view of this topic

Create a new topic

Submit Reply

Re: 16 bits wchar [message #17229 is a reply to message #17215]

Sun, 03 August 2008 14:51

cbpporter is currently offline

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

I've finished Label too and that's about it for the immediate support that I need for CJK. I fixed the previous problem with characters not being drawn by using a hardcoded font name for those problematic ranges. I guess Windows font support is not perfect either Smile

Smile

.A better solution would be to determine if the font can display the character, and if not, change the font by probably using a list determined at application startup. But I'm afraid that it isn't that simple to do with current font rendering methods and we should get back to it at the next text output engine refactoring (maybe when we do it for Linux, where it is more needed).

I only had to update a couple of U++ functions, and I wrote different encoding conversion functions which I explicitly call instead of the standard ones to limit my changes to specific parts of code and let the rest use the defaults.

I will probably need an edit control updated also, but for now I'm pretty happy with my over 13000 unique characters displayed, so full JIS support.

BTW, the Core2000 font available on the Internet has a number of broken codepoints, drawing the wrong characters in several cases. It is pretty hard to notice unless you know what to look for, so if anybody is using it, try out "HAN NOM A" instead, which hasn't shown any error up to now.

The question is what now. Since I'm happy with my fixes and nobody else seems to have needs regarding CJK support, I could just rename the couple of functions I modified and override Paint in a control that inherits from Label and thus keep my changes local and become U++ version agnostic. Of course, I will release a package in Bazaar for those who for some particular reason need more than Unicode 1.1 support, but a fair warning is due: my changes are strongly biased towards Japanese characters, so Chinese or Korean specific issues might still exist.

Or I could merge my changes with my installed version of U++, use it for a while to see if there are other problems (Qtf and edit controls are sure to not enjoy surrogate pairs) and continue researching how to best migrate U++ entirely to the new scheme. I'm only going to do this if you want these changes and if you want them relatively soon, i.e. in 1-2 devs. If not, I'll go with variant one because I still need to implement EUC-JP encoding support, for which I need huge conversion tables Smile

Smile

.

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17241 is a reply to message #17208]

Mon, 04 August 2008 15:03

mirek is currently offline

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

cbpporter wrote on Sat, 02 August 2008 07:27

Do you know of other key functions or classes that I need to look over to get basic output working? And could you explain in a few words how font compositioning works for U++. I found the code, but font compositioning is not used when I try to draw text. It will probably need to be modified to get it to work with surrogates also.

Well, U++ uses, obviousl, 16-bit XFT variants in DrawText. I suspect that maybe we would need to use 32-bit variants and convert pairs to it.

Mirek

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17242 is a reply to message #17229]

Mon, 04 August 2008 15:07

mirek is currently offline

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

Well, is my understanding correct that your method leads to non-BMP support in UTF-8/String and non-BMP support in WString via surrogate pairs?

First is fine and very good achievement (except that we still have font issue in X11).

Second (WString) is still a bit problematic - WString should be the means of manipulating unicode texts on per-character basis (e.g. in editor).

This is really unhappy situation Smile

Smile

I am still very undecided whether to introduce LString or make wchar 32-bit (and convert everything in Win32 + have modest performance impact on everything).

Mirek

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17245 is a reply to message #17242]

Mon, 04 August 2008 15:53

cbpporter is currently offline

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

luzr wrote on Mon, 04 August 2008 16:07

Well, is my understanding correct that your method leads to non-BMP support in UTF-8/String and non-BMP support in WString via surrogate pairs?

First is fine and very good achievement (except that we still have font issue in X11).

Yes, full Unicode code range for conversions and experimental support for size based calculation, fonts and output is what I'm trying to achieve.

Under Linux we will either go with some determined at start font for some ranges of Unicode, or we need to do full font pooling on display operation, and somehow cache the results. I don't know how slow the operation of font enquery is, but with my 1GB of fonts it is pretty slow with full ppoling (i.e. Opera or Character Map).

Quote:

WString should be the means of manipulating unicode texts on per-character basis (e.g. in editor).

Using multiple code units per character doesn't disable the use of a text editor or any means of manipulating Unicode texts. It just needs a little bit smarter methods for some operations. I know that using only one word is convenient, but Unicode says that there are up to two words per codepoint and there is no other work around than using 32 bits, which is not a lot better, because not even with UTF32 there isn't a 1:1 relationship between character and display operation of that character. Nonwhitespaces, separators, control characters, combining characters and others must be filtered out, and the end result is the same as if you would use 16 bit chars (where the same operations must be done and I don't think they are done right now). Have you ever tried using combining characters in Upp? And even worse, using combining characters with zero width placeholders, where

I think than one by one all methods that take a string must and traverse it must be reviewed and altered to use a new style of traversing. This only applies to codepoint based addressing, like in GetTextSize. This could be done in an unified way, with iterators, or even "fake index" iterators (which will be a little bit slower than iterators, who should have the same performance as index based traversing).
Anyway performance shouldn't be a problem, because I've been experimenting with a faster method of conversion, which uses local caches short strings up to a static length, bypassing the general algorithm of traverse data, compute code points, determine if escaping is necessary, return length and then recompute data and using a faster method in which the second computation is done only for long strings. It should be faster, but I'm not done benchmarking yet because Linux console apps refuse to print anything since Today (and to connect to mysql, but that is unrelated).

This way there is no need for WString actually, except the fact that it helps as an optimization because Win32 uses it. In the end, we will probably need a full text layout engine, breaking text in multiple segments, and drawing them one by one to support composition, multichar composition, RTL.

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17248 is a reply to message #17245]

Mon, 04 August 2008 17:14

mirek is currently offline

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

cbpporter wrote on Mon, 04 August 2008 09:53

Using multiple code units per character doesn't disable the use of a text editor or any means of manipulating Unicode texts. It just needs a little bit smarter methods for some operations. I know that using only one word is convenient, but Unicode says that there are up to two words per codepoint and there is no other work around than using 32 bits, which is not a lot better, because not even with UTF32 there isn't a 1:1 relationship between character and display operation of that character. Nonwhitespaces, separators, control characters, combining characters and others must be filtered out, and the end result is the same as if you would use 16 bit chars (where the same operations must be done and I don't think they are done right now).

Well, this rather sound like we should kick out WString altogether and keep just UTF-8:)

Quote:

This way there is no need for WString actually, except the fact that it helps as an optimization because Win32 uses it. In the end, we will probably need a full text layout engine, breaking text in multiple segments, and drawing them one by one to support composition, multichar composition, RTL.

Ah, right

Smile

OTOH, on logical level, I still see characters on the screen. And those characters should be edited on per-character basis.

Maybe we just need smarter encoding than UNICODE? Smile

Smile

Makes me think - realistically, there is a lot of "reserved" positions in BMP. Could we just use them for this?

Mirek

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17253 is a reply to message #17248]

Mon, 04 August 2008 22:47

cbpporter is currently offline

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

luzr wrote on Mon, 04 August 2008 18:14

Well, this rather sound like we should kick out WString altogether and keep just UTF-8:)

I think that we should keep both, and even add LString eventually just for the sake of completeness. In other package if your worried about exe size.

Quote:

Maybe we just need smarter encoding than UNICODE? Smile

Smile

Makes me think - realistically, there is a lot of "reserved" positions in BMP. Could we just use them for this?

Well there is nothing better than Unicode AFAIK. It may seem sometimes like there is too much fuss with it, but if you are in my place and have to deal with other legacy encodings, you would have to deal with EUC, EUC-JP, ShiftJIS, JIS and a couple of ISOs, where a lot of these encoding don't guarantee round-trip conversion, and you'll see that Unicode is a true blessing. Great that I have iConv to ease the burden a little.

And BTW, Unicode forbids the use of the reserved or unassigned code points for any use Smile

Smile

.

Anyway I ran my benchmarks on my Windows machine where console output still works. I did the tests with some experimental methods which are not complete, so the results could be a little inaccurate, but they are still interesting enough too post.

I used 3 methods to convert from a two UTF8 sets to UTF16. The first method is the standard U++ FromUtf8. The second is my FromUtf8SR, which takes into account 4 byte characters, and the third is the highly experimental FromUtf8SR2. The first data set consists of 200 latin characters, representing 200 code points (the letter c 200 times). The second one consists of 100 kanji, 3 characters each, totaling 300 bytes. On second thought, I should have used same sized data sets. All conversions are run 1000000 times.

In Debug mode:
Latin
3125
3203
2078
Kanji
3891
3906
2406

Nothing too impressive here. First method, the standard one is a little faster than mine, and the experimental one is considerably faster.

In Release mode:
Latin
484
485
390
Kanji
4718
3157
812

Here, for kanji, my method really is a lot faster. But in release mode, FromUtf8 for an all kanji input is slower than in Debug mode. Can someone verify this? Maybe I messed something up.

As I said my experimental method is really experimental and not complete yet (I hope it is thread safe also). I hope I'm not chasing after wild geese (is that an expression?) and I didn't miss something that should render my experimental method useless or wrong, because the numbers are great!

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17257 is a reply to message #17253]

Tue, 05 August 2008 00:03

mirek is currently offline

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

[quote title=cbpporter wrote on Mon, 04 August 2008 16:47]
Well there is nothing better than Unicode AFAIK. It may seem sometimes like there is too much fuss with it, but if you are in my place and have to deal with other legacy encodings, you would have to deal with EUC, EUC-JP, ShiftJIS, JIS and a couple of ISOs, where a lot of these encoding don't guarantee round-trip conversion, and you'll see that Unicode is a true blessing. Great that I have iConv to ease the burden a little.

And BTW, Unicode forbids the use of the reserved or unassigned code points for any use Smile

Smile

.
[/code]

I obviously do not understand the depth of the problem, anyway:

One code-point corresponds, at the end of process, to one font glyph. Is that correct?

Meanwhile, it can be made of several unicode words/dword. Correct?

If yes, how much codepoints we need in *existing fonts*?

If we can fit all possible font glyphs into 64KB codepoints, problem is solved. Of course, we would need some more conversion routines between our "UnicodeEx" and the "real Unicode"...

Mirek

[Updated on: Tue, 05 August 2008 00:04]

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17258 is a reply to message #17257]

Tue, 05 August 2008 00:12

cbpporter is currently offline

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

I was trying to finish my methods, but I came to the conclusion that it is far too complicated and I wouldn't be able to maintain it. But then I tried something else. Something a lot simpler.

Add this to CharSet.cpp (or any other package except MakeList, to escape the aggressive link optimizer if method is in same package):

WString FromUtf8Op(const char *_s, int len)
{
	if (len >= 8000)
		return FromUtf8(_s, len);
	
	const byte *s = (const byte *)_s;
	const byte *lim = s + len;
	//int tlen = utf8len(_s, len);
	//WStringBuffer result(tlen);
	wchar buf[33000];
	wchar *t = buf;
	if(len > 4)
		while(s < lim - 4) {
			unsigned code = (byte)*s++;
			if(code < 0x80)
				*t++ = code;
			else
			if(code < 0xC2)
				*t++ = 0xEE00 + code;
			else
			if(code < 0xE0) {
				word c = ((code - 0xC0) << 6) + s[0] - 0x80;
				if(s[0] >= 0x80 && s[0] < 0xc0 && c >= 0x80 && c < 0x800)
					*t++ = c;
				else {
					*t++ = 0xEE00 + code;
					*t++ = 0xEE00 + s[0];
				}
				s += 1;
			}
			else
			if(code < 0xF0) {
				word c = ((code - 0xE0) << 12) + ((s[0] - 0x80) << 6) + s[1] - 0x80;
				if(s[0] >= 0x80 && s[0] < 0xc0 && s[1] >= 0x80 && s[1] < 0xc0 && c >= 0x800
				   && !(c >= 0xEE00 && c <= 0xEEFF))
					*t++ = c;
				else {
					*t++ = 0xEE00 + code;
					*t++ = 0xEE00 + s[0];
					*t++ = 0xEE00 + s[1];
				}
				s += 2;
			}
			else
				*t++ = 0xEE00 + code;
		}
	while(s < lim) {
		word code = (byte)*s++;
		if(code < 0x80)
			*t++ = code;
		else
		if(code < 0xC0)
			*t++ = 0xEE00 + code;
		else
		if(code < 0xE0) {
			if(s > lim - 1) {
				*t++ = 0xEE00 + code;
				break;
			}
			word c = ((code - 0xC0) << 6) + s[0] - 0x80;
			if(s[0] >= 0x80 && s[0] < 0xc0 && c >= 0x80 && c < 0x800)
				*t++ = c;
			else {
				*t++ = 0xEE00 + code;
				*t++ = 0xEE00 + s[0];
			}
			s += 1;
		}
		else
		if(code < 0xF0) {
			if(s > lim - 2) {
				*t++ = 0xEE00 + code;
				while(s < lim)
					*t++ = 0xEE00 + *s++;
				break;
			}
			word c = ((code - 0xE0) << 12) + ((s[0] - 0x80) << 6) + s[1] - 0x80;
			if(s[0] >= 0x80 && s[0] < 0xc0 && s[1] >= 0x80 && s[1] < 0xc0 && c >= 0x800
			   && !(c >= 0xEE00 && c <= 0xEEFF))
				*t++ = c;
			else {
				*t++ = 0xEE00 + code;
				*t++ = 0xEE00 + s[0];
				*t++ = 0xEE00 + s[1];
			}
			s += 2;
		}
		else
			*t++ = 0xEE00 + code;
	}
	*t = 0;
	//ASSERT(t - ~result == tlen);
	return WString(buf, t - buf);
}

Then try out this test package to see if there is really a performance improvement. It contains a simple benchmark.

Attachment: MakeList.rar
(Size: 0.79KB, Downloaded 362 times)

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17260 is a reply to message #17258]

Tue, 05 August 2008 00:14

mirek is currently offline

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

A small bit of explanation about what you are trying to achieve? I am lost now Smile

Smile

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17261 is a reply to message #17260]

Tue, 05 August 2008 00:18

cbpporter is currently offline

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

I'm just trying to get my chars displayed Smile

Smile

. And while doing that, I wrote some overcomplicated conversion between encodings, but which had better performance for short string where most characters are CJK.

And what i posted in the previous mail is a benchmark which test a new and lot simpler method to get comparable speed up. Sorry if I'm not than clear, but it is almost 2 past midnight here.

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17262 is a reply to message #17261]

Tue, 05 August 2008 00:20

mirek is currently offline

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

Means UTF-8 to the *correct* UTF-16 (with surrogate pairs) ?

Mirek

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17263 is a reply to message #17262]

Tue, 05 August 2008 00:24

cbpporter is currently offline

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

My code in my package handles surrogate pairs, but that's not what I posted. What I posted right now is a quick patch based on the default U++ method which should behave 100% the same way, without extra surrogate support or anything else. It is not for inclusion in Core, it is just for a test to see if the performance gain is not local somehow to my machine and to get some extra eyes on it to see if it is not due to some other cause.

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17264 is a reply to message #17263]

Tue, 05 August 2008 00:26

mirek is currently offline

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

Yes.

It looks like the basis for optimization is avoiding utf8len, right?

I guess a good idea, I will think about it Smile

Smile

Mirek

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17265 is a reply to message #17264]

Tue, 05 August 2008 00:32

cbpporter is currently offline

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

Yes, because utf8len does pretty much the same thing as the conversion function. Most strings that will be converted are shorted than that arbitrary limit I imposed, so we should get the performance benefit. And if they are longer, one extra function call won't make a difference.

I had a very complicated version, but then I decided to simply use an if, even if it leads to some code duplication.

Also, on my machine, if I change the line where code is defined: word code = *s++ to either byte or unsigned, I also get an unexplained performance boost.

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17266 is a reply to message #17265]

Tue, 05 August 2008 00:51

mirek is currently offline

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

cbpporter wrote on Mon, 04 August 2008 18:32

Also, on my machine, if I change the line where code is defined: word code = *s++ to either byte or unsigned, I also get an unexplained performance boost.

Sometime it pays off to look at assembly Smile

Smile

Anyway, might I ask you to think about / comment codepoint == glyph and distinct(codepoint) < 64K claims?

Mirek

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17267 is a reply to message #17266]

Tue, 05 August 2008 10:42

mirek is currently offline

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

Hm, I was thinking about our problem a lot....

I believe that we should do one important thing first - scan all available fonts and count/list all codepoints there...

Mirek

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17268 is a reply to message #17266]

Tue, 05 August 2008 12:03

cbpporter is currently offline

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

luzr wrote on Tue, 05 August 2008 01:51

Anyway, might I ask you to think about / comment codepoint == glyph and distinct(codepoint) < 64K claims?

I really can't imagine how that would be possible.

First of all, how do you expect to squish almost 100K characters in 64K? Some kind of dynamic character set loading would be needed, and still a string could not contain every possible character.

And second, in Unicode codepoint != glyph. All the 90k+ codepoints can be combined theoretically to produce and endless number of glyphs. Think of Unicode as a comparably more feature poor Qtf. Codepoints are commands. 99% of commands are "print glyph X", but the rest allow you to manipulate the layout and appearance of glyph. It is not a visual manipulation, like with font, rather manipulation that alters the abstract concept of a glyph, like adding diacritics.

The reason why this is not that obvious is that Win API handles this for you automatically. Most users and even developers are not familiar with this process, and if somehow their input data contains such characters, Win controls will display them correctly. All common diacritics are handled pretty well, but uncommon ones which are often incorrectly handled. This could be one of U++ strong points in the future. When all font issues are resolved (probably not before 2009.1 Razz

Razz

), if we would offer full combining characters support algorithmically where fonts fail, we would certainly be in a relatively unique position.

but since we don't use native controls, we are more exposed to them. Under Windows, when you use such text in non editable controls in U++, you get correct result, but if you use an EditString for example, you have to press cursor keys multiple times to step through a character which visually is made out of only one glyph, but uses several code points as representation.

This problem can be relatively easily addressed, by updating a couple of functions and making sure than Windows API always gets full chunks of text.

Under Linux, such support is a lot poorer. Since we send to X text one codepoint at a time, no composition can take place. And I don't even know if the methods from X that are in use can handle such texts. All my experiments in U++ gave the same result: diacritics are removed and the rest of characters are displayed as whitespace. KDE editors seemed quite happy with such codes, while gedit displayed the characters correctly, but without composing them in the same place., so basically it did not do any better than U++ if we would have font pooling.

As always, I come to the same conclusion: nobody really cares for proper internationalization and Unicode (except Qt or KDE, who seems to have best support out of all, comparable and maybe better than Windows, but seemingly poorer because of available fonts).

Quote:

Hm, I was thinking about our problem a lot....

I believe that we should do one important thing first - scan all available fonts and count/list all codepoints there...

Yes, that would help under windows and is must under Linux. We could even use some "heuristics", i.e. if a font has 2 Arabic characters, there is a high probability that it handles all Arabic characters from that given Unicode range. Maybe we can get away by splinting all codepoints into ranges on a per script basis, and only test some key characters, but I can't be sure without testing.

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17277 is a reply to message #17268]

Tue, 05 August 2008 15:12

mirek is currently offline

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

cbpporter wrote on Tue, 05 August 2008 06:03

luzr wrote on Tue, 05 August 2008 01:51

Anyway, might I ask you to think about / comment codepoint == glyph and distinct(codepoint) < 64K claims?

I really can't imagine how that would be possible.

First of all, how do you expect to squish almost 100K characters in 64K? Some kind of dynamic character set loading would be needed, and still a string could not contain every possible character.

Yes, meanwhile I have studied it a little bit more, you are right.

Quote:

And second, in Unicode codepoint != glyph. All the 90k+ codepoints can be combined theoretically to produce and endless number of glyphs. Think of Unicode as a comparably more feature poor Qtf. Codepoints are commands. 99% of commands are "print glyph X", but the rest allow you to manipulate the layout and appearance of glyph. It is not a visual manipulation, like with font, rather manipulation that alters the abstract concept of a glyph, like adding diacritics.

Well, I was studying this as well and came to conclusion that combining is of little concern.

First, AFAIK, basic Unicode "compliance" does not requite it.

Second, all important ("real") combining codepoints have characters in Unicode.

IMO, I would regard combining as sort of formating info, similar to '\n' or '\t' - something that we need to be aware about (and, in fact, we already are, sort of, see UnicodeCombine...) but do not need to actively support in editors etc...

BTW, that UnicodeCombine is exactly the sort of support that makes sense.

Quote:

All common diacritics are handled pretty well, but uncommon ones which are often incorrectly handled.

This is because there is no general way how to create combined glyph....

Quote:

Under Windows, when you use such text in non editable controls in U++, you get correct result, but if you use an EditString for example, you have to press cursor keys multiple times to step through a character which visually is made out of only one glyph, but uses several code points as representation.

Does not make sense to me... Smile

Smile

Quote:

This problem can be relatively easily addressed, by updating a couple of functions and making sure than Windows API always gets full chunks of text.

IMO, this would be pretty hard to address in fact. Or result in confusing user interface.

Quote:

Under Linux, such support is a lot poorer. Since we send to X text one codepoint at a time, no composition can take place.

Actually, we do not. Interface accepts strings. But I doubt it manages combining.

Quote:

And I don't even know if the methods from X that are in use can handle such texts. All my experiments in U++ gave the same result: diacritics are removed and the rest of characters are displayed as whitespace. KDE editors seemed quite happy with such codes, while gedit displayed the characters correctly, but without composing them in the same place., so basically it did not do any better than U++ if we would have font pooling.

We will, I promise Smile

Smile

(Well, I would rather describe it as "font substitution"...).

Quote:

As always, I come to the same conclusion: nobody really cares for proper internationalization and Unicode

The question is how combining really helps... IMO, it is not worth the enormous trouble it brings...

Quote:

Yes, that would help under windows and is must under Linux. We could even use some "heuristics", i.e. if a font has 2 Arabic characters, there is a high probability that it handles all Arabic characters from that given Unicode range. Maybe we can get away by splinting all codepoints into ranges on a per script basis, and only test some key characters, but I can't be sure without testing.

Oh, for the beginning, I was rather thinking about "offline experimental scan" to find out what is really going on Smile

Smile

Maybe we should then match "standard substitution fonts" for all basic fonts.

Also interesting point is what then happens to my heurestic "glyph fixing" for characters 256-512 (U++ synthetises missing glyphs there by combining characters 0-256). So it is sort of alternative approach to font substitution. But I would keep it as it results in better looking texts.

Mirek

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17278 is a reply to message #17277]

Tue, 05 August 2008 15:19

mirek is currently offline

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

PS.: I am now leaning toward

typedef wchar int;

Report message to a moderator

Send a private message to this user

Re: 16 bits wchar [message #17282 is a reply to message #17277]

Tue, 05 August 2008 15:57

cbpporter is currently offline

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

luzr wrote on Tue, 05 August 2008 16:12

Well, I was studying this as well and came to conclusion that combining is of little concern.

First, AFAIK, basic Unicode "compliance" does not requite it.

Second, all important ("real") combining codepoints have characters in Unicode.

Sure, it does not require it, but is relatively easy to implement. I have a pretty clear idea on how to do it. But you are right, it's not a priority right now.

Quote:

This is because there is no general way how to create combined glyph....

Compute size of base character, retrieve align of character that is combined with, align in a rect that has the size as a maximum of both and draw. Basically in pseudocode:
draw(curx, cury, basechar);
draw(curx + deltax, cury + deltay, combinedchar);

Finding out delta is not that easy, but doable. This is pretty much what Qt (empirically determined) does and is near perfect.

Quote:

Does not make sense to me... Smile

Smile

Was doesn't make sense. Basically what I said is that you cant feed composed characters to an editable control. And expect align and keyboard/mouse navigation to work.

Quote:

IMO, this would be pretty hard to address in fact. Or result in confusing user interface.

I don't understand why this affects user interface. Everything looks the same from the point of view of the user.

Quote:

Actually, we do not. Interface accepts strings. But I doubt it manages combining.

You are doing:

for(int i = 0; i < n; i++) {
	wchar h = text[i];
	XftDrawString16(...,(FcChar16 *)&h, 1);

That's drawing characters one at a time (if angle is not zero). If a composition system is available, in this case it will not trigger because drawing characters that are used for composition one at a time doesn't make any sense.
But if angle is zero, you are right. I guess that it does not do composition.

The offline scan seems like a good idea.

Report message to a moderator

Send a private message to this user

Pages (5): [ « ‹ 1 2 3 4 5 › »]

Switch to threaded view of this topic

Create a new topic

Submit Reply

Previous Topic:	Arabic words from file
Next Topic:	Not possible to get .t files

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

PDF

]

Current Time: Fri Apr 26 03:06:23 CEST 2024

Total time taken to generate the page: 0.06460 seconds