Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » U++ Library support » U++ Libraries and TheIDE: i18n, Unicode and Internationalization » 16 bits wchar
Re: 16 bits wchar [message #17300 is a reply to message #17282] Wed, 06 August 2008 13:33 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
Well, I was going to go life Today with my changes. I was issue free for a couple of days and I wanted to merge everything into Core on my local setup to see if there are any issues that I haven't discovered yet.

But I read something interesting. It seems that Unicode 5.1 gives quite specific information regarding what to do with ill formated texts, and most importantly, where the boundaries of such text are. In 5.0, there was a lot of place for interpretation, and pragmatically speaking is a good change.

To quote Unicode:
Quote:

A process which interprets a Unicode string must not interpret any ill-formed code unit subsequences in the string as characters. (See conformance clause C10.) Furthermore, such a process must not treat any adjacent well-formed code unit sequences as being part of those ill-formed code unit sequences.

This does change slightly the result of error escaping in some cases. I don't know if it is important or not. I'll have to think about it.
Re: 16 bits wchar [message #17322 is a reply to message #17300] Thu, 07 August 2008 08:41 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
OK, the merge went well and except some issues that I expected, no new ones appeared. New characters can even be inserted and displayed in Qtf, but the way text metrics are handled in Qtf makes it look and behave slightly differently, which suggests that Qtf uses a more manual text layout scheme when compared to other methods of output from U++. I will look into it.

I'm using this text image:
index.php?t=getfile&id=1303&private=0

This is pretty much a reference Windows rendering with font Arial 24. First character is CJK, second is 'i', third CJK, fourth CJK from SIP, then CJK again and Latin 'M'.

And here is a side by side comparison in three different applications:
index.php?t=getfile&id=1304&private=0

First is OpenOffice, second is Notepad, and third is U++ with a Label and an EditField.

So let me congratulate OpenOffice for completely forgetting to display my SIP character! Not oven a black box. But if you try to use cursor to navigate, it will act as if there was an invisible characters at that position. Even super beta KOffice for windows which is an unusable piece of software gets it right. And Notepad and Wordpad can handle it, Notepad rendering it all and Wordpad rendering a black box since it takes font specification literally and doesn't seem to do font pooling. Changing font will result in correct display though.

Next is U++. As you can see, the display work fine, except I don't understand why Arial(24) does not look the same as in all other application. It looks smaller , even without font zooming. I need to fix this somehow.
  • Attachment: font2.PNG
    (Size: 1.25KB, Downloaded 944 times)
  • Attachment: font.PNG
    (Size: 4.33KB, Downloaded 883 times)
Re: 16 bits wchar [message #17326 is a reply to message #17322] Thu, 07 August 2008 16:10 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
cbpporter wrote on Thu, 07 August 2008 02:41


Next is U++. As you can see, the display work fine, except I don't understand why Arial(24) does not look the same as in all other application. It looks smaller , even without font zooming. I need to fix this somehow.


Arial(24) is not Arial 24pt (but Arial 24 pixels).

Besides, pt are only really meaningful when you print something.

Mirek
Re: 16 bits wchar [message #17327 is a reply to message #17326] Thu, 07 August 2008 17:33 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
luzr wrote on Thu, 07 August 2008 17:10


Arial(24) is not Arial 24pt (but Arial 24 pixels).

Besides, pt are only really meaningful when you print something.

Mirek


Aren't points the international consensus for delivering consistent and resolution independent font sizes?
Re: 16 bits wchar [message #17328 is a reply to message #17327] Thu, 07 August 2008 17:40 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
cbpporter wrote on Thu, 07 August 2008 11:33

luzr wrote on Thu, 07 August 2008 17:10


Arial(24) is not Arial 24pt (but Arial 24 pixels).

Besides, pt are only really meaningful when you print something.

Mirek


Aren't points the international consensus for delivering consistent and resolution independent font sizes?


Yes. On paper.

On display, zoom capability makes it moot. Especially as long as most fonts are hint-optimized for pixel sizes. And you can never say 100% what is DPI of the monitor.

That is why Arial(24) is 24 *pixels*.

Mirek
Re: 16 bits wchar [message #17330 is a reply to message #17328] Thu, 07 August 2008 18:37 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
So if I want points, I have to manually compute how many pixels would the given size in points take and use that.
Re: 16 bits wchar [message #17331 is a reply to message #17330] Thu, 07 August 2008 20:01 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Yes.

Mirek
Re: 16 bits wchar [message #17351 is a reply to message #17331] Fri, 08 August 2008 13:34 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
I fixed Qtf to accept 4 byte UTF8. It is strange that it doesn't accept it when passed directly, but if you copy & paste into a control like RichEdit, it has no problems. Probably because only ParseQtf cares about correct codes, and copy/paste is not checked.

Then I continued fixing EditField navigation. And here I found some interesting problems. Using FontInfo, I keep getting wrong widths for SIP (non BMP) characters, even though it uses the right code points.

So I did some testing and rendered a text composed out of a SIP character and a BMP character using six different fonts:
index.php?t=getfile&id=1305&private=0

The first sample uses StdFont, whatever that may be and it has align problems. The second is Arial, and the third and fourth are Windows Japanese fonts MS Mincho and MS Gothic. The standard CJK Windows font should obviously be the best choice, yet they have the worst align problems. Number 5 and six are HAN NOM A (Plane 0)and HAN NOM B (Plane 2), free fonts that have all the needed characters.

As you can see most of the samples are not rendered correctly. It is OK to have such problems when mixing Latin fonts with CJK fonts, but the last 4 fonts are all CJK. The problem is that Windows uses it's callback font exclusively for SIP characters. I couldn't even find a Win API function that when enumerating Unicode ranges uses anything larger than a word. And even if a font contains a SIP character, windows font pooling does not manage to find it. It is clear from the screenshot that the first character is drawn from the same font, and is somehow coerced by the font rendering engine to look more like the selected font. But for CJK font, making them look more like Arial or Verdana doesn't make much sense, and I'm sure that users would not appreciate this. It is clear that all the first characters are drawn from HAN HOM B, because this is my system fallback plane 2 font. If I disable it I get this result:
index.php?t=getfile&id=1306&private=0

Only the the 5th sample can draw the first character, because it has it's font given explicitly, and it messes up the second one, because it takes it from a different font.

So my conclusion is the following: Windows tries to render with the given font, for example Times New Roman. It find a non BMP character, it changes the font to the system fallback font for that given font and tries to apply Times New Roman hinting and weight to it. And it fails pretty bad in most cases. This is probably why using FontInfo gives wrong widths: because it fails to change font to some fallback, and tries to return font metrics taking into account current font and maybe some other fonts, but not the fallback.

Then I tried to bypass the whole automatic fallback system, and composed manually my text. Here is the result:
index.php?t=getfile&id=1307&private=0

It is obviously a lot better, competing in correctness with sample number 6. Yet it is based on sample number 3, using fallback and standard CJK Windows font, but without letting Windows apply some freaky font transformations that does really work.

So a possible solution would include providing a StdPlane2() function and a modified DrawTextOp function. This way the only font that you have to choose is for BMP CJK. The SIP characters are always drawn with the same font, and it is up to you to choose a font for BMP that fits from a stylistic point of view. Even if the styles don't fit, the sizes will fit a lot better, because Windows is relatively good at giving a glyph with the size you requested, and even if it is not perfect, it is going to be a lot better than in screenshot number one, samples 3 and 4.

What do you think?

PS: And under Linux, what do you think about using some font rendering API that is a little smarter than Xft? Maybe something from gnome or pango? What is your attitude regarding new dependencies Smile?
  • Attachment: test0.PNG
    (Size: 2.12KB, Downloaded 962 times)
  • Attachment: test1.PNG
    (Size: 1.88KB, Downloaded 1147 times)
  • Attachment: test2.PNG
    (Size: 0.62KB, Downloaded 1206 times)
Re: 16 bits wchar [message #17354 is a reply to message #17351] Fri, 08 August 2008 15:32 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
cbpporter wrote on Fri, 08 August 2008 07:34


Then I continued fixing EditField navigation. And here I found some interesting problems. Using FontInfo, I keep getting wrong widths for SIP (non BMP) characters, even though it uses the right code points.



No surprise, FontInfo only supports BMP.

Other than that, the rest of your message indicates what a mess all this is Smile

Quote:


PS: And under Linux, what do you think about using some font rendering API that is a little smarter than Xft? Maybe something from gnome or pango? What is your attitude regarding new dependencies Smile?



Well, I think we will have to solve this issue in Win32 too... and the solution there will be common for both platforms.

Oh well, I think we will have to start with wchar -> int.... That will solve quite a lot problems (I bet QTF will start working etc...). Besides, int based WString can be quite useful outside text handling too Smile

Then we will have to look into font substitution techniques...

Mirek
Re: 16 bits wchar [message #17356 is a reply to message #17354] Fri, 08 August 2008 15:47 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
luzr wrote on Fri, 08 August 2008 16:32


No surprise, FontInfo only supports BMP.

Other than that, the rest of your message indicates what a mess all this is Smile


Well I'm pretty sure that I fixed it to work outside of BMP, but not to handle plane based fallback fonts.

Quote:


Oh well, I think we will have to start with wchar -> int.... That will solve quite a lot problems (I bet QTF will start working etc...). Besides, int based WString can be quite useful outside text handling too Smile


Sure, that would be good for start. Even better would be to abstract away such details by using some kind of a string iterator class. Most processing is done by *s++ and similar constructs, and these can be emulated by fast and convenient iterators, which all return 32 bit results when used both with String and WString (and DString, and...).

And for me personally, using 32 bits is pretty much out of the question for production code, because I have very strict RAM needs and I may be forced to replace String and WString with wchar[3] (not null terminated) for most of my database. I hope it doesn't come to this because that would be a terrible mess...
Re: 16 bits wchar [message #17357 is a reply to message #17356] Fri, 08 August 2008 18:25 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
cbpporter wrote on Fri, 08 August 2008 09:47


Sure, that would be good for start. Even better would be to abstract away such details by using some kind of a string iterator class. Most processing is done by *s++ and similar constructs, and these can be emulated by fast and convenient iterators, which all return 32 bit results when used both with String and WString (and DString, and...).



IMO it just looks like being simple.

Consider only the simple fact that you might want to display the column number in TheIDE Smile

Quote:


And for me personally, using 32 bits is pretty much out of the question for production code, because I have very strict RAM needs and I may be forced to replace String and WString with wchar[3] (not null terminated) for most of my database. I hope it doesn't come to this because that would be a terrible mess...



I think WString in fact should only be used as "transient uncompressed form". Just like it already is everywhere, except EditField.

If you really have very strict memory requirements, using something like ZCompress on UTF-8 String would have superior results anyway... Smile

Hm, OTOH, using only 3 bytes per character in WString perhaps is not that bad idea Smile

Mirek
Re: 16 bits wchar [message #17363 is a reply to message #17357] Sat, 09 August 2008 01:45 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
Here is a little demo of my effort thus far. Nothing too fancy, just a windows with and EditField. Keyboard navigation, editing, selecting work great and it no longer looks like crap even though I'm using two different fonts for rendering. Probably if you use all the keyboard shortcuts it may be possible to mess the cursor position up, since I didn't investigate all shortcuts, and mouse selection is not fixed yet.

The predefined text consists of SIP, BMP, BMP, space, Latin, Space, SIP, space, Latin, BMP, space, SIP. You will need to download HAN NOM A and HAN NOM B, and set up HAN NOM B as the plane 2 fallback font. Maybe in the future we can do a little guess work, and if a font can print a character from a given plane and a registry setting for that plane is missing, we could still use it as a fallback only in U++.

You can find instructions here: <a href="http://winvnkey.sourceforge.net/webhelp/surrogate_fonts.htm" target="_blank">here</a>. Internet Explorer setting are not necessary.

edit: link was missing.
  • Attachment: TestCJK.rar
    (Size: 381.38KB, Downloaded 340 times)

[Updated on: Sat, 09 August 2008 09:02]

Report message to a moderator

Re: 16 bits wchar [message #17984 is a reply to message #17363] Fri, 05 September 2008 19:13 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
Oh crap, I have just overwritten my Unicode modifications with svn checkout Crying or Very Sad. I can recover about half of it from my UnicodeEx package, but still a lot of extra work.

When you're ready to get back to this subject, I think I should do the work on a branch on SVN to avoid such problems.
Re: 16 bits wchar [message #18032 is a reply to message #17984] Sun, 07 September 2008 13:24 Go to previous message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Sad

Anyway, I think that T++ is now the priority, 32-bit wchar is just next...

(In fact, going 32-bit wchar will not be as simple, some performance investigations of WString will be necessary....)

Mirek
Previous Topic: Arabic words from file
Next Topic: Not possible to get .t files
Goto Forum:
  


Current Time: Tue Apr 16 22:53:17 CEST 2024

Total time taken to generate the page: 0.01375 seconds