|
|
Home » U++ Library support » U++ Libraries and TheIDE: i18n, Unicode and Internationalization » Basic character set analyzer
Basic character set analyzer [message #19546] |
Sat, 13 December 2008 11:26 |
cbpporter
Messages: 1401 Registered: September 2007
|
Ultimate Contributor |
|
|
Before starting to write a text output method capable of font substitution, using a little program to see what fonts contain which characters is a good idea (Mirek's idea).
I've already gotten a lot of valuable information from this little program, so if anybody want to give it a try, run it and post it's result here (or as PM if you don't want to fill up the space here) so I can tell if my assumptions hold out for different versions of Linux, that would be great. I would like to see results both from people who just installed a normal Linux, never bothering to look over the font list and from people who have a localized version of Linux or who manually installed a font to be able to use non English characters.
The program covers code ranges from Basic Latin to Arabic. I did not go any farther yet, because clearly a better interface and way to present information is needed (I'm thinking tables, ranges and percents vs. current character list). But this will be enough to verify my assumptions and write a basic method which will handle for now only the above code ranges.
PS:
To make this work, you are going to have to add this to Draw/Draw.h in FontInfo declaration:
bool HasChar(int codePoint);
bool HasCharRange(int startCp, int endCp);
bool CharRangeEmpty(int startCp, int endCp);
and this to Draw/DrawText.cpp:
bool FontInfo::HasChar(int codePoint)
{
return XftCharExists(Xdisplay, ptr->xftfont, codePoint);
}
bool FontInfo::HasCharRange(int startCp, int endCp)
{
for (int i = startCp; i <= endCp; i++)
if (!XftCharExists(Xdisplay, ptr->xftfont, i))
return false;
return true;
}
bool FontInfo::CharRangeEmpty(int startCp, int endCp)
{
for (int i = startCp; i <= endCp; i++)
if (XftCharExists(Xdisplay, ptr->xftfont, i))
return false;
return true;
}
These methods are not final, so add them only for the purpose of running this test.
|
|
|
Re: Basic character set analyzer [message #19571 is a reply to message #19546] |
Wed, 17 December 2008 11:21 |
cbpporter
Messages: 1401 Registered: September 2007
|
Ultimate Contributor |
|
|
Quote: |
I've continued experimenting and reached an interesting conclusion: X does some very basic character substitution and composition for Latin characters.
I tested pretty much every font that doesn't have Latin-1 Supplement support and it seems that if the given font has basic Latin alphabetic characters, it will manage to draw also those from the supplement by doing on the fly composition. The results are pretty good. I think we can skip substitution for the supplement if alphabetic characters are available. If alphabet is not available, we need to substitute those characters and use probably the same method you used for replacing Latin characters for Font::COMPOSED.
Also, X tries to compose missing characters from Latin Extended-A. Some fonts handle this OK, but most will either fail, or draw a rectangular outline and superimpose the root character of the composition. E.G. if I have a very strange "L" with dots on it's right, I will get a box and a normal L. In this case I don't know which is better. Using substitution will result in a correct character, but quite likely the typeface will be different. Using X scheme, we get the same type face, but the result is ugly.
|
This is what I was just about to post this when I realized: Maybe X is not the one who is doing the composition. Maybe It's U++ with the mechanism I mentioned earlier! Am I right? Or is the composition used only under Windows. I couldn't figure it out by testing under other application, because all feature under a form or other font substitution. Does anybody know a very primitive plain X editor which can choose it's font. Or maybe a font exploration tool for Linux?
|
|
|
Re: Basic character set analyzer [message #19572 is a reply to message #19571] |
Wed, 17 December 2008 17:18 |
cbpporter
Messages: 1401 Registered: September 2007
|
Ultimate Contributor |
|
|
I found the code responsible for composition, and it seems only code points 0x0100 to 0x017F are subject to U++ composition. So either X11 does it's own composition for Latin-1 Supplement, or Xft API is lying to us about which characters are available (or some other reason ).
I'll try to determine more. But for now, first step is going to be to make characters available when basic Latin is missing in font. I think this is a good idea. Even if you use Dingbats or some other specialized fonts, I think it would be useful to be able to print basic Latin characters without having to use two explicit fonts. I'll use StdFont as a basic Latin fallback, since this font will always contain the needed characters.
I also noticed that using a lot of fonts slows down rendering to a crawl. I'll have to look over the code to see if some caching can be done or some bottleneck avoided, but basically this means that we must keep the substitution pool as small as possible.
PS: How was the current composition behavior established? How did you determine that you need to draw the little line at an offset of font.GetHeight() / 13 for example. Did you find some reference material, or was it experimental and you went with what looked good.
[Updated on: Wed, 17 December 2008 17:24] Report message to a moderator
|
|
|
|
|
|
|
|
Re: Basic character set analyzer [message #19874 is a reply to message #19826] |
Wed, 28 January 2009 19:53 |
cbpporter
Messages: 1401 Registered: September 2007
|
Ultimate Contributor |
|
|
luzr wrote on Sun, 25 January 2009 11:56 |
I wonder how is this effort different from existing code?
|
Well I hope it is going to be different in at least one detail:
It is going to work now (as soon as I'm ready). I've been nagging forever about the horrible font support under Unix, yet the situation remained the same. Don't take this the wrong way: I am well aware how busy people very providing genuine improvements all over U++, but this domain has largely been ignored. I understand that this is not a focus area right now, but I don't have time to wait until it will become one. And while Painter might help here, I do need a version which works with plain X, because I have no intention to ship with AGG.
I believe I am very close to a general, powerful and most importantly good looking mechanism. Here is a sample screenshot:
As you can see, the manual placement of diacritics (rendered in red) is almost as good as the native one. There are still some bugs left, but the most important part is that the mechanism is general. It is based on precise bounding boxes, so I can use for diacritics any renderable character, not just special ones. This opens the way to full composition support with very little extra work.
You know that you've been staring to much at this site when the disappearance of a link bothers you . Anyway, the wiki was in a bad shape and I can't say I'm going to miss it. It wasn't really that useful anyway, since we don't really have that kind of a community which favors wikis.
|
|
|
|
Re: Basic character set analyzer [message #19888 is a reply to message #19887] |
Fri, 30 January 2009 09:15 |
cbpporter
Messages: 1401 Registered: September 2007
|
Ultimate Contributor |
|
|
luzr wrote on Fri, 30 January 2009 09:53 |
I still do not get it. I thought we are already doing this in Draw/ComposeText.cpp.
I guess I will have to wait for the actual code...
|
Yes, we are doing it, but we could just as well not do it. A lot of fonts have these characters, so the code is not used.
And in places where the code is used, it looks very bad. I posted a screenshot a while ago with exactly how the text looks:
This is completely unreadable. I don't need all the characters that are above, but I don't like half baked solutions, so I would like them all to work.
The problem is that even though rendering is OK for Latin fonts, it is horrible for non-Latin fonts and I must use almost non-Latin fonts exclusively. Plus I need composition for non-Latin characters, so I might as well do it first for Latin ones, where the implementation is several orders of magnitude easier.
|
|
|
Re: Basic character set analyzer [message #19889 is a reply to message #19888] |
Fri, 30 January 2009 09:28 |
|
mirek
Messages: 13975 Registered: November 2005
|
Ultimate Member |
|
|
cbpporter wrote on Fri, 30 January 2009 03:15 |
luzr wrote on Fri, 30 January 2009 09:53 |
I still do not get it. I thought we are already doing this in Draw/ComposeText.cpp.
I guess I will have to wait for the actual code...
|
Yes, we are doing it, but we could just as well not do it. A lot of fonts have these characters, so the code is not used.
And in places where the code is used, it looks very bad. I posted a screenshot a while ago with exactly how the text looks:
This is completely unreadable. I don't need all the characters that are above, but I don't like half baked solutions, so I would like them all to work.
|
Good, getting somewhere.
I believe the problem with the above font is that it does not have diacritics characters defined.
What is your solution to the problem?
Quote: |
The problem is that even though rendering is OK for Latin fonts, it is horrible for non-Latin fonts and I must use almost non-Latin fonts exclusively. Plus I need composition for non-Latin characters, so I might as well do it first for Latin ones, where the implementation is several orders of magnitude easier.
|
I wonder how do you plan to do composition for non-Latin characters if basic glyphs are not defined in the font?
Hm, maybe it is just terminology issue - by composition I mean creating a new glyph by composing two other glyphs (basic latin character + diacritics glyph) from the same font.
Mirek
|
|
|
Re: Basic character set analyzer [message #19890 is a reply to message #19889] |
Fri, 30 January 2009 09:59 |
cbpporter
Messages: 1401 Registered: September 2007
|
Ultimate Contributor |
|
|
luzr wrote on Fri, 30 January 2009 10:28 |
Good, getting somewhere.
I believe the problem with the above font is that it does not have diacritics characters defined.
What is your solution to the problem?
|
That's exactly the problem. I'm using two way substitution, replacing both base character and diacritic when necessary. Diacritics look similar enough across fonts so no problems here with substitution. For base char, I'm using StdFont. This way at least characters substituted across different fonts will at least look the same, even though the might differ visually from current font. But this is not a big problem for Latin, because most font either have all characters from a range or have none. There are some who have an arbitrary subset of those characters, but I'm guessing that the subset is biased toward an existing language and since people will probably use that font to render that given language, we shouldn't have many cases where substitution result in ugly text.
Quote: |
Hm, maybe it is just terminology issue - by composition I mean creating a new glyph by composing two other glyphs (basic latin character + diacritics glyph) from the same font.
|
That's for Latin composition (which I'm implementing right now). For non-Latin composition, I create a new glyph based on a non-Latin little drawing and combine it with another one (which could be considered a diacritic).
The reasons why I'm not altering your code and witting a new one are:
1. I want to have full composition with arbitrary bases and diacritics to handle all Unicode composition characters. I can place a '-' on something, but I can also place any printable character.
2. I am implementing these methods as functions that take a Draw object as their first parameter so that I can use the code without having to integrate it into Draw. While some minor additions are necessary to Draw (a method to determine if font has char and one to determine the exact bounding box of a character), getting these accepted is going to be easier than a full blown Unicode composition engine with heavy bias toward Latin and CJK. And BTW, I'm still completely against Utf32, so it's safer this way .
|
|
|
Re: Basic character set analyzer [message #19903 is a reply to message #19890] |
Sat, 31 January 2009 13:22 |
cbpporter
Messages: 1401 Registered: September 2007
|
Ultimate Contributor |
|
|
I've made great progress today and all fonts are capable of displaying almost all Latin characters and compositions. There is one last problem and this is a big one.
Let's say that I'm trying to draw 'ŕ' (a`). First I draw 'a', and then a '`'. With the new methods I added when I interrogate the font for the top and bottom of '`', I will get something in the vicinity of 2 for top (it doesn't start exactly at the top of the line) and something in the vicinity of 5 for bottom (these numbers are just for example). I need this info, rather then ascent/height. And it works for most fonts. But there are some fonts which will return 0 for the top and the height of the font for bottom. So instead of getting (2, 5), I will get (0, 18) and this ruins all computations and causes the diacritic to be severely misplaced. This is not the correct value for the API calls that I have made, and I think it is the fault of the font designer, who did not alter the vertical extent information of the character and just went with the default font height.
I hope I was successful in explaining this issue.
I don't know how I could fix this, safe for rendering the character in a white bitmap buffer, and shrinking the bitmap line by line until I have reached the minimal height and caching the result. I'm afraid such a method will be slow, and anyway I would rather not take such extreme measures.
There is one other solution: define one global font for diacritics and draw all diacritics with that font. This will cause repeated renderings of 'ŕ' (a`) with different fonts to have a visually different base, but the diacritic is going to look constant. I think this is a reasonable compromise.
What do you think?
|
|
|
|
|
|
|
Re: Basic character set analyzer [message #19920 is a reply to message #19918] |
Mon, 02 February 2009 11:19 |
cbpporter
Messages: 1401 Registered: September 2007
|
Ultimate Contributor |
|
|
luzr wrote on Mon, 02 February 2009 10:40 | Well, my original plan was to perform composition if both glyphs are available in the font (current algorithm), then look into other fonts to get the complete required glyph...
Mirek
|
Yes, I thought about that but I tried with this method because this way at least the basic character will look the same way as the font. Maybe it's worth a shot to try it your way also.
But anyway, we should get to the modifications in Draw to handle these algorithms. Nothing must be changed, but some methods must be added.
The first one is HasChar from the first post in this thread (I no longer need HasCharRange and CharRangeIsEmpty). So I propose either add HasChar, or if you have a better method to do it (maybe one for Win also), I'm opened for suggestions.
PS: HasChar seems pretty fast. I run it for every font on application startup for code ranges from zero to the end of Arabic range, and I didn't notice any slowdown, so I guess it is fast enough to render all the text that can appear on screen at once. At least one call for HasChar must be made for every character that will be printed, so maybe later we can cache results somewhere in home folder.
|
|
|
|
Goto Forum:
Current Time: Mon Apr 29 08:44:59 CEST 2024
Total time taken to generate the page: 0.03181 seconds
|
|
|