|
|
Home » U++ Library support » U++ Libraries and TheIDE: i18n, Unicode and Internationalization » 16 bits wchar
|
Re: 16 bits wchar [message #8059 is a reply to message #8036] |
Mon, 05 February 2007 23:07 |
|
mirek
Messages: 14105 Registered: November 2005
|
Ultimate Member |
|
|
riri wrote on Mon, 05 February 2007 11:19 | Hi all!
That makes now a long time I didn't post to this forum
Just a metaphysic (and maybe ridiculous) question: I saw WString uses 16 bits integers as internal character values; is it suitable for any language, as all Unicode code points can't represented in 65535 values ?
#ifdef PLATFORM_WINCE
typedef WCHAR wchar;
#else
typedef word wchar;
#endif
Again, it can be a stupid question, but if I well understood, internal strings representation is in Unicode format, no?
|
Well, the main problem is that Win32 GDI output works with 16-bit characters -> wchar better be 16-bit.
Other than that, yes, it works in most cases. UNICODE characters >65536 are quite special (like Toolkien's alphabet) and not supported by any fonts.
Mirek
|
|
|
Re: 16 bits wchar [message #11785 is a reply to message #8059] |
Tue, 25 September 2007 22:03 |
cbpporter
Messages: 1406 Registered: September 2007
|
Ultimate Contributor |
|
|
I was quite unhappy when I found out that U++ is not Unicode standard compliant with it's "UTF-16" (what it implements is actually UCS-2). There are o lot of programs with poor Unicode support, which is partyially because STL doesn't support full Unicode either.
In theory it would be quite unforgivable for an application to handle just a subset of the standard. But how does the situation look in theory?
To answer this question I did a number of relatively thorough tests which took me about two hours. I used my computer at work, which has a Windows XP SP2 operating system. The first part was to determine if the OS supports surrogate pairs. After some testing (and research) I found that surrogate pairs can be enabled easily and are enabled by default. Windows has no problems theoretically to use the kind of characters (but individual pieces of software can). Next I found a font which displays about 20000 characters with codes about 0xFFFF, I installed them, and surprise surprise, it worked.
Next I tested a couple of applications. At first I wanted to give exact results, but I found it boring to write them and concluded that you will find it boring to read them. In short, Notepad and WordPad both display correctly and identify two code-units as one code point. Opera doesn’t identify code points correctly in some files and it can no do copy operations (it will truncate only to the lower 16 bits). Internet Explorer works correctly, but it couldn't use the correct registry entries to display the characters, so it used a little black rectangle. And the viewer from Total Commander is really ill equipped for these kinds of tasks.
Next I would like to test U++, but I get strange results when trying to find the length of a string when using only normal characters (with codes below 0xFFFF).
I took one of the examples and slightly modified it:
#include <Core/Core.h>
using namespace Upp;
CONSOLE_APP_MAIN
{
SetDefaultCharset(CHARSET_UTF8);
WString x = "";
for(int i = 280; i < 300; i++)
x.Cat(i);
DUMP(x);
DUMP(x.GetLength());
DUMP(x.GetCount());
String y = x.ToString();
DUMP(y);
DUMP(y.GetLength());
DUMP(y.GetCount());
y.Cat(" (appended)");
x = y.ToWString();
DUMP(x);
DUMP(x.GetLength());
DUMP(x.GetCount());
}
I got these results:
x = ─ÿ─Ö─Ü─¢─£─¥─₧─ƒ─á─í─ó─ú─ñ─Ñ─ª─º─¿─⌐──½
x.GetLength() = 20
x.GetCount() = 20
y = ─ÿ─Ö─Ü─¢─£─¥─₧─ƒ─á─í─ó─ú─ñ─Ñ─ª─º─¿─⌐──½
y.GetLength() = 40
y.GetCount() = 40
x = ─ÿ─Ö─Ü─¢─£─¥─₧─ƒ─á─í─ó─ú─ñ─Ñ─ª─º─¿─⌐──½ (appended)
x.GetLength() = 31
x.GetCount() = 31
Except the fact that the cars are mangled, the lengths doesn't seem to be ok. I may have understood incorrectly, but AFAIK GetLength should return the length in code units and GgtCount the number of real characters, so code points.
I also started researching the exact encoding methods of UTF and I will add full Unicode support to strings. It will be for personal use, but if anybody is interested I will post my results. Right now I'm trying to find and efficient way to index multichar strings. I think I will have to use iterators instead.
|
|
|
Re: 16 bits wchar [message #11789 is a reply to message #11785] |
Tue, 25 September 2007 23:18 |
|
mirek
Messages: 14105 Registered: November 2005
|
Ultimate Member |
|
|
Quote: |
GetLength should return the length in code units and GgtCount the number of real characters, so code points.
|
I am afraid you expect too much. GetLength returns exactly the same number as GetCount, two names in this case are there because of each fits better for different scenario (same thing as 0 and '\0').
Rather than thinking in terms of UTF-8 / UTF-16... String is just an array of bytes, WString array of 16bit words. Not much more logic there, except that conversions between two can be performed - in conversions there is one and only encoding logic.
Quote: |
I also started researching the exact encoding methods of UTF and I will add full Unicode support to strings. It will be for personal use, but if anybody is interested I will post my results. Right now I'm trying to find and efficient way to index multichar strings. I think I will have to use iterators instead.
|
Actually, it is not that I am not worried here. Anyway, I think that the only reasonable approach is perhaps to change wchar to 32-bit characters OR introduce LString.
The problem is that in that case you immediately have to perform conversions for all Win32 system calls... that is why I have concluded that it is not worth the trouble for now. (E.g. RTL clearly is the priority).
Anyway, any research in this area is welcome. And perhaps you could fix UTF-8 functions to support UTF-16 (so far, everything >0xffff is basically ignored).
Mirek
[Updated on: Tue, 25 September 2007 23:18] Report message to a moderator
|
|
|
Re: 16 bits wchar [message #11796 is a reply to message #8036] |
Wed, 26 September 2007 01:56 |
sergei
Messages: 94 Registered: September 2007
|
Member |
|
|
As much as I'd like to see RTL in U++, I agree that unicode should, if possible, be fixed. RTL is built upon unicode, so a solid base - unicode strings storage - is essential. Who knows, maybe tomorrow someone will need Linear B.
I was thinking of UTF-32 as a possible main storage format. I wrote a simple benchmark to see what are the speeds with the 3 sizes of character. Here are the results (source attached):
Size: 64; Iterations: 10000000; 8: 2281; 16: 2125; 32: 2172;
Size: 128; Iterations: 5000000; 8: 1625; 16: 1453; 32: 2391;
Size: 256; Iterations: 2500000; 8: 1328; 16: 1515; 32: 1578;
Size: 512; Iterations: 1250000; 8: 1375; 16: 1141; 32: 1141;
Size: 1024; Iterations: 625000; 8: 1172; 16: 953; 32: 984;
Size: 2048; Iterations: 312500; 8: 1094; 16: 875; 32: 906;
Size: 4096; Iterations: 156250; 8: 1109; 16: 938; 32: 859;
Size: 8192; Iterations: 78125; 8: 1110; 16: 890; 32: 922;
Size: 16384; Iterations: 39062; 8: 1000; 16: 813; 32: 4047;
Size: 32768; Iterations: 19531; 8: 1000; 16: 2250; 32: 3906;
Size: 65536; Iterations: 9765; 8: 1656; 16: 2172; 32: 3812;
Size: 131072; Iterations: 4882; 8: 1625; 16: 2125; 32: 3782;
Size: 262144; Iterations: 2441; 8: 1593; 16: 2110; 32: 3781;
Size: 524288; Iterations: 1220; 8: 1563; 16: 2109; 32: 3984;
IMHO, 32-bit values aren't much worse than 16-bit. For search/replace operations - non-32-bit values would have significant overhead for characters outside main plane.
Converting UTF-32 to other formats shouldn't be a problem. But what I like most is that character would be the same as cell (unlike UTF-16 which might have 20 cells to store 19 characters).
Edit: I didn't mention that I tested basic read/write performance. UTF handling would add overhead to 8 and 16 formats, but not to 32 format. I also remembered the UTF8-EE issue. UTF-32 could solve it easily. IIRC only 21 bits are needed for full unicode, so there's plenty of space to escape to (without overtaking private space).
-
Attachment: UniCode.cpp
(Size: 1.31KB, Downloaded 467 times)
[Updated on: Wed, 26 September 2007 02:30] Report message to a moderator
|
|
|
Re: 16 bits wchar [message #11797 is a reply to message #11789] |
Wed, 26 September 2007 07:43 |
cbpporter
Messages: 1406 Registered: September 2007
|
Ultimate Contributor |
|
|
sergei wrote on Wed, 26 September 2007 01:56 |
I didn't mention that I tested basic read/write performance. UTF handling would add overhead to 8 and 16 formats, but not to 32 format. I also remembered the UTF8-EE issue. UTF-32 could solve it easily. IIRC only 21 bits are needed for full unicode, so there's plenty of space to escape to (without overtaking private space).
|
The only problem with UTF-32 is the storage space. It is 2/4 times the size of UTF-8 and almost always double of UTF-16. And I don't think that UTF-8EE is such a big issue, you just have to make sure to use a more permissive validation scheme. And what is RTL anyway?
luzr wrote on Tue, 25 September 2007 23:18 |
I am afraid you expect too much. GetLength returns exactly the same number as GetCount, two names in this case are there because of each fits better for different scenario (same thing as 0 and '\0').
Rather than thinking in terms of UTF-8 / UTF-16... String is just an array of bytes, WString array of 16bit words. Not much more logic there, except that conversions between two can be performed - in conversions there is one and only encoding logic.
|
Then I don't understand how can you insert values 280-230 into a 8 bit fixed character length format. Are they translated to some code page? And if the values are 8 bit and there are 20 of them, why do I have a length 40 string in the output. And why is the length of the same string 40 and not 20 when I switch over to wide string?
[Updated on: Wed, 26 September 2007 07:44] Report message to a moderator
|
|
|
|
Re: 16 bits wchar [message #11809 is a reply to message #11797] |
Wed, 26 September 2007 14:55 |
sergei
Messages: 94 Registered: September 2007
|
Member |
|
|
cbpporter wrote on Wed, 26 September 2007 07:43 |
sergei wrote on Wed, 26 September 2007 01:56 |
I didn't mention that I tested basic read/write performance. UTF handling would add overhead to 8 and 16 formats, but not to 32 format. I also remembered the UTF8-EE issue. UTF-32 could solve it easily. IIRC only 21 bits are needed for full unicode, so there's plenty of space to escape to (without overtaking private space).
|
The only problem with UTF-32 is the storage space. It is 2/4 times the size of UTF-8 and almost always double of UTF-16. And I don't think that UTF-8EE is such a big issue, you just have to make sure to use a more permissive validation scheme. And what is RTL anyway?
|
Well, 4MB of memory would yield 1 million characters. Do you typically need more, even for a rather complex GUI app? With memory of 512MB/1GB on many computers and 200GB hard drives, I don't think space is a serious issue now. I was more worried about performance - memory allocation and access is somewhat slower (but not always, for 256-8k sizes it's quite good).
The issue isn't UTF-8EE, it's more of a side effect. The main gain is char equals cell. That is, LString (or whatever the name) can always be treated as UTF-32. Unlike WString, which might be 20 wchars or unknown-length UTF-16 string. Even worse with UTF-8, where String length would almost always be different from amount of characters stored. Replace char is a trivial operation in UTF-32, but might require shifting in UTF-8/16 (if the chars require different amounts of space). Search char from end (backwards) - would require to test every find if it's the second/third/fourth char of some sequence. Actually, even simplier - how do you supply a multibyte char to some search/replace function in UTF-16/32? Integer? That would require conversion for every operation.
Unlike currently, when String is either a sequence of chars OR a UTF-8 string, LString would always be a sequence of ints/unsigned ints AND UTF-32 string. String could be left for single-char storing (like data from file or ASCII-only strings), WString for OS interop, and LString could supply conversions to/from both.
|
|
|
|
Re: 16 bits wchar [message #11813 is a reply to message #8036] |
Wed, 26 September 2007 16:54 |
sergei
Messages: 94 Registered: September 2007
|
Member |
|
|
Theoretically String could be used "exclusively" for UTF-8, WString for UTF-16. "normal strings" could be Vector<char> and Vector<wchar>. All operations - (reverse) find/replace char/substring, trim(truncate), starts/endswith, left/right/mid, cat (append), insert, are applicable to Vectors as well (and maybe should be implemented as algorithms for all containers). Extra considerations might be a closing '\0' (maybe not necessary - normal strings aren't for interop with OS, where '\0' is used, for inner works there's GetCount), and conversion functions (already partially implemented).
P.S. does anyone know why chars/wchars tend to be signed? IMHO unsigned character values are much more clear - after all the ASCII codes we use are unsigned (in hex).
|
|
|
|
|
Re: 16 bits wchar [message #11921 is a reply to message #11829] |
Mon, 01 October 2007 13:24 |
cbpporter
Messages: 1406 Registered: September 2007
|
Ultimate Contributor |
|
|
I finally finished my Unicode research (I took longer than planed because of computer games... ). I read a good chunk of the Unicode Standard 5.0, looked over their official sample implementation and studied a little U++'s String, WString and Stream classes.
I think that the first thing that must be done is extended PutUtf8 and GetUtf8 so that it reads correctly th values outside of BMP. This is not too difficult and I will try to implement and test this.
The only issue is how to handle ill-formated values. I came to the conclusion that read and write operation must recognize compliant encodings, but it also must process ill-formated characters and insert them into the stream. If the stream is set to strict, it will throw an exception. If not, it will still encode. I propose the Least Complex Encoding TM possibility. Non-atomic Unicode aware string manipulation functions will should not fail when encountering such characters, so after a read, process and write, these ill-formated values (which could be essential to other applications) will be preserved. In this scenario, only functions that display the string must be aware that some data is ill-formated.
Next, there should be a method to Validate the string, and a way to convert strings containing ill-formated string to error-escaped strings and back, so we can use atomic string processing if needed. This conversion should be done explicitly, so no general performance overhead is introduced.
|
|
|
Re: 16 bits wchar [message #11925 is a reply to message #11921] |
Mon, 01 October 2007 14:28 |
|
mirek
Messages: 14105 Registered: November 2005
|
Ultimate Member |
|
|
cbpporter wrote on Mon, 01 October 2007 07:24 |
I think that the first thing that must be done is extended PutUtf8 and GetUtf8 so that it reads correctly th values outside of BMP. This is not too difficult and I will try to implement and test this.
|
I guess fixing Utf8 routines to provide UTF16 surrogate support (for now) is a good idea.
Quote: |
The only issue is how to handle ill-formated values. I came to the conclusion that read and write operation must recognize compliant encodings, but it also must process ill-formated characters and insert them into the stream. If the stream is set to strict, it will throw an exception. If not, it will still encode. I propose the Least Complex Encoding TM possibility. Non-atomic Unicode aware string manipulation functions will should not fail when encountering such characters, so after a read, process and write, these ill-formated values (which could be essential to other applications) will be preserved. In this scenario, only functions that display the string must be aware that some data is ill-formated.
|
Well, the basic requirement there is that converting UTF8 with invalid sequences to WString and back must result in the equal String. This feat is successfully achieved by UTF8EE.
Also, I do not think that any string manipulation routine everywhere ever should be aware about UTF-8 or UTF-16 encoding. It is much practical to covert to WString, process and (eventually) convert it back. I think that in the long run, it might be even faster.
Quote: |
Next, there should be a method to Validate the string, and a way to convert strings containing ill-formated string to error-escaped strings and back, so we can use atomic string processing if needed. This conversion should be done explicitly, so no general performance overhead is introduced.
|
bool CheckUtf8(const String& src);
You can add CheckUtf16
Anyway, seriously, I believe that the ultimate solution is to go with sizeof(wchar) = 4... The only trouble is converting this to UTF16 and back in Win32 everywhere... OTOH, good news is that after the system code is fixed, the transition does not pose too much backwards compatibility problems...
Mirek
|
|
|
Re: 16 bits wchar [message #11938 is a reply to message #11925] |
Wed, 03 October 2007 06:16 |
cbpporter
Messages: 1406 Registered: September 2007
|
Ultimate Contributor |
|
|
luzr wrote on Mon, 01 October 2007 14:28 |
I guess fixing Utf8 routines to provide UTF16 surrogate support (for now) is a good idea.
|
Great! On the side note though, I am extending Utf8 methods to handle 4 byte long encodings, not UTF-16 surrogate pairs (which are illegal in UTF-8).
Quote: |
Also, I do not think that any string manipulation routine everywhere ever should be aware about UTF-8 or UTF-16 encoding. It is much practical to covert to WString, process and (eventually) convert it back. I think that in the long run, it might be even faster.
|
That could be an acceptable compromise. But a few processing functions couldn't hurt when you really want to process that string in place.
Quote: |
Anyway, seriously, I believe that the ultimate solution is to go with sizeof(wchar) = 4... The only trouble is converting this to UTF16 and back in Win32 everywhere... OTOH, good news is that after the system code is fixed, the transition does not pose too much backwards compatibility problems...
|
I think you should keep UTF-16 as default for Win32 and UTF-32 as default for Linux. Win32 and .NET both use UTF-16 (with surrogates - Win98 doesn't support surrogates, but the rest do), so I think the future of character encoding for GUI purposes is pretty well defined.
I started working on GetUtf8. I tried to keep everything as close to your style of designing things, but I have two questions.
1. I couldn’t find any function that reads or writes UTF-8 strings (only a single char). The rest of the functions read using pain byte storage. This is OK for storing strings, but when loading them, I need a UTF-8 aware method.
2. Your GetUtf8 method is quite straightforward, but I'm afraid it does not decode values correctly.
Here is a pseudo code of what you do:
if(code <= 0x7F)
compute 1 byte value
else if (code <= 0xDF)
compute 2 byte value
else if (code <= 0xEF)
compute 3 byte value
else if (...)
pretty much just read them and return "space"
The issue with this is for the invalid value range 80-C1 which is handled by you second if clause. These values are invalid in UTF-8, but you still decode them using their value and the value of the next character. If this is done for error-escaping, the UTF-8 standard expects to error escape only the current character and start procesing the next one and not to build the error escaped code by using more than the absolute minimum number of code-units (in this case one).
|
|
|
Re: 16 bits wchar [message #11939 is a reply to message #11938] |
Wed, 03 October 2007 10:11 |
|
mirek
Messages: 14105 Registered: November 2005
|
Ultimate Member |
|
|
cbpporter wrote on Wed, 03 October 2007 00:16 |
luzr wrote on Mon, 01 October 2007 14:28 |
I guess fixing Utf8 routines to provide UTF16 surrogate support (for now) is a good idea.
|
Great! On the side note though, I am extending Utf8 methods to handle 4 byte long encodings, not UTF-16 surrogate pairs (which are illegal in UTF-8).
|
Well, so what is the result then? WString is now 16-bit. Utf8 conversions are basically String<->WString (ok, also char * <-> WString).
Quote: |
That could be an acceptable compromise. But a few processing functions couldn't hurt when you really want to process that string in place.
|
Which exactly?
Quote: |
I think you should keep UTF-16 as default for Win32 and UTF-32 as default for Linux. Win32 and .NET both use UTF-16 (with surrogates - Win98 doesn't support surrogates, but the rest do), so I think the future of character encoding for GUI purposes is pretty well defined.
|
That is why it is 16bit now. But if you really need the solution for ucs-4, 32bit character and conversions is the only option.
Quote: |
1. I couldn’t find any function that reads or writes UTF-8 strings (only a single char).
|
String ToUtf8(wchar code);
String ToUtf8(const wchar *s, int len);
String ToUtf8(const wchar *s);
String ToUtf8(const WString& w);
WString FromUtf8(const char *_s, int len);
WString FromUtf8(const char *_s);
WString FromUtf8(const String& s);
bool utf8check(const char *_s, int len);
int utf8len(const char *s, int len);
int utf8len(const char *s);
int lenAsUtf8(const wchar *s, int len);
int lenAsUtf8(const wchar *s);
bool CheckUtf8(const String& src);
Quote: |
2. Your GetUtf8 method is quite straightforward, but I'm afraid it does not decode values correctly.
Here is a pseudo code of what you do:
if(code <= 0x7F)
compute 1 byte value
else if (code <= 0xDF)
compute 2 byte value
else if (code <= 0xEF)
compute 3 byte value
else if (...)
pretty much just read them and return "space"
|
Ops, you are right, something is really missing in Stream. Anyway, GetUtf8 in Stream is quite auxiliary (and maybe wrong) addition. The real meat is in Charset.h/.cpp.
Mirek
|
|
|
|
|
|
|
Goto Forum:
Current Time: Fri Nov 01 00:56:23 CET 2024
Total time taken to generate the page: 0.01990 seconds
|
|
|