Home » U++ Library support » U++ Libraries and TheIDE: i18n, Unicode and Internationalization » 16 bits wchar
|
|
| Re: 16 bits wchar [message #8059 is a reply to message #8036] |
Mon, 05 February 2007 23:07   |
 |
mirek
Messages: 14290 Registered: November 2005
|
Ultimate Member |
|
|
| riri wrote on Mon, 05 February 2007 11:19 | Hi all!
That makes now a long time I didn't post to this forum 
Just a metaphysic (and maybe ridiculous) question: I saw WString uses 16 bits integers as internal character values; is it suitable for any language, as all Unicode code points can't represented in 65535 values ?
#ifdef PLATFORM_WINCE
typedef WCHAR wchar;
#else
typedef word wchar;
#endif
Again, it can be a stupid question, but if I well understood, internal strings representation is in Unicode format, no?
|
Well, the main problem is that Win32 GDI output works with 16-bit characters -> wchar better be 16-bit.
Other than that, yes, it works in most cases. UNICODE characters >65536 are quite special (like Toolkien's alphabet) and not supported by any fonts.
Mirek
|
|
|
|
| Re: 16 bits wchar [message #11785 is a reply to message #8059] |
Tue, 25 September 2007 22:03   |
cbpporter
Messages: 1428 Registered: September 2007
|
Ultimate Contributor |
|
|
I was quite unhappy when I found out that U++ is not Unicode standard compliant with it's "UTF-16" (what it implements is actually UCS-2). There are o lot of programs with poor Unicode support, which is partyially because STL doesn't support full Unicode either.
In theory it would be quite unforgivable for an application to handle just a subset of the standard. But how does the situation look in theory?
To answer this question I did a number of relatively thorough tests which took me about two hours. I used my computer at work, which has a Windows XP SP2 operating system. The first part was to determine if the OS supports surrogate pairs. After some testing (and research) I found that surrogate pairs can be enabled easily and are enabled by default. Windows has no problems theoretically to use the kind of characters (but individual pieces of software can). Next I found a font which displays about 20000 characters with codes about 0xFFFF, I installed them, and surprise surprise, it worked.
Next I tested a couple of applications. At first I wanted to give exact results, but I found it boring to write them and concluded that you will find it boring to read them. In short, Notepad and WordPad both display correctly and identify two code-units as one code point. Opera doesn’t identify code points correctly in some files and it can no do copy operations (it will truncate only to the lower 16 bits). Internet Explorer works correctly, but it couldn't use the correct registry entries to display the characters, so it used a little black rectangle. And the viewer from Total Commander is really ill equipped for these kinds of tasks.
Next I would like to test U++, but I get strange results when trying to find the length of a string when using only normal characters (with codes below 0xFFFF).
I took one of the examples and slightly modified it:
#include <Core/Core.h>
using namespace Upp;
CONSOLE_APP_MAIN
{
SetDefaultCharset(CHARSET_UTF8);
WString x = "";
for(int i = 280; i < 300; i++)
x.Cat(i);
DUMP(x);
DUMP(x.GetLength());
DUMP(x.GetCount());
String y = x.ToString();
DUMP(y);
DUMP(y.GetLength());
DUMP(y.GetCount());
y.Cat(" (appended)");
x = y.ToWString();
DUMP(x);
DUMP(x.GetLength());
DUMP(x.GetCount());
}
I got these results:
x = ─ÿ─Ö─Ü─¢─£─¥─₧─ƒ─á─í─ó─ú─ñ─Ñ─ª─º─¿─⌐──½
x.GetLength() = 20
x.GetCount() = 20
y = ─ÿ─Ö─Ü─¢─£─¥─₧─ƒ─á─í─ó─ú─ñ─Ñ─ª─º─¿─⌐──½
y.GetLength() = 40
y.GetCount() = 40
x = ─ÿ─Ö─Ü─¢─£─¥─₧─ƒ─á─í─ó─ú─ñ─Ñ─ª─º─¿─⌐──½ (appended)
x.GetLength() = 31
x.GetCount() = 31
Except the fact that the cars are mangled, the lengths doesn't seem to be ok. I may have understood incorrectly, but AFAIK GetLength should return the length in code units and GgtCount the number of real characters, so code points.
I also started researching the exact encoding methods of UTF and I will add full Unicode support to strings. It will be for personal use, but if anybody is interested I will post my results. Right now I'm trying to find and efficient way to index multichar strings. I think I will have to use iterators instead.
|
|
|
|
| Re: 16 bits wchar [message #11789 is a reply to message #11785] |
Tue, 25 September 2007 23:18   |
 |
mirek
Messages: 14290 Registered: November 2005
|
Ultimate Member |
|
|
| Quote: |
GetLength should return the length in code units and GgtCount the number of real characters, so code points.
|
I am afraid you expect too much. GetLength returns exactly the same number as GetCount, two names in this case are there because of each fits better for different scenario (same thing as 0 and '\0').
Rather than thinking in terms of UTF-8 / UTF-16... String is just an array of bytes, WString array of 16bit words. Not much more logic there, except that conversions between two can be performed - in conversions there is one and only encoding logic.
| Quote: |
I also started researching the exact encoding methods of UTF and I will add full Unicode support to strings. It will be for personal use, but if anybody is interested I will post my results. Right now I'm trying to find and efficient way to index multichar strings. I think I will have to use iterators instead.
|
Actually, it is not that I am not worried here. Anyway, I think that the only reasonable approach is perhaps to change wchar to 32-bit characters OR introduce LString.
The problem is that in that case you immediately have to perform conversions for all Win32 system calls... that is why I have concluded that it is not worth the trouble for now. (E.g. RTL clearly is the priority).
Anyway, any research in this area is welcome. And perhaps you could fix UTF-8 functions to support UTF-16 (so far, everything >0xffff is basically ignored).
Mirek
[Updated on: Tue, 25 September 2007 23:18] Report message to a moderator
|
|
|
|
| Re: 16 bits wchar [message #11796 is a reply to message #8036] |
Wed, 26 September 2007 01:56   |
sergei
Messages: 94 Registered: September 2007
|
Member |
|
|
As much as I'd like to see RTL in U++, I agree that unicode should, if possible, be fixed. RTL is built upon unicode, so a solid base - unicode strings storage - is essential. Who knows, maybe tomorrow someone will need Linear B.
I was thinking of UTF-32 as a possible main storage format. I wrote a simple benchmark to see what are the speeds with the 3 sizes of character. Here are the results (source attached):
Size: 64; Iterations: 10000000; 8: 2281; 16: 2125; 32: 2172;
Size: 128; Iterations: 5000000; 8: 1625; 16: 1453; 32: 2391;
Size: 256; Iterations: 2500000; 8: 1328; 16: 1515; 32: 1578;
Size: 512; Iterations: 1250000; 8: 1375; 16: 1141; 32: 1141;
Size: 1024; Iterations: 625000; 8: 1172; 16: 953; 32: 984;
Size: 2048; Iterations: 312500; 8: 1094; 16: 875; 32: 906;
Size: 4096; Iterations: 156250; 8: 1109; 16: 938; 32: 859;
Size: 8192; Iterations: 78125; 8: 1110; 16: 890; 32: 922;
Size: 16384; Iterations: 39062; 8: 1000; 16: 813; 32: 4047;
Size: 32768; Iterations: 19531; 8: 1000; 16: 2250; 32: 3906;
Size: 65536; Iterations: 9765; 8: 1656; 16: 2172; 32: 3812;
Size: 131072; Iterations: 4882; 8: 1625; 16: 2125; 32: 3782;
Size: 262144; Iterations: 2441; 8: 1593; 16: 2110; 32: 3781;
Size: 524288; Iterations: 1220; 8: 1563; 16: 2109; 32: 3984;
IMHO, 32-bit values aren't much worse than 16-bit. For search/replace operations - non-32-bit values would have significant overhead for characters outside main plane.
Converting UTF-32 to other formats shouldn't be a problem. But what I like most is that character would be the same as cell (unlike UTF-16 which might have 20 cells to store 19 characters).
Edit: I didn't mention that I tested basic read/write performance. UTF handling would add overhead to 8 and 16 formats, but not to 32 format. I also remembered the UTF8-EE issue. UTF-32 could solve it easily. IIRC only 21 bits are needed for full unicode, so there's plenty of space to escape to (without overtaking private space).
-
Attachment: UniCode.cpp
(Size: 1.31KB, Downloaded 572 times)
[Updated on: Wed, 26 September 2007 02:30] Report message to a moderator
|
|
|
|
| Re: 16 bits wchar [message #11797 is a reply to message #11789] |
Wed, 26 September 2007 07:43   |
cbpporter
Messages: 1428 Registered: September 2007
|
Ultimate Contributor |
|
|
| sergei wrote on Wed, 26 September 2007 01:56 |
I didn't mention that I tested basic read/write performance. UTF handling would add overhead to 8 and 16 formats, but not to 32 format. I also remembered the UTF8-EE issue. UTF-32 could solve it easily. IIRC only 21 bits are needed for full unicode, so there's plenty of space to escape to (without overtaking private space).
|
The only problem with UTF-32 is the storage space. It is 2/4 times the size of UTF-8 and almost always double of UTF-16. And I don't think that UTF-8EE is such a big issue, you just have to make sure to use a more permissive validation scheme. And what is RTL anyway?
| luzr wrote on Tue, 25 September 2007 23:18 |
I am afraid you expect too much. GetLength returns exactly the same number as GetCount, two names in this case are there because of each fits better for different scenario (same thing as 0 and '\0').
Rather than thinking in terms of UTF-8 / UTF-16... String is just an array of bytes, WString array of 16bit words. Not much more logic there, except that conversions between two can be performed - in conversions there is one and only encoding logic.
|
Then I don't understand how can you insert values 280-230 into a 8 bit fixed character length format. Are they translated to some code page? And if the values are 8 bit and there are 20 of them, why do I have a length 40 string in the output. And why is the length of the same string 40 and not 20 when I switch over to wide string?
[Updated on: Wed, 26 September 2007 07:44] Report message to a moderator
|
|
|
|
|
|
| Re: 16 bits wchar [message #11809 is a reply to message #11797] |
Wed, 26 September 2007 14:55   |
sergei
Messages: 94 Registered: September 2007
|
Member |
|
|
| cbpporter wrote on Wed, 26 September 2007 07:43 |
| sergei wrote on Wed, 26 September 2007 01:56 |
I didn't mention that I tested basic read/write performance. UTF handling would add overhead to 8 and 16 formats, but not to 32 format. I also remembered the UTF8-EE issue. UTF-32 could solve it easily. IIRC only 21 bits are needed for full unicode, so there's plenty of space to escape to (without overtaking private space).
|
The only problem with UTF-32 is the storage space. It is 2/4 times the size of UTF-8 and almost always double of UTF-16. And I don't think that UTF-8EE is such a big issue, you just have to make sure to use a more permissive validation scheme. And what is RTL anyway?
|
Well, 4MB of memory would yield 1 million characters. Do you typically need more, even for a rather complex GUI app? With memory of 512MB/1GB on many computers and 200GB hard drives, I don't think space is a serious issue now. I was more worried about performance - memory allocation and access is somewhat slower (but not always, for 256-8k sizes it's quite good).
The issue isn't UTF-8EE, it's more of a side effect. The main gain is char equals cell. That is, LString (or whatever the name) can always be treated as UTF-32. Unlike WString, which might be 20 wchars or unknown-length UTF-16 string. Even worse with UTF-8, where String length would almost always be different from amount of characters stored. Replace char is a trivial operation in UTF-32, but might require shifting in UTF-8/16 (if the chars require different amounts of space). Search char from end (backwards) - would require to test every find if it's the second/third/fourth char of some sequence. Actually, even simplier - how do you supply a multibyte char to some search/replace function in UTF-16/32? Integer? That would require conversion for every operation.
Unlike currently, when String is either a sequence of chars OR a UTF-8 string, LString would always be a sequence of ints/unsigned ints AND UTF-32 string. String could be left for single-char storing (like data from file or ASCII-only strings), WString for OS interop, and LString could supply conversions to/from both.
|
|
|
|
|
|
| Re: 16 bits wchar [message #11813 is a reply to message #8036] |
Wed, 26 September 2007 16:54   |
sergei
Messages: 94 Registered: September 2007
|
Member |
|
|
Theoretically String could be used "exclusively" for UTF-8, WString for UTF-16. "normal strings" could be Vector<char> and Vector<wchar>. All operations - (reverse) find/replace char/substring, trim(truncate), starts/endswith, left/right/mid, cat (append), insert, are applicable to Vectors as well (and maybe should be implemented as algorithms for all containers). Extra considerations might be a closing '\0' (maybe not necessary - normal strings aren't for interop with OS, where '\0' is used, for inner works there's GetCount), and conversion functions (already partially implemented).
P.S. does anyone know why chars/wchars tend to be signed? IMHO unsigned character values are much more clear - after all the ASCII codes we use are unsigned (in hex).
|
|
|
|
|
|
|
|
| Re: 16 bits wchar [message #11921 is a reply to message #11829] |
Mon, 01 October 2007 13:24   |
cbpporter
Messages: 1428 Registered: September 2007
|
Ultimate Contributor |
|
|
I finally finished my Unicode research (I took longer than planed because of computer games... ). I read a good chunk of the Unicode Standard 5.0, looked over their official sample implementation and studied a little U++'s String, WString and Stream classes.
I think that the first thing that must be done is extended PutUtf8 and GetUtf8 so that it reads correctly th values outside of BMP. This is not too difficult and I will try to implement and test this.
The only issue is how to handle ill-formated values. I came to the conclusion that read and write operation must recognize compliant encodings, but it also must process ill-formated characters and insert them into the stream. If the stream is set to strict, it will throw an exception. If not, it will still encode. I propose the Least Complex Encoding TM possibility. Non-atomic Unicode aware string manipulation functions will should not fail when encountering such characters, so after a read, process and write, these ill-formated values (which could be essential to other applications) will be preserved. In this scenario, only functions that display the string must be aware that some data is ill-formated.
Next, there should be a method to Validate the string, and a way to convert strings containing ill-formated string to error-escaped strings and back, so we can use atomic string processing if needed. This conversion should be done explicitly, so no general performance overhead is introduced.
|
|
|
|
| Re: 16 bits wchar [message #11925 is a reply to message #11921] |
Mon, 01 October 2007 14:28   |
 |
mirek
Messages: 14290 Registered: November 2005
|
Ultimate Member |
|
|
| cbpporter wrote on Mon, 01 October 2007 07:24 |
I think that the first thing that must be done is extended PutUtf8 and GetUtf8 so that it reads correctly th values outside of BMP. This is not too difficult and I will try to implement and test this.
|
I guess fixing Utf8 routines to provide UTF16 surrogate support (for now) is a good idea.
| Quote: |
The only issue is how to handle ill-formated values. I came to the conclusion that read and write operation must recognize compliant encodings, but it also must process ill-formated characters and insert them into the stream. If the stream is set to strict, it will throw an exception. If not, it will still encode. I propose the Least Complex Encoding TM possibility. Non-atomic Unicode aware string manipulation functions will should not fail when encountering such characters, so after a read, process and write, these ill-formated values (which could be essential to other applications) will be preserved. In this scenario, only functions that display the string must be aware that some data is ill-formated.
|
Well, the basic requirement there is that converting UTF8 with invalid sequences to WString and back must result in the equal String. This feat is successfully achieved by UTF8EE.
Also, I do not think that any string manipulation routine everywhere ever should be aware about UTF-8 or UTF-16 encoding. It is much practical to covert to WString, process and (eventually) convert it back. I think that in the long run, it might be even faster.
| Quote: |
Next, there should be a method to Validate the string, and a way to convert strings containing ill-formated string to error-escaped strings and back, so we can use atomic string processing if needed. This conversion should be done explicitly, so no general performance overhead is introduced.
|
bool CheckUtf8(const String& src);
You can add CheckUtf16 
Anyway, seriously, I believe that the ultimate solution is to go with sizeof(wchar) = 4... The only trouble is converting this to UTF16 and back in Win32 everywhere... OTOH, good news is that after the system code is fixed, the transition does not pose too much backwards compatibility problems...
Mirek
|
|
|
|
| Re: 16 bits wchar [message #11938 is a reply to message #11925] |
Wed, 03 October 2007 06:16   |
cbpporter
Messages: 1428 Registered: September 2007
|
Ultimate Contributor |
|
|
| luzr wrote on Mon, 01 October 2007 14:28 |
I guess fixing Utf8 routines to provide UTF16 surrogate support (for now) is a good idea.
|
Great! On the side note though, I am extending Utf8 methods to handle 4 byte long encodings, not UTF-16 surrogate pairs (which are illegal in UTF-8).
| Quote: |
Also, I do not think that any string manipulation routine everywhere ever should be aware about UTF-8 or UTF-16 encoding. It is much practical to covert to WString, process and (eventually) convert it back. I think that in the long run, it might be even faster.
|
That could be an acceptable compromise. But a few processing functions couldn't hurt when you really want to process that string in place.
| Quote: |
Anyway, seriously, I believe that the ultimate solution is to go with sizeof(wchar) = 4... The only trouble is converting this to UTF16 and back in Win32 everywhere... OTOH, good news is that after the system code is fixed, the transition does not pose too much backwards compatibility problems...
|
I think you should keep UTF-16 as default for Win32 and UTF-32 as default for Linux. Win32 and .NET both use UTF-16 (with surrogates - Win98 doesn't support surrogates, but the rest do), so I think the future of character encoding for GUI purposes is pretty well defined.
I started working on GetUtf8. I tried to keep everything as close to your style of designing things, but I have two questions.
1. I couldn’t find any function that reads or writes UTF-8 strings (only a single char). The rest of the functions read using pain byte storage. This is OK for storing strings, but when loading them, I need a UTF-8 aware method.
2. Your GetUtf8 method is quite straightforward, but I'm afraid it does not decode values correctly.
Here is a pseudo code of what you do:
if(code <= 0x7F)
compute 1 byte value
else if (code <= 0xDF)
compute 2 byte value
else if (code <= 0xEF)
compute 3 byte value
else if (...)
pretty much just read them and return "space"
The issue with this is for the invalid value range 80-C1 which is handled by you second if clause. These values are invalid in UTF-8, but you still decode them using their value and the value of the next character. If this is done for error-escaping, the UTF-8 standard expects to error escape only the current character and start procesing the next one and not to build the error escaped code by using more than the absolute minimum number of code-units (in this case one).
|
|
|
|
| Re: 16 bits wchar [message #11939 is a reply to message #11938] |
Wed, 03 October 2007 10:11   |
 |
mirek
Messages: 14290 Registered: November 2005
|
Ultimate Member |
|
|
| cbpporter wrote on Wed, 03 October 2007 00:16 |
| luzr wrote on Mon, 01 October 2007 14:28 |
I guess fixing Utf8 routines to provide UTF16 surrogate support (for now) is a good idea.
|
Great! On the side note though, I am extending Utf8 methods to handle 4 byte long encodings, not UTF-16 surrogate pairs (which are illegal in UTF-8).
|
Well, so what is the result then? WString is now 16-bit. Utf8 conversions are basically String<->WString (ok, also char * <-> WString).
| Quote: |
That could be an acceptable compromise. But a few processing functions couldn't hurt when you really want to process that string in place.
|
Which exactly?
| Quote: |
I think you should keep UTF-16 as default for Win32 and UTF-32 as default for Linux. Win32 and .NET both use UTF-16 (with surrogates - Win98 doesn't support surrogates, but the rest do), so I think the future of character encoding for GUI purposes is pretty well defined.
|
That is why it is 16bit now. But if you really need the solution for ucs-4, 32bit character and conversions is the only option.
| Quote: |
1. I couldn’t find any function that reads or writes UTF-8 strings (only a single char).
|
String ToUtf8(wchar code);
String ToUtf8(const wchar *s, int len);
String ToUtf8(const wchar *s);
String ToUtf8(const WString& w);
WString FromUtf8(const char *_s, int len);
WString FromUtf8(const char *_s);
WString FromUtf8(const String& s);
bool utf8check(const char *_s, int len);
int utf8len(const char *s, int len);
int utf8len(const char *s);
int lenAsUtf8(const wchar *s, int len);
int lenAsUtf8(const wchar *s);
bool CheckUtf8(const String& src);
| Quote: |
2. Your GetUtf8 method is quite straightforward, but I'm afraid it does not decode values correctly.
Here is a pseudo code of what you do:
if(code <= 0x7F)
compute 1 byte value
else if (code <= 0xDF)
compute 2 byte value
else if (code <= 0xEF)
compute 3 byte value
else if (...)
pretty much just read them and return "space"
|
Ops, you are right, something is really missing in Stream. Anyway, GetUtf8 in Stream is quite auxiliary (and maybe wrong) addition. The real meat is in Charset.h/.cpp.
Mirek
|
|
|
|
|
|
|
|
|
|
|
|
| Re: 16 bits wchar [message #11944 is a reply to message #11942] |
Wed, 03 October 2007 12:10   |
 |
mirek
Messages: 14290 Registered: November 2005
|
Ultimate Member |
|
|
| cbpporter wrote on Wed, 03 October 2007 04:36 |
| luzr wrote on Wed, 03 October 2007 10:26 | Error escaping in Stream:
The error escaping in GetUtf8 is impossible, as it returns only single int - you do not know you have to escape until you read more than single character from the input - and then you need more than one wchar to be returned...
|
It depends on what that int represents and what kind of error escaping is used. For Utf-8, there are only a small number of characters that are invalid and they could be escaped to non-character code-points or even to a small region of the Private Use Area (for example FFF00-FFFFF). The private user area has approximatively 130000 reserved code points which are guaranteed to not appear in public Unicode data (they are reserved for private processing only, not data interchange).
|
Ah, but that is not the problem - AFAIK.
The trouble is e.g. invalid 6 bytes sequence, which you detect at byte 6. In this case, you cannot reasonable return anything escaped from Stream::GetUtf8. You would need more than 32-bit value for any reasonable output.
BTW, private area is exactly what "real" Utf8 functions use, just the range is 0xEE00 - 0xEEFF (did not wanted to spoil the beginning of range and 0xEExx nicely resonates with "Error Escape" 
However, please check the fixed version Stream::GetUtf8():
int Stream::GetUtf8()
{
int code = Get();
if(code <= 0) {
LoadError();
return -1;
}
if(code < 0x80)
return code;
else
if(code < 0xC0)
return -1;
else
if(code < 0xE0) {
if(IsEof()) {
LoadError();
return -1;
}
return ((code - 0xC0) << 6) + Get() - 0x80;
}
else
if(code < 0xF0) {
int c0 = Get();
int c1 = Get();
if(c1 < 0) {
LoadError();
return -1;
}
return ((code - 0xE0) << 12) + ((c0 - 0x80) << 6) + c1 - 0x80;
}
else
if(code < 0xF8) {
int c0 = Get();
int c1 = Get();
int c2 = Get();
if(c2 < 0) {
LoadError();
return -1;
}
return ((code - 0xf0) << 18) + ((c0 - 0x80) << 12) + ((c1 - 0x80) << 6) + c2 - 0x80;
}
else
if(code < 0xFC) {
int c0 = Get();
int c1 = Get();
int c2 = Get();
int c3 = Get();
if(c3 < 0) {
LoadError();
return -1;
}
return ((code - 0xF8) << 24) + ((c0 - 0x80) << 18) + ((c1 - 0x80) << 12) +
((c2 - 0x80) << 6) + c3 - 0x80;
}
else
if(code < 0xFE) {
int c0 = Get();
int c1 = Get();
int c2 = Get();
int c3 = Get();
int c4 = Get();
if(c4 < 0) {
LoadError();
return -1;
}
return ((code - 0xFC) << 30) + ((c0 - 0x80) << 24) + ((c1 - 0x80) << 18) +
((c2 - 0x80) << 12) + ((c3 - 0x80) << 6) + c4 - 0x80;
}
else {
LoadError();
return -1;
}
}
BTW, thinking further about UTF-8 -> UTF-16 surrogate conversion, I am afraid that it in fact can cause some problems in the code.
The primary motivation for "Error Escape" is that when file that is not representable by UCS-2 wchars is loaded into the editor (e.g. IDE) or if it simply has UTF-8 errors, there are two requirements:
- Parts of file with correct and representable UTF-8 encoding must be editable
- Invalid parts must not be damaged by loading/saving.
I am afraid that with real surrogate pairs in editor, editor logic can go bad, it really expects that single wchar represents one code point. There would be visual artifacts, with Win32 interpretting surrogate pairs correctly (while U++ considering them single characters).
What a nice bunch of problems to solve And we have not even started to consider REAL problems 
Mirek
|
|
|
|
| Re: 16 bits wchar [message #11946 is a reply to message #11944] |
Wed, 03 October 2007 14:43   |
cbpporter
Messages: 1428 Registered: September 2007
|
Ultimate Contributor |
|
|
| Quote: |
However, please check the fixed version Stream::GetUtf8():
|
Thank you! You should have said that you would fix it that quickly and I wouldn't have tried it myself . Shouldn't second if clause be < C2?
| Quote: |
What is the point of spreading encoding related stuff all over the application? Stream works with bytes, end of story. I do not want to end with multiple methods for everything that can handle text.
|
Yes, I agree, Stream should work with bytes. But text processing should never work with bytes, unless in legacy mode.
And considering the problem regarding escaping, AFAIK, if the sixth byte is invalid, you need to signal an error for the first byte and continue to decode the second character as a new code point.
And also six byte Utf-8 is no longer considered correct, and should only be used when legacy data needs to be processed. But since 4 bytes allow well over 1 million code-units, I doubt there is any data stored in six bytes format. CESU8 is another thing though, but that is not supported, so it's not a problem.
[Updated on: Wed, 03 October 2007 14:52] Report message to a moderator
|
|
|
|
|
|
|
|
|
|
| Re: 16 bits wchar [message #11963 is a reply to message #11960] |
Thu, 04 October 2007 19:49   |
cbpporter
Messages: 1428 Registered: September 2007
|
Ultimate Contributor |
|
|
| luzr wrote on Thu, 04 October 2007 17:33 | OK, patch applied. And you are right about 0xC2, I missed the fact that 0xC0 and 0xC1 is represented by single byte...
Mirek
|
Did you replace the other one or do you plan to support both versions of Unicode? (5 - mine and what you have - I think 3 or 4). Hope there is no code that depends on six byte Utf-8, but I doubt that this will be an issue for U++.
I will tell you a little about what I'm implementing next. Right now you have a system which allows the use of ill-formatted Utf-8. When transmitted to GUI, it is converted to a valid Utf-16, and if needed you can convert it back to the same Utf-8. This system works, but it kind of creates a bias toward Utf-16. I know that there are objective reasons for this, and Utf-16 is the best choice for Win and a reasonable for other systems, but I would like to be able to process all Unicode formats without regard to OS interaction, efficiency and other issues. If I want to write and i18n GUI application, I'll use WString. If I want to write a console app which specializes in Utf8 or Utf32, I can process those in their native format without need for conversions.
In order to do this, I need to Utf-8 that is corrected by conversion will no longer suffice. The error escaping must be done directly on the Utf-8 and this way there will be no need to error escape at conversions, only at load and save.
This way the normal methods will remain the same. For example, you could still use FromUtf8(in.GetLine()) and all your methods without modification. If you want to do special Utf processing (not needed in normal apps), you will use a new API which takes a "raw" Utf-8 string and escapes it if needed with something like:
String ToUtf8(char code);
String ToUtf8(const char *s, int len);
String ToUtf8(const char *s);
String ToUtf8(const String& w);
or other name to not create confusion with wide char variants.
You will use something like ToUtf8(in.GetLine()) to get a valid Utf from the input for example. Just need to un-error-escape on store. Again, these two different steps will not be necessary in normal apps.
Do you find any utility in this (and not from a GUI programmers stand-point, but a generic library's stand-point)?
|
|
|
|
|
|
|
|
| Re: 16 bits wchar [message #12136 is a reply to message #11963] |
Fri, 12 October 2007 11:52   |
 |
mirek
Messages: 14290 Registered: November 2005
|
Ultimate Member |
|
|
| cbpporter wrote on Thu, 04 October 2007 13:49 |
| luzr wrote on Thu, 04 October 2007 17:33 | OK, patch applied. And you are right about 0xC2, I missed the fact that 0xC0 and 0xC1 is represented by single byte...
Mirek
|
Did you replace the other one or do you plan to support both versions of Unicode? (5 - mine and what you have - I think 3 or 4). Hope there is no code that depends on six byte Utf-8, but I doubt that this will be an issue for U++.
I will tell you a little about what I'm implementing next. Right now you have a system which allows the use of ill-formatted Utf-8. When transmitted to GUI, it is converted to a valid Utf-16, and if needed you can convert it back to the same Utf-8. This system works, but it kind of creates a bias toward Utf-16.
|
I do not think that THIS creates bias toward Utf-16 - for Ucs4 (means, 32 bit integers), there is IMO no need to change anything in error escaping method.
| Quote: |
You will use something like ToUtf8(in.GetLine()) to get a valid Utf from the input for example. Just need to un-error-escape on store. Again, these two different steps will not be necessary in normal apps.
Do you find any utility in this (and not from a GUI programmers stand-point, but a generic library's stand-point)?
|
Well, actually, I do not see a problem that this is supposed to solve. I guess then if you are interested in valid utf8 only, there is no need for escaping at all - I guess that then it could/should be solved by error message...
Mirek
|
|
|
|
|
|
| Re: 16 bits wchar [message #12140 is a reply to message #12138] |
Fri, 12 October 2007 13:54   |
cbpporter
Messages: 1428 Registered: September 2007
|
Ultimate Contributor |
|
|
| luzr wrote on Fri, 12 October 2007 11:59 | P.S.: Really, more and more we are dealing with this, more and more it is apparent that the real solution is
|
Yes, these conversions are tricky, but can be done. If you use wchar as a 32-bit value, that would simplify things as in you only need two conversion functions to UCS4 and back, and all the fuss could be ignored. This would be a great idea for GUI. But if I can create some useful things for other standards too and you don't mind including them, I don't know why we shouldn't do it.
| luzr wrote on Fri, 12 October 2007 11:59 |
Anyway, what might be a good idea for now is Utf8 <-> Utf16 conversion utilities, what do you think?
|
After I finish my round-trip conversion code, I'll get right to it.
| luzr wrote on Fri, 12 October 2007 11:59 |
Also interesting question: While longer UTF-8 sequences are invalid, would not be actually a good idea to accept them as a form of error-escapement? I can imagine a couple of scenarious where this might be very useful... E.g. what are we supposed to to with invalid UCS-4 values after all?
|
Yes, that would also be a good alternative. I choose the EExx encoding out of two reasons:
1. You already use this approach.
2. Private code-units are more unlikely to be found in exterior sources than overlong sequences, but I guess this depends a lot on circumstances. And as for invalid UCS-4, there are only single surrogate pairs and a couple more values, I'm sure we can find a good place for them somewhere in the private planes (0x0EExxx for example).
And can I use exceptions in these conversion routines?
I really need to document myself on the differences between UCS4 and UTF32.
|
|
|
|
|
|
| Re: 16 bits wchar [message #12144 is a reply to message #12140] |
Fri, 12 October 2007 17:03   |
 |
mirek
Messages: 14290 Registered: November 2005
|
Ultimate Member |
|
|
| cbpporter wrote on Fri, 12 October 2007 07:54 |
| luzr wrote on Fri, 12 October 2007 11:59 |
Also interesting question: While longer UTF-8 sequences are invalid, would not be actually a good idea to accept them as a form of error-escapement? I can imagine a couple of scenarious where this might be very useful... E.g. what are we supposed to to with invalid UCS-4 values after all?
|
Yes, that would also be a good alternative. I choose the EExx encoding out of two reasons:
|
Actually, I would keep EExx for ill-formed utf8 anyway. What I was up to was rather the fact that UTF-8 represents a sort of huffman encoding.
In practice, there is a lot of cases where you have store a set of offsets or indicies efficiently, which are "small" (e.g. lower than 128) in most case, but in exceptional cases they can be larger.
Using "full" UTF-8 would provide a nice compression algorithm here...
(Note that such use is completely unrelated to UNICODE, but why not to reuse the existing code? .
Mirek
|
|
|
|
|
|
|
|
|
|
|
|
| Re: 16 bits wchar [message #12248 is a reply to message #12186] |
Sun, 21 October 2007 20:19   |
 |
mirek
Messages: 14290 Registered: November 2005
|
Ultimate Member |
|
|
| cbpporter wrote on Tue, 16 October 2007 05:13 | OK, fixed all bugs I could find and judging by the the number of runs test I done both automatically and manually I'm reasonably sure that the algorithms are correct. Any input string can be EE-ed to a valid Utf and back, even if the original input is too short.
There is only one issue left. If the original input contains one of our codes for EE-ing (range EE00-EEFF), it will gladly accept it as a valid sequence, thus preserving it's representation. But when you undo the EE-ing, it will think that the input sequence was generated, so it will destroy that given character and replace it with an incorrect 1 byte character. We knew from the start that this issue will arise when the input contains these codes (which normally it shouldn't), but it would be nice if the algorithm would detect these codes and either EE them or just give an error.
Which method would you prefer?
|
Well, I now might sound stupid, but I got a little bit lost in regard what problem we are really trying to solve.
In fact, I have already asked in some of previous posts...
My suggestion back then was that perhaps, if we are about rigid Unicode processing, we should not error-escape at all.
Well, what might help me: Do you have any real world scenario that can be solved using your routines? Maybe considering it will tell us something about what we are trying to do.
Mirek
|
|
|
|
| Re: 16 bits wchar [message #12253 is a reply to message #12248] |
Sun, 21 October 2007 23:46   |
cbpporter
Messages: 1428 Registered: September 2007
|
Ultimate Contributor |
|
|
| luzr wrote on Sun, 21 October 2007 20:19 |
Well, I now might sound stupid, but I got a little bit lost in regard what problem we are really trying to solve.
In fact, I have already asked in some of previous posts...
My suggestion back then was that perhaps, if we are about rigid Unicode processing, we should not error-escape at all.
Well, what might help me: Do you have any real world scenario that can be solved using your routines? Maybe considering it will tell us something about what we are trying to do.
Mirek
|
Well the my routines are meant to be used this way:
// obtain a possibly invalid Utf-8 in s
if (!CheckUtf8(s))
s = ToUtf8EE(s);
// pass s to other methods handling only valid Utf-8
The routines are done and tested, I'll post them on Monday (I don't have them on my home computer, which brings up the problem of forum submitting. Can I zip you my whole file or something?). I'm not sure if this is what you wanted to know.
Not that I'm done with this you said that some Utf8 <-> Utf16 conversion could be useful for now. I can also do this on Monday, but I'm not sure what you want, because you already have such a conversion. Do you want me to update it to Unicode 5.0 or do you want me to create code which handles surrogate pairs. As for controls that don't handle these correctly, I could then make them compatible too. This is quite trivial for controls that don't edit their caption, and those that do are most derived from a base class, so it shouldn't be that hard.
|
|
|
|
| Re: 16 bits wchar [message #12254 is a reply to message #12253] |
Sun, 21 October 2007 23:57   |
 |
mirek
Messages: 14290 Registered: November 2005
|
Ultimate Member |
|
|
| cbpporter wrote on Sun, 21 October 2007 17:46 |
| luzr wrote on Sun, 21 October 2007 20:19 |
Well, I now might sound stupid, but I got a little bit lost in regard what problem we are really trying to solve.
In fact, I have already asked in some of previous posts...
My suggestion back then was that perhaps, if we are about rigid Unicode processing, we should not error-escape at all.
Well, what might help me: Do you have any real world scenario that can be solved using your routines? Maybe considering it will tell us something about what we are trying to do.
Mirek
|
Well the my routines are meant to be used this way:
// obtain a possibly invalid Utf-8 in s
if (!CheckUtf8(s))
s = ToUtf8EE(s);
// pass s to other methods handling only valid Utf-8
The routines are done and tested, I'll post them on Monday (I don't have them on my home computer, which brings up the problem of forum submitting. Can I zip you my whole file or something?). I'm not sure if this is what you wanted to know.
|
Ah, I see.
Anyway, what are "other methods" supposed to do?
(I just want to see the bigger picture - IME, the only reasonable way of working with codepoints is to convert it to WString...).
Mirek
P.S.: Consider other aspect too - I have to be a little bit hesitant when adding things to Core - everything in chrset.cpp will bloat the Linux binaries...
|
|
|
|
Goto Forum:
Current Time: Tue Apr 28 13:54:53 GMT+2 2026
Total time taken to generate the page: 0.01354 seconds
|