U++ framework: U++ Libraries and TheIDE: i18n, Unicode and Internationalization

Home » U++ Library support » U++ Libraries and TheIDE: i18n, Unicode and Internationalization » 16 bits wchar

Show: Today's Messages :: Polls :: Message Navigator
E-mail to friend

16 bits wchar [message #8036]

Mon, 05 February 2007 17:19

riri
Messages: 18
Registered: February 2006
Location: France

Promising Member

Hi all!

That makes now a long time I didn't post to this forum Smile

Just a metaphysic (and maybe ridiculous) question: I saw WString uses 16 bits integers as internal character values; is it suitable for any language, as all Unicode code points can't represented in 65535 values ?

#ifdef PLATFORM_WINCE
typedef WCHAR              wchar;
#else
typedef word               wchar;
#endif

Again, it can be a stupid question, but if I well understood, internal strings representation is in Unicode format, no?

Report message to a moderator

Re: 16 bits wchar [message #8059 is a reply to message #8036]

Mon, 05 February 2007 23:07

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

riri wrote on Mon, 05 February 2007 11:19

Hi all!

That makes now a long time I didn't post to this forum Smile

#ifdef PLATFORM_WINCE
typedef WCHAR              wchar;
#else
typedef word               wchar;
#endif

Again, it can be a stupid question, but if I well understood, internal strings representation is in Unicode format, no?

Well, the main problem is that Win32 GDI output works with 16-bit characters -> wchar better be 16-bit.

Other than that, yes, it works in most cases. UNICODE characters >65536 are quite special (like Toolkien's alphabet) and not supported by any fonts.

Mirek

Report message to a moderator

Re: 16 bits wchar [message #11785 is a reply to message #8059]

Tue, 25 September 2007 22:03

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

I was quite unhappy when I found out that U++ is not Unicode standard compliant with it's "UTF-16" (what it implements is actually UCS-2). There are o lot of programs with poor Unicode support, which is partyially because STL doesn't support full Unicode either.

In theory it would be quite unforgivable for an application to handle just a subset of the standard. But how does the situation look in theory?

To answer this question I did a number of relatively thorough tests which took me about two hours. I used my computer at work, which has a Windows XP SP2 operating system. The first part was to determine if the OS supports surrogate pairs. After some testing (and research) I found that surrogate pairs can be enabled easily and are enabled by default. Windows has no problems theoretically to use the kind of characters (but individual pieces of software can). Next I found a font which displays about 20000 characters with codes about 0xFFFF, I installed them, and surprise surprise, it worked.

Next I tested a couple of applications. At first I wanted to give exact results, but I found it boring to write them and concluded that you will find it boring to read them. In short, Notepad and WordPad both display correctly and identify two code-units as one code point. Opera doesn’t identify code points correctly in some files and it can no do copy operations (it will truncate only to the lower 16 bits). Internet Explorer works correctly, but it couldn't use the correct registry entries to display the characters, so it used a little black rectangle. And the viewer from Total Commander is really ill equipped for these kinds of tasks.

Next I would like to test U++, but I get strange results when trying to find the length of a string when using only normal characters (with codes below 0xFFFF).

I took one of the examples and slightly modified it:

#include <Core/Core.h>

using namespace Upp;

CONSOLE_APP_MAIN
{
	SetDefaultCharset(CHARSET_UTF8);

	WString x = "";
	for(int i = 280; i < 300; i++)
		x.Cat(i);
	DUMP(x);
	DUMP(x.GetLength());
	DUMP(x.GetCount());
	
	String y = x.ToString();
	DUMP(y);
	DUMP(y.GetLength());
	DUMP(y.GetCount());
	
	y.Cat(" (appended)");
	x = y.ToWString();
	
	DUMP(x);
	DUMP(x.GetLength());
	DUMP(x.GetCount());
}

I got these results:

x = ─ÿ─�─�─¢─£─¥─₧─ƒ─�─�─�─�─ñ─Ñ─ª─º─¿─⌐──½
x.GetLength() = 20
x.GetCount() = 20
y = ─ÿ─�─�─¢─£─¥─₧─ƒ─�─�─�─�─ñ─Ñ─ª─º─¿─⌐──½
y.GetLength() = 40
y.GetCount() = 40
x = ─ÿ─�─�─¢─£─¥─₧─ƒ─�─�─�─�─ñ─Ñ─ª─º─¿─⌐──½ (appended)
x.GetLength() = 31
x.GetCount() = 31

Except the fact that the cars are mangled, the lengths doesn't seem to be ok. I may have understood incorrectly, but AFAIK GetLength should return the length in code units and GgtCount the number of real characters, so code points.

I also started researching the exact encoding methods of UTF and I will add full Unicode support to strings. It will be for personal use, but if anybody is interested I will post my results. Right now I'm trying to find and efficient way to index multichar strings. I think I will have to use iterators instead.

Report message to a moderator

Re: 16 bits wchar [message #11789 is a reply to message #11785]

Tue, 25 September 2007 23:18

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

Quote:

GetLength should return the length in code units and GgtCount the number of real characters, so code points.

I am afraid you expect too much. GetLength returns exactly the same number as GetCount, two names in this case are there because of each fits better for different scenario (same thing as 0 and '\0').

Rather than thinking in terms of UTF-8 / UTF-16... String is just an array of bytes, WString array of 16bit words. Not much more logic there, except that conversions between two can be performed - in conversions there is one and only encoding logic.

Quote:

I also started researching the exact encoding methods of UTF and I will add full Unicode support to strings. It will be for personal use, but if anybody is interested I will post my results. Right now I'm trying to find and efficient way to index multichar strings. I think I will have to use iterators instead.

Actually, it is not that I am not worried here. Anyway, I think that the only reasonable approach is perhaps to change wchar to 32-bit characters OR introduce LString.

The problem is that in that case you immediately have to perform conversions for all Win32 system calls... that is why I have concluded that it is not worth the trouble for now. (E.g. RTL clearly is the priority).

Anyway, any research in this area is welcome. And perhaps you could fix UTF-8 functions to support UTF-16 (so far, everything >0xffff is basically ignored).

Mirek

[Updated on: Tue, 25 September 2007 23:18]

Report message to a moderator

Re: 16 bits wchar [message #11796 is a reply to message #8036]

Wed, 26 September 2007 01:56

sergei
Messages: 94
Registered: September 2007

Member

As much as I'd like to see RTL in U++, I agree that unicode should, if possible, be fixed. RTL is built upon unicode, so a solid base - unicode strings storage - is essential. Who knows, maybe tomorrow someone will need Linear B.

I was thinking of UTF-32 as a possible main storage format. I wrote a simple benchmark to see what are the speeds with the 3 sizes of character. Here are the results (source attached):

Size: 64; Iterations: 10000000; 8: 2281; 16: 2125; 32: 2172;
Size: 128; Iterations: 5000000; 8: 1625; 16: 1453; 32: 2391;
Size: 256; Iterations: 2500000; 8: 1328; 16: 1515; 32: 1578;
Size: 512; Iterations: 1250000; 8: 1375; 16: 1141; 32: 1141;
Size: 1024; Iterations: 625000; 8: 1172; 16: 953; 32: 984;
Size: 2048; Iterations: 312500; 8: 1094; 16: 875; 32: 906;
Size: 4096; Iterations: 156250; 8: 1109; 16: 938; 32: 859;
Size: 8192; Iterations: 78125; 8: 1110; 16: 890; 32: 922;
Size: 16384; Iterations: 39062; 8: 1000; 16: 813; 32: 4047;
Size: 32768; Iterations: 19531; 8: 1000; 16: 2250; 32: 3906;
Size: 65536; Iterations: 9765; 8: 1656; 16: 2172; 32: 3812;
Size: 131072; Iterations: 4882; 8: 1625; 16: 2125; 32: 3782;
Size: 262144; Iterations: 2441; 8: 1593; 16: 2110; 32: 3781;
Size: 524288; Iterations: 1220; 8: 1563; 16: 2109; 32: 3984;

IMHO, 32-bit values aren't much worse than 16-bit. For search/replace operations - non-32-bit values would have significant overhead for characters outside main plane.

Converting UTF-32 to other formats shouldn't be a problem. But what I like most is that character would be the same as cell (unlike UTF-16 which might have 20 cells to store 19 characters).

Edit: I didn't mention that I tested basic read/write performance. UTF handling would add overhead to 8 and 16 formats, but not to 32 format. I also remembered the UTF8-EE issue. UTF-32 could solve it easily. IIRC only 21 bits are needed for full unicode, so there's plenty of space to escape to (without overtaking private space).

Attachment: UniCode.cpp
(Size: 1.31KB, Downloaded 572 times)

[Updated on: Wed, 26 September 2007 02:30]

Report message to a moderator

Re: 16 bits wchar [message #11797 is a reply to message #11789]

Wed, 26 September 2007 07:43

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

sergei wrote on Wed, 26 September 2007 01:56

I didn't mention that I tested basic read/write performance. UTF handling would add overhead to 8 and 16 formats, but not to 32 format. I also remembered the UTF8-EE issue. UTF-32 could solve it easily. IIRC only 21 bits are needed for full unicode, so there's plenty of space to escape to (without overtaking private space).

The only problem with UTF-32 is the storage space. It is 2/4 times the size of UTF-8 and almost always double of UTF-16. And I don't think that UTF-8EE is such a big issue, you just have to make sure to use a more permissive validation scheme. And what is RTL anyway?

luzr wrote on Tue, 25 September 2007 23:18

Then I don't understand how can you insert values 280-230 into a 8 bit fixed character length format. Are they translated to some code page? And if the values are 8 bit and there are 20 of them, why do I have a length 40 string in the output. And why is the length of the same string 40 and not 20 when I switch over to wide string?

[Updated on: Wed, 26 September 2007 07:44]

Report message to a moderator

Re: 16 bits wchar [message #11798 is a reply to message #11797]

Wed, 26 September 2007 08:48

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

cbpporter wrote on Wed, 26 September 2007 01:43

make sure to use a more permissive validation scheme. And what is RTL anyway?

Right-to-left language like hebrew.

Quote:

Then I don't understand how can you insert values 280-230 into a 8 bit fixed character length format.

You cannot and you are not doing that. You are creating WString and then converting it to String, using default encoding, which you have set to UTF-8.

In UTF-8, 230-280 gets converted to two byte sequences.

Quote:

Are they translated to some code page?

ToString/ToWString uses default encoding ("default charset").

Mirek

Report message to a moderator

Re: 16 bits wchar [message #11809 is a reply to message #11797]

Wed, 26 September 2007 14:55

sergei
Messages: 94
Registered: September 2007

Member

cbpporter wrote on Wed, 26 September 2007 07:43

sergei wrote on Wed, 26 September 2007 01:56

Well, 4MB of memory would yield 1 million characters. Do you typically need more, even for a rather complex GUI app? With memory of 512MB/1GB on many computers and 200GB hard drives, I don't think space is a serious issue now. I was more worried about performance - memory allocation and access is somewhat slower (but not always, for 256-8k sizes it's quite good).

The issue isn't UTF-8EE, it's more of a side effect. The main gain is char equals cell. That is, LString (or whatever the name) can always be treated as UTF-32. Unlike WString, which might be 20 wchars or unknown-length UTF-16 string. Even worse with UTF-8, where String length would almost always be different from amount of characters stored. Replace char is a trivial operation in UTF-32, but might require shifting in UTF-8/16 (if the chars require different amounts of space). Search char from end (backwards) - would require to test every find if it's the second/third/fourth char of some sequence. Actually, even simplier - how do you supply a multibyte char to some search/replace function in UTF-16/32? Integer? That would require conversion for every operation.

Unlike currently, when String is either a sequence of chars OR a UTF-8 string, LString would always be a sequence of ints/unsigned ints AND UTF-32 string. String could be left for single-char storing (like data from file or ASCII-only strings), WString for OS interop, and LString could supply conversions to/from both.

Report message to a moderator

Re: 16 bits wchar [message #11812 is a reply to message #11809]

Wed, 26 September 2007 15:37

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

Well UTF-32 is the UNIX approach. (wchar is 32 bits there).
It would certainly be a start for Unicode support and that can be easily done by creating a class based on string, changing char to dword and some new I/O functions and functions to convert to other UTF formats. But I would still like to see full UTF support, maybe not in normal strings, but in special Unicode strings. I will try to implement this evening a GetULenght() function and look over String and WString implementation to decide which functions would work with Unicode, and which need modification (for example, a find operation doesn't need to be changed, but a find starting with index must be).

Report message to a moderator

Re: 16 bits wchar [message #11813 is a reply to message #8036]

Wed, 26 September 2007 16:54

sergei
Messages: 94
Registered: September 2007

Member

Theoretically String could be used "exclusively" for UTF-8, WString for UTF-16. "normal strings" could be Vector<char> and Vector<wchar>. All operations - (reverse) find/replace char/substring, trim(truncate), starts/endswith, left/right/mid, cat (append), insert, are applicable to Vectors as well (and maybe should be implemented as algorithms for all containers). Extra considerations might be a closing '\0' (maybe not necessary - normal strings aren't for interop with OS, where '\0' is used, for inner works there's GetCount), and conversion functions (already partially implemented).

P.S. does anyone know why chars/wchars tend to be signed? IMHO unsigned character values are much more clear - after all the ASCII codes we use are unsigned (in hex).

Report message to a moderator

Re: 16 bits wchar [message #11822 is a reply to message #11813]

Wed, 26 September 2007 19:11

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

sergei wrote on Wed, 26 September 2007 16:54

P.S. does anyone know why chars/wchars tend to be signed? IMHO unsigned character values are much more clear - after all the ASCII codes we use are unsigned (in hex).

I think it is an artifact left over from C, where char was also used a lot for storing 8 bit integers and booleans (on 16-bit systems both int ant short would often be 16 bit long, so an 8 bit integer was needed).

Report message to a moderator

Re: 16 bits wchar [message #11829 is a reply to message #11797]

Wed, 26 September 2007 22:40

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

cbpporter wrote on Wed, 26 September 2007 01:43

sergei wrote on Wed, 26 September 2007 01:56

Not necessary. Current way of handling with this is just everything is mass stored as UTF-8 and only converted to UCS-2 for processing.

I guess this system should stand.

The only real trouble (and the main reason why sizeof(wchar) is 2) is Win32 compatibility. I do not feel well converting every text to UTF-16 for displaying on the screen... while, in reality, for 99% applications UCS-2 is enough...

Mirek

Report message to a moderator

Re: 16 bits wchar [message #11921 is a reply to message #11829]

Mon, 01 October 2007 13:24

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

I finally finished my Unicode research (I took longer than planed because of computer games... Razz

). I read a good chunk of the Unicode Standard 5.0, looked over their official sample implementation and studied a little U++'s String, WString and Stream classes.

I think that the first thing that must be done is extended PutUtf8 and GetUtf8 so that it reads correctly th values outside of BMP. This is not too difficult and I will try to implement and test this.

The only issue is how to handle ill-formated values. I came to the conclusion that read and write operation must recognize compliant encodings, but it also must process ill-formated characters and insert them into the stream. If the stream is set to strict, it will throw an exception. If not, it will still encode. I propose the Least Complex Encoding TM possibility. Non-atomic Unicode aware string manipulation functions will should not fail when encountering such characters, so after a read, process and write, these ill-formated values (which could be essential to other applications) will be preserved. In this scenario, only functions that display the string must be aware that some data is ill-formated.

Next, there should be a method to Validate the string, and a way to convert strings containing ill-formated string to error-escaped strings and back, so we can use atomic string processing if needed. This conversion should be done explicitly, so no general performance overhead is introduced.

Report message to a moderator

Re: 16 bits wchar [message #11925 is a reply to message #11921]

Mon, 01 October 2007 14:28

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

cbpporter wrote on Mon, 01 October 2007 07:24

I think that the first thing that must be done is extended PutUtf8 and GetUtf8 so that it reads correctly th values outside of BMP. This is not too difficult and I will try to implement and test this.

I guess fixing Utf8 routines to provide UTF16 surrogate support (for now) is a good idea.

Quote:

The only issue is how to handle ill-formated values. I came to the conclusion that read and write operation must recognize compliant encodings, but it also must process ill-formated characters and insert them into the stream. If the stream is set to strict, it will throw an exception. If not, it will still encode. I propose the Least Complex Encoding TM possibility. Non-atomic Unicode aware string manipulation functions will should not fail when encountering such characters, so after a read, process and write, these ill-formated values (which could be essential to other applications) will be preserved. In this scenario, only functions that display the string must be aware that some data is ill-formated.

Well, the basic requirement there is that converting UTF8 with invalid sequences to WString and back must result in the equal String. This feat is successfully achieved by UTF8EE.

Also, I do not think that any string manipulation routine everywhere ever should be aware about UTF-8 or UTF-16 encoding. It is much practical to covert to WString, process and (eventually) convert it back. I think that in the long run, it might be even faster.

Quote:

Next, there should be a method to Validate the string, and a way to convert strings containing ill-formated string to error-escaped strings and back, so we can use atomic string processing if needed. This conversion should be done explicitly, so no general performance overhead is introduced.

bool CheckUtf8(const String& src);

You can add CheckUtf16 Smile

Anyway, seriously, I believe that the ultimate solution is to go with sizeof(wchar) = 4... The only trouble is converting this to UTF16 and back in Win32 everywhere... OTOH, good news is that after the system code is fixed, the transition does not pose too much backwards compatibility problems...

Mirek

Report message to a moderator

Re: 16 bits wchar [message #11938 is a reply to message #11925]

Wed, 03 October 2007 06:16

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

luzr wrote on Mon, 01 October 2007 14:28

I guess fixing Utf8 routines to provide UTF16 surrogate support (for now) is a good idea.

Great! On the side note though, I am extending Utf8 methods to handle 4 byte long encodings, not UTF-16 surrogate pairs (which are illegal in UTF-8).

Quote:

Also, I do not think that any string manipulation routine everywhere ever should be aware about UTF-8 or UTF-16 encoding. It is much practical to covert to WString, process and (eventually) convert it back. I think that in the long run, it might be even faster.

That could be an acceptable compromise. But a few processing functions couldn't hurt when you really want to process that string in place.

Quote:

I think you should keep UTF-16 as default for Win32 and UTF-32 as default for Linux. Win32 and .NET both use UTF-16 (with surrogates - Win98 doesn't support surrogates, but the rest do), so I think the future of character encoding for GUI purposes is pretty well defined.

I started working on GetUtf8. I tried to keep everything as close to your style of designing things, but I have two questions.

1. I couldn’t find any function that reads or writes UTF-8 strings (only a single char). The rest of the functions read using pain byte storage. This is OK for storing strings, but when loading them, I need a UTF-8 aware method.

2. Your GetUtf8 method is quite straightforward, but I'm afraid it does not decode values correctly.

Here is a pseudo code of what you do:

if(code <= 0x7F)
    compute 1 byte value
else if (code <= 0xDF)
    compute 2 byte value
else if (code <= 0xEF)
    compute 3 byte value
else if (...)
    pretty much just read them and return "space"

The issue with this is for the invalid value range 80-C1 which is handled by you second if clause. These values are invalid in UTF-8, but you still decode them using their value and the value of the next character. If this is done for error-escaping, the UTF-8 standard expects to error escape only the current character and start procesing the next one and not to build the error escaped code by using more than the absolute minimum number of code-units (in this case one).

Report message to a moderator

Re: 16 bits wchar [message #11939 is a reply to message #11938]

Wed, 03 October 2007 10:11

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

cbpporter wrote on Wed, 03 October 2007 00:16

luzr wrote on Mon, 01 October 2007 14:28

I guess fixing Utf8 routines to provide UTF16 surrogate support (for now) is a good idea.

Great! On the side note though, I am extending Utf8 methods to handle 4 byte long encodings, not UTF-16 surrogate pairs (which are illegal in UTF-8).

Well, so what is the result then? WString is now 16-bit. Utf8 conversions are basically String<->WString (ok, also char * <-> WString).

Quote:

That could be an acceptable compromise. But a few processing functions couldn't hurt when you really want to process that string in place.

Which exactly?

Quote:

That is why it is 16bit now. But if you really need the solution for ucs-4, 32bit character and conversions is the only option.

Quote:

1. I couldn�t find any function that reads or writes UTF-8 strings (only a single char).

String  ToUtf8(wchar code);
String  ToUtf8(const wchar *s, int len);
String  ToUtf8(const wchar *s);
String  ToUtf8(const WString& w);

WString FromUtf8(const char *_s, int len);
WString FromUtf8(const char *_s);
WString FromUtf8(const String& s);

bool utf8check(const char *_s, int len);

int utf8len(const char *s, int len);
int utf8len(const char *s);
int lenAsUtf8(const wchar *s, int len);
int lenAsUtf8(const wchar *s);

bool    CheckUtf8(const String& src);

Quote:

2. Your GetUtf8 method is quite straightforward, but I'm afraid it does not decode values correctly.

Here is a pseudo code of what you do:

if(code <= 0x7F)
    compute 1 byte value
else if (code <= 0xDF)
    compute 2 byte value
else if (code <= 0xEF)
    compute 3 byte value
else if (...)
    pretty much just read them and return "space"

Ops, you are right, something is really missing in Stream. Anyway, GetUtf8 in Stream is quite auxiliary (and maybe wrong) addition. The real meat is in Charset.h/.cpp.

Mirek

Report message to a moderator

Re: 16 bits wchar [message #11940 is a reply to message #11939]

Wed, 03 October 2007 10:23

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

I know about those functions but what I was looking for is something like String& Stream::ReadUtf8Line(). I don't want to read an arbitrary number of bytes and then convert them to an encoding after. This makes Unicode fell more like an afterthought than something supported by the library.

But I still need to analyze some of your methods and then I'll be ready to reimplement them for full support. WString or equivalent will still be 16bit, but it will also contain surrogate pairs. Most of the GUI code should not be affected by this, but more experiments are needed before I can be sure.

Report message to a moderator

Re: 16 bits wchar [message #11941 is a reply to message #11939]

Wed, 03 October 2007 10:26

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

Error escaping in Stream:

The error escaping in GetUtf8 is impossible, as it returns only single int - you do not know you have to escape until you read more than single character from the input - and then you need more than one wchar to be returned...

Report message to a moderator

Re: 16 bits wchar [message #11942 is a reply to message #11941]

Wed, 03 October 2007 10:36

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

luzr wrote on Wed, 03 October 2007 10:26

It depends on what that int represents and what kind of error escaping is used. For Utf-8, there are only a small number of characters that are invalid and they could be escaped to non-character code-points or even to a small region of the Private Use Area (for example FFF00-FFFFF). The private user area has approximatively 130000 reserved code points which are guaranteed to not appear in public Unicode data (they are reserved for private processing only, not data interchange).

Report message to a moderator

Re: 16 bits wchar [message #11943 is a reply to message #11940]

Wed, 03 October 2007 10:42

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

cbpporter wrote on Wed, 03 October 2007 04:23

What is wrong with FromUtf8(in.GetLine()) ?

What is the point of spreading encoding related stuff all over the application? Stream works with bytes, end of story. I do not want to end with multiple methods for everything that can handle text.

Mirek

Report message to a moderator

Re: 16 bits wchar [message #11944 is a reply to message #11942]

Wed, 03 October 2007 12:10

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

cbpporter wrote on Wed, 03 October 2007 04:36

luzr wrote on Wed, 03 October 2007 10:26

Ah, but that is not the problem - AFAIK.

The trouble is e.g. invalid 6 bytes sequence, which you detect at byte 6. In this case, you cannot reasonable return anything escaped from Stream::GetUtf8. You would need more than 32-bit value for any reasonable output.

BTW, private area is exactly what "real" Utf8 functions use, just the range is 0xEE00 - 0xEEFF (did not wanted to spoil the beginning of range and 0xEExx nicely resonates with "Error Escape" Smile

However, please check the fixed version Stream::GetUtf8():

int Stream::GetUtf8()
{
	int code = Get();
	if(code <= 0) {
		LoadError();
		return -1;
	}
	if(code < 0x80)
		return code;
	else
	if(code < 0xC0)
		return -1;
	else
	if(code < 0xE0) {
		if(IsEof()) {
			LoadError();
			return -1;
		}
		return ((code - 0xC0) << 6) + Get() - 0x80;
	}
	else
	if(code < 0xF0) {
		int c0 = Get();
		int c1 = Get();
		if(c1 < 0) {
			LoadError();
			return -1;
		}
		return ((code - 0xE0) << 12) + ((c0 - 0x80) << 6) + c1 - 0x80;
	}
	else
	if(code < 0xF8) {
		int c0 = Get();
		int c1 = Get();
		int c2 = Get();
		if(c2 < 0) {
			LoadError();
			return -1;
		}
		return ((code - 0xf0) << 18) + ((c0 - 0x80) << 12) + ((c1 - 0x80) << 6) + c2 - 0x80;
	}
	else
	if(code < 0xFC) {
		int c0 = Get();
		int c1 = Get();
		int c2 = Get();
		int c3 = Get();
		if(c3 < 0) {
			LoadError();
			return -1;
		}
		return ((code - 0xF8) << 24) + ((c0 - 0x80) << 18) + ((c1 - 0x80) << 12) +
		       ((c2 - 0x80) << 6) + c3 - 0x80;
	}
	else
	if(code < 0xFE) {
		int c0 = Get();
		int c1 = Get();
		int c2 = Get();
		int c3 = Get();
		int c4 = Get();
		if(c4 < 0) {
			LoadError();
			return -1;
		}
		return ((code - 0xFC) << 30) + ((c0 - 0x80) << 24) + ((c1 - 0x80) << 18) +
		       ((c2 - 0x80) << 12) + ((c3 - 0x80) << 6) + c4 - 0x80;
		
	}
	else {
		LoadError();
		return -1;
	}
}

BTW, thinking further about UTF-8 -> UTF-16 surrogate conversion, I am afraid that it in fact can cause some problems in the code.

The primary motivation for "Error Escape" is that when file that is not representable by UCS-2 wchars is loaded into the editor (e.g. IDE) or if it simply has UTF-8 errors, there are two requirements:

- Parts of file with correct and representable UTF-8 encoding must be editable

- Invalid parts must not be damaged by loading/saving.

I am afraid that with real surrogate pairs in editor, editor logic can go bad, it really expects that single wchar represents one code point. There would be visual artifacts, with Win32 interpretting surrogate pairs correctly (while U++ considering them single characters).

What a nice bunch of problems to solve Smile

And we have not even started to consider REAL problems Smile

Mirek

Report message to a moderator

Re: 16 bits wchar [message #11946 is a reply to message #11944]

Wed, 03 October 2007 14:43

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

Quote:

However, please check the fixed version Stream::GetUtf8():

Thank you! You should have said that you would fix it that quickly and I wouldn't have tried it myself Smile

. Shouldn't second if clause be < C2?

Quote:

What is the point of spreading encoding related stuff all over the application? Stream works with bytes, end of story. I do not want to end with multiple methods for everything that can handle text.

Yes, I agree, Stream should work with bytes. But text processing should never work with bytes, unless in legacy mode.

And considering the problem regarding escaping, AFAIK, if the sixth byte is invalid, you need to signal an error for the first byte and continue to decode the second character as a new code point.

And also six byte Utf-8 is no longer considered correct, and should only be used when legacy data needs to be processed. But since 4 bytes allow well over 1 million code-units, I doubt there is any data stored in six bytes format. CESU8 is another thing though, but that is not supported, so it's not a problem.

[Updated on: Wed, 03 October 2007 14:52]

Report message to a moderator

Re: 16 bits wchar [message #11951 is a reply to message #11946]

Wed, 03 October 2007 21:40

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

Quote:

And considering the problem regarding escaping, AFAIK, if the sixth byte is invalid, you need to signal an error for the first byte and continue to decode the second character as a new code point.

Not that not even that is quite possible, unless I would add buffer to Stream for rejected sequence continuations...

Mirek

Report message to a moderator

Re: 16 bits wchar [message #11959 is a reply to message #11951]

Thu, 04 October 2007 13:15

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

OK, we should leave than Stream the way you intended. It serves it's purpose well without extra buffers and I don't want 20 variants of Stream and assorted with different kinds of buffers (like in Java).

So I am going to concentrate on CharSet and String. I created a function to check if an UTF-8 sequence is correct or not. I know that you have such a function (I even reused most of it), but we use different versions of Unicode. Mine is compliant (or will be) with changes after November 2003, while yours is older.

I tested it a little and going to try to find some test data so I can fully debug it, but it looks something like this:

bool utf8check5(const char *_s, int len)
{
	const byte *s = (const byte *)_s;
	const byte *lim = s + len;
	int codePoint = 0;
	while(s < lim) {
		word code = (byte)*s++;
		if(code >= 0x80) {
			if(code < 0xC2)
				return false;
			else
			if(code < 0xE0) {
				if(s >= lim || *s < 0x80 || *s >= 0xc0)
					return false;
				codePoint = ((code - 0xC0) << 6) + *s - 0x80;
				if(codePoint < 0x80 || codePoint > 0x07FF)
					return false;
				s++;
			}
			else
			if(code < 0xF0) {
				if(s + 1 >= lim ||
				   s[0] < 0x80 || s[0] >= 0xc0 ||
				   s[1] < 0x80 || s[1] >= 0xc0)
				   	return false;
				codePoint = ((code - 0xE0) << 12) + ((s[0] - 0x80) << 6) + s[1] - 0x80;
				if(codePoint < 0x0800 || codePoint > 0xFFFF)
					return false;
				s += 2;
			}
			else
			if(code < 0xF5) {
				if(s + 2 >= lim ||
				   s[0] < 0x80 || s[0] >= 0xc0 ||
				   s[1] < 0x80 || s[1] >= 0xc0 ||
				   s[2] < 0x80 || s[2] >= 0xc0)
				   	return false;
				codePoint = ((code - 0xf0) << 18) + ((s[0] - 0x80) << 12) + ((s[1] - 0x80) << 6) + s[2] - 0x80;
				if(codePoint < 0x010000 || codePoint > 0x10FFFF)
					return false;
				s += 3;
			}
			else
				return false;
		}
	}
	return true;
}

Report message to a moderator

Re: 16 bits wchar [message #11960 is a reply to message #11959]

Thu, 04 October 2007 17:33

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

OK, patch applied. And you are right about 0xC2, I missed the fact that 0xC0 and 0xC1 is represented by single byte...

Mirek

Report message to a moderator

Re: 16 bits wchar [message #11963 is a reply to message #11960]

Thu, 04 October 2007 19:49

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

luzr wrote on Thu, 04 October 2007 17:33

OK, patch applied. And you are right about 0xC2, I missed the fact that 0xC0 and 0xC1 is represented by single byte...

Mirek

Did you replace the other one or do you plan to support both versions of Unicode? (5 - mine and what you have - I think 3 or 4). Hope there is no code that depends on six byte Utf-8, but I doubt that this will be an issue for U++.

I will tell you a little about what I'm implementing next. Right now you have a system which allows the use of ill-formatted Utf-8. When transmitted to GUI, it is converted to a valid Utf-16, and if needed you can convert it back to the same Utf-8. This system works, but it kind of creates a bias toward Utf-16. I know that there are objective reasons for this, and Utf-16 is the best choice for Win and a reasonable for other systems, but I would like to be able to process all Unicode formats without regard to OS interaction, efficiency and other issues. If I want to write and i18n GUI application, I'll use WString. If I want to write a console app which specializes in Utf8 or Utf32, I can process those in their native format without need for conversions.

In order to do this, I need to Utf-8 that is corrected by conversion will no longer suffice. The error escaping must be done directly on the Utf-8 and this way there will be no need to error escape at conversions, only at load and save.

This way the normal methods will remain the same. For example, you could still use FromUtf8(in.GetLine()) and all your methods without modification. If you want to do special Utf processing (not needed in normal apps), you will use a new API which takes a "raw" Utf-8 string and escapes it if needed with something like:

String  ToUtf8(char code);
String  ToUtf8(const char *s, int len);
String  ToUtf8(const char *s);
String  ToUtf8(const String& w);

or other name to not create confusion with wide char variants.

You will use something like ToUtf8(in.GetLine()) to get a valid Utf from the input for example. Just need to un-error-escape on store. Again, these two different steps will not be necessary in normal apps.

Do you find any utility in this (and not from a GUI programmers stand-point, but a generic library's stand-point)?

Report message to a moderator

Re: 16 bits wchar [message #12132 is a reply to message #11963]

Fri, 12 October 2007 10:25

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

I created a function that takes a valid or invalid Utf8 string and returns the lenght in bytes of the corresponding error-escaped Utf-8 string. The function utf88codepointEE is an internal function and should not be made public.

inline int utf8codepointEE(const byte *s, const byte *z, int &lmod, int & dep)
{
	if (s < z)
	{
		word code = (byte)*s++;
		int codePoint = 0;
		
		if(code < 0x80)
		{
			dep = 1;
			lmod = 1;
			return code;
		}
		else if (code < 0xC2)
		{
			dep = 1;
			lmod = 3;
			return 0xEE00 + code;
		}
		else if (code < 0xE0)
		{
			if(s >= z)
				return -1;
			if (s[0] < 0x80 || s[0] >= 0xC0)
			{
				dep = 1;
				lmod = 3;
				return 0xEE00 + code;
			}
			codePoint = ((code - 0xC0) << 6) + *s - 0x80;
			if(codePoint < 0x80 || codePoint > 0x07FF)
			{
				dep = 1;
				lmod = 3;
				return 0xEE00 + code;
			}
			else
			{
				dep = 2;
				lmod = 2;
				return codePoint;
			}
		}
		else if (code < 0xF0)
		{
			if(s + 1 >= z)
				return -1;
			if(s[0] < 0x80 || s[0] >= 0xC0 || s[1] < 0x80 || s[1] >= 0xC0)
			{
				dep = 1;
				lmod = 3;
				return 0xEE00 + code;
			}
			codePoint = ((code - 0xE0) << 12) + ((s[0] - 0x80) << 6) + s[1] - 0x80;
			if(codePoint < 0x0800 || codePoint > 0xFFFF)
			{
				dep = 1;
				lmod = 3;
				return 0xEE00 + code;
			}
			else
			{
				dep = 3;
				lmod = 3;
				return codePoint;
			}
		}
		else if (code < 0xF5)
		{
			if(s + 2 >= z)
				return -1;
			if(s[0] < 0x80 || s[0] >= 0xc0 || s[1] < 0x80 || s[1] >= 0xc0 ||
				   s[2] < 0x80 || s[2] >= 0xc0)
			{
				dep = 1;
				lmod = 3;
				return 0xEE00 + code;
			}
			codePoint = ((code - 0xf0) << 18) + ((s[0] - 0x80) << 12) +
				            ((s[1] - 0x80) << 6) + s[2] - 0x80;
			if(codePoint < 0x010000 || codePoint > 0x10FFFF)
			{
				dep = 1;
				lmod = 3;
				return 0xEE00 + code;
			}
			else
			{
				dep = 1;
				lmod = 3;	
				return codePoint;
			}
		}
		else 
		{
			dep = 1;
			lmod = 3;
			return 0xEE00 + code;
		}
	}
	else
		return -1;
}

int utf8lenEE(const char *_s, int len)
{
	const byte *s = (const byte *)_s;
	const byte *lim = s + len;
	int codePoint = 0;
	int length = 0;
	while(s < lim) {
		int lmod, dep;
		int codePoint = utf8codepointEE(s, lim, lmod, dep);
		if (codePoint == -1)
			return -1;
		
		length += lmod;
		s += dep;
	}
	return length;
}

Report message to a moderator

Re: 16 bits wchar [message #12134 is a reply to message #12132]

Fri, 12 October 2007 11:27

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

And the function for error-escaping:

inline byte * putUtf8(byte *s, int codePoint)
{
	if (codePoint < 0x80)
		*s++ = codePoint;
	else if (codePoint < 0x0800)
	{
		*s++ = 0xC2 | (codePoint >> 6);
		*s++ = 0x80 | (codePoint & 0x3f);
	}
	else if (codePoint < 0xFFFF)
	{
		*s++ = 0xE0 | (codePoint >> 12);
		*s++ = 0x80 | (codePoint >> 6) & 0x3F;
		*s++ = 0x80 | (codePoint & 0x3F);
	}
	else
	{
		*s++ = 0xF0 | (codePoint >> 18);
		*s++ = 0x80 | (codePoint >> 12) & 0x3F;
		*s++ = 0x80 | (codePoint >> 6) & 0x3F;
		*s++ = 0x80 | (codePoint & 0x3F);
	}
	return s;
}

String ToUtf8EE(const char *_s, int _len)
{
	int tlen = utf8lenEE(_s, _len);
	StringBuffer result(tlen);
	
	byte *s = (byte *) _s;
	const byte *lim = s + _len;
	
	byte *z = (byte *) ~result;
	int length = 0;
	while(s < lim) {
		int lmod, dep;
		int codePoint = utf8codepointEE(s, lim, lmod, dep);
		if (codePoint == -1)
			return "";
		
		length += lmod;
		s += dep;
		
		z = putUtf8(z, codePoint);
	}
	ASSERT(length == tlen);
	return result;
}

Now I only need to implement the reverse operation and do some round-trip conversions (a large number of random chars should do the trick) to make sure everything is correct.

Report message to a moderator

Re: 16 bits wchar [message #12136 is a reply to message #11963]

Fri, 12 October 2007 11:52

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

cbpporter wrote on Thu, 04 October 2007 13:49

luzr wrote on Thu, 04 October 2007 17:33

OK, patch applied. And you are right about 0xC2, I missed the fact that 0xC0 and 0xC1 is represented by single byte...

Mirek

I do not think that THIS creates bias toward Utf-16 - for Ucs4 (means, 32 bit integers), there is IMO no need to change anything in error escaping method.

Quote:

You will use something like ToUtf8(in.GetLine()) to get a valid Utf from the input for example. Just need to un-error-escape on store. Again, these two different steps will not be necessary in normal apps.

Do you find any utility in this (and not from a GUI programmers stand-point, but a generic library's stand-point)?

Well, actually, I do not see a problem that this is supposed to solve. I guess then if you are interested in valid utf8 only, there is no need for escaping at all - I guess that then it could/should be solved by error message...

Mirek

Report message to a moderator

Re: 16 bits wchar [message #12138 is a reply to message #12136]

Fri, 12 October 2007 11:59

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

P.S.: Really, more and more we are dealing with this, more and more it is apparent that the real solution is

typedef int32 wchar

The only trouble are those UCS4 <-> UTF-16 conversions, but IMO not that big trouble.

Anyway, what might be a good idea for now is Utf8 <-> Utf16 conversion utilities, what do you think?

Also interesting question: While longer UTF-8 sequences are invalid, would not be actually a good idea to accept them as a form of error-escapement? I can imagine a couple of scenarious where this might be very useful... E.g. what are we supposed to to with invalid UCS-4 values after all?

Mirek

Report message to a moderator

Re: 16 bits wchar [message #12140 is a reply to message #12138]

Fri, 12 October 2007 13:54

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

luzr wrote on Fri, 12 October 2007 11:59

P.S.: Really, more and more we are dealing with this, more and more it is apparent that the real solution is

typedef int32 wchar

Yes, these conversions are tricky, but can be done. If you use wchar as a 32-bit value, that would simplify things as in you only need two conversion functions to UCS4 and back, and all the fuss could be ignored. This would be a great idea for GUI. But if I can create some useful things for other standards too and you don't mind including them, I don't know why we shouldn't do it.

luzr wrote on Fri, 12 October 2007 11:59

Anyway, what might be a good idea for now is Utf8 <-> Utf16 conversion utilities, what do you think?

After I finish my round-trip conversion code, I'll get right to it.

luzr wrote on Fri, 12 October 2007 11:59

Also interesting question: While longer UTF-8 sequences are invalid, would not be actually a good idea to accept them as a form of error-escapement? I can imagine a couple of scenarious where this might be very useful... E.g. what are we supposed to to with invalid UCS-4 values after all?

Yes, that would also be a good alternative. I choose the EExx encoding out of two reasons:
1. You already use this approach.
2. Private code-units are more unlikely to be found in exterior sources than overlong sequences, but I guess this depends a lot on circumstances. And as for invalid UCS-4, there are only single surrogate pairs and a couple more values, I'm sure we can find a good place for them somewhere in the private planes (0x0EExxx for example).

And can I use exceptions in these conversion routines?

I really need to document myself on the differences between UCS4 and UTF32.

Report message to a moderator

Re: 16 bits wchar [message #12142 is a reply to message #12140]

Fri, 12 October 2007 16:25

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

I would also like to know how much effort would it take to have controls which can be edited to have an optional hWnd?

Report message to a moderator

Re: 16 bits wchar [message #12144 is a reply to message #12140]

Fri, 12 October 2007 17:03

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

cbpporter wrote on Fri, 12 October 2007 07:54

luzr wrote on Fri, 12 October 2007 11:59

Yes, that would also be a good alternative. I choose the EExx encoding out of two reasons:

Actually, I would keep EExx for ill-formed utf8 anyway. What I was up to was rather the fact that UTF-8 represents a sort of huffman encoding.

In practice, there is a lot of cases where you have store a set of offsets or indicies efficiently, which are "small" (e.g. lower than 128) in most case, but in exceptional cases they can be larger.

Using "full" UTF-8 would provide a nice compression algorithm here...

(Note that such use is completely unrelated to UNICODE, but why not to reuse the existing code? Smile

.

Mirek

Report message to a moderator

Re: 16 bits wchar [message #12179 is a reply to message #12144]

Mon, 15 October 2007 15:01

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

It seems that somewhere a bug managed to sneak in. I used a code to test if a string after error-escaping and un-error-escaping would have the same value and once in a while a get an assertion error that the strings are not equal and even more rarely I get one in String.h at line 567, function Zero. I'll try to find and fix this bug.

And it would be real nice if TheIDE would take me to the correct file and line number after a failed assertion, rather than to some line in AssertFailed function.

#include <Core/Core.h>
#include <cstdlib>
#include <ctime>

using namespace Upp;

const int StrLen = 10;
const int BufferSize = StrLen + 1;

CONSOLE_APP_MAIN
{
	char s[BufferSize];
	
	srand(time(NULL));
	
	for (int j = 0; j < 10000; j++)
	{
		for (int i = 0; i < StrLen; i++)
			s[i] = rand() % 254 + 1;
		
		s[BufferSize] = 0;
		
		String first = s;
		String second = ToUtf8EE(first, first.GetLength());
		
		String back = FromUtf8EE(second, second.GetLength());
		
		DUMP(first);
		DUMP(second);
		DUMP(back);
		
		if (second == "" || back == "")
			continue;
		
		ASSERT(first == back);
	}
}

Report message to a moderator

Re: 16 bits wchar [message #12180 is a reply to message #12179]

Mon, 15 October 2007 16:49

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

OK, fixed the bug (it was just a bit operation that set one extra bit, but very hard to find), but there is still an approximatively 1 in 10000 chance that my conversion functions fail the length equality test. I need to find out why, but that is going to be something for tomorrow.

Report message to a moderator

Re: 16 bits wchar [message #12186 is a reply to message #12180]

Tue, 16 October 2007 11:13

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

OK, fixed all bugs I could find and judging by the the number of runs test I done both automatically and manually I'm reasonably sure that the algorithms are correct. Any input string can be EE-ed to a valid Utf and back, even if the original input is too short.

There is only one issue left. If the original input contains one of our codes for EE-ing (range EE00-EEFF), it will gladly accept it as a valid sequence, thus preserving it's representation. But when you undo the EE-ing, it will think that the input sequence was generated, so it will destroy that given character and replace it with an incorrect 1 byte character. We knew from the start that this issue will arise when the input contains these codes (which normally it shouldn't), but it would be nice if the algorithm would detect these codes and either EE them or just give an error.

Which method would you prefer?

Report message to a moderator

Re: 16 bits wchar [message #12246 is a reply to message #12179]

Sun, 21 October 2007 20:14

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

cbpporter wrote on Mon, 15 October 2007 09:01

You can get there through the stack frames list.

Mirek

Report message to a moderator

Re: 16 bits wchar [message #12248 is a reply to message #12186]

Sun, 21 October 2007 20:19

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

cbpporter wrote on Tue, 16 October 2007 05:13

Well, I now might sound stupid, but I got a little bit lost in regard what problem we are really trying to solve.

In fact, I have already asked in some of previous posts...

My suggestion back then was that perhaps, if we are about rigid Unicode processing, we should not error-escape at all.

Well, what might help me: Do you have any real world scenario that can be solved using your routines? Maybe considering it will tell us something about what we are trying to do.

Mirek

Report message to a moderator

Re: 16 bits wchar [message #12253 is a reply to message #12248]

Sun, 21 October 2007 23:46

cbpporter
Messages: 1428
Registered: September 2007

Ultimate Contributor

luzr wrote on Sun, 21 October 2007 20:19

Well the my routines are meant to be used this way:

// obtain a possibly invalid Utf-8 in s
if (!CheckUtf8(s))
    s = ToUtf8EE(s);
// pass s to other methods handling only valid Utf-8

The routines are done and tested, I'll post them on Monday (I don't have them on my home computer, which brings up the problem of forum submitting. Can I zip you my whole file or something?). I'm not sure if this is what you wanted to know.

Not that I'm done with this you said that some Utf8 <-> Utf16 conversion could be useful for now. I can also do this on Monday, but I'm not sure what you want, because you already have such a conversion. Do you want me to update it to Unicode 5.0 or do you want me to create code which handles surrogate pairs. As for controls that don't handle these correctly, I could then make them compatible too. This is quite trivial for controls that don't edit their caption, and those that do are most derived from a base class, so it shouldn't be that hard.

Report message to a moderator

Re: 16 bits wchar [message #12254 is a reply to message #12253]

Sun, 21 October 2007 23:57

mirek
Messages: 14290
Registered: November 2005

Ultimate Member

cbpporter wrote on Sun, 21 October 2007 17:46

luzr wrote on Sun, 21 October 2007 20:19

Well the my routines are meant to be used this way:

// obtain a possibly invalid Utf-8 in s
if (!CheckUtf8(s))
    s = ToUtf8EE(s);
// pass s to other methods handling only valid Utf-8

Ah, I see.

Anyway, what are "other methods" supposed to do?

(I just want to see the bigger picture - IME, the only reasonable way of working with codepoints is to convert it to WString...).

Mirek

P.S.: Consider other aspect too - I have to be a little bit hesitant when adding things to Core - everything in chrset.cpp will bloat the Linux binaries...

Report message to a moderator

Pages (3): [1 2 3 › »]

Previous Topic:	Arabic words from file
Next Topic:	Not possible to get .t files

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

]

Current Time: Tue Apr 28 13:54:53 GMT+2 2026

Total time taken to generate the page: 0.01354 seconds