Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » U++ Library support » U++ Libraries and TheIDE: i18n, Unicode and Internationalization » 16 bits wchar
Re: 16 bits wchar [message #11944 is a reply to message #11942] Wed, 03 October 2007 12:10 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
cbpporter wrote on Wed, 03 October 2007 04:36

luzr wrote on Wed, 03 October 2007 10:26

Error escaping in Stream:

The error escaping in GetUtf8 is impossible, as it returns only single int - you do not know you have to escape until you read more than single character from the input - and then you need more than one wchar to be returned...

It depends on what that int represents and what kind of error escaping is used. For Utf-8, there are only a small number of characters that are invalid and they could be escaped to non-character code-points or even to a small region of the Private Use Area (for example FFF00-FFFFF). The private user area has approximatively 130000 reserved code points which are guaranteed to not appear in public Unicode data (they are reserved for private processing only, not data interchange).


Ah, but that is not the problem - AFAIK.

The trouble is e.g. invalid 6 bytes sequence, which you detect at byte 6. In this case, you cannot reasonable return anything escaped from Stream::GetUtf8. You would need more than 32-bit value for any reasonable output.

BTW, private area is exactly what "real" Utf8 functions use, just the range is 0xEE00 - 0xEEFF (did not wanted to spoil the beginning of range and 0xEExx nicely resonates with "Error Escape" Smile

However, please check the fixed version Stream::GetUtf8():

int Stream::GetUtf8()
{
	int code = Get();
	if(code <= 0) {
		LoadError();
		return -1;
	}
	if(code < 0x80)
		return code;
	else
	if(code < 0xC0)
		return -1;
	else
	if(code < 0xE0) {
		if(IsEof()) {
			LoadError();
			return -1;
		}
		return ((code - 0xC0) << 6) + Get() - 0x80;
	}
	else
	if(code < 0xF0) {
		int c0 = Get();
		int c1 = Get();
		if(c1 < 0) {
			LoadError();
			return -1;
		}
		return ((code - 0xE0) << 12) + ((c0 - 0x80) << 6) + c1 - 0x80;
	}
	else
	if(code < 0xF8) {
		int c0 = Get();
		int c1 = Get();
		int c2 = Get();
		if(c2 < 0) {
			LoadError();
			return -1;
		}
		return ((code - 0xf0) << 18) + ((c0 - 0x80) << 12) + ((c1 - 0x80) << 6) + c2 - 0x80;
	}
	else
	if(code < 0xFC) {
		int c0 = Get();
		int c1 = Get();
		int c2 = Get();
		int c3 = Get();
		if(c3 < 0) {
			LoadError();
			return -1;
		}
		return ((code - 0xF8) << 24) + ((c0 - 0x80) << 18) + ((c1 - 0x80) << 12) +
		       ((c2 - 0x80) << 6) + c3 - 0x80;
	}
	else
	if(code < 0xFE) {
		int c0 = Get();
		int c1 = Get();
		int c2 = Get();
		int c3 = Get();
		int c4 = Get();
		if(c4 < 0) {
			LoadError();
			return -1;
		}
		return ((code - 0xFC) << 30) + ((c0 - 0x80) << 24) + ((c1 - 0x80) << 18) +
		       ((c2 - 0x80) << 12) + ((c3 - 0x80) << 6) + c4 - 0x80;
		
	}
	else {
		LoadError();
		return -1;
	}
}


BTW, thinking further about UTF-8 -> UTF-16 surrogate conversion, I am afraid that it in fact can cause some problems in the code.

The primary motivation for "Error Escape" is that when file that is not representable by UCS-2 wchars is loaded into the editor (e.g. IDE) or if it simply has UTF-8 errors, there are two requirements:

- Parts of file with correct and representable UTF-8 encoding must be editable

- Invalid parts must not be damaged by loading/saving.

I am afraid that with real surrogate pairs in editor, editor logic can go bad, it really expects that single wchar represents one code point. There would be visual artifacts, with Win32 interpretting surrogate pairs correctly (while U++ considering them single characters).

What a nice bunch of problems to solve Smile And we have not even started to consider REAL problems Smile

Mirek
Re: 16 bits wchar [message #11946 is a reply to message #11944] Wed, 03 October 2007 14:43 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
Quote:


However, please check the fixed version Stream::GetUtf8():


Thank you! You should have said that you would fix it that quickly and I wouldn't have tried it myself Smile. Shouldn't second if clause be < C2?

Quote:


What is the point of spreading encoding related stuff all over the application? Stream works with bytes, end of story. I do not want to end with multiple methods for everything that can handle text.


Yes, I agree, Stream should work with bytes. But text processing should never work with bytes, unless in legacy mode.

And considering the problem regarding escaping, AFAIK, if the sixth byte is invalid, you need to signal an error for the first byte and continue to decode the second character as a new code point.

And also six byte Utf-8 is no longer considered correct, and should only be used when legacy data needs to be processed. But since 4 bytes allow well over 1 million code-units, I doubt there is any data stored in six bytes format. CESU8 is another thing though, but that is not supported, so it's not a problem.

[Updated on: Wed, 03 October 2007 14:52]

Report message to a moderator

Re: 16 bits wchar [message #11951 is a reply to message #11946] Wed, 03 October 2007 21:40 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Quote:


And considering the problem regarding escaping, AFAIK, if the sixth byte is invalid, you need to signal an error for the first byte and continue to decode the second character as a new code point.



Not that not even that is quite possible, unless I would add buffer to Stream for rejected sequence continuations...

Mirek
Re: 16 bits wchar [message #11959 is a reply to message #11951] Thu, 04 October 2007 13:15 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
OK, we should leave than Stream the way you intended. It serves it's purpose well without extra buffers and I don't want 20 variants of Stream and assorted with different kinds of buffers (like in Java).

So I am going to concentrate on CharSet and String. I created a function to check if an UTF-8 sequence is correct or not. I know that you have such a function (I even reused most of it), but we use different versions of Unicode. Mine is compliant (or will be) with changes after November 2003, while yours is older.

I tested it a little and going to try to find some test data so I can fully debug it, but it looks something like this:
bool utf8check5(const char *_s, int len)
{
	const byte *s = (const byte *)_s;
	const byte *lim = s + len;
	int codePoint = 0;
	while(s < lim) {
		word code = (byte)*s++;
		if(code >= 0x80) {
			if(code < 0xC2)
				return false;
			else
			if(code < 0xE0) {
				if(s >= lim || *s < 0x80 || *s >= 0xc0)
					return false;
				codePoint = ((code - 0xC0) << 6) + *s - 0x80;
				if(codePoint < 0x80 || codePoint > 0x07FF)
					return false;
				s++;
			}
			else
			if(code < 0xF0) {
				if(s + 1 >= lim ||
				   s[0] < 0x80 || s[0] >= 0xc0 ||
				   s[1] < 0x80 || s[1] >= 0xc0)
				   	return false;
				codePoint = ((code - 0xE0) << 12) + ((s[0] - 0x80) << 6) + s[1] - 0x80;
				if(codePoint < 0x0800 || codePoint > 0xFFFF)
					return false;
				s += 2;
			}
			else
			if(code < 0xF5) {
				if(s + 2 >= lim ||
				   s[0] < 0x80 || s[0] >= 0xc0 ||
				   s[1] < 0x80 || s[1] >= 0xc0 ||
				   s[2] < 0x80 || s[2] >= 0xc0)
				   	return false;
				codePoint = ((code - 0xf0) << 18) + ((s[0] - 0x80) << 12) + ((s[1] - 0x80) << 6) + s[2] - 0x80;
				if(codePoint < 0x010000 || codePoint > 0x10FFFF)
					return false;
				s += 3;
			}
			else
				return false;
		}
	}
	return true;
}
Re: 16 bits wchar [message #11960 is a reply to message #11959] Thu, 04 October 2007 17:33 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
OK, patch applied. And you are right about 0xC2, I missed the fact that 0xC0 and 0xC1 is represented by single byte...

Mirek
Re: 16 bits wchar [message #11963 is a reply to message #11960] Thu, 04 October 2007 19:49 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
luzr wrote on Thu, 04 October 2007 17:33

OK, patch applied. And you are right about 0xC2, I missed the fact that 0xC0 and 0xC1 is represented by single byte...

Mirek

Did you replace the other one or do you plan to support both versions of Unicode? (5 - mine and what you have - I think 3 or 4). Hope there is no code that depends on six byte Utf-8, but I doubt that this will be an issue for U++.

I will tell you a little about what I'm implementing next. Right now you have a system which allows the use of ill-formatted Utf-8. When transmitted to GUI, it is converted to a valid Utf-16, and if needed you can convert it back to the same Utf-8. This system works, but it kind of creates a bias toward Utf-16. I know that there are objective reasons for this, and Utf-16 is the best choice for Win and a reasonable for other systems, but I would like to be able to process all Unicode formats without regard to OS interaction, efficiency and other issues. If I want to write and i18n GUI application, I'll use WString. If I want to write a console app which specializes in Utf8 or Utf32, I can process those in their native format without need for conversions.

In order to do this, I need to Utf-8 that is corrected by conversion will no longer suffice. The error escaping must be done directly on the Utf-8 and this way there will be no need to error escape at conversions, only at load and save.

This way the normal methods will remain the same. For example, you could still use FromUtf8(in.GetLine()) and all your methods without modification. If you want to do special Utf processing (not needed in normal apps), you will use a new API which takes a "raw" Utf-8 string and escapes it if needed with something like:
String  ToUtf8(char code);
String  ToUtf8(const char *s, int len);
String  ToUtf8(const char *s);
String  ToUtf8(const String& w);

or other name to not create confusion with wide char variants.

You will use something like ToUtf8(in.GetLine()) to get a valid Utf from the input for example. Just need to un-error-escape on store. Again, these two different steps will not be necessary in normal apps.

Do you find any utility in this (and not from a GUI programmers stand-point, but a generic library's stand-point)?

Re: 16 bits wchar [message #12132 is a reply to message #11963] Fri, 12 October 2007 10:25 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
I created a function that takes a valid or invalid Utf8 string and returns the lenght in bytes of the corresponding error-escaped Utf-8 string. The function utf88codepointEE is an internal function and should not be made public.

inline int utf8codepointEE(const byte *s, const byte *z, int &lmod, int & dep)
{
	if (s < z)
	{
		word code = (byte)*s++;
		int codePoint = 0;
		
		if(code < 0x80)
		{
			dep = 1;
			lmod = 1;
			return code;
		}
		else if (code < 0xC2)
		{
			dep = 1;
			lmod = 3;
			return 0xEE00 + code;
		}
		else if (code < 0xE0)
		{
			if(s >= z)
				return -1;
			if (s[0] < 0x80 || s[0] >= 0xC0)
			{
				dep = 1;
				lmod = 3;
				return 0xEE00 + code;
			}
			codePoint = ((code - 0xC0) << 6) + *s - 0x80;
			if(codePoint < 0x80 || codePoint > 0x07FF)
			{
				dep = 1;
				lmod = 3;
				return 0xEE00 + code;
			}
			else
			{
				dep = 2;
				lmod = 2;
				return codePoint;
			}
		}
		else if (code < 0xF0)
		{
			if(s + 1 >= z)
				return -1;
			if(s[0] < 0x80 || s[0] >= 0xC0 || s[1] < 0x80 || s[1] >= 0xC0)
			{
				dep = 1;
				lmod = 3;
				return 0xEE00 + code;
			}
			codePoint = ((code - 0xE0) << 12) + ((s[0] - 0x80) << 6) + s[1] - 0x80;
			if(codePoint < 0x0800 || codePoint > 0xFFFF)
			{
				dep = 1;
				lmod = 3;
				return 0xEE00 + code;
			}
			else
			{
				dep = 3;
				lmod = 3;
				return codePoint;
			}
		}
		else if (code < 0xF5)
		{
			if(s + 2 >= z)
				return -1;
			if(s[0] < 0x80 || s[0] >= 0xc0 || s[1] < 0x80 || s[1] >= 0xc0 ||
				   s[2] < 0x80 || s[2] >= 0xc0)
			{
				dep = 1;
				lmod = 3;
				return 0xEE00 + code;
			}
			codePoint = ((code - 0xf0) << 18) + ((s[0] - 0x80) << 12) +
				            ((s[1] - 0x80) << 6) + s[2] - 0x80;
			if(codePoint < 0x010000 || codePoint > 0x10FFFF)
			{
				dep = 1;
				lmod = 3;
				return 0xEE00 + code;
			}
			else
			{
				dep = 1;
				lmod = 3;	
				return codePoint;
			}
		}
		else 
		{
			dep = 1;
			lmod = 3;
			return 0xEE00 + code;
		}
	}
	else
		return -1;
}

int utf8lenEE(const char *_s, int len)
{
	const byte *s = (const byte *)_s;
	const byte *lim = s + len;
	int codePoint = 0;
	int length = 0;
	while(s < lim) {
		int lmod, dep;
		int codePoint = utf8codepointEE(s, lim, lmod, dep);
		if (codePoint == -1)
			return -1;
		
		length += lmod;
		s += dep;
	}
	return length;
}
Re: 16 bits wchar [message #12134 is a reply to message #12132] Fri, 12 October 2007 11:27 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
And the function for error-escaping:

inline byte * putUtf8(byte *s, int codePoint)
{
	if (codePoint < 0x80)
		*s++ = codePoint;
	else if (codePoint < 0x0800)
	{
		*s++ = 0xC2 | (codePoint >> 6);
		*s++ = 0x80 | (codePoint & 0x3f);
	}
	else if (codePoint < 0xFFFF)
	{
		*s++ = 0xE0 | (codePoint >> 12);
		*s++ = 0x80 | (codePoint >> 6) & 0x3F;
		*s++ = 0x80 | (codePoint & 0x3F);
	}
	else
	{
		*s++ = 0xF0 | (codePoint >> 18);
		*s++ = 0x80 | (codePoint >> 12) & 0x3F;
		*s++ = 0x80 | (codePoint >> 6) & 0x3F;
		*s++ = 0x80 | (codePoint & 0x3F);
	}
	return s;
}

String ToUtf8EE(const char *_s, int _len)
{
	int tlen = utf8lenEE(_s, _len);
	StringBuffer result(tlen);
	
	byte *s = (byte *) _s;
	const byte *lim = s + _len;
	
	byte *z = (byte *) ~result;
	int length = 0;
	while(s < lim) {
		int lmod, dep;
		int codePoint = utf8codepointEE(s, lim, lmod, dep);
		if (codePoint == -1)
			return "";
		
		length += lmod;
		s += dep;
		
		z = putUtf8(z, codePoint);
	}
	ASSERT(length == tlen);
	return result;
}


Now I only need to implement the reverse operation and do some round-trip conversions (a large number of random chars should do the trick) to make sure everything is correct.
Re: 16 bits wchar [message #12136 is a reply to message #11963] Fri, 12 October 2007 11:52 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
cbpporter wrote on Thu, 04 October 2007 13:49

luzr wrote on Thu, 04 October 2007 17:33

OK, patch applied. And you are right about 0xC2, I missed the fact that 0xC0 and 0xC1 is represented by single byte...

Mirek

Did you replace the other one or do you plan to support both versions of Unicode? (5 - mine and what you have - I think 3 or 4). Hope there is no code that depends on six byte Utf-8, but I doubt that this will be an issue for U++.

I will tell you a little about what I'm implementing next. Right now you have a system which allows the use of ill-formatted Utf-8. When transmitted to GUI, it is converted to a valid Utf-16, and if needed you can convert it back to the same Utf-8. This system works, but it kind of creates a bias toward Utf-16.



I do not think that THIS creates bias toward Utf-16 - for Ucs4 (means, 32 bit integers), there is IMO no need to change anything in error escaping method.

Quote:


You will use something like ToUtf8(in.GetLine()) to get a valid Utf from the input for example. Just need to un-error-escape on store. Again, these two different steps will not be necessary in normal apps.

Do you find any utility in this (and not from a GUI programmers stand-point, but a generic library's stand-point)?



Well, actually, I do not see a problem that this is supposed to solve. I guess then if you are interested in valid utf8 only, there is no need for escaping at all - I guess that then it could/should be solved by error message...

Mirek
Re: 16 bits wchar [message #12138 is a reply to message #12136] Fri, 12 October 2007 11:59 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
P.S.: Really, more and more we are dealing with this, more and more it is apparent that the real solution is

typedef int32 wchar


The only trouble are those UCS4 <-> UTF-16 conversions, but IMO not that big trouble.

Anyway, what might be a good idea for now is Utf8 <-> Utf16 conversion utilities, what do you think?

Also interesting question: While longer UTF-8 sequences are invalid, would not be actually a good idea to accept them as a form of error-escapement? I can imagine a couple of scenarious where this might be very useful... E.g. what are we supposed to to with invalid UCS-4 values after all?

Mirek
Re: 16 bits wchar [message #12140 is a reply to message #12138] Fri, 12 October 2007 13:54 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
luzr wrote on Fri, 12 October 2007 11:59

P.S.: Really, more and more we are dealing with this, more and more it is apparent that the real solution is

typedef int32 wchar



Yes, these conversions are tricky, but can be done. If you use wchar as a 32-bit value, that would simplify things as in you only need two conversion functions to UCS4 and back, and all the fuss could be ignored. This would be a great idea for GUI. But if I can create some useful things for other standards too and you don't mind including them, I don't know why we shouldn't do it.

luzr wrote on Fri, 12 October 2007 11:59


Anyway, what might be a good idea for now is Utf8 <-> Utf16 conversion utilities, what do you think?


After I finish my round-trip conversion code, I'll get right to it.

luzr wrote on Fri, 12 October 2007 11:59


Also interesting question: While longer UTF-8 sequences are invalid, would not be actually a good idea to accept them as a form of error-escapement? I can imagine a couple of scenarious where this might be very useful... E.g. what are we supposed to to with invalid UCS-4 values after all?


Yes, that would also be a good alternative. I choose the EExx encoding out of two reasons:
1. You already use this approach.
2. Private code-units are more unlikely to be found in exterior sources than overlong sequences, but I guess this depends a lot on circumstances. And as for invalid UCS-4, there are only single surrogate pairs and a couple more values, I'm sure we can find a good place for them somewhere in the private planes (0x0EExxx for example).

And can I use exceptions in these conversion routines?

I really need to document myself on the differences between UCS4 and UTF32.
Re: 16 bits wchar [message #12142 is a reply to message #12140] Fri, 12 October 2007 16:25 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
I would also like to know how much effort would it take to have controls which can be edited to have an optional hWnd?
Re: 16 bits wchar [message #12144 is a reply to message #12140] Fri, 12 October 2007 17:03 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
cbpporter wrote on Fri, 12 October 2007 07:54


luzr wrote on Fri, 12 October 2007 11:59


Also interesting question: While longer UTF-8 sequences are invalid, would not be actually a good idea to accept them as a form of error-escapement? I can imagine a couple of scenarious where this might be very useful... E.g. what are we supposed to to with invalid UCS-4 values after all?


Yes, that would also be a good alternative. I choose the EExx encoding out of two reasons:



Actually, I would keep EExx for ill-formed utf8 anyway. What I was up to was rather the fact that UTF-8 represents a sort of huffman encoding.

In practice, there is a lot of cases where you have store a set of offsets or indicies efficiently, which are "small" (e.g. lower than 128) in most case, but in exceptional cases they can be larger.

Using "full" UTF-8 would provide a nice compression algorithm here...

(Note that such use is completely unrelated to UNICODE, but why not to reuse the existing code?Smile.

Mirek
Re: 16 bits wchar [message #12179 is a reply to message #12144] Mon, 15 October 2007 15:01 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
It seems that somewhere a bug managed to sneak in. I used a code to test if a string after error-escaping and un-error-escaping would have the same value and once in a while a get an assertion error that the strings are not equal and even more rarely I get one in String.h at line 567, function Zero. I'll try to find and fix this bug.

And it would be real nice if TheIDE would take me to the correct file and line number after a failed assertion, rather than to some line in AssertFailed function.

#include <Core/Core.h>
#include <cstdlib>
#include <ctime>

using namespace Upp;

const int StrLen = 10;
const int BufferSize = StrLen + 1;

CONSOLE_APP_MAIN
{
	char s[BufferSize];
	
	srand(time(NULL));
	
	for (int j = 0; j < 10000; j++)
	{
		for (int i = 0; i < StrLen; i++)
			s[i] = rand() % 254 + 1;
		
		s[BufferSize] = 0;
		
		String first = s;
		String second = ToUtf8EE(first, first.GetLength());
		
		String back = FromUtf8EE(second, second.GetLength());
		
		DUMP(first);
		DUMP(second);
		DUMP(back);
		
		if (second == "" || back == "")
			continue;
		
		ASSERT(first == back);
	}
}
Re: 16 bits wchar [message #12180 is a reply to message #12179] Mon, 15 October 2007 16:49 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
OK, fixed the bug (it was just a bit operation that set one extra bit, but very hard to find), but there is still an approximatively 1 in 10000 chance that my conversion functions fail the length equality test. I need to find out why, but that is going to be something for tomorrow.
Re: 16 bits wchar [message #12186 is a reply to message #12180] Tue, 16 October 2007 11:13 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
OK, fixed all bugs I could find and judging by the the number of runs test I done both automatically and manually I'm reasonably sure that the algorithms are correct. Any input string can be EE-ed to a valid Utf and back, even if the original input is too short.

There is only one issue left. If the original input contains one of our codes for EE-ing (range EE00-EEFF), it will gladly accept it as a valid sequence, thus preserving it's representation. But when you undo the EE-ing, it will think that the input sequence was generated, so it will destroy that given character and replace it with an incorrect 1 byte character. We knew from the start that this issue will arise when the input contains these codes (which normally it shouldn't), but it would be nice if the algorithm would detect these codes and either EE them or just give an error.

Which method would you prefer?
Re: 16 bits wchar [message #12246 is a reply to message #12179] Sun, 21 October 2007 20:14 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
cbpporter wrote on Mon, 15 October 2007 09:01

It seems that somewhere a bug managed to sneak in. I used a code to test if a string after error-escaping and un-error-escaping would have the same value and once in a while a get an assertion error that the strings are not equal and even more rarely I get one in String.h at line 567, function Zero. I'll try to find and fix this bug.

And it would be real nice if TheIDE would take me to the correct file and line number after a failed assertion, rather than to some line in AssertFailed function.



You can get there through the stack frames list.

Mirek
Re: 16 bits wchar [message #12248 is a reply to message #12186] Sun, 21 October 2007 20:19 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
cbpporter wrote on Tue, 16 October 2007 05:13

OK, fixed all bugs I could find and judging by the the number of runs test I done both automatically and manually I'm reasonably sure that the algorithms are correct. Any input string can be EE-ed to a valid Utf and back, even if the original input is too short.

There is only one issue left. If the original input contains one of our codes for EE-ing (range EE00-EEFF), it will gladly accept it as a valid sequence, thus preserving it's representation. But when you undo the EE-ing, it will think that the input sequence was generated, so it will destroy that given character and replace it with an incorrect 1 byte character. We knew from the start that this issue will arise when the input contains these codes (which normally it shouldn't), but it would be nice if the algorithm would detect these codes and either EE them or just give an error.

Which method would you prefer?


Well, I now might sound stupid, but I got a little bit lost in regard what problem we are really trying to solve.

In fact, I have already asked in some of previous posts...

My suggestion back then was that perhaps, if we are about rigid Unicode processing, we should not error-escape at all.

Well, what might help me: Do you have any real world scenario that can be solved using your routines? Maybe considering it will tell us something about what we are trying to do.

Mirek
Re: 16 bits wchar [message #12253 is a reply to message #12248] Sun, 21 October 2007 23:46 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
luzr wrote on Sun, 21 October 2007 20:19


Well, I now might sound stupid, but I got a little bit lost in regard what problem we are really trying to solve.

In fact, I have already asked in some of previous posts...

My suggestion back then was that perhaps, if we are about rigid Unicode processing, we should not error-escape at all.

Well, what might help me: Do you have any real world scenario that can be solved using your routines? Maybe considering it will tell us something about what we are trying to do.

Mirek


Well the my routines are meant to be used this way:
// obtain a possibly invalid Utf-8 in s
if (!CheckUtf8(s))
    s = ToUtf8EE(s);
// pass s to other methods handling only valid Utf-8

The routines are done and tested, I'll post them on Monday (I don't have them on my home computer, which brings up the problem of forum submitting. Can I zip you my whole file or something?). I'm not sure if this is what you wanted to know.

Not that I'm done with this you said that some Utf8 <-> Utf16 conversion could be useful for now. I can also do this on Monday, but I'm not sure what you want, because you already have such a conversion. Do you want me to update it to Unicode 5.0 or do you want me to create code which handles surrogate pairs. As for controls that don't handle these correctly, I could then make them compatible too. This is quite trivial for controls that don't edit their caption, and those that do are most derived from a base class, so it shouldn't be that hard.
Re: 16 bits wchar [message #12254 is a reply to message #12253] Sun, 21 October 2007 23:57 Go to previous messageGo to previous message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
cbpporter wrote on Sun, 21 October 2007 17:46

luzr wrote on Sun, 21 October 2007 20:19


Well, I now might sound stupid, but I got a little bit lost in regard what problem we are really trying to solve.

In fact, I have already asked in some of previous posts...

My suggestion back then was that perhaps, if we are about rigid Unicode processing, we should not error-escape at all.

Well, what might help me: Do you have any real world scenario that can be solved using your routines? Maybe considering it will tell us something about what we are trying to do.

Mirek


Well the my routines are meant to be used this way:
// obtain a possibly invalid Utf-8 in s
if (!CheckUtf8(s))
    s = ToUtf8EE(s);
// pass s to other methods handling only valid Utf-8

The routines are done and tested, I'll post them on Monday (I don't have them on my home computer, which brings up the problem of forum submitting. Can I zip you my whole file or something?). I'm not sure if this is what you wanted to know.



Ah, I see.

Anyway, what are "other methods" supposed to do?

(I just want to see the bigger picture - IME, the only reasonable way of working with codepoints is to convert it to WString...).

Mirek

P.S.: Consider other aspect too - I have to be a little bit hesitant when adding things to Core - everything in chrset.cpp will bloat the Linux binaries...
Previous Topic: Arabic words from file
Next Topic: Not possible to get .t files
Goto Forum:
  


Current Time: Fri Mar 29 12:48:50 CET 2024

Total time taken to generate the page: 0.01323 seconds