Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » U++ Library support » U++ Core » string filtering bug
string filtering bug [message #22517] Tue, 21 July 2009 14:33 Go to next message
Zbych is currently offline  Zbych
Messages: 325
Registered: July 2009
Senior Member
Hi,

String filtering functions for some reason treat all input data as bytes. This causes incorrect behaviour of filter if string contains non-ascii letters (ex. polish '±', 'ê') encoded in unicode or utf-8.

WString Filter(const wchar *s, int (*filter)(int))
{
	WString result;
	while(*s) {
		int c = (*filter)((char)*s++);
//                               ^^^^^^^^^ bug, should be wchar
		if(c) result.Cat(c);
	}
	return result;
}




String Filter(const char *s, int (*filter)(int))
{
	String result;
	while(*s) {
		int c = (*filter)((byte)*s++);
//                               ^^^^^^^ problem when s is UTF-8
		if(c) result.Cat(c);
	}
	return result;
}

Re: string filtering bug [message #22520 is a reply to message #22517] Tue, 21 July 2009 17:17 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Zbych wrote on Tue, 21 July 2009 08:33

Hi,

String filtering functions for some reason treat all input data as bytes. This causes incorrect behaviour of filter if string contains non-ascii letters (ex. polish '±', 'ê') encoded in unicode or utf-8.

WString Filter(const wchar *s, int (*filter)(int))
{
	WString result;
	while(*s) {
		int c = (*filter)((char)*s++);
//                               ^^^^^^^^^ bug, should be wchar
		if(c) result.Cat(c);
	}
	return result;
}




String Filter(const char *s, int (*filter)(int))
{
	String result;
	while(*s) {
		int c = (*filter)((byte)*s++);
//                               ^^^^^^^ problem when s is UTF-8
		if(c) result.Cat(c);
	}
	return result;
}




Thanks. First one fixed (hopefuly), second one I have to think through...

Mirek
Re: string filtering bug [message #22521 is a reply to message #22520] Tue, 21 July 2009 21:28 Go to previous messageGo to next message
Zbych is currently offline  Zbych
Messages: 325
Registered: July 2009
Senior Member
luzr wrote on Tue, 21 July 2009 17:17

Thanks. First one fixed (hopefuly), second one I have to think through...


Thanks.
I think that the same kind of bug is in [] operator in String - it returns n-th byte instead of n-th letter. This example works fine:

	SetDefaultCharset(CHARSET_UTF8);
	String first_name = "John";
	String second_name = "Wayne";
	String login = first_name[0] + second_name;
	PromptOK(login);


But this one doesn't:

	SetDefaultCharset(CHARSET_UTF8);
	String first_name = "¡ohn";
	String second_name = "Wayne";
	String login = first_name[0] + second_name;
	PromptOK(login);

Re: string filtering bug [message #22522 is a reply to message #22521] Tue, 21 July 2009 21:44 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
This is not a bug, rather a design choice. String is not Utf8, it is 8 bit. You can store Utf8 in it, but U++ doesn't handle code points if you don't convert to WString, and even then only 16 bit, not true Unicode and will fail if you do real internationalization.

Generally you can store Utf8 without problems, but if you need to iterate over an Utf8 string you are left to your own devices. Utf8 (and even Utf32) are not indexable. There is no way to implement a fast []. There is a way to implement an amortized cost [], but the best way is to use an iterator which has excellent performance with the limitation that you can only go ahead or backwards in a linear fashion. Fortunately, in most cases this is what you need.
Re: string filtering bug [message #22524 is a reply to message #22522] Wed, 22 July 2009 10:34 Go to previous messageGo to next message
Zbych is currently offline  Zbych
Messages: 325
Registered: July 2009
Senior Member
cbpporter wrote on Tue, 21 July 2009 21:44

This is not a bug, rather a design choice.


Ok, I checked that there is information about this in help:
Quote:

String works with 8 bit characters.


but I think that there should be a warning about String and UTF-8 (since UTF-8 is default encoding for UPP).
Re: string filtering bug [message #22541 is a reply to message #22524] Sun, 26 July 2009 03:27 Go to previous message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Zbych wrote on Wed, 22 July 2009 04:34


but I think that there should be a warning about String and UTF-8 (since UTF-8 is default encoding for UPP).



OK, I have tried my best...
Previous Topic: EOL problem
Next Topic: Environment variables code page
Goto Forum:
  


Current Time: Sun Apr 28 00:53:37 CEST 2024

Total time taken to generate the page: 0.06798 seconds