Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » U++ Library support » U++ Libraries and TheIDE: i18n, Unicode and Internationalization » 16 bits wchar
Re: 16 bits wchar [message #11959 is a reply to message #11951] Thu, 04 October 2007 13:15 Go to previous messageGo to previous message
cbpporter is currently offline  cbpporter
Messages: 1427
Registered: September 2007
Ultimate Contributor
OK, we should leave than Stream the way you intended. It serves it's purpose well without extra buffers and I don't want 20 variants of Stream and assorted with different kinds of buffers (like in Java).

So I am going to concentrate on CharSet and String. I created a function to check if an UTF-8 sequence is correct or not. I know that you have such a function (I even reused most of it), but we use different versions of Unicode. Mine is compliant (or will be) with changes after November 2003, while yours is older.

I tested it a little and going to try to find some test data so I can fully debug it, but it looks something like this:
bool utf8check5(const char *_s, int len)
{
	const byte *s = (const byte *)_s;
	const byte *lim = s + len;
	int codePoint = 0;
	while(s < lim) {
		word code = (byte)*s++;
		if(code >= 0x80) {
			if(code < 0xC2)
				return false;
			else
			if(code < 0xE0) {
				if(s >= lim || *s < 0x80 || *s >= 0xc0)
					return false;
				codePoint = ((code - 0xC0) << 6) + *s - 0x80;
				if(codePoint < 0x80 || codePoint > 0x07FF)
					return false;
				s++;
			}
			else
			if(code < 0xF0) {
				if(s + 1 >= lim ||
				   s[0] < 0x80 || s[0] >= 0xc0 ||
				   s[1] < 0x80 || s[1] >= 0xc0)
				   	return false;
				codePoint = ((code - 0xE0) << 12) + ((s[0] - 0x80) << 6) + s[1] - 0x80;
				if(codePoint < 0x0800 || codePoint > 0xFFFF)
					return false;
				s += 2;
			}
			else
			if(code < 0xF5) {
				if(s + 2 >= lim ||
				   s[0] < 0x80 || s[0] >= 0xc0 ||
				   s[1] < 0x80 || s[1] >= 0xc0 ||
				   s[2] < 0x80 || s[2] >= 0xc0)
				   	return false;
				codePoint = ((code - 0xf0) << 18) + ((s[0] - 0x80) << 12) + ((s[1] - 0x80) << 6) + s[2] - 0x80;
				if(codePoint < 0x010000 || codePoint > 0x10FFFF)
					return false;
				s += 3;
			}
			else
				return false;
		}
	}
	return true;
}
 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: Arabic words from file
Next Topic: Not possible to get .t files
Goto Forum:
  


Current Time: Mon Jul 07 00:33:45 CEST 2025

Total time taken to generate the page: 0.03574 seconds