U++ forum: Welcome to the forum

Search on this site

Search in forums

Home » U++ Library support » U++ Libraries and TheIDE: i18n, Unicode and Internationalization » 16 bits wchar

Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend

16 bits wchar [message #8036]

Mon, 05 February 2007 17:19

riri
Messages: 18
Registered: February 2006
Location: France

Promising Member

Hi all!

That makes now a long time I didn't post to this forum Smile

Just a metaphysic (and maybe ridiculous) question: I saw WString uses 16 bits integers as internal character values; is it suitable for any language, as all Unicode code points can't represented in 65535 values ?

#ifdef PLATFORM_WINCE
typedef WCHAR              wchar;
#else
typedef word               wchar;
#endif

Again, it can be a stupid question, but if I well understood, internal strings representation is in Unicode format, no?

Report message to a moderator

Re: 16 bits wchar [message #8059 is a reply to message #8036]

Mon, 05 February 2007 23:07

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

riri wrote on Mon, 05 February 2007 11:19

Hi all!

That makes now a long time I didn't post to this forum Smile

#ifdef PLATFORM_WINCE
typedef WCHAR              wchar;
#else
typedef word               wchar;
#endif

Again, it can be a stupid question, but if I well understood, internal strings representation is in Unicode format, no?

Well, the main problem is that Win32 GDI output works with 16-bit characters -> wchar better be 16-bit.

Other than that, yes, it works in most cases. UNICODE characters >65536 are quite special (like Toolkien's alphabet) and not supported by any fonts.

Mirek

Report message to a moderator

Re: 16 bits wchar [message #11785 is a reply to message #8059]

Tue, 25 September 2007 22:03

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

I was quite unhappy when I found out that U++ is not Unicode standard compliant with it's "UTF-16" (what it implements is actually UCS-2). There are o lot of programs with poor Unicode support, which is partyially because STL doesn't support full Unicode either.

In theory it would be quite unforgivable for an application to handle just a subset of the standard. But how does the situation look in theory?

To answer this question I did a number of relatively thorough tests which took me about two hours. I used my computer at work, which has a Windows XP SP2 operating system. The first part was to determine if the OS supports surrogate pairs. After some testing (and research) I found that surrogate pairs can be enabled easily and are enabled by default. Windows has no problems theoretically to use the kind of characters (but individual pieces of software can). Next I found a font which displays about 20000 characters with codes about 0xFFFF, I installed them, and surprise surprise, it worked.

Next I tested a couple of applications. At first I wanted to give exact results, but I found it boring to write them and concluded that you will find it boring to read them. In short, Notepad and WordPad both display correctly and identify two code-units as one code point. Opera doesn’t identify code points correctly in some files and it can no do copy operations (it will truncate only to the lower 16 bits). Internet Explorer works correctly, but it couldn't use the correct registry entries to display the characters, so it used a little black rectangle. And the viewer from Total Commander is really ill equipped for these kinds of tasks.

Next I would like to test U++, but I get strange results when trying to find the length of a string when using only normal characters (with codes below 0xFFFF).

I took one of the examples and slightly modified it:

#include <Core/Core.h>

using namespace Upp;

CONSOLE_APP_MAIN
{
	SetDefaultCharset(CHARSET_UTF8);

	WString x = "";
	for(int i = 280; i < 300; i++)
		x.Cat(i);
	DUMP(x);
	DUMP(x.GetLength());
	DUMP(x.GetCount());
	
	String y = x.ToString();
	DUMP(y);
	DUMP(y.GetLength());
	DUMP(y.GetCount());
	
	y.Cat(" (appended)");
	x = y.ToWString();
	
	DUMP(x);
	DUMP(x.GetLength());
	DUMP(x.GetCount());
}

I got these results:

x = ─ÿ─�─�─¢─£─¥─₧─ƒ─�─�─�─�─ñ─Ñ─ª─º─¿─⌐──½
x.GetLength() = 20
x.GetCount() = 20
y = ─ÿ─�─�─¢─£─¥─₧─ƒ─�─�─�─�─ñ─Ñ─ª─º─¿─⌐──½
y.GetLength() = 40
y.GetCount() = 40
x = ─ÿ─�─�─¢─£─¥─₧─ƒ─�─�─�─�─ñ─Ñ─ª─º─¿─⌐──½ (appended)
x.GetLength() = 31
x.GetCount() = 31

Except the fact that the cars are mangled, the lengths doesn't seem to be ok. I may have understood incorrectly, but AFAIK GetLength should return the length in code units and GgtCount the number of real characters, so code points.

I also started researching the exact encoding methods of UTF and I will add full Unicode support to strings. It will be for personal use, but if anybody is interested I will post my results. Right now I'm trying to find and efficient way to index multichar strings. I think I will have to use iterators instead.

Report message to a moderator

Re: 16 bits wchar [message #11789 is a reply to message #11785]

Tue, 25 September 2007 23:18

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

Quote:

GetLength should return the length in code units and GgtCount the number of real characters, so code points.

I am afraid you expect too much. GetLength returns exactly the same number as GetCount, two names in this case are there because of each fits better for different scenario (same thing as 0 and '\0').

Rather than thinking in terms of UTF-8 / UTF-16... String is just an array of bytes, WString array of 16bit words. Not much more logic there, except that conversions between two can be performed - in conversions there is one and only encoding logic.

Quote:

I also started researching the exact encoding methods of UTF and I will add full Unicode support to strings. It will be for personal use, but if anybody is interested I will post my results. Right now I'm trying to find and efficient way to index multichar strings. I think I will have to use iterators instead.

Actually, it is not that I am not worried here. Anyway, I think that the only reasonable approach is perhaps to change wchar to 32-bit characters OR introduce LString.

The problem is that in that case you immediately have to perform conversions for all Win32 system calls... that is why I have concluded that it is not worth the trouble for now. (E.g. RTL clearly is the priority).

Anyway, any research in this area is welcome. And perhaps you could fix UTF-8 functions to support UTF-16 (so far, everything >0xffff is basically ignored).

Mirek

[Updated on: Tue, 25 September 2007 23:18]

Report message to a moderator

Re: 16 bits wchar [message #11796 is a reply to message #8036]

Wed, 26 September 2007 01:56

sergei
Messages: 94
Registered: September 2007

Member

As much as I'd like to see RTL in U++, I agree that unicode should, if possible, be fixed. RTL is built upon unicode, so a solid base - unicode strings storage - is essential. Who knows, maybe tomorrow someone will need Linear B.

I was thinking of UTF-32 as a possible main storage format. I wrote a simple benchmark to see what are the speeds with the 3 sizes of character. Here are the results (source attached):

Size: 64; Iterations: 10000000; 8: 2281; 16: 2125; 32: 2172;
Size: 128; Iterations: 5000000; 8: 1625; 16: 1453; 32: 2391;
Size: 256; Iterations: 2500000; 8: 1328; 16: 1515; 32: 1578;
Size: 512; Iterations: 1250000; 8: 1375; 16: 1141; 32: 1141;
Size: 1024; Iterations: 625000; 8: 1172; 16: 953; 32: 984;
Size: 2048; Iterations: 312500; 8: 1094; 16: 875; 32: 906;
Size: 4096; Iterations: 156250; 8: 1109; 16: 938; 32: 859;
Size: 8192; Iterations: 78125; 8: 1110; 16: 890; 32: 922;
Size: 16384; Iterations: 39062; 8: 1000; 16: 813; 32: 4047;
Size: 32768; Iterations: 19531; 8: 1000; 16: 2250; 32: 3906;
Size: 65536; Iterations: 9765; 8: 1656; 16: 2172; 32: 3812;
Size: 131072; Iterations: 4882; 8: 1625; 16: 2125; 32: 3782;
Size: 262144; Iterations: 2441; 8: 1593; 16: 2110; 32: 3781;
Size: 524288; Iterations: 1220; 8: 1563; 16: 2109; 32: 3984;

IMHO, 32-bit values aren't much worse than 16-bit. For search/replace operations - non-32-bit values would have significant overhead for characters outside main plane.

Converting UTF-32 to other formats shouldn't be a problem. But what I like most is that character would be the same as cell (unlike UTF-16 which might have 20 cells to store 19 characters).

Edit: I didn't mention that I tested basic read/write performance. UTF handling would add overhead to 8 and 16 formats, but not to 32 format. I also remembered the UTF8-EE issue. UTF-32 could solve it easily. IIRC only 21 bits are needed for full unicode, so there's plenty of space to escape to (without overtaking private space).

Attachment: UniCode.cpp
(Size: 1.31KB, Downloaded 435 times)

[Updated on: Wed, 26 September 2007 02:30]

Report message to a moderator

Re: 16 bits wchar [message #11797 is a reply to message #11789]

Wed, 26 September 2007 07:43

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

sergei wrote on Wed, 26 September 2007 01:56

I didn't mention that I tested basic read/write performance. UTF handling would add overhead to 8 and 16 formats, but not to 32 format. I also remembered the UTF8-EE issue. UTF-32 could solve it easily. IIRC only 21 bits are needed for full unicode, so there's plenty of space to escape to (without overtaking private space).

The only problem with UTF-32 is the storage space. It is 2/4 times the size of UTF-8 and almost always double of UTF-16. And I don't think that UTF-8EE is such a big issue, you just have to make sure to use a more permissive validation scheme. And what is RTL anyway?

luzr wrote on Tue, 25 September 2007 23:18

Then I don't understand how can you insert values 280-230 into a 8 bit fixed character length format. Are they translated to some code page? And if the values are 8 bit and there are 20 of them, why do I have a length 40 string in the output. And why is the length of the same string 40 and not 20 when I switch over to wide string?

[Updated on: Wed, 26 September 2007 07:44]

Report message to a moderator

Re: 16 bits wchar [message #11798 is a reply to message #11797]

Wed, 26 September 2007 08:48

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

cbpporter wrote on Wed, 26 September 2007 01:43

make sure to use a more permissive validation scheme. And what is RTL anyway?

Right-to-left language like hebrew.

Quote:

Then I don't understand how can you insert values 280-230 into a 8 bit fixed character length format.

You cannot and you are not doing that. You are creating WString and then converting it to String, using default encoding, which you have set to UTF-8.

In UTF-8, 230-280 gets converted to two byte sequences.

Quote:

Are they translated to some code page?

ToString/ToWString uses default encoding ("default charset").

Mirek

Report message to a moderator

Re: 16 bits wchar [message #11809 is a reply to message #11797]

Wed, 26 September 2007 14:55

sergei
Messages: 94
Registered: September 2007

Member

cbpporter wrote on Wed, 26 September 2007 07:43

sergei wrote on Wed, 26 September 2007 01:56

Well, 4MB of memory would yield 1 million characters. Do you typically need more, even for a rather complex GUI app? With memory of 512MB/1GB on many computers and 200GB hard drives, I don't think space is a serious issue now. I was more worried about performance - memory allocation and access is somewhat slower (but not always, for 256-8k sizes it's quite good).

The issue isn't UTF-8EE, it's more of a side effect. The main gain is char equals cell. That is, LString (or whatever the name) can always be treated as UTF-32. Unlike WString, which might be 20 wchars or unknown-length UTF-16 string. Even worse with UTF-8, where String length would almost always be different from amount of characters stored. Replace char is a trivial operation in UTF-32, but might require shifting in UTF-8/16 (if the chars require different amounts of space). Search char from end (backwards) - would require to test every find if it's the second/third/fourth char of some sequence. Actually, even simplier - how do you supply a multibyte char to some search/replace function in UTF-16/32? Integer? That would require conversion for every operation.

Unlike currently, when String is either a sequence of chars OR a UTF-8 string, LString would always be a sequence of ints/unsigned ints AND UTF-32 string. String could be left for single-char storing (like data from file or ASCII-only strings), WString for OS interop, and LString could supply conversions to/from both.

Report message to a moderator

Re: 16 bits wchar [message #11812 is a reply to message #11809]

Wed, 26 September 2007 15:37

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

Well UTF-32 is the UNIX approach. (wchar is 32 bits there).
It would certainly be a start for Unicode support and that can be easily done by creating a class based on string, changing char to dword and some new I/O functions and functions to convert to other UTF formats. But I would still like to see full UTF support, maybe not in normal strings, but in special Unicode strings. I will try to implement this evening a GetULenght() function and look over String and WString implementation to decide which functions would work with Unicode, and which need modification (for example, a find operation doesn't need to be changed, but a find starting with index must be).

Report message to a moderator

Re: 16 bits wchar [message #11813 is a reply to message #8036]

Wed, 26 September 2007 16:54

sergei
Messages: 94
Registered: September 2007

Member

Theoretically String could be used "exclusively" for UTF-8, WString for UTF-16. "normal strings" could be Vector<char> and Vector<wchar>. All operations - (reverse) find/replace char/substring, trim(truncate), starts/endswith, left/right/mid, cat (append), insert, are applicable to Vectors as well (and maybe should be implemented as algorithms for all containers). Extra considerations might be a closing '\0' (maybe not necessary - normal strings aren't for interop with OS, where '\0' is used, for inner works there's GetCount), and conversion functions (already partially implemented).

P.S. does anyone know why chars/wchars tend to be signed? IMHO unsigned character values are much more clear - after all the ASCII codes we use are unsigned (in hex).

Report message to a moderator

Re: 16 bits wchar [message #11822 is a reply to message #11813]

Wed, 26 September 2007 19:11

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

sergei wrote on Wed, 26 September 2007 16:54

P.S. does anyone know why chars/wchars tend to be signed? IMHO unsigned character values are much more clear - after all the ASCII codes we use are unsigned (in hex).

I think it is an artifact left over from C, where char was also used a lot for storing 8 bit integers and booleans (on 16-bit systems both int ant short would often be 16 bit long, so an 8 bit integer was needed).

Report message to a moderator

Re: 16 bits wchar [message #11829 is a reply to message #11797]

Wed, 26 September 2007 22:40

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

cbpporter wrote on Wed, 26 September 2007 01:43

sergei wrote on Wed, 26 September 2007 01:56

Not necessary. Current way of handling with this is just everything is mass stored as UTF-8 and only converted to UCS-2 for processing.

I guess this system should stand.

The only real trouble (and the main reason why sizeof(wchar) is 2) is Win32 compatibility. I do not feel well converting every text to UTF-16 for displaying on the screen... while, in reality, for 99% applications UCS-2 is enough...

Mirek

Report message to a moderator

Re: 16 bits wchar [message #11921 is a reply to message #11829]

Mon, 01 October 2007 13:24

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

I finally finished my Unicode research (I took longer than planed because of computer games... Razz

). I read a good chunk of the Unicode Standard 5.0, looked over their official sample implementation and studied a little U++'s String, WString and Stream classes.

I think that the first thing that must be done is extended PutUtf8 and GetUtf8 so that it reads correctly th values outside of BMP. This is not too difficult and I will try to implement and test this.

The only issue is how to handle ill-formated values. I came to the conclusion that read and write operation must recognize compliant encodings, but it also must process ill-formated characters and insert them into the stream. If the stream is set to strict, it will throw an exception. If not, it will still encode. I propose the Least Complex Encoding TM possibility. Non-atomic Unicode aware string manipulation functions will should not fail when encountering such characters, so after a read, process and write, these ill-formated values (which could be essential to other applications) will be preserved. In this scenario, only functions that display the string must be aware that some data is ill-formated.

Next, there should be a method to Validate the string, and a way to convert strings containing ill-formated string to error-escaped strings and back, so we can use atomic string processing if needed. This conversion should be done explicitly, so no general performance overhead is introduced.

Report message to a moderator

Re: 16 bits wchar [message #11925 is a reply to message #11921]

Mon, 01 October 2007 14:28

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

cbpporter wrote on Mon, 01 October 2007 07:24

I think that the first thing that must be done is extended PutUtf8 and GetUtf8 so that it reads correctly th values outside of BMP. This is not too difficult and I will try to implement and test this.

I guess fixing Utf8 routines to provide UTF16 surrogate support (for now) is a good idea.

Quote:

The only issue is how to handle ill-formated values. I came to the conclusion that read and write operation must recognize compliant encodings, but it also must process ill-formated characters and insert them into the stream. If the stream is set to strict, it will throw an exception. If not, it will still encode. I propose the Least Complex Encoding TM possibility. Non-atomic Unicode aware string manipulation functions will should not fail when encountering such characters, so after a read, process and write, these ill-formated values (which could be essential to other applications) will be preserved. In this scenario, only functions that display the string must be aware that some data is ill-formated.

Well, the basic requirement there is that converting UTF8 with invalid sequences to WString and back must result in the equal String. This feat is successfully achieved by UTF8EE.

Also, I do not think that any string manipulation routine everywhere ever should be aware about UTF-8 or UTF-16 encoding. It is much practical to covert to WString, process and (eventually) convert it back. I think that in the long run, it might be even faster.

Quote:

Next, there should be a method to Validate the string, and a way to convert strings containing ill-formated string to error-escaped strings and back, so we can use atomic string processing if needed. This conversion should be done explicitly, so no general performance overhead is introduced.

bool CheckUtf8(const String& src);

You can add CheckUtf16 Smile

Anyway, seriously, I believe that the ultimate solution is to go with sizeof(wchar) = 4... The only trouble is converting this to UTF16 and back in Win32 everywhere... OTOH, good news is that after the system code is fixed, the transition does not pose too much backwards compatibility problems...

Mirek

Report message to a moderator

Re: 16 bits wchar [message #11938 is a reply to message #11925]

Wed, 03 October 2007 06:16

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

luzr wrote on Mon, 01 October 2007 14:28

I guess fixing Utf8 routines to provide UTF16 surrogate support (for now) is a good idea.

Great! On the side note though, I am extending Utf8 methods to handle 4 byte long encodings, not UTF-16 surrogate pairs (which are illegal in UTF-8).

Quote:

Also, I do not think that any string manipulation routine everywhere ever should be aware about UTF-8 or UTF-16 encoding. It is much practical to covert to WString, process and (eventually) convert it back. I think that in the long run, it might be even faster.

That could be an acceptable compromise. But a few processing functions couldn't hurt when you really want to process that string in place.

Quote:

I think you should keep UTF-16 as default for Win32 and UTF-32 as default for Linux. Win32 and .NET both use UTF-16 (with surrogates - Win98 doesn't support surrogates, but the rest do), so I think the future of character encoding for GUI purposes is pretty well defined.

I started working on GetUtf8. I tried to keep everything as close to your style of designing things, but I have two questions.

1. I couldn’t find any function that reads or writes UTF-8 strings (only a single char). The rest of the functions read using pain byte storage. This is OK for storing strings, but when loading them, I need a UTF-8 aware method.

2. Your GetUtf8 method is quite straightforward, but I'm afraid it does not decode values correctly.

Here is a pseudo code of what you do:

if(code <= 0x7F)
    compute 1 byte value
else if (code <= 0xDF)
    compute 2 byte value
else if (code <= 0xEF)
    compute 3 byte value
else if (...)
    pretty much just read them and return "space"

The issue with this is for the invalid value range 80-C1 which is handled by you second if clause. These values are invalid in UTF-8, but you still decode them using their value and the value of the next character. If this is done for error-escaping, the UTF-8 standard expects to error escape only the current character and start procesing the next one and not to build the error escaped code by using more than the absolute minimum number of code-units (in this case one).

Report message to a moderator

Re: 16 bits wchar [message #11939 is a reply to message #11938]

Wed, 03 October 2007 10:11

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

cbpporter wrote on Wed, 03 October 2007 00:16

luzr wrote on Mon, 01 October 2007 14:28

I guess fixing Utf8 routines to provide UTF16 surrogate support (for now) is a good idea.

Great! On the side note though, I am extending Utf8 methods to handle 4 byte long encodings, not UTF-16 surrogate pairs (which are illegal in UTF-8).

Well, so what is the result then? WString is now 16-bit. Utf8 conversions are basically String<->WString (ok, also char * <-> WString).

Quote:

That could be an acceptable compromise. But a few processing functions couldn't hurt when you really want to process that string in place.

Which exactly?

Quote:

That is why it is 16bit now. But if you really need the solution for ucs-4, 32bit character and conversions is the only option.

Quote:

1. I couldn�t find any function that reads or writes UTF-8 strings (only a single char).

String  ToUtf8(wchar code);
String  ToUtf8(const wchar *s, int len);
String  ToUtf8(const wchar *s);
String  ToUtf8(const WString& w);

WString FromUtf8(const char *_s, int len);
WString FromUtf8(const char *_s);
WString FromUtf8(const String& s);

bool utf8check(const char *_s, int len);

int utf8len(const char *s, int len);
int utf8len(const char *s);
int lenAsUtf8(const wchar *s, int len);
int lenAsUtf8(const wchar *s);

bool    CheckUtf8(const String& src);

Quote:

2. Your GetUtf8 method is quite straightforward, but I'm afraid it does not decode values correctly.

Here is a pseudo code of what you do:

if(code <= 0x7F)
    compute 1 byte value
else if (code <= 0xDF)
    compute 2 byte value
else if (code <= 0xEF)
    compute 3 byte value
else if (...)
    pretty much just read them and return "space"

Ops, you are right, something is really missing in Stream. Anyway, GetUtf8 in Stream is quite auxiliary (and maybe wrong) addition. The real meat is in Charset.h/.cpp.

Mirek

Report message to a moderator

Re: 16 bits wchar [message #11940 is a reply to message #11939]

Wed, 03 October 2007 10:23

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

I know about those functions but what I was looking for is something like String& Stream::ReadUtf8Line(). I don't want to read an arbitrary number of bytes and then convert them to an encoding after. This makes Unicode fell more like an afterthought than something supported by the library.

But I still need to analyze some of your methods and then I'll be ready to reimplement them for full support. WString or equivalent will still be 16bit, but it will also contain surrogate pairs. Most of the GUI code should not be affected by this, but more experiments are needed before I can be sure.

Report message to a moderator

Re: 16 bits wchar [message #11941 is a reply to message #11939]

Wed, 03 October 2007 10:26

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

Error escaping in Stream:

The error escaping in GetUtf8 is impossible, as it returns only single int - you do not know you have to escape until you read more than single character from the input - and then you need more than one wchar to be returned...

Report message to a moderator

Re: 16 bits wchar [message #11942 is a reply to message #11941]

Wed, 03 October 2007 10:36

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

luzr wrote on Wed, 03 October 2007 10:26

It depends on what that int represents and what kind of error escaping is used. For Utf-8, there are only a small number of characters that are invalid and they could be escaped to non-character code-points or even to a small region of the Private Use Area (for example FFF00-FFFFF). The private user area has approximatively 130000 reserved code points which are guaranteed to not appear in public Unicode data (they are reserved for private processing only, not data interchange).

Report message to a moderator

Re: 16 bits wchar [message #11943 is a reply to message #11940]

Wed, 03 October 2007 10:42

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

cbpporter wrote on Wed, 03 October 2007 04:23

What is wrong with FromUtf8(in.GetLine()) ?

What is the point of spreading encoding related stuff all over the application? Stream works with bytes, end of story. I do not want to end with multiple methods for everything that can handle text.

Mirek

Report message to a moderator

Pages (5): [1 2 3 4 5 › »]

Previous Topic:	Arabic words from file
Next Topic:	Not possible to get .t files

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

]

Current Time: Fri Apr 19 04:53:02 CEST 2024

Total time taken to generate the page: 0.02157 seconds