U++ forum: Welcome to the forum

Search on this site

Search in forums

Home » Developing U++ » U++ Developers corner » WString vs Grapheme Cluster idea (with possible flaw)

Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend

WString vs Grapheme Cluster idea (with possible flaw) [message #57103]

Tue, 25 May 2021 14:25

mirek
Messages: 14265
Registered: November 2005

Ultimate Member

One of things that remains to be resolved in time is full unicode support. Which basically means that instead of working with UTF-16 characters, we need to start working with (extended) grapheme clusters.

In other words, where now there is wchar, in future needs to be some type with variable number of bytes. Which of course causes all kinds of problems...

However, recently I have got an interesting idea how to solve this "on cheap". Unicode defines codepoints up to about 300 000. Now if we redefine wchar to be 32 bit long, we have about 4 billions of "unusued" bit patterns in wchar. My idea then is to use them as "virtual characters" that eventually map to extended grapheme cluster. It would work this way:

When converting UTF-8 to WString, there would be a global map "grapheme cluster" -> "virtual character". When new grapheme cluster would be encountered, it would be added into this global map and in WString there would always just be 32-bit value.

This would make everything supersimple - e.g. current Ctrl::Key could keep working as it is, every piece of code with WString would work exactly the same (as in most cases, like in EditField, grapheme cluster is exactly what we want to work with). Draw::DrawText would just detect that the character is virtual and obtained the grapheme cluster from the global map...

Now the possible flaw in the reasoning: 4 billions is a large number, but is it large enough to stop worrying about it?

I think in the regular use, it is just fine. But what about webservices and dedicated hackers? What to do if we run out of virtual characters?

Realistically, it is probably ok... But are there any ideas how to eventually handle this (what to do when we reach 4 billions of virtual characters)?

Mirek

Report message to a moderator

Re: WString vs Grapheme Cluster idea (with possible flaw) [message #57106 is a reply to message #57103]

Tue, 25 May 2021 19:06

Didier
Messages: 736
Registered: November 2008
Location: France

Contributor

Hello Mirek,

I don't master all these caracter support problems but one thing that comes to my mind when I read you're proposition is:

With 300 000 graphemes in a table ==> you get 1.2 MB of used RAM : no problem
With 4 000 000 000 graphemes in a table ==> you get 12 GB of used RAM : this will be a problem before you run out of virtual characters

But I may misunderstand what you propose

Report message to a moderator

Re: WString vs Grapheme Cluster idea (with possible flaw) [message #57110 is a reply to message #57106]

Wed, 26 May 2021 08:24

mirek
Messages: 14265
Registered: November 2005

Ultimate Member

Didier wrote on Tue, 25 May 2021 19:06

Hello Mirek,

I don't master all these caracter support problems but one thing that comes to my mind when I read you're proposition is:

With 300 000 graphemes in a table ==> you get 1.2 MB of used RAM : no problem
With 4 000 000 000 graphemes in a table ==> you get 12 GB of used RAM : this will be a problem before you run out of virtual characters

But I may misunderstand what you propose

Yes, I guess you are right. Bad idea after all.

Means we will need something like GString, where individual characters are (potentially) represented as String...

Mirek

Report message to a moderator

Re: WString vs Grapheme Cluster idea (with possible flaw) [message #57114 is a reply to message #57110]

Wed, 26 May 2021 10:52

Didier
Messages: 736
Registered: November 2008
Location: France

Contributor

Well it really depends on how this is done.

One thing is for sure, all 4 000 000 000 graphemes won't be used at the same time
If we look at how virtual memory is managed, it's just about the same: we have a huge virtual memory but use only a tiny bit of it.
I don't know if it's feasable for graphemes since this has to be performant and we still would have the indirection table to manage which could be the killer

Why not use UTF32 ? sounds like the same (from my very far and poor UTF knowledge)

Report message to a moderator

Re: WString vs Grapheme Cluster idea (with possible flaw) [message #57115 is a reply to message #57114]

Wed, 26 May 2021 11:27

mirek
Messages: 14265
Registered: November 2005

Ultimate Member

Didier wrote on Wed, 26 May 2021 10:52

Well it really depends on how this is done.

One thing is for sure, all 4 000 000 000 graphemes won't be used at the same time
If we look at how virtual memory is managed, it's just about the same: we have a huge virtual memory but use only a tiny bit of it.
I don't know if it's feasable for graphemes since this has to be performant and we still would have the indirection table to manage which could be the killer

I think it would indeed work in most of cases. But those remaining ones probably make it unfeasible...

Quote:

Why not use UTF32 ? sounds like the same (from my very far and poor UTF knowledge)

Single grapheme can contain multiple unicode codepoints (that is, multiple UTF32 "characters"), yet we need to treat it as single character in most of scenarios. UTF32 is incomplete solution.

Mirek

[Updated on: Wed, 26 May 2021 11:27]

Report message to a moderator

Previous Topic:	Using BLITZ in release mode CLANG
Next Topic:	Using Pen with U++

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

]

Current Time: Sun Jul 06 07:17:13 CEST 2025

Total time taken to generate the page: 0.03361 seconds