U++ forum: Welcome to the forum

Search on this site

Search in forums

Home » Developing U++ » U++ Developers corner » Choosing the best way to go full UNICODE

Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend

Re: Choosing the best way to go full UNICODE [message #48303 is a reply to message #48302]

Mon, 19 June 2017 10:22

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

mirek wrote on Mon, 19 June 2017 11:03

My undestanding is that if decomposition sequence starts with "<", it is 'compatibility', if not, it is 'canonical'.

I believe that you should use compatibility seqences e.g. for comparing, but you should never 'recompose' these into single codepoint - one of reasons is that canonical compositions are unique, but there can be the same compatibility decompositions for multiple codepoints (found out that hard way during testing).

That's why you read the spec!

Everything is convention based.

Compatibility decomposition and everything that is marked in compatibility in Unicode means that it would not be part of Unicode and has no reason to exist in a standalone standard, but it had to be added to be compatible with another standard.

Canonical is the only that is needed for comparing and search.

Compatibility decomposition and non-compatibility decomposition are separate entities with separate names and compatibility one should not be used unless you are trying to be compatible with another standard.

And the rest can be ignored. Like <font> substitutions:
http://www.fileformat.info/info/unicode/char/2102/index.htm

I really don't think that users expect that hollowed out C to return true when compared to a plain C.

Report message to a moderator

Re: Choosing the best way to go full UNICODE [message #48304 is a reply to message #48302]

Mon, 19 June 2017 10:23

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

int UnicodeDecompose(dword codepoint, dword t[MAX_DECOMPOSED], bool& canonical);

PS: Unicode doesn't and shouldn't tell you if the decomposition is canonical or not. You ask it for one or the other, never both in the same string.

Report message to a moderator

Re: Choosing the best way to go full UNICODE [message #48305 is a reply to message #48303]

Mon, 19 June 2017 10:40

mirek
Messages: 14265
Registered: November 2005

Ultimate Member

cbpporter wrote on Mon, 19 June 2017 10:22

I really don't think that users expect that hollowed out C to return true when compared to a plain C.

I think this is where we differ.

I really believe that if I, as user, have a long document and I am searching for "Come", I really want to find it the one starting with hollowed C too. Or, e.g. FB01 you want to match upon searching for "fi".

(not that I am going to implement that anytime soon, but IMO this is the expected behaviour).

Report message to a moderator

Re: Choosing the best way to go full UNICODE [message #48306 is a reply to message #48305]

Mon, 19 June 2017 10:51

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

Yeah, that's why those decompositions are special and should not be used in day to day use.

Canonical decomposition in Unicode is not something to convert to when needed to solve some task. It can be used as the sole encoding format for all your string because it doesn't change the meaning of the string. Example: LoadFile("c:\\utf8.txt"). You could return the string as is, or you could return one of the two canonical Unicode forms, with the original string never stored.

On the other hand, <font>, <box> and other non standard decompositions change the meaning of the text. They are computed when needed and you can't store your string as such 100% of the time.

Report message to a moderator

Re: Choosing the best way to go full UNICODE [message #48307 is a reply to message #48306]

Mon, 19 June 2017 10:58

mirek
Messages: 14265
Registered: November 2005

Ultimate Member

cbpporter wrote on Mon, 19 June 2017 10:51

Yeah, that's why those decompositions are special and should not be used in day to day use.

Searching documents is day to day use. I think you got this wrong:

Quote:

Compatibility decomposition and everything that is marked in compatibility in Unicode means that it would not be part of Unicode and has no reason to exist in a standalone standard, but it had to be added to be compatible with another standard.

Canonical is the only that is needed for comparing and search.

Compatibility decomposition and non-compatibility decomposition are separate entities with separate names and compatibility one should not be used unless you are trying to be compatible with another standard.

It really is not about other standard (well, can be, but not only). It is about equivalence when searching.

Report message to a moderator

Re: Choosing the best way to go full UNICODE [message #48308 is a reply to message #48307]

Mon, 19 June 2017 11:07

cbpporter
Messages: 1427
Registered: September 2007

Ultimate Contributor

Day to day use: standard encoding scheme, i.e. guaranteed form to have a string in.

Reading from a no normalized stream and getting NFD or NFC is not an error and can be a standard behavior in a library.

Not day to day use: on demand encoding scheme.

Reading from a non normalized stream and getting <font> substitutions is an error. Taking a string and converting it on the fly to a <font> substitution, doing a search to find that "C" like in the example and then discarding the string is not an error.

There is no way I'm getting Unicode wrong.

The only misunderstandings are related to my use of English or being to vague in my expression Smile

Report message to a moderator

Pages (4): [ « ‹ 1 2 3 4]

Previous Topic:	Some addition proposals
Next Topic:	Help needed with link errors (serversocket)

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

]

Current Time: Mon Jul 14 10:16:17 CEST 2025

Total time taken to generate the page: 0.03846 seconds