cbpporter Messages: 1427 Registered: September 2007
Ultimate Contributor
mirek wrote on Mon, 19 June 2017 11:03
My undestanding is that if decomposition sequence starts with "<", it is 'compatibility', if not, it is 'canonical'.
I believe that you should use compatibility seqences e.g. for comparing, but you should never 'recompose' these into single codepoint - one of reasons is that canonical compositions are unique, but there can be the same compatibility decompositions for multiple codepoints (found out that hard way during testing).
That's why you read the spec!
Everything is convention based.
Compatibility decomposition and everything that is marked in compatibility in Unicode means that it would not be part of Unicode and has no reason to exist in a standalone standard, but it had to be added to be compatible with another standard.
Canonical is the only that is needed for comparing and search.
Compatibility decomposition and non-compatibility decomposition are separate entities with separate names and compatibility one should not be used unless you are trying to be compatible with another standard.
I really don't think that users expect that hollowed out C to return true when compared to a plain C.
I think this is where we differ.
I really believe that if I, as user, have a long document and I am searching for "Come", I really want to find it the one starting with hollowed C too. Or, e.g. FB01 you want to match upon searching for "fi".
(not that I am going to implement that anytime soon, but IMO this is the expected behaviour).
cbpporter Messages: 1427 Registered: September 2007
Ultimate Contributor
Yeah, that's why those decompositions are special and should not be used in day to day use.
Canonical decomposition in Unicode is not something to convert to when needed to solve some task. It can be used as the sole encoding format for all your string because it doesn't change the meaning of the string. Example: LoadFile("c:\\utf8.txt"). You could return the string as is, or you could return one of the two canonical Unicode forms, with the original string never stored.
On the other hand, <font>, <box> and other non standard decompositions change the meaning of the text. They are computed when needed and you can't store your string as such 100% of the time.
Yeah, that's why those decompositions are special and should not be used in day to day use.
Searching documents is day to day use. I think you got this wrong:
Quote:
Compatibility decomposition and everything that is marked in compatibility in Unicode means that it would not be part of Unicode and has no reason to exist in a standalone standard, but it had to be added to be compatible with another standard.
Canonical is the only that is needed for comparing and search.
Compatibility decomposition and non-compatibility decomposition are separate entities with separate names and compatibility one should not be used unless you are trying to be compatible with another standard.
It really is not about other standard (well, can be, but not only). It is about equivalence when searching.
cbpporter Messages: 1427 Registered: September 2007
Ultimate Contributor
Day to day use: standard encoding scheme, i.e. guaranteed form to have a string in.
Reading from a no normalized stream and getting NFD or NFC is not an error and can be a standard behavior in a library.
Not day to day use: on demand encoding scheme.
Reading from a non normalized stream and getting <font> substitutions is an error. Taking a string and converting it on the fly to a <font> substitution, doing a search to find that "C" like in the example and then discarding the string is not an error.
There is no way I'm getting Unicode wrong.
The only misunderstandings are related to my use of English or being to vague in my expression .