Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » Developing U++ » U++ Developers corner » Choosing the best way to go full UNICODE
Re: Choosing the best way to go full UNICODE [message #48303 is a reply to message #48302] Mon, 19 June 2017 10:22 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
mirek wrote on Mon, 19 June 2017 11:03
My undestanding is that if decomposition sequence starts with "<", it is 'compatibility', if not, it is 'canonical'.

I believe that you should use compatibility seqences e.g. for comparing, but you should never 'recompose' these into single codepoint - one of reasons is that canonical compositions are unique, but there can be the same compatibility decompositions for multiple codepoints (found out that hard way during testing).

That's why you read the spec!

Everything is convention based.

Compatibility decomposition and everything that is marked in compatibility in Unicode means that it would not be part of Unicode and has no reason to exist in a standalone standard, but it had to be added to be compatible with another standard.

Canonical is the only that is needed for comparing and search.

Compatibility decomposition and non-compatibility decomposition are separate entities with separate names and compatibility one should not be used unless you are trying to be compatible with another standard.

And the rest can be ignored. Like <font> substitutions:
http://www.fileformat.info/info/unicode/char/2102/index.htm

I really don't think that users expect that hollowed out C to return true when compared to a plain C.
Re: Choosing the best way to go full UNICODE [message #48304 is a reply to message #48302] Mon, 19 June 2017 10:23 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
int UnicodeDecompose(dword codepoint, dword t[MAX_DECOMPOSED], bool& canonical);

PS: Unicode doesn't and shouldn't tell you if the decomposition is canonical or not. You ask it for one or the other, never both in the same string.
Re: Choosing the best way to go full UNICODE [message #48305 is a reply to message #48303] Mon, 19 June 2017 10:40 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
cbpporter wrote on Mon, 19 June 2017 10:22

I really don't think that users expect that hollowed out C to return true when compared to a plain C.


I think this is where we differ.

I really believe that if I, as user, have a long document and I am searching for "Come", I really want to find it the one starting with hollowed C too. Or, e.g. FB01 you want to match upon searching for "fi".

(not that I am going to implement that anytime soon, but IMO this is the expected behaviour).
Re: Choosing the best way to go full UNICODE [message #48306 is a reply to message #48305] Mon, 19 June 2017 10:51 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
Yeah, that's why those decompositions are special and should not be used in day to day use.

Canonical decomposition in Unicode is not something to convert to when needed to solve some task. It can be used as the sole encoding format for all your string because it doesn't change the meaning of the string. Example: LoadFile("c:\\utf8.txt"). You could return the string as is, or you could return one of the two canonical Unicode forms, with the original string never stored.

On the other hand, <font>, <box> and other non standard decompositions change the meaning of the text. They are computed when needed and you can't store your string as such 100% of the time.
Re: Choosing the best way to go full UNICODE [message #48307 is a reply to message #48306] Mon, 19 June 2017 10:58 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
cbpporter wrote on Mon, 19 June 2017 10:51
Yeah, that's why those decompositions are special and should not be used in day to day use.


Searching documents is day to day use. I think you got this wrong:

Quote:

Compatibility decomposition and everything that is marked in compatibility in Unicode means that it would not be part of Unicode and has no reason to exist in a standalone standard, but it had to be added to be compatible with another standard.

Canonical is the only that is needed for comparing and search.

Compatibility decomposition and non-compatibility decomposition are separate entities with separate names and compatibility one should not be used unless you are trying to be compatible with another standard.


It really is not about other standard (well, can be, but not only). It is about equivalence when searching.
Re: Choosing the best way to go full UNICODE [message #48308 is a reply to message #48307] Mon, 19 June 2017 11:07 Go to previous message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
Day to day use: standard encoding scheme, i.e. guaranteed form to have a string in.

Reading from a no normalized stream and getting NFD or NFC is not an error and can be a standard behavior in a library.

Not day to day use: on demand encoding scheme.

Reading from a non normalized stream and getting <font> substitutions is an error. Taking a string and converting it on the fly to a <font> substitution, doing a search to find that "C" like in the example and then discarding the string is not an error.

There is no way I'm getting Unicode wrong.

The only misunderstandings are related to my use of English or being to vague in my expression Smile.
Previous Topic: Some addition proposals
Next Topic: Help needed with link errors (serversocket)
Goto Forum:
  


Current Time: Fri Mar 29 12:44:21 CET 2024

Total time taken to generate the page: 0.01766 seconds