Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » U++ Library support » U++ Core » Strings with national specific characters are wrongly sorted - Sort
Strings with national specific characters are wrongly sorted - Sort [message #57468] Wed, 25 August 2021 13:01 Go to next message
Klugier is currently offline  Klugier
Messages: 912
Registered: September 2012
Location: Poland, Kraków
Experienced Contributor
Hello,

Today I found that sort returns wrong results in term of special characters:
#include <Core/Core.h>

using namespace Upp;

CONSOLE_APP_MAIN
{
	Vector<WString> vec = { "Zbig", "Ąć", "Ęc", "Ala", "Edward" };
	Sort(vec);
	
	for (const auto s : vec)
	{
		Cout() << s << "\n";
	}
}


The results are:
Ala
Edward
Zbig
Ąć
Ęc


and should be:
Ala
Ąć
Edward
Ęc
Zbig


This is probably corner case, because this world doesn't exist in Polish, but anyway the error is there. I believe it is more serve when these character are in the middle of the string and we have a lot of such words.

Here is the article about Polish alphabet and the order of letters.

Klugier


Ultimate++ - one framework to rule them all.

[Updated on: Wed, 25 August 2021 17:49]

Report message to a moderator

Re: Strings with national specific characters are wrongly sorted - Sort [message #57470 is a reply to message #57468] Thu, 26 August 2021 13:09 Go to previous messageGo to next message
busiek is currently offline  busiek
Messages: 55
Registered: February 2011
Location: Poland
Member
For Polish alphabet I use such function to map wchar into something comparable according to our needs:
auto CharRepr = [](wchar c)
{
    wchar d = ToLower(c);
    d = d == u'ł' ? 'l' : d;
    return MakeTuple(IsLetter(c), ToAscii(d), c);
};

Then, to compare, I map each WString into the appropriate vector of tuples.

Cheers,
Kuba
Re: Strings with national specific characters are wrongly sorted - Sort [message #57473 is a reply to message #57468] Fri, 27 August 2021 09:51 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13526
Registered: November 2005
Ultimate Member
Klugier wrote on Wed, 25 August 2021 13:01
Hello,

Today I found that sort returns wrong results in term of special characters:
#include <Core/Core.h>

using namespace Upp;

CONSOLE_APP_MAIN
{
	Vector<WString> vec = { "Zbig", "Ąć", "Ęc", "Ala", "Edward" };
	Sort(vec);
	
	for (const auto s : vec)
	{
		Cout() << s << "\n";
	}
}


The results are:
Ala
Edward
Zbig
Ąć
Ęc


and should be:
Ala
Ąć
Edward
Ęc
Zbig


This is probably corner case, because this world doesn't exist in Polish, but anyway the error is there. I believe it is more serve when these character are in the middle of the string and we have a lot of such words.

Here is the article about Polish alphabet and the order of letters.

Klugier


This is not error, base [W]String comparison simply compares character values.

You need to use NLS specific sorting in this situation - LanguageInfo::Compare. That said, it really is specifically defined just for CZ and even there it would need improvement, OTOH the generic routine should at least work better that the result you get.

BTW, language specific sorting is extremely difficult topic if it should be done right in many languages...

Mirek

[Updated on: Fri, 27 August 2021 10:18]

Report message to a moderator

Re: Strings with national specific characters are wrongly sorted - Sort [message #57474 is a reply to message #57473] Fri, 27 August 2021 09:57 Go to previous message
mirek is currently offline  mirek
Messages: 13526
Registered: November 2005
Ultimate Member
BTW, as I am revisiting this issue after a long long time, it looks like we should connect here to host platform APIs in the future.
Previous Topic: Probable nasty bug with StringBuffer
Next Topic: POSIX home directory symbol '~' is causing trouble
Goto Forum:
  


Current Time: Fri Jan 28 00:02:05 CET 2022

Total time taken to generate the page: 0.02072 seconds