Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » U++ Library support » U++ Core » LoadFile problem with accented chars
LoadFile problem with accented chars [message #19983] Sat, 07 February 2009 22:27 Go to next message
koldo is currently offline  koldo
Messages: 3358
Registered: August 2008
Senior Veteran
Hello all

In a program I am reading text files written with Notepad.

To do that I simply use String LoadFile(String fileName) and I parse through the readed String.

When the file readed has only simple characters everything goes right, but when I enter in Notepad accented characters like á, é, ... it is like the file would be corrupted. For example in that case a simple char \r is converted into a -17 int.

I have seen that TheIde handles this ok. It seems that (it is just a guess) TheIde detects the file charset and handles the file properly converting it to an Utf8 String.

I have tried to get parts of TheIde code into my program but without success. Do you know what to do?.

Best regards
Koldo



Best regards
Iñaki
Re: LoadFile problem with accented chars [message #19987 is a reply to message #19983] Sun, 08 February 2009 08:06 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
U++ never does any encoding conversions in Stream *content* (it can do some conversions to *file name*, e.g. converting utf-8 to unicode).

I suspect that the error is in processing the file. Hard to say what the problem is without knowing more.

Mirek
Re: LoadFile problem with accented chars [message #20000 is a reply to message #19983] Sun, 08 February 2009 22:11 Go to previous messageGo to next message
koldo is currently offline  koldo
Messages: 3358
Registered: August 2008
Senior Veteran
Hello luzr

It seems it is a matter of Notepad itself. If the file has 7 bits chars there is no problem, but after adding chars like á it seems that Notepad itself changes its charset.

Using this test program:
CONSOLE_APP_MAIN
{
	String data = LoadFile("C:\\test.txt");
	for (int i = 0; i < data.GetCount(); ++i) 
		puts(Format("%d: %d", i, data[i]));	
	getchar();
}

with test.txt with a simple "a-á", I initially get this output:

0: 97
1: 45
2: -31

but after saving and opening the file some times, I get this:

0: -1
1: -2
2: 97
3: 0
4: 45
5: 0
6: -31
7: 0

and yesterday I got other output... The answer is that Notepad adds a "BOM" to the file if it thinks it requires a bigger encoding.

BOM (Byte Order Mark, http://unicode.org/faq/utf_bom.html#BOM) is a signature of letters in the begining of files that shows its encoding. For example:

- EF BB BF means UTF-8
- FF FE means UTF-16, little-endian

So yesterday Notepad saved the file as UTF8 (beginning with -17 == EF) and today it saved it in UTF-16, little-endian (beginning with a -1 == FF)

Sorry, perhaps it is not easy but, do you know how to program a way to get a text file and converting it into utf8 to be properly viewed U++ programs?, as when entering these chars into U++ controls I get strange symbols and errors. It will also be great for parsing them.

Best regards
Koldo



Best regards
Iñaki
Re: LoadFile problem with accented chars [message #20002 is a reply to message #20000] Mon, 09 February 2009 08:12 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
koldo wrote on Sun, 08 February 2009 16:11

Hello luzr

It seems it is a matter of Notepad itself. If the file has 7 bits chars there is no problem, but after adding chars like á it seems that Notepad itself changes its charset.

Using this test program:
CONSOLE_APP_MAIN
{
	String data = LoadFile("C:\\test.txt");
	for (int i = 0; i < data.GetCount(); ++i) 
		puts(Format("%d: %d", i, data[i]));	
	getchar();
}

with test.txt with a simple "a-á", I initially get this output:

0: 97
1: 45
2: -31

but after saving and opening the file some times, I get this:

0: -1
1: -2
2: 97
3: 0
4: 45
5: 0
6: -31
7: 0

and yesterday I got other output... The answer is that Notepad adds a "BOM" to the file if it thinks it requires a bigger encoding.

BOM (Byte Order Mark, http://unicode.org/faq/utf_bom.html#BOM) is a signature of letters in the begining of files that shows its encoding. For example:

- EF BB BF means UTF-8
- FF FE means UTF-16, little-endian



Why do not interpret it yourself?

I suggest implementing these:

WString LoadBOMW(const Stream& s);
WString LoadFileBOMW(const char *path);
void    SaveBOMUtf8(const Stream& s, const WString& data);
bool    SaveFileBOMUtf8(const char *path, const WString& data);

String  LoadBOM(const Stream& s); // Default encoding, usually utf-8
String  LoadFileBOM(const char *path);
void    SaveBOMUtf8(const Stream& s, const String& data);
bool    SaveFileBOMUtf8(const char *path, const String& data);


I would be glad to add them to Core.

Mirek
Re: LoadFile problem with accented chars [message #20004 is a reply to message #20002] Mon, 09 February 2009 08:47 Go to previous messageGo to next message
koldo is currently offline  koldo
Messages: 3358
Registered: August 2008
Senior Veteran
Hello luzr

No problem. One question: Is there inside U++ code to convert a string from/to a certain charset? (for not trying to reinvent the wheel). Inside there are functions with names as pretty as:

ToUnicode
FromUnicode
ConvertCharset
CheckUtf8
FromUtf8
ToUtf8

Best regards
Koldo


Best regards
Iñaki
Re: LoadFile problem with accented chars [message #20011 is a reply to message #20004] Mon, 09 February 2009 17:28 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
koldo wrote on Mon, 09 February 2009 02:47

Hello luzr

No problem. One question: Is there inside U++ code to convert a string from/to a certain charset? (for not trying to reinvent the wheel). Inside there are functions with names as pretty as:

ToUnicode
FromUnicode
ConvertCharset
CheckUtf8
FromUtf8
ToUtf8

Best regards
Koldo


Sure!

Core/Charset.h

They work Smile

Mirek
Re: LoadFile problem with accented chars [message #20016 is a reply to message #19983] Tue, 10 February 2009 09:23 Go to previous messageGo to next message
koldo is currently offline  koldo
Messages: 3358
Registered: August 2008
Senior Veteran
Hello luzr

I have had problems. In reading an UTF16-Little Endian, ToUtf8 does not seem to do it well.

I have used this function:
String  LoadFileBOM(const char *path)
{
	String s = LoadFile(path);
	if (((s[0]&0xFF) == 0xFF) && ((s[1]&0xFF) == 0xFE))					// UTF16 Little Endian
		s = ToUtf8(s.Mid(2).ToWString());
	else if (((s[0]&0xFF) == 0xEF) && ((s[1]&0xFF) == 0xBB) && ((s[2]&0xFF) == 0xBF))	// UTF8
		s = s.Mid(3);
	return s;
}


that is called from here:

String s = LoadFileBOM("demo_u_16le.txt");
String ss;
	
for (int i = 0; i < s.GetCount(); ++i)
	ss.Cat(Format("%d: %0x;\n", i, s[i]&0xFF));
	
ss.Cat(s);
TestLineEdit.SetData(ss);
TestEditString.SetData(ss);
TestDocEdit.SetData(ss);


As the file only contains "Aupá", the output should have to be:

Quote:

0: 41;
1: 75;
2: 70;
3: c3;
4: a1;
Aupá


but it is:

Quote:

0: 41;
1: 0;
2: 75;
3: 0;
4: 70;
5: 0;
6: e1;
7: 0;
Aup


Another question. When loading an UTF8 file with the later code, LineEdit and DocEdit reads it right but EditString shows a strange char. Does EditString not show UTF8 or I am doing it wrong?.

Best regards
Koldo


Best regards
Iñaki
Re: LoadFile problem with accented chars [message #20032 is a reply to message #20016] Wed, 11 February 2009 15:05 Go to previous messageGo to next message
koldo is currently offline  koldo
Messages: 3358
Registered: August 2008
Senior Veteran
Hello luzr

Everything solved. I inclose you some of the proposed functions:

String  LoadFileBOM(const char *path)
{
	String s = LoadFile(path);
	if (((s[0]&0xFF) == 0xFF) && ((s[1]&0xFF) == 0xFE))	{			// UTF16 Little Endian
		StringBuffer ws = s.Mid(2);
		s = ToUtf8((wchar *)ws.Begin(), ws.GetCount()*sizeof(char)/sizeof(wchar));
	} else if (((s[0]&0xFF) == 0xEF) && ((s[1]&0xFF) == 0xBB) && ((s[2]&0xFF) == 0xBF))	// UTF8
		s = s.Mid(3);
	else 										// May be ISO8859-1
		s = ToUtf8(ToUnicode(s, CHARSET_ISO8859_1));
	return s;
}
bool SaveBOMUtf8(Stream& out, const String& data) {
	if(!out.IsOpen() || out.IsError()) 
		return false;
	unsigned char bom[] = {0xEF, 0xBB, 0xBF};
	out.Put(bom, 3);
	out.Put((const char *)data, data.GetLength());
	out.Close();
	return out.IsOK();
}
bool SaveFileBOMUtf8(const char *path, const String& data)
{
	FileOut out(path);
	return SaveBOMUtf8(out, data);
}


When loading it checks the BOM if it is UTF-16 little endian or UTF-8. If there is no BOM it is considered to be ISO8859-1. It always return a UTF-8 String.

When saving it always save to UTF-8.

If they are right I will do the rest of functions.

There was no problem with EditString. My error was because it handles UTF-8 but not ISO8859-1 chars.

Best regards
Koldo



Best regards
Iñaki

[Updated on: Wed, 11 February 2009 15:07]

Report message to a moderator

Re: LoadFile problem with accented chars [message #20033 is a reply to message #20032] Wed, 11 February 2009 15:25 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1401
Registered: September 2007
Ultimate Contributor
s = ToUtf8(s.Mid(2).ToWString());

Well by looking at this code from the old version, I believe you should be getting the results you were getting (the wrong ones).

When calling ToWString, the input will be considered to be in Utf8, but you have an Utf16 stuffed inside an String (Utf8). That explains the extra zeros.

On the other hand, the new versions may work, but they are a little heavy on conversions and allocations so I wouldn't use them on large files. Especially for Utf8 BOM, I believe the solution would be to create a LoadFile which detects the BOM and allocates and fills a buffer/String without the BOM directly. I'll have to check out LoadFile before I can elaborate on this.
Re: LoadFile problem with accented chars [message #20035 is a reply to message #20033] Wed, 11 February 2009 19:26 Go to previous messageGo to next message
koldo is currently offline  koldo
Messages: 3358
Registered: August 2008
Senior Veteran
Sorry cbpporter

In some hours I will pass a more optimized proposal.

Best regards
Koldo


Best regards
Iñaki
Re: LoadFile problem with accented chars [message #20036 is a reply to message #20035] Thu, 12 February 2009 01:13 Go to previous messageGo to next message
koldo is currently offline  koldo
Messages: 3358
Registered: August 2008
Senior Veteran
Hello luzr and all

Here I inclose you the "String" version of the functions.

LoadStreamBOM now handles UTF-16 LE & BE, UTF-8 and ISO8859_1 text files and is more optimized but more complex than the first version.

Best regards
Koldo


String LoadStreamBOM(Stream& in) 
{
	if(in.IsOpen()) {
		in.ClearError();
		int size = (int)in.GetLeft();
		if((dword)size != 0xffffffff) {
			unsigned char header[3];								// Get 3 bytes header
			if (!in.GetAll(&header, 3))
				return String::GetVoid();
			if ((header[0] == 0xFF) && (header[1] == 0xFE)) {		// Check header
				StringBuffer s(size-2);								// UTF16 Little Endian		
				s[0] = header[2];									// This char is not header
				if (!in.GetAll(s.Begin()+1, size-3))
					return String::GetVoid();						// Conversion
				return ToUtf8((wchar *)s.Begin(), (size-2)*sizeof(char)/sizeof(wchar));
			} else if ((header[0] == 0xFE) && (header[1] == 0xFF)) {		
				StringBuffer s(size-2);								// UTF16 Big Endian		
				s[0] = header[2];									// This char is not header
				if (!in.GetAll(s.Begin()+1, size-3))
					return String::GetVoid();
				for (int i = 0; i < size-2; i += 2) {	// Change from big to little endian
					unsigned char aux = s[i];			// by changing byte order
					s[i] = s[i+1];
					s[i+1] = aux;
				}													// Conversion
				return ToUtf8((wchar *)s.Begin(), (size-2)*sizeof(char)/sizeof(wchar));
			} else if ((header[0] == 0xEF) && (header[1] == 0xBB) && (header[2] == 0xBF))
				return in.Get(size-3);								// UTF8. No conversion required
			else {																
				StringBuffer s(size);								// Maybe ISO8859-1
				s[0] = header[0];									// Three chars are not header
				s[1] = header[1];									// so inserted into the StringBuffer
				s[2] = header[2];
				if (!in.GetAll(s.Begin()+3, size-3))
					return String::GetVoid();
				return ToUtf8(ToUnicode(s.Begin(), size, CHARSET_ISO8859_1));	// Conversion
			}
		}
	}
	return String::GetVoid();
}
String LoadFileBOM(const char *filename) 
{
	FileIn in(filename);
	return LoadStreamBOM(in);
}
bool SaveBOMUtf8(Stream& out, const String& data) {
	if(!out.IsOpen() || out.IsError()) 
		return false;
	unsigned char bom[] = {0xEF, 0xBB, 0xBF};
	out.Put(bom, 3);
	out.Put((const char *)data, data.GetLength());
	out.Close();
	return out.IsOK();
}
bool SaveFileBOMUtf8(const char *path, const String& data)
{
	FileOut out(path);
	return SaveBOMUtf8(out, data);
}


Best regards
Iñaki
Re: LoadFile problem with accented chars [message #20040 is a reply to message #20036] Thu, 12 February 2009 18:02 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
I am afraid this is not really consistent with the rest of U++ handling of charsets.

The problem is that you always convert to Utf8. I think we should rather convert to active default encoding (which usually IS UTF-8, but this is how things work and in fact, some of my application depend on it).

I also think the we should have 'W' variant (returing WString) first, then 'String' variant with conversion - that will cost nothing....

Mirek
Re: LoadFile problem with accented chars [message #20042 is a reply to message #20040] Fri, 13 February 2009 09:50 Go to previous messageGo to next message
koldo is currently offline  koldo
Messages: 3358
Registered: August 2008
Senior Veteran
Hello luzr

No problem.

I have done it and I have prepared a simple demo test saving text files with different codifications and loading them into EditString, LineEdit and DocEdit controls.

It is tested in XP (MinGW and MSC). This afternoon I will test in GNU/Linux and I will post it.

Best regards
Koldo

(I do not know Czech so I hope this text is not inadecuate)

index.php?t=getfile&id=1588&private=0
  • Attachment: Screen.JPG
    (Size: 22.87KB, Downloaded 590 times)


Best regards
Iñaki
Re: LoadFile problem with accented chars [message #20045 is a reply to message #20042] Fri, 13 February 2009 11:05 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Well, before reading your post, I might have duplicated some of your recent efforts - based on your code, I have added BOM support to Core.... (Core/Bom.cpp).

I hope there is a couple additional problems solved, e.g. system charset is used if no BOM header is detected, String and StringW share single function body while avoiding unnecessarry conversion (utf8->wstring->utf8).

Mirek
Re: LoadFile problem with accented chars [message #20051 is a reply to message #20045] Fri, 13 February 2009 19:09 Go to previous messageGo to next message
koldo is currently offline  koldo
Messages: 3358
Registered: August 2008
Senior Veteran
Hello luzr

After testing it in Linux here I enclose you the code with the demo.

Thinking about the functions name perhaps it is not the best as not many people knows that many apparently plain text files have this BOM.

Perhaps managing the BOM would have to be the by default behaviour of LoadFile and SaveFile functions and bypassing the BOM would be only an option.

Best regards
Koldo
  • Attachment: DemoBOM.7z
    (Size: 2.64KB, Downloaded 165 times)


Best regards
Iñaki
Re: LoadFile problem with accented chars [message #20057 is a reply to message #20051] Sun, 15 February 2009 00:05 Go to previous message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
koldo wrote on Fri, 13 February 2009 13:09


Perhaps managing the BOM would have to be the by default behaviour of LoadFile and SaveFile functions and bypassing the BOM would be only an option.



LoadFile / SaveFile must be able to load normal binary files. BOM is relatively high-end for them.

Mirek
Previous Topic: Core package build flags
Next Topic: Hi! Performance question
Goto Forum:
  


Current Time: Mon Apr 29 16:26:23 CEST 2024

Total time taken to generate the page: 0.02667 seconds