|
|
Home » U++ Library support » U++ Core » LoadFile problem with accented chars
LoadFile problem with accented chars [message #19983] |
Sat, 07 February 2009 22:27 |
|
koldo
Messages: 3361 Registered: August 2008
|
Senior Veteran |
|
|
Hello all
In a program I am reading text files written with Notepad.
To do that I simply use String LoadFile(String fileName) and I parse through the readed String.
When the file readed has only simple characters everything goes right, but when I enter in Notepad accented characters like á, é, ... it is like the file would be corrupted. For example in that case a simple char \r is converted into a -17 int.
I have seen that TheIde handles this ok. It seems that (it is just a guess) TheIde detects the file charset and handles the file properly converting it to an Utf8 String.
I have tried to get parts of TheIde code into my program but without success. Do you know what to do?.
Best regards
Koldo
Best regards
Iñaki
|
|
|
|
Re: LoadFile problem with accented chars [message #20000 is a reply to message #19983] |
Sun, 08 February 2009 22:11 |
|
koldo
Messages: 3361 Registered: August 2008
|
Senior Veteran |
|
|
Hello luzr
It seems it is a matter of Notepad itself. If the file has 7 bits chars there is no problem, but after adding chars like á it seems that Notepad itself changes its charset.
Using this test program:
CONSOLE_APP_MAIN
{
String data = LoadFile("C:\\test.txt");
for (int i = 0; i < data.GetCount(); ++i)
puts(Format("%d: %d", i, data[i]));
getchar();
}
with test.txt with a simple "a-á", I initially get this output:
0: 97
1: 45
2: -31
but after saving and opening the file some times, I get this:
0: -1
1: -2
2: 97
3: 0
4: 45
5: 0
6: -31
7: 0
and yesterday I got other output... The answer is that Notepad adds a "BOM" to the file if it thinks it requires a bigger encoding.
BOM (Byte Order Mark, http://unicode.org/faq/utf_bom.html#BOM) is a signature of letters in the begining of files that shows its encoding. For example:
- EF BB BF means UTF-8
- FF FE means UTF-16, little-endian
So yesterday Notepad saved the file as UTF8 (beginning with -17 == EF) and today it saved it in UTF-16, little-endian (beginning with a -1 == FF)
Sorry, perhaps it is not easy but, do you know how to program a way to get a text file and converting it into utf8 to be properly viewed U++ programs?, as when entering these chars into U++ controls I get strange symbols and errors. It will also be great for parsing them.
Best regards
Koldo
Best regards
Iñaki
|
|
|
Re: LoadFile problem with accented chars [message #20002 is a reply to message #20000] |
Mon, 09 February 2009 08:12 |
|
mirek
Messages: 13980 Registered: November 2005
|
Ultimate Member |
|
|
koldo wrote on Sun, 08 February 2009 16:11 | Hello luzr
It seems it is a matter of Notepad itself. If the file has 7 bits chars there is no problem, but after adding chars like á it seems that Notepad itself changes its charset.
Using this test program:
CONSOLE_APP_MAIN
{
String data = LoadFile("C:\\test.txt");
for (int i = 0; i < data.GetCount(); ++i)
puts(Format("%d: %d", i, data[i]));
getchar();
}
with test.txt with a simple "a-á", I initially get this output:
0: 97
1: 45
2: -31
but after saving and opening the file some times, I get this:
0: -1
1: -2
2: 97
3: 0
4: 45
5: 0
6: -31
7: 0
and yesterday I got other output... The answer is that Notepad adds a "BOM" to the file if it thinks it requires a bigger encoding.
BOM (Byte Order Mark, http://unicode.org/faq/utf_bom.html#BOM) is a signature of letters in the begining of files that shows its encoding. For example:
- EF BB BF means UTF-8
- FF FE means UTF-16, little-endian
|
Why do not interpret it yourself?
I suggest implementing these:
WString LoadBOMW(const Stream& s);
WString LoadFileBOMW(const char *path);
void SaveBOMUtf8(const Stream& s, const WString& data);
bool SaveFileBOMUtf8(const char *path, const WString& data);
String LoadBOM(const Stream& s); // Default encoding, usually utf-8
String LoadFileBOM(const char *path);
void SaveBOMUtf8(const Stream& s, const String& data);
bool SaveFileBOMUtf8(const char *path, const String& data);
I would be glad to add them to Core.
Mirek
|
|
|
|
|
|
Re: LoadFile problem with accented chars [message #20032 is a reply to message #20016] |
Wed, 11 February 2009 15:05 |
|
koldo
Messages: 3361 Registered: August 2008
|
Senior Veteran |
|
|
Hello luzr
Everything solved. I inclose you some of the proposed functions:
String LoadFileBOM(const char *path)
{
String s = LoadFile(path);
if (((s[0]&0xFF) == 0xFF) && ((s[1]&0xFF) == 0xFE)) { // UTF16 Little Endian
StringBuffer ws = s.Mid(2);
s = ToUtf8((wchar *)ws.Begin(), ws.GetCount()*sizeof(char)/sizeof(wchar));
} else if (((s[0]&0xFF) == 0xEF) && ((s[1]&0xFF) == 0xBB) && ((s[2]&0xFF) == 0xBF)) // UTF8
s = s.Mid(3);
else // May be ISO8859-1
s = ToUtf8(ToUnicode(s, CHARSET_ISO8859_1));
return s;
}
bool SaveBOMUtf8(Stream& out, const String& data) {
if(!out.IsOpen() || out.IsError())
return false;
unsigned char bom[] = {0xEF, 0xBB, 0xBF};
out.Put(bom, 3);
out.Put((const char *)data, data.GetLength());
out.Close();
return out.IsOK();
}
bool SaveFileBOMUtf8(const char *path, const String& data)
{
FileOut out(path);
return SaveBOMUtf8(out, data);
}
When loading it checks the BOM if it is UTF-16 little endian or UTF-8. If there is no BOM it is considered to be ISO8859-1. It always return a UTF-8 String.
When saving it always save to UTF-8.
If they are right I will do the rest of functions.
There was no problem with EditString. My error was because it handles UTF-8 but not ISO8859-1 chars.
Best regards
Koldo
Best regards
Iñaki
[Updated on: Wed, 11 February 2009 15:07] Report message to a moderator
|
|
|
|
|
Re: LoadFile problem with accented chars [message #20036 is a reply to message #20035] |
Thu, 12 February 2009 01:13 |
|
koldo
Messages: 3361 Registered: August 2008
|
Senior Veteran |
|
|
Hello luzr and all
Here I inclose you the "String" version of the functions.
LoadStreamBOM now handles UTF-16 LE & BE, UTF-8 and ISO8859_1 text files and is more optimized but more complex than the first version.
Best regards
Koldo
String LoadStreamBOM(Stream& in)
{
if(in.IsOpen()) {
in.ClearError();
int size = (int)in.GetLeft();
if((dword)size != 0xffffffff) {
unsigned char header[3]; // Get 3 bytes header
if (!in.GetAll(&header, 3))
return String::GetVoid();
if ((header[0] == 0xFF) && (header[1] == 0xFE)) { // Check header
StringBuffer s(size-2); // UTF16 Little Endian
s[0] = header[2]; // This char is not header
if (!in.GetAll(s.Begin()+1, size-3))
return String::GetVoid(); // Conversion
return ToUtf8((wchar *)s.Begin(), (size-2)*sizeof(char)/sizeof(wchar));
} else if ((header[0] == 0xFE) && (header[1] == 0xFF)) {
StringBuffer s(size-2); // UTF16 Big Endian
s[0] = header[2]; // This char is not header
if (!in.GetAll(s.Begin()+1, size-3))
return String::GetVoid();
for (int i = 0; i < size-2; i += 2) { // Change from big to little endian
unsigned char aux = s[i]; // by changing byte order
s[i] = s[i+1];
s[i+1] = aux;
} // Conversion
return ToUtf8((wchar *)s.Begin(), (size-2)*sizeof(char)/sizeof(wchar));
} else if ((header[0] == 0xEF) && (header[1] == 0xBB) && (header[2] == 0xBF))
return in.Get(size-3); // UTF8. No conversion required
else {
StringBuffer s(size); // Maybe ISO8859-1
s[0] = header[0]; // Three chars are not header
s[1] = header[1]; // so inserted into the StringBuffer
s[2] = header[2];
if (!in.GetAll(s.Begin()+3, size-3))
return String::GetVoid();
return ToUtf8(ToUnicode(s.Begin(), size, CHARSET_ISO8859_1)); // Conversion
}
}
}
return String::GetVoid();
}
String LoadFileBOM(const char *filename)
{
FileIn in(filename);
return LoadStreamBOM(in);
}
bool SaveBOMUtf8(Stream& out, const String& data) {
if(!out.IsOpen() || out.IsError())
return false;
unsigned char bom[] = {0xEF, 0xBB, 0xBF};
out.Put(bom, 3);
out.Put((const char *)data, data.GetLength());
out.Close();
return out.IsOK();
}
bool SaveFileBOMUtf8(const char *path, const String& data)
{
FileOut out(path);
return SaveBOMUtf8(out, data);
}
Best regards
Iñaki
|
|
|
|
|
|
|
|
Goto Forum:
Current Time: Thu May 16 18:24:20 CEST 2024
Total time taken to generate the page: 0.03182 seconds
|
|
|