Character set support

Search on this site

While in ideal world every text resource would be encoded in Unicode or UTF-8, in practice we have to deal with many 8-bit encodings. U++ has extensible support for various encoding (charsets). It directly defines following constants to express charset (names are self-explanatory):

CHARSET_ISO8859_1

CHARSET_ISO8859_2

CHARSET_ISO8859_3

CHARSET_ISO8859_4

CHARSET_ISO8859_5

CHARSET_ISO8859_6

CHARSET_ISO8859_7

CHARSET_ISO8859_8

CHARSET_ISO8859_9

CHARSET_ISO8859_10

CHARSET_ISO8859_13

CHARSET_ISO8859_14

CHARSET_ISO8859_15

CHARSET_ISO8859_16

CHARSET_WIN1250

CHARSET_WIN1251

CHARSET_WIN1252

CHARSET_WIN1253

CHARSET_WIN1254

CHARSET_WIN1255

CHARSET_WIN1256

CHARSET_WIN1257

CHARSET_WIN1258

CHARSET_KOI8_R

CHARSET_CP852

CHARSET_MJK

CHARSET_CP850

There are also some special charset values

CHARSET_DEFAULT

Represents "default" charset. Default charset can be set using SetDefaultCharset and used instead "real" charset in most charset-related operations (and is usually used as default value of parameter). Guaranteed to be equal to 0.

CHARSET_TOASCII

This charset, used in charset conversion

CHARSET_UTF8

UTF-8 encoding.

enum DEFAULTCHAR = 0x1f

This special value is used as result of conversion in place of characters that do not exist in target charset.

Function List

byte GetDefaultCharset()

Returns the current default charset.

void SetDefaultCharset(byte charset)

Sets the default charset. This is to support legacy application; new applications should always use UTF8.

byte ResolveCharset(byte charset)

If charset.is DEFAULT_CHARSET, returns GetDefaultCharset(), otherwise returns charset.

byte AddCharSet(const char *name, const word *table, byte systemcharset = CHARSET_DEFAULT)

Adds a new charset named name. table must point to 128 elements and contains UNICODE code-points for character values 128-255. Character codes that are not defined in UNICODE or in new charset should be set to CUNDEF. systemcharset can contain equivalent "typical" charset of host platform as optional auxiliary information. Returns a code for the new charset. table must exist till the end of program (only pointer to it is stored).

byte AddCharSetE(const char *name, word *table, byte systemcharset = CHARSET_DEFAULT)

This is similar to AddCharSet, but any CUNDEF values in table are replaced in characters in special private range 0xee00-0xeeff. U++ uses this area as "unicode error escape", mapping there makes possible to loss-lessly convert to unicode and back even if some characters do not have assigned code-points. table must exist till the end of program (only pointer to it is stored).

const char *CharsetName(byte charset)

Returns the name of charset code.

int CharsetCount()

Returns the total count of charset (UTF-8 excluded). It is guaranteed that charset code numeric value for "real" charset is in range 1...CharsetCount().

int CharsetByName(const char *name)

Tries to find charset code by name. Comparison is case insensitive and ignores any characters other that digits and alphas. If charset is not identified, returns 0 (which is the same as CHARSET_DEFAULT).

byte SystemCharset(byte charset)

Attempt to retrieve host platform typical charset for charset. If unsuccessful, returns 0.

int ToUnicode(int chr, byte charset)

Convert 8-bit encoded character to unicode. charset cannot be CHARSET_UTF8.

int FromUnicode(wchar wchr, byte charset, int defchar = DEFAULTCHAR)

Converts unicode character to 8-bit encoding. If codepoint does not exist in given charset, returns defchar. charset cannot be CHARSET_UTF8.

void ToUnicode(wchar *ws, const char *s, int n, byte charset)

Converts an array of 8-bit characters to UNICODE. Both arrays, ws and s, must have (at least) n elements. charset cannot be CHARSET_UTF8.

void FromUnicode(char *s, const wchar *ws, int n, byte charset, int defchar = DEFAULTCHAR)

Converts an array of 8-bit characters to UNICODE. Both arrays, ws and s, must have (at least) n elements. charset cannot be CHARSET_UTF8.

void ConvertCharset(char *t, byte tcharset, const char *s, byte scharset, int n)

Converts an array of 8-bit characters s with encoding scharset to another 8-bit array with encoding tcharset. Both arrays must have (at least) n elements. Neither tcharset or scharset can be CHARSET_UTF8.

WString ToUnicode(const String& src, byte charset)

Converts src encoded in charset to UNICODE. charset can be CHARSET_UTF8. Invalid bytes are error-escaped using 0xEExx private range.

WString ToUnicode(const char *src, int n, byte charset)

Converts n characters starting at src encoded in charset to UNICODE. charset can be CHARSET_UTF8. Invalid bytes are error-escaped using 0xEExx private range.

String FromUnicodeBuffer(const wchar *src, int len, byte charset = CHARSET_DEFAULT, int defchar = DEFAULTCHAR)

Converts len UNICODE characters from src to 8-bit encoding charset. charset can be CHARSET_UTF8. Error-escape characters 0xEExx are converted to xx bytes. If code-point does not exist in target encoding, defchar is used as result of conversion.

String FromUnicodeBuffer(const wchar *src)

Same as FromUnicodeBuffer(src , wstrlen(src)).

String FromUnicode(const WString& src, byte charset = CHARSET_DEFAULT, int defchar = DEFAULTCHAR)

Converts UNICODE src to 8-bit encoding charset. charset can be CHARSET_UTF8. Error-escape characters 0xEExx are converted to xx bytes. If code-point does not exist in target encoding, defchar is used as result of conversion.

String ToCharset(byte charset, const String& s, byte scharset = CHARSET_DEFAULT, int defchar = DEFAULTCHAR)

Converts src encoded in scharset to charset. charset can be CHARSET_UTF8. Error-escape characters can be used if one of charsets is CHARSET_UTF8. If code-point does not exist in target encoding, defchar is used as result of conversion.

bool IsLetter(int c)

Returns true if c < 2048 and it represents a letter.

bool IsUpper(int c)

Returns true if c < 2048 and it is upper-case UNICODE code-point.

bool IsLower(int c)

Returns true if c < 2048 and it is lower-case UNICODE code-point.

int ToUpper(int c)

If c < 2048 and it is lower-case, returns respective UNICODE upper-case character, otherwise returns c.

int ToLower(int c)

If c < 2048 and it is upper-case, returns respective UNICODE lower-case character, otherwise returns c.

int ToAscii(int c)

Returns UNICODE c 'converted' to basic ASCII. Conversion is performed by removing any diacritical marks. If such conversion is not possible, returns 32 (space).

char ToUpperAscii(int c)

Same as ToUpper(ToAscii(c) (but faster).

char ToLowerAscii(int c)

Same as ToLower(ToAscii(c) (but faster).

bool IsLetter(char c)

bool IsLetter(signed char c)

Returns IsLetter((byte)c).

bool IsUpper(char c)

bool IsUpper(signed char c)

Returns IsUpper((byte)c).

bool IsLower(char c)

bool IsLower(signed char c)

Returns IsLower((byte)c).

int ToUpper(char c)

int ToUpper(signed char c)

Returns ToUpper((byte)c).

int ToLower(char c)

int ToLower(signed char c)

Returns ToLower((byte)c).

int ToAscii(char c)

int ToAscii(signed char c)

Returns ToAscii((byte)c).

char ToUpperAscii(signed char c)

Same as ToUpper(ToAscii(c)).

char ToLowerAscii(signed char c)

Same as ToLower(ToAscii(c)).

char ToUpperAscii(char c)

Same as ToUpper(ToAscii(c)).

char ToLowerAscii(char c)

Same as ToLower(ToAscii(c)).

bool IsLetter(wchar c)

Returns IsLetter(c).

bool IsUpper(wchar c)

Returns IsUpper(c).

bool IsLower(wchar c)

Returns IsLower(c).

int ToUpper(wchar c)

Returns ToUpper(c).

int ToLower(wchar c)

Returns ToLower(c).

int ToAscii(wchar c)

Returns ToAscii(c).

char ToUpperAscii(wchar c)

Same as ToUpper(ToAscii(c)).

char ToLowerAscii(wchar c)

Same as ToLower(ToAscii(c)).

bool IsDigit(int c)

Returns true if c is a digit: c >= '0' && c <= '9'.

bool IsAlpha(int c)

Returns true if c.is ASCII alphabetic character: c >= 'A' && c <= 'Z' || c >= 'a' && c <= 'z'.

bool IsAlNum(int c)

Returns true if c.is either digit or ASCII alphabetic character.

bool IsLeNum(int c)

Returns true if c.is either digit or UNICODE letter < 2048.

bool IsPunct(int c)

Returns true if: c != ' ' && !IsAlNum(c).

bool IsSpace(int c)

Returns true c is one of ' ', '\f', '\n', '\r', '\v', '\t'.

bool IsXDigit(int c)

Returns true c.is hexadecimal digit (0-9, a-f, A-F).

bool IsDoubleWidth(int c)

Returns true if c is a double-width UNICODE character (like CJK ideograph).

String Utf8ToAscii(const String& src)

Returns UTF-8 String 'converted' to basic ASCII. Conversion is performed by removing any diacritical marks. If such conversion is not possible, returns 32 (space).

String Utf8ToUpperAscii(const String& src)

Same as ToUpper(ToAscii(src) but faster.

String Utf8ToLowerAscii(const String& src)

Same as ToLower(ToAscii(src) but faster.

void ToUpper(wchar *t, const wchar *s, int len)

Converts UNICODE array to upper-case.

void ToLower(wchar *t, const wchar *s, int len)

Converts UNICODE array to lower-case.

void ToAscii(wchar *t, const wchar *s, int len)

Converts UNICODE array to basic ASCII (see ToAscii).

void ToUpper(wchar *s, int len)

Converts UNICODE array to upper-case.

void ToLower(wchar *s, int len)

Converts UNICODE array to lower-case.

void ToAscii(wchar *s, int len)

Converts UNICODE array to basic ASCII (see ToAscii).

bool IsLetter(int c, byte charset)

Returns true if character c encoded using 8-bit charset is letter.

bool IsUpper(int c, byte charset)

Returns true if character c encoded using 8-bit charset is upper-case letter.

bool IsLower(int c, byte charset)

Returns true if character c encoded using 8-bit charset is lower-case letter.

int ToUpper(int c, byte charset)

Converts character c encoded using 8-bit charset to upper-case if it is letter, otherwise returns it unchanged.

int ToLower(int c, byte charset)

Converts character c encoded using 8-bit charset to lower-case if it is letter, otherwise returns it unchanged.

int ToAscii(int c, byte charset)

Converts character c encoded using 8-bit charset to basic ASCII character by removing diacritical markings. If c is not letter, returns it unchanged.

void ToUpper(char *t, const char *s, int len, byte charset = CHARSET_DEFAULT)

Converts array s of len characters with encoding charset to upper-case (using ToUpper). Stores result to t.

void ToLower(char *t, const char *s, int len, byte charset = CHARSET_DEFAULT)

Converts array s of len characters with encoding charset to lower-case (using ToLower). Stores result to t.

void ToAscii(char *t, const char *s, int len, byte charset = CHARSET_DEFAULT)

Converts array s of len characters with encoding charset to basic ASCII (using ToAscii). Stores result to t.

void ToUpper(char *s, int len, byte charset = CHARSET_DEFAULT)

Converts array s of len characters with encoding charset to upper-case (using ToUpper). Stores result back to s.

void ToLower(char *s, int len, byte charset = CHARSET_DEFAULT)

Converts array s of len characters with encoding charset to lower-case (using ToLower). Stores result back to s.

void ToAscii(char *s, int len, byte charset = CHARSET_DEFAULT)

Converts array s of len characters with encoding charset to basic ASCII (using ToAscii). Stores result back to s.

WString InitCaps(const wchar *s)

Converts input zero-terminated UNICODE string.so that first letters in each word (letters that are next to whitespace) are upper-case, rest is lower-case.

WString InitCaps(const WString& s)

Converts UNICODE string.so that first letters in each word (letters that are next to whitespace) are upper-case, rest is lower-case.

WString ToUpper(const WString& w)

Converts UNICODE string to upper-case.

WString ToLower(const WString& w)

Converts UNICODE string to lower-case.

WString ToAscii(const WString& w)

Converts UNICODE string to basic ASCII by removing diacritical markings.

String InitCaps(const char *s, byte charset = CHARSET_DEFAULT)

Converts input zero-terminated 8-bit string.encoded in charset so that first letters in each word (letters that are next to whitespace) are upper-case, rest is lower-case. charset can be CHARSET_UTF8.

String ToUpper(const String& s, byte charset = CHARSET_DEFAULT)

Converts input 8-bit string.encoded in charset to upper-case. charset can be CHARSET_UTF8.

String ToLower(const String& s, byte charset = CHARSET_DEFAULT)

Converts input 8-bit string.encoded in charset to lower-case. charset can be CHARSET_UTF8.

String ToAscii(const String& s, byte charset = CHARSET_DEFAULT)

Converts input 8-bit string.encoded in charset to basic ASCII by removing diacritical markings. charset can be CHARSET_UTF8.

String ToUpperAscii(const String& s, byte charset)

Same as ToUpper(ToAscii(s), charset), but faster.

String ToLowerAscii(const String& s, byte charset)

Same as ToLower(ToAscii(s), charset), but faster.

String ToUpper(const char *s, byte charset = CHARSET_DEFAULT)

Converts input zero-terminated 8-bit string.encoded in charset to upper-case. charset can be CHARSET_UTF8.

String ToLower(const char *s, byte charset = CHARSET_DEFAULT)

Converts input zero-terminated 8-bit string.encoded in charset to lower-case. charset can be CHARSET_UTF8.

String ToAscii(const char *s, byte charset = CHARSET_DEFAULT)

Converts input zero-terminated 8-bit string.encoded in charset to basic ASCII by removing diacritical markings. charset can be CHARSET_UTF8.

WString LoadStreamBOMW(Stream& in, byte def_charset)

Reads stream into UNICODE string, granting optional BOM UNICODE mark. If there is no BOM, text is considered to be in def_charset.

WString LoadStreamBOMW(Stream& in)

Reads stream into UNICODE string, granting optional BOM UNICODE mark. If there is no BOM, text is considered to be in host defined encoding. (e.g. set by linux locale).

String LoadStreamBOM(Stream& in, byte def_charset)

Reads stream into 8-bit string with default encoding, granting optional BOM UNICODE mark. If there is no BOM, text is considered to be in def_charset .

String LoadStreamBOM(Stream& in)

Reads stream into 8-bit string with default encoding, granting optional BOM UNICODE mark. If there is no BOM, text is considered to be in host defined encoding. (e.g. set by linux locale).

WString LoadFileBOMW(const char *path, byte def_charset)

Reads file into UNICODE string, granting optional BOM UNICODE mark. If there is no BOM, text is considered to be in def_charset. If the file cannot be read, returns WString::GetVoid().

WString LoadFileBOMW(const char *path)

Reads file into UNICODE string, granting optional BOM UNICODE mark. If there is no BOM, text is considered to be in host defined encoding. (e.g. set by linux locale). If the file cannot be read, returns WString::GetVoid().

String LoadFileBOM(const char *path, byte def_charset)

Reads file into 8-bit string with default encoding, granting optional BOM UNICODE mark. If there is no BOM, text is considered to be in def_charset . If the file cannot be read, returns String::GetVoid().

String LoadFileBOM(const char *path)

Reads file into 8-bit string with default encoding, granting optional BOM UNICODE mark. If there is no BOM, text is considered to be in host defined encoding. (e.g. set by linux locale). If the file cannot be read, returns String::GetVoid().

bool SaveStreamBOM(Stream& out, const WString& data)

Saves stream in 16-bit UNICODE format, with BOM header. Returns true on success.

bool SaveFileBOM(const char *path, const WString& data)

Saves file in 16-bit UNICODE format, with BOM header. Returns true on success.

bool SaveStreamBOMUtf8(Stream& out, const String& data)

Saves 8-bit string in default encoding to the stream. Returns true on success.

bool SaveFileBOMUtf8(const char *path, const String& data)

Saves 8-bit string in default encoding to the file. Returns true on success.

bool Utf8BOM(Stream& in)

Tests for and skips UTF-8 BOM mark in the seekable Stream in.

WString FromUtf8(const char *_s, int len)

Converts UTF-8 to UNICODE string. Any wrong bytes and sequences are converted to private 0xEExx range. Deprecated, use ToUtf16.

WString FromUtf8(const char *_s)

Converts zero-terminted UTF-8 string to UNICODE. Any wrong bytes and sequences are converted to private 0xEExx range. Deprecated, use ToUtf16.

WString FromUtf8(const String& s)

Converts UTF-8 string to UNICODE. Any wrong bytes and sequences are converted to private 0xEExx range. Deprecated, use ToUtf16.

bool utf8check(const char *_s, int len)

Checks whether array contains a valid UTF-8 sequence. Deprecated, use CheckUtf8.

int utf8len(const char *s, int len)

Returns a number of UNICODE characters in UTF-8 text. Error-escaped 0xEExx characters for ill-formed parts of UTF-8 are correctly accounted for. Deprecated, use Utf16Len.

int utf8len(const char *s)

Returns a number of UNICODE characters in zero-terminated UTF-8 text. Error-escaped 0xEExx characters for ill-formed parts of UTF-8 are correctly accounted for. Deprecated, use Utf16Len.

int lenAsUtf8(const wchar *s, int len)

Returns number of bytes of UNICODE text when UTF-8 encoded. Deprecated, use Utf8Len.

int lenAsUtf8(const wchar *s)

Returns number of bytes of UNICODE zero-terminated text when UTF-8 encoded. Deprecated, use Utf8Len.

Do you want to contribute?