Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » Developing U++ » U++ Developers corner » Choosing the best way to go full UNICODE
Choosing the best way to go full UNICODE [message #48150] Sat, 27 May 2017 16:58 Go to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
Hi,

last week, for the first time, I have encountered a problem caused by presence of characters outside of UCS2 range (did you know that emoji characters have assigned codepoints? I did not either. Yet our www backend was failing because somebody send "thumbs up" in the message... Smile

So perhaps it is time for full support. The mission target is that these characters are correctly loaded/stored to/from LineEdit, DocEdit, EditString, RichText.

Now I can see 3 paths to achieve this:

(1) Extend WString to 32-bits (so that sizeof(wchar) is 4 bytes). This is the most straightforward and right now my favorite. Downsides is that I am a little bit afraid about backward compatibility and that the performance will be worse than WString. Performance is less of an issue because I believe that in most cases, texts are stored as UTF-8 String anyway, but worth checking.

(2) Add 32-bit LString (or "UString"?) and perhaps deprecate WString. Fixes the problem with compatibility, but fattens the API.

(3) Keep WString and somehow fix things to work with surrogate pairs.

Any thoughts?

Mirek
Re: Choosing the best way to go full UNICODE [message #48151 is a reply to message #48150] Sat, 27 May 2017 20:02 Go to previous messageGo to next message
Zbych is currently offline  Zbych
Messages: 313
Registered: July 2009
Senior Member
Correct me if I am wrong, but you need to keep wchar 2-bytes long to be compatible with WinApi, ToSystemCharsetW and FromSystemCharsetW depend on this size.
So I would vote for DString (like DWORD).
Re: Choosing the best way to go full UNICODE [message #48152 is a reply to message #48151] Sat, 27 May 2017 20:48 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
Good point, but I do not think this is a showstopper. Clearly

String FromSystemCharsetW(const wchar *src)

would have to be

String FromSystemCharsetW(const WCHAR *src)

and a bit of fixing would to be done. The advantage is that it would be all compile-time errors.
Re: Choosing the best way to go full UNICODE [message #48153 is a reply to message #48152] Sat, 27 May 2017 20:51 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
A good contraargument is search in *.h files for WString. There is quite a lot of functions that would need DString variant implemented.
Re: Choosing the best way to go full UNICODE [message #48160 is a reply to message #48153] Mon, 29 May 2017 14:36 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
Hi Mirek!

Welcome to my problems, circa 1-2 years in of using U++ Laughing.

I didn't know if you remember, but I was pressuring you left and right to fix these issues, with occasional fixes and what not.

Some fixes managed to get in, but there was no real interest in it, so my solution was to eventually deprecate whole String support in U++, using it as a pure storage. String + custom conversions methods + some custom things, like ToUpper, fixes the problems.

U++ is at least a decade behind on Unicode, with absolutely no good reason. It is not like the issue can't all be fixed in like 3 weeks. I'm sure it would fail 100% of the more difficult compatibility tests.

The good news is that these issues are so rare that you are extraordinarily unlikely to encounter them as an American or European.

As anecdotal proof, this is the first time you encountered issues.

I'm using a very dense data representation method, but even like this, full support is almost 200 KiB of data in every single executable. This is without more advanced stuff, like scripts. I believe 180 KiB is the minimum data that can be stored for the first 3 Unicode planes, this vs. your 2048 entry table in U++, but this gives you lower, upper, title case and Unicode category. I have unit-tests covering these, making sure they are always correct.

String and WString are used for storage. Mostly String.

Contrary to popular belief, Unicode code points are not indexable, so DString does not help. Unicode is glyph based. This + DString being the least beneficent method of storage makes DString not really needed.

In most cases, you can treat the String as opaque.

When not, the ideal solution is an amortized string walker and light algorithm rewrites. You can create an indexed string walker, that seeks, but if you ask it for indices in order, the seek overhead is basically zero. This is only needed for glyph based algorithms, like case conversion and rendering. Or a traditional string walker without indices, a bit faster, but a lit uglier to use, with Get and ++/-- methods.
Re: Choosing the best way to go full UNICODE [message #48162 is a reply to message #48160] Mon, 29 May 2017 19:40 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
cbpporter wrote on Mon, 29 May 2017 14:36

Contrary to popular belief, Unicode code points are not indexable, so DString does not help. Unicode is glyph based. This + DString being the least beneficent method of storage makes DString not really needed.


I really feel like I am missing something here, but what does it mean "not indexable"?

Quote:

When not, the ideal solution is an amortized string walker and light algorithm rewrites. You can create an indexed string walker, that seeks, but if you ask it for indices in order, the seek overhead is basically zero. This is only needed for glyph based algorithms, like case conversion and rendering. Or a traditional string walker without indices, a bit faster, but a lit uglier to use, with Get and ++/-- methods.


What is the result of 'Get'?

Mirek
Re: Choosing the best way to go full UNICODE [message #48165 is a reply to message #48162] Tue, 30 May 2017 10:31 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
mirek wrote on Mon, 29 May 2017 20:40

I really feel like I am missing something here, but what does it mean "not indexable"?


Well the perceived problem with Utf8 is that you have a string s[i] and s[i+1] are code units, not a code points. It is not indexable.

So the proposed solution is to use Utf32, where s[i] and s[i+1] are code points. Code points are indexable.

But, here is the issue. Most, but not all algorithms fall into opaque ones or glyph based. In opaque ones you care more about the data in bulk and don't really interpret each code unit. In glyph based ones, s[i] and s[i+1] are code points, but that doesn't help you. You need to determine glyph[i] and glyph[i + 1], so you are back to Utf8 levels of boundary determination even with Utf32. You need "indexable" glyphs, not indexable code units, but they are not.

So the solution is to use a string walker.
Quote:

What is the result of 'Get'?

Get return the code unit.

The code still remains largely the same, you go from something like:
char* s = string_8bits;
while (s ...) {
    if (*s)
        ....
    s++;
}


to

StringWalker s = string_utf8;
while (s ...) {
    if (*s)
        ....
    s++;
}


In the constructor, you set up the first code unit.

*s, or Get, returns it.

++ seeks the next code unit.

In the same class, or a different one, you implement the GlyphWalker mechanics.

With a GlyphWalker you can do things like "visual" strlen, where you pass it in A C1 C2 B C3 C4, where Cx are punctuation marks, and the first glyph boundary is A and the second B, with the returned length of 2. The same class handles right to left languages, with the help of block boundary getters.

As you may know, I am implementing my own Core-like library, and one of its goals is to have perfect Unicode support. I can help you with some pieces of code, but I need to know what you want to do.

To expand upon the example I have given before, one needs ToUpper.

ToUpper is hard to get small and fast, since there is no rhyme and reason to it. I tested the data and weird values pop up all the time.

What i did is encode all Unicode characters in two tables.

One is a table of words. The entire Unicode spectrum is divided into chunks of 8 to 32 characters. The word table index is the chunk index. If all the characters in the chunk are well behaved, all you need is the original table.

As an example, characters 0 tot 7 have the same properties and no cases, so the upper bits of the word table are zero and the lower bits are the properties.

If the chunk is not well behaved, the upper bits are the index to a second table of qwords and the lower bits are zero.

I tested all chunk sizes and surprisingly 8 characters/chunk occupy the least amount of RAM.

Using the two tables, you get something similar to this ancient code that is no longer up to date:

dword _ToUpper(dword c) {
	dword plane = c >> 16;
	if (plane < PLANES) {
		dword group = c / BS;
		dword offs = c % BS;
		word gdata = db1[group];
		if (gdata & 0x8000) {
			gdata &= 0x7FFF;
			return db2[gdata * BS + offs] & 0x3FFFF;
		}
		else
			return c;
	}
	return c;
}


This is the fastest and most compact scheme I could come up with without some RLE or other compression method. But this is meant to serve 1114112 code units across all 17 planes. Not just 2048 characters. Even if you somehow manage to squeeze 1114112 characters into a word table, which is BTW impossible, when in some cases upper/title/lower case are all over 16 bits in size, that is still 1114112 * bytes. My scheme compresses the range down to 131456 bytes across the two tables, with the caveat that for planes that are not in use, planes above 2, c = ToUpper(c).

Did a quick test, and if somehow in the future, all 1114112 are assigned random values, you probably will have a 600+ KiB table.

Maybe you can come up with and even faster or compact scheme, but I wanted it to be near constant cost.
Re: Choosing the best way to go full UNICODE [message #48166 is a reply to message #48165] Tue, 30 May 2017 11:03 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
cbpporter wrote on Tue, 30 May 2017 10:31

Get return the code unit.


What is 'code unit' in your teminology?

If I go by definition here: https://en.wikipedia.org/wiki/Character_encoding, it does not make sense with your explanation.

Now to explain what I am trying to do now is to provide some means to add full support to U++ editors in future. To that end, I need to define "cursor position" within the UNICODE text.

So far I believe that a "index of UNICODE codepoint" is a good cursor position. Alternatively, grapheme could be a good cursor position too, but then sometimes you would like to edit grapheme internals, like remove some combining mark, which could be difficult if grapheme is indivisible.

Anyway, it is definitely somthing to think about.

Mirek
Re: Choosing the best way to go full UNICODE [message #48167 is a reply to message #48166] Tue, 30 May 2017 11:23 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
I meant code point.

Get is the code point.

And what I claimed is that having code point = code unit only half-way solves the problem and you get the least favorite and advantageous encoding in DString/Utf32.

The index of Utf code-point can be easily solved with a string walker that can work for any encoding, from utf8 to 32 and there is no need to tie yourself down with DString supremacy.

Because once you reach "full" Unicode support with DString, you know that most of the non-opaque string code in U++ will use the DString implementation and everything will be biased towards 32 bit.

I prefer 1-3 string classes with an 8 bit bias and the same amount of "String walker" classes and once one implements this, I believe in practice the amount of switching from one encoding to another gets minimized.

I my library I only have String, with plans to add WString someday, and the complexity of Utf8 has never made me wish for DString. Conversion to Utf16 only happens in WinAPI context anyway. Plus on the web, Utf8 is the standard. You will probably find that across the globe, on average, Utf8 is still the smaller method of storage. And in CJK context, Utf16 makes a lot of sense, even with occasional surrogate pairs. Historic/academic CJK is where Utf32 shines, but still, in such contexts, Utf is not the most popular.

Anyway, I would go with the same approach I went so many years ago.

Conversion from Utf8 to Ucs2 must be upgraded to Unicode 9.0 compliant Utf8 to Utf16. Utf8 1-4 code units must be converted to a single code point. Correct escaping and error recovery must be implemented. The Unicode standard give you exact deterministic outcomes for any illformated sequences, and combined with the EE you already have in U++, you can get non-destructive error handling. This will leave you with the ability to read and write Unicode. The rest of U++ won't be helped by this, but it is still an important step and will solve a few problems.

Then, one by one, I would convert pieces of code over to a standardized traversal mechanic, be it a string walker class or something else you come up with.
Re: Choosing the best way to go full UNICODE [message #48170 is a reply to message #48167] Tue, 30 May 2017 11:45 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
cbpporter wrote on Tue, 30 May 2017 11:23
I meant code point.

Get is the code point.

And what I claimed is that having code point = code unit only half-way solves the problem and you get the least favorite and advantageous encoding in DString/Utf32.

The index of Utf code-point can be easily solved with a string walker that can work for any encoding, from utf8 to 32 and there is no need to tie yourself down with DString supremacy.

Because once you reach "full" Unicode support with DString, you know that most of the non-opaque string code in U++ will use the DString implementation and everything will be biased towards 32 bit.


Interestingly, this is not true at all in current U++ with WString. Most data are always in utf8 String. WString is usually only used temporary to process data. Which was the original intention going to 32 bits 'DString'.

Actually, rethinking graphem problem, maybe we could have

GraphemString

with

String operator[](int i) // returns complete grapheme at position i
int GetCount() // returns number of graphemes

etc...

It would be an 'index' over Utf8 String, grapheme addressed.

See, my problem can be demonstrated by

TextCtrl::Ln

There is the storage for lines of editors. Now 'len' is number of 'cursor characters'. When changes are done to the line, e.g. inserting something at cursor position, text is unpacked to WString, Insert at cursor position is done, then text is packed back to utf8.

Now maybe, if I decide that grapheme is the 'cursor character', going to GraphemeString would be relatively simple. And maybe even faster than using WString...

(All that said, all this will need some way how to provide GUI input / Draw output for graphemes.... but that is a story for another day).

[Updated on: Tue, 30 May 2017 11:46]

Report message to a moderator

Re: Choosing the best way to go full UNICODE [message #48179 is a reply to message #48170] Wed, 31 May 2017 10:30 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
It looks like there are many possible ways to go forward. We can try several things and probably a lot of things will work.

As long as we understand that there is no universal way to make Unicode indexable, but on a case by case basis, you can. The only thing you can universally do is to iterate linearly over Unicode.

But I still think I gave you a partial solution so many years ago.

To reiterate:
1. Utf8 to Utf16 and vice-versa must be fixed under all scenarios. We also need to add Utf8 to Utf32, but that is trivial compared to Utf16. So proper error recovery must be implemented and 4 byte long sequences must be converted to surrogate pairs.
2. The Unicode table must be expanded to more than 2048 characters. Maybe not full range, but Unicode is based on blocks. We can move a bit closer to the CJK block, because for CJK, nobody expects more than the basics. Probably 8k at least.
3. Take the opportunity to clean stuff up. Core has had several clean-ups and now C++11 shook thing up. So this might be the perfect opportunity to make sure the designs and interfaces are future proof.
4. Implement DString? I don't know yet. On the other hand, implementing DSting is easy and it can be dropped in its own header and available for use.

Re: Choosing the best way to go full UNICODE [message #48180 is a reply to message #48179] Wed, 31 May 2017 11:00 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
cbpporter wrote on Wed, 31 May 2017 10:30
It looks like there are many possible ways to go forward. We can try several things and probably a lot of things will work.

As long as we understand that there is no universal way to make Unicode indexable, but on a case by case basis, you can. The only thing you can universally do is to iterate linearly over Unicode.


If you can iterate linearly, Unicode is indexable...

Quote:

But I still think I gave you a partial solution so many years ago.

To reiterate:
1. Utf8 to Utf16 and vice-versa must be fixed under all scenarios. We also need to add Utf8 to Utf32, but that is trivial compared to Utf16. So proper error recovery must be implemented and 4 byte long sequences must be converted to surrogate pairs.


Agreed.

Quote:

2. The Unicode table must be expanded to more than 2048 characters. Maybe not full range, but Unicode is based on blocks. We can move a bit closer to the CJK block, because for CJK, nobody expects more than the basics. Probably 8k at least.


Not so sure about this - not that important IMO at this point. So I will not get correct ToUpper for many characters - that has little impact on most applications.

Quote:

4. Implement DString? I don't know yet. On the other hand, implementing DSting is easy and it can be dropped in its own header and available for use.


I am now leaning against it. Vector<int> is good enough for utf32 - what we eventually need to do with it.

I am now really thinking that "multibyte" String is the solution. The one that returns a variable sequence of bytes for each position. I am now even thinking this does not need to be bound to graphemes only.

The longterm point with that is to replace WString as processing facility in editors.
Re: Choosing the best way to go full UNICODE [message #48181 is a reply to message #48180] Wed, 31 May 2017 11:30 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
mirek wrote on Wed, 31 May 2017 12:00

If you can iterate linearly, Unicode is indexable...


Sorry, you don't get it. I'm using a very specific meaning of indexable, probably that is the problem.

A vector is indexable. You can reach v[7] without going though 0 to 6 and 0 though 6 can't do anything to change the "offset" of 7. A list is not. You need to traverse it to get list[7].

You need to traverse a String s to get to s[7]. The contents up to 7 can change the offset of 7. Unicode strings are not indexable.

Indexable data structures are predictable contant-cost no need to iterate over them to get some "groundind". Non-indexable does not mean that you can't reach them though an index. It means that you can get only obtain that index with a linear traversal O(index) traversal and you need to reach than index at least once.


mirek wrote on Wed, 31 May 2017 12:00

I am now really thinking that "multibyte" String is the solution. The one that returns a variable sequence of bytes for each position. I am now even thinking this does not need to be bound to graphemes only.

The longterm point with that is to replace WString as processing facility in editors.


That sound to me like you trying to make String indexable. Which can't be done with Unicode. How are you going to determine the sequence of bytes that you must return for a position without traversing it linearly?

If you give it a position as input, this still needs to be a deterministic constant cost memory jump. That should be the bread and butter operation. In the few cases where you need your multibyte behavior, there you abandon indexing randomly into the string (string[7] is forbidden) and iterate over it, going left to right, returning multiple bytes for each logical "position".

Let me sum up the Unicode fallacy:
1. Code points are only indexable in Utf32. Which is not always helpful because
2. Glyps/Graphemes are never indexable.

The common misconception is that 1 and/or 2 are false.

mirek wrote on Wed, 31 May 2017 12:00

I am now leaning against it. Vector<int> is good enough for utf32 - what we eventually need to do with it.


I agree, that is enough. And in some context even C vectors will work.

We do need a standardized functions that convert from code units to a single code point to fill up these structures.

mirek wrote on Wed, 31 May 2017 12:00

Not so sure about this - not that important IMO at this point. So I will not get correct ToUpper for many characters - that has little impact on most applications.

Half measures again...
Re: Choosing the best way to go full UNICODE [message #48183 is a reply to message #48181] Wed, 31 May 2017 12:07 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
cbpporter wrote on Wed, 31 May 2017 11:30
mirek wrote on Wed, 31 May 2017 12:00

If you can iterate linearly, Unicode is indexable...


Sorry, you don't get it. I'm using a very specific meaning of indexable, probably that is the problem.

A vector is indexable. You can reach v[7] without going though 0 to 6 and 0 though 6 can't do anything to change the "offset" of 7. A list is not. You need to traverse it to get list[7].


I guess with "indexable" I was was not specific enough.

Imagine that you traverse through Utf8 String and each codepoint (or grapheme cluster) you store in individual cell of

Vector<String>

Then the result is definitely indexable. Or am I missing something? (Of course, I would not use Vector<String> for optimized version, but that is just implementation issue).

mirek wrote on Wed, 31 May 2017 12:00

That sound to me like you trying to make String indexable. Which can't be done with Unicode. How are you going to determine the sequence of bytes that you must return for a position without traversing it linearly?


Traverse it and store positions. (It is functional equivalent to splitting to Vector<String> as described above).

mirek wrote on Wed, 31 May 2017 12:00

Not so sure about this - not that important IMO at this point. So I will not get correct ToUpper for many characters - that has little impact on most applications.

Half measures again...
[/quote]

I do not mean that it is not important at all. Just that choosing proper basic interfaces is more pressing at the moment.
Re: Choosing the best way to go full UNICODE [message #48184 is a reply to message #48183] Wed, 31 May 2017 12:26 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
mirek wrote on Wed, 31 May 2017 13:07

Then the result is definitely indexable. Or am I missing something?


Yes, performance!

Unicode Strings are not indexable and indeed you can make them indexable with Vector<String> (or better).

But that is a bit of Sisyphean act. The conflict between them being non indexable and you forcing them to be indexable will result in performance and memory overhead. Like I said before, you can make a list indexable by traverse and store but you rarely would do this in practice, instead replace your random access algorithm with a linear traversal one if possible.

Now, there are some mighty complex algorithms which probably will call for this, where we will traverse and store.

But for the rest, I still think that traverse and store into a indexable structure is the worst case scenario.

You still traverse the string once, but do not store only the current code point and maybe have a few "last" positions to keep track of some other characters from previous positions.

And I would still advise the use of a StringWalker class, one that can seek to a random position, but ideally the algorithm will never use this capability! After a seek (or just initialization) it will store the current code point and a few more fields, like begin of the code point, size of the sequence. Then it as ++ and -- to go one code point up or down. This class or a separate one can do the same for glyphs.

Using such a class (or embedding this functionality directly into String) to traverse the string once from beginning to end to process each codepoint/glyph will have almost zero performance overhead.

The important part is not to make the confusion that such a class makes string index-able, i.e. only random seek if can't avoid it. And random seek with small jumps. Writing the algorithm in such a way that StringWalker is List<int>, not Vector<int>.

Re: Choosing the best way to go full UNICODE [message #48185 is a reply to message #48184] Wed, 31 May 2017 12:40 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
cbpporter wrote on Wed, 31 May 2017 12:26
mirek wrote on Wed, 31 May 2017 13:07

Then the result is definitely indexable. Or am I missing something?


Yes, performance!


No.

Really, the issue at hand is this. Current code works like (inserting character into LineEdit):

WString unpacked_line = line[i].ToWString();
unpacked_line.Insert(cursor_position, something);
line[i] = unpacked_line.ToString();


I am pretty sure that replacing WString with the new "Indexed unicode" String type will have similar performance, if not better, plus code will not change significantly.

That said, I am really focused at this very issue, this is what I really need to solve.

For most "low-level" tasks like ToUpper walker is better approach.
Re: Choosing the best way to go full UNICODE [message #48186 is a reply to message #48185] Wed, 31 May 2017 13:12 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
Well since in that example you wont to use some new form of storage and not just a DString, this is more of an implementation detail and not a public API.

I was discussing public API, not LineEdit implementation detail.

I want the parts that are publicly visible from U++ and often used to no longer be a decade behind on Unicode.

As for the LineEdit, there are multiple solutions. I'm pretty sure one can make that work with Utf8 with 1-4 code unit string, but it is a bit more complicated. The only complication is of course getting and changing the cursor for insertion, right? And related to that, but on a side note, my String, when doing find, insert and other such operations, does have multiple variants. You can insert in the middle both a byte like 32 for space or my "Char", which is a DWord, so inserting 猫 into it can be done and it will result in 3 added characters. Basically there are two variants for everything, code points and code units. You can insert int the middle of a Utf8 string a Utf8/16/32 string, with on the fly conversion, and it helps if the Insert returns the end point of the insertion. You already know the beginning.

For LineEdit, DString, Vector<int> or some multi-byte string would all work I think. I comes down to what one is willing to implement.
Re: Choosing the best way to go full UNICODE [message #48187 is a reply to message #48186] Wed, 31 May 2017 13:20 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
cbpporter wrote on Wed, 31 May 2017 13:12
Well since in that example you wont to use some new form of storage and not just a DString, this is more of an implementation detail and not a public API.

I was discussing public API, not LineEdit implementation detail.


But that is just one use of that class. This situation repeats (basically everywhere there is a defined 'position', so requires a tool to handle it. So I am pretty sure it is a useful public api. More useful than WString.

Quote:

For LineEdit, DString, Vector<int> or some multi-byte string would all work I think. I comes down to what one is willing to implement.


I guess we have started with (your correct) argument that DString or Vector<int> are not enough, because we edit characters, not codepoints...

[Updated on: Wed, 31 May 2017 13:21]

Report message to a moderator

Re: Choosing the best way to go full UNICODE [message #48188 is a reply to message #48187] Wed, 31 May 2017 13:43 Go to previous messageGo to next message
cbpporter is currently offline  cbpporter
Messages: 1400
Registered: September 2007
Senior Contributor
Without RLE, I think you can get around this by positions not representing characters, but sequence starts.

As a mandatory condition, every time the position is updated, it is guaranteed to be at the start of a sequence.

As a more complicated example, let's say you have a selection of text, with a "begin" and "end" pos. You handle a key press. Taking your start pos as a sequence start, you determine the sequence end. This means looking to see how many code units it is, seeking over combination marks and ligatures. Basically on the fly glyph analysis. With sequence start end end you know for a fact that everything between these two values must go. You do the same for the end position. As an optimization, you can mark everything for deletion between the start sequence begin and end sequence end. The text marked to be replaced will replaced with multiple code units.

The real challenge is to standardize these operations so you don't have to repeat them.

Maybe we need some GlyphInfoExtractor class or something. Something when given a random sequence of code units and a valid code point start, it can handle such common operations?

Here is a sample from unciode.org:
index.php?t=getfile&id=5305&private=0


This is 14 code units, 5 code units, 4 glyphs. The user will see and recognize 4 items as more or less "atomic", so we should focus on this.

We need and API that can locate each glyph start and allow us to replace glyphs 2 and 3 with an already properly encoded glyph sequence, like ſʒ, as an example.

This can be done on the fly with something high level like:
s = s.GlyphMid(0, 1) + "ſʒ" + s.GlyphMid(3, 1)
or
s.GlyphReplace(2, 3, "ſʒ")

or we can go lower level. Or we can go into multi-byte String territory.

PS: the high level stuff still is StringWalker territory.



Re: Choosing the best way to go full UNICODE [message #48189 is a reply to message #48188] Wed, 31 May 2017 14:41 Go to previous messageGo to previous message
mirek is currently offline  mirek
Messages: 12565
Registered: November 2005
Ultimate Member
cbpporter wrote on Wed, 31 May 2017 13:43
Without RLE, I think you can get around this by positions not representing characters, but sequence starts.


But that is just implementation issue. Semantically, it is the same.

FYI, my new "String" implementation would go like this

class MString {
String text;
int character_count; // number of code units groups, e.g. graphemes
union {
byte *offsets; // for tiny strings
word *woffsets; // for small strings
dword *loffsets; // for really big strings
};

String operator[](int pos) { // not often used likely, more for demo
int pos1 = GetOffset(pos);
return text.Mid(pos1, GetOffset(pos + 1) - pos1);
}
}

Quote:

The real challenge is to standardize these operations so you don't have to repeat them.


Yes. Hence MString. Note that with MString, after you perform Insert of another MString or Remove, you just update offset table, no need to iterate anything again.

Quote:

PS: the high level stuff still is StringWalker territory.


Depends on what how you define "high level" Smile

For me, really high level abstraction is where you abstract from those little pesky issues and handle everything in high-level units (graphemes / characters).
Previous Topic: Some addition proposals
Next Topic: Help needed with link errors (serversocket)
Goto Forum:
  


Current Time: Thu Jul 02 11:13:32 CEST 2020

Total time taken to generate the page: 0.02510 seconds