January 17, 2011
On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
> We need to get some real numbers together. I'll see what I can create
> for a type, but someone else needs to supply the input :) I'm on short
> supply of unicode data, and any attempts I've made to create some result
> in failure. I have one example of one composed character in this thread
> that I can cling to, but in order to supply some real numbers, we need a
> large amount of data.

Oh, one more thing. You don't need a lot of Unicode text containing combining characters to write benchmarks. (You do need it for testing purposes.) Most text won't contain combining characters anyway, so after you implement graphemes, just benchmark them on regular text.

Andrei
January 17, 2011
On 01/15/2011 08:51 PM, Steven Schveighoffer wrote:
>> More over, Even if you ignore Hebrew as a tiny insignificant minority
>> you cannot do the same for Arabic which has over one *billion* people
>> that use that language.
>
> I hope that the medium type works 'good enough' for those languages,
> with the high level type needed for advanced usages.  At a minimum,
> comparison and substring should work for all languages.

Hello Steven,

How does an application know that a given text, which supposedly is written in a given natural language (as for instance indicated by an html header) does not also hold terms from other languages? There are various occasions for this: quotations, use of foreign words, pointers...

A side-issue is raised by precomposed codes for composite characters. For most languages of the world, I guess (but unsure), all "official" characters have single-code representations. Good, but unfortunately this is not enforced by the standard (instead, the decomposed form can sensibly be considered the base form, but this is another topic).
So that even if ones knows for sure that all characters of all texts an app will ever deal with can be mapped to single codes, to be safe one would have to normalise to NFC anyway (Normalised Form Composed). Then, where is the actual gain? In fact, it is a loss because NFC is more costly than NFD (Decomposed) --actually, the standard NFC algo first decomposes to NFD to initially get an unique representation that can then be more easily (re)composed via simple mappings.

For further information:
Unicode's normalisation algos: http://unicode.org/reports/tr15/
list of technical reports: http://unicode.org/reports/
(Unicode's technical reports are far more readible than the standard itself, but unfortunately often refer to it.)

Denis
_________________
vita es estrany
spir.wikidot.com

January 17, 2011
On Mon, 17 Jan 2011 10:00:57 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
>> We need to get some real numbers together. I'll see what I can create
>> for a type, but someone else needs to supply the input :) I'm on short
>> supply of unicode data, and any attempts I've made to create some result
>> in failure. I have one example of one composed character in this thread
>> that I can cling to, but in order to supply some real numbers, we need a
>> large amount of data.
>
> Oh, one more thing. You don't need a lot of Unicode text containing combining characters to write benchmarks. (You do need it for testing purposes.) Most text won't contain combining characters anyway, so after you implement graphemes, just benchmark them on regular text.

True, benchmarking doesn't apply with combining characters because we have nothing to compare it to.  The current scheme fails on it anyways, so it by default would be the best solution.

-Steve
January 17, 2011
On Mon, 17 Jan 2011 10:14:19 -0500, spir <denis.spir@gmail.com> wrote:

> On 01/15/2011 08:51 PM, Steven Schveighoffer wrote:
>>> More over, Even if you ignore Hebrew as a tiny insignificant minority
>>> you cannot do the same for Arabic which has over one *billion* people
>>> that use that language.
>>
>> I hope that the medium type works 'good enough' for those languages,
>> with the high level type needed for advanced usages.  At a minimum,
>> comparison and substring should work for all languages.
>
> Hello Steven,
>
> How does an application know that a given text, which supposedly is written in a given natural language (as for instance indicated by an html header) does not also hold terms from other languages? There are various occasions for this: quotations, use of foreign words, pointers...
>
> A side-issue is raised by precomposed codes for composite characters. For most languages of the world, I guess (but unsure), all "official" characters have single-code representations. Good, but unfortunately this is not enforced by the standard (instead, the decomposed form can sensibly be considered the base form, but this is another topic).
> So that even if ones knows for sure that all characters of all texts an app will ever deal with can be mapped to single codes, to be safe one would have to normalise to NFC anyway (Normalised Form Composed). Then, where is the actual gain? In fact, it is a loss because NFC is more costly than NFD (Decomposed) --actually, the standard NFC algo first decomposes to NFD to initially get an unique representation that can then be more easily (re)composed via simple mappings.
>
> For further information:
> Unicode's normalisation algos: http://unicode.org/reports/tr15/
> list of technical reports: http://unicode.org/reports/
> (Unicode's technical reports are far more readible than the standard itself, but unfortunately often refer to it.)

I'll reply to this to save you the trouble.  I have reversed my position since writing a lot of these posts.

In summary, I think strings should default to an element type of a grapheme, which should be implemented via a slice of the original data.  Updated string type forthcoming.

-Steve
January 17, 2011
On 01/15/2011 09:46 PM, foobar wrote:
> I'd like to have full Unicode support. I think it is a good thing for D to have in order to expand in the world. As an alternative, I'd settle for loud errors that make absolutely clear to the non-Unicode expert programmer that D simply does NOT support e.g. Normalization.
>
> As Spir already said, Unicode is something few understand and even it's own official docs do not explain such issues properly. We should not confuse users even further with incomplete support.

In a few days, D will have an external library able to deal with those issues, hopefully correctly and clearly for client programmers. Possibly, its design is not the best possible approach (esp for efficiency: Michel let me doubt about that, and my competence in this field is close to nothing). But it has the merit to exist and provide a clear example of the correct semantics. Let us use it as a base for experimentation.

Then, everything can be redesigned from scratch if we realise I was initially completely wrong. In any case, it would certainly be a far easier and fast job to do now, after having explored the issues at length, and with a reference implementation at hand.

Denis
_________________
vita es estrany
spir.wikidot.com

January 17, 2011
On 01/15/2011 11:45 PM, Michel Fortin wrote:
> That said, I'm sure if someone could redesign Unicode by breaking
> backward-compatibility we'd have something simpler. You could probably
> get rid of pre-combined characters and reduce the number of
> normalization forms. But would you be able to get rid of normalization
> entirely? I don't think so. Reinventing Unicode is probably not worth it.

I think like you about pre-composed characters: they bring no real gain (even for easing passage from historic charsets, since texts must be decoded anyway, then mapping to single or multiple codes is nothing).
But they add complication to the design in proposing 2 // representation schemes (one character <--> one "code pile" versus one character <--> one precomposed code). And impose much weight on the back of software (and programmers) relative to correct indexing/ slicing and comparison, search, count, etc. Where normalisation forms enter the game.
My whoice would be:
* decomposed form only
* ordering imposed by the standard at text-composition time
==> no normalisation because everything is normalised from scratch.

Remains only what I call "piling". But we cannot easily get rid of it --without separators in standard UTF encodings.
I had the idea of UTF-33 ;-): a alternative freely agreed-upon encoding that just says (in addition to UTF-32) that the content is already normalised (NFD decomposed and ordered): either so produced intially or already processed. So that software can happily read texts in and only think at piling if needed. UTF-33+ would add "grapheme" separators (a costly solution in terms of space) to get rid of piling.
The aim indeed beeing to avoid stupidly doing the same job multiple times on the same text.

Denis
_________________
vita es estrany
spir.wikidot.com

January 17, 2011
On 01/14/2011 04:50 PM, Michel Fortin wrote:
>> This might be a good time to see whether we need to address graphemes
>> systematically. Could you please post a few links that would educate
>> me and others in the mysteries of combining characters?
>
> As usual, Wikipedia offers a good summary and a couple of references.
> Here's the part about combining characters:
> <http://en.wikipedia.org/wiki/Combining_character>.
>
> There's basically four ranges of code points which are combining:
> - Combining Diacritical Marks (0300–036F)
> - Combining Diacritical Marks Supplement (1DC0–1DFF)
> - Combining Diacritical Marks for Symbols (20D0–20FF)
> - Combining Half Marks (FE20–FE2F)
>
> A code point followed by one or more code points in these ranges is
> conceptually a single character (a grapheme).

Unfortunatly, things are complicated by _prepend_ combining marks that happen in a code sequence _before_ the base mark.
The Unicode algorithm is described here: http://unicode.org/reports/tr29/ section 3 (humanly readable ;-). See esp the first table in section 3.1.

Denis
_________________
vita es estrany
spir.wikidot.com

January 17, 2011
On 2011-01-16 18:58:54 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> On 1/16/11 3:20 PM, Michel Fortin wrote:
>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail@erdani.org> said:
>> 
>>> On 1/15/11 10:45 PM, Michel Fortin wrote:
>>>> No doubt it's easier to implement it that way. The problem is that in
>>>> most cases it won't be used. How many people really know what is a
>>>> grapheme?
>>> 
>>> How many people really should care?
>> 
>> I think the only people who should *not* care are those who have
>> validated that the input does not contain any combining code point. If
>> you know the input *can't* contain combining code points, then it's safe
>> to ignore them.
> 
> I agree. Now let me ask again: how many people really should care?

As I said: all those people who are not validating the inputs to make sure they don't contain combining code points. As far as I know, no one is doing that, so that means everybody should use algorithms capable of handling multi-code-point graphemes. If someone indeed is doing this validation, he'll probably also be smart enough to make his algorithms to work with dchars.

That said, no one should really have to care but those who implement the string manipulation functions. The idea behind making the grapheme the element type is to make it easier to write grapheme-aware string manipulation functions, even if you don't know about graphemes. But the reality is probably more mixed than that.

- - -

I gave some thought about all this, and came to an interesting realizations that made me refine the proposal. The new proposal is disruptive perhaps as much as the first, but in a different way.

But first, let's state a few facts to reframe the current discussion:

Fact 1: most people don't know Unicode very well
Fact 2: most people are confused by code units, code points, graphemes, and what is a 'character'
Fact 3: most people won't bother with all this, they'll just use the basic language facilities and assume everything work correctly if it it works correctly for them

Now, let's define two goals:

Goal 1: make most people's string operations work correctly
Goal 2: make most people's string operations work fast

To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not sure we agree on this, but let's continue.

From the above 3 facts, we can deduce that a user won't want to bother to using byDchar, byGrapheme, or byWhatever when using algorithms. You were annoyed by having to write byDchar everywhere, so changed the element type to always be dchar and you don't have to write byDchar anymore. That's understandable and perfectly reasonable.

The problem is of course that it doesn't give you correct results. Most of the time what you really want is to use graphemes, dchar just happen to be a good approximation of that that works most of the time.

Iterating by grapheme is somewhat problematic, and it degrades performance. Same for comparing graphemes for normalized equivalence. That's all true. I'm not too sure what we can do about that. It can be optimized, but it's very understandable that some people won't be satisfied by the performance and will want to avoid graphemes.

Speaking of optimization, I do understand that iterating by grapheme using the range interface won't give you the best performance. It's certainly convenient as it enables the reuse of existing algorithms with graphemes, but more specialized algorithms and interfaces might be more suited.

One observation I made with having dchar as the default element type is that not all algorithms really need to deal with dchar. If I'm searching for code point 'a' in a UTF-8 string, decoding code units into code points is a waste of time. Why? because the only way to represent code point 'a' is by having code point 'a'. And guess what? The almost same optimization can apply to graphemes: if you're searching for 'a' in a grapheme-aware manner in a UTF-8 string, all you have to do is search for the UTF-8 code unit 'a', then check if the 'a' code unit is followed by a combining mark code point to confirm it is really a 'a', not a composed grapheme. Iterating the string by code unit is enough for these cases, and it'd increase performance by a lot.

So making dchar the default type is no doubt convenient because it abstracts things enough so that generic algorithms can work with strings, but it has a performance penalty that you don't always need. I made an example using UTF-8, it applies even more to UTF-16. And it applies to grapheme-aware manipulations too.

This penalty with generic algorithms comes from the fact that they take a predicate of the form "a == 'a'" or "a == b", which is ill-suited for strings because you always need to fully decode the string (by dchar or by graphemes) for the purpose of calling the predicate. Given that comparing characters for something else than equality or them being part of a set is very rarely something you do, generic algorithms miss a big optimization opportunity here.

- - -

So here's what I think we should do:

Todo 1: disallow generic algorithms on naked strings: string-specific Unicode-aware algorithms should be used instead; they can share the same name if their usage is similar

Todo 2: to use a generic algorithm with a strings, you must dress the string using one of toDchar, toGrapheme, toCodeUnits; this way your intentions are clear

Todo 3: string-specific algorithms can implemented as simple wrappers for generic algorithms with the string dressed correctly for the task, or they can implement more sophisticated algorithms to increase performance

There's two major benefits to this approach:

Benefit 1: if indeed you really don't want the performance penalty that comes with checking for composed graphemes, you can bypass it at some specific places in your code using byDchar, or you can disable it altogether by modifying the string-specific algorithms and recompiling Phobos.

Benefit 2: we don't have to rush to implementing graphemes in the Unicode-aware algorithms. Just make sure the interface for string-specific algorithms *can* accept graphemes, and we can roll out support for them at a later time once we have a decent implementation.

Also, all this is leaving the question open as to what to do when someone uses the string as a range. In my opinion, it should either iterate on code units (because the string is actually an array, and because that's what foreach does) or simply disallow iteration (asking that you dress the string first using toCodeUnit, toDchar, or toGrapheme).

Do you like that more?


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 17, 2011
On 01/15/2011 12:21 AM, Michel Fortin wrote:
> Also, it'd really help this discussion to have some hard numbers about
> the cost of decoding graphemes.

Text has a perf module that provides such numbers (on different stages of Text object construction) (but the measured algos are not yet stabilised, so that said numbers regularly change, but in the right sense ;-)
You can try the current version at https://bitbucket.org/denispir/denispir-d/src (the perf module is called chrono.d)

For information, recently, the cost of full text construction: decoding, normalisation (both decomp & ordering), piling, was about 5 times decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just informed me about a new gain in piling I have not yet tested.
This performance places our library in-between Windows native tools and ICU in terms of speed. Which is imo rather good for a brand new tool written in a still unstable language.

I have carefully read your arguments on Text's approach to systematically "pile" and normalise source texts not beeing the right one from an efficiency point of view. Even for strict use cases of universal text manipulation (because the relative space cost would indirectly cause time cost due to cache effects). Instead, you state we should "pile" and/or normalise on the fly. But I am, similarly to you, rather doubtful on this point without any numbers available.
So, let us produce some benchmark results on both approaches if you like.


Denis
_________________
vita es estrany
spir.wikidot.com

January 17, 2011
On 01/17/2011 01:44 PM, Steven Schveighoffer wrote:
> On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>
>> On 1/15/11 9:25 PM, Jonathan M Davis wrote:
>>> Considering that strings are already dealt with specially in order to
>>> have an
>>> element of dchar, I wouldn't think that it would be all that
>>> distruptive to make
>>> it so that they had an element type of Grapheme instead. Wouldn't
>>> that then fix
>>> all of std.algorithm and the like without really disrupting anything?
>>
>> It would make everything related a lot (a TON) slower, and it would
>> break all client code that uses dchar as the element type, or is
>> otherwise unprepared to use Graphemes explicitly. There is no question
>> there will be disruption.
>
> I would have agreed with you last week. Now I understand that using
> dchar is just as useless for unicode as using char.
>
> Will it be slower? Perhaps. A TON slower? Probably not.
>
> But it will be correct. Correct and slow is better than incorrect and
> fast. If I showed you a shortest-path algorithm that ran in O(V) time,
> but didn't always find the shortest path, would you call it a success?
>
> We need to get some real numbers together. I'll see what I can create
> for a type, but someone else needs to supply the input :) I'm on short
> supply of unicode data, and any attempts I've made to create some result
> in failure. I have one example of one composed character in this thread
> that I can cling to, but in order to supply some real numbers, we need a
> large amount of data.
>
> -Steve

Hello Steve & Andrei,


I see 2 questions: (1) whether we should provide Unicode correctness as a default or not? and relative points of level of abstraction & normalisation (2) what is the best way to implement such correctness?
Let us put aside (1) for a while, anyway nothing prevents us to experiment while waiting for an agreement; such experiment would in fact feed the debate with real facts instead of "airy" ideas.

It seems there are 2 opposite approaches to Unicode correctness. Mine was to build a types that systematically abstracts UCS-created issues (that real whole characters are coded by mini-arrays of codes I call "code piles", that those piles have variable lengths, _and_ that cheracters even may have several representations). Then, in my wild guesses, every text manipulation method should obviously be "flash fast", actually faster than any on the fly algo by several orders of magnitude. But Michel let me doubt on that point.

The other approach is precisely to provide needed abstraction ("piling" and normalisation) on the fly. Like proposed by Michel, and like Objective-C does, IIUC. This way seems to me closer to a kind of re-design Steven's new String type and/or Andrei's VLERange.

As you say, we need real timing numbers to decide. I think we should measure at least 2 routines:
* indexing (or better iteration?) which only requires "piling"
* counting occurrences of a given character or slice, which requires both piling and normalisation

I do not feel like implementating such routine for the on the fly version, and have no time for this in coming days; but if anyone is volunteer, feel free to rip code and data from Text's current implementation if it may help.

As source text, we can use the one at https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/data/unicode.txt (already my source for perf measures). It has the only merit to be a text (about unicode!) in twelve rather different languages.

[My intuitive guess is that Michel is wrong by orders of magnitude --but again I know about nothing about code performance.]


Denis
_________________
vita es estrany
spir.wikidot.com