View mode: basic / threaded / horizontal-split · Log in · Help
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
> We need to get some real numbers together. I'll see what I can create
> for a type, but someone else needs to supply the input :) I'm on short
> supply of unicode data, and any attempts I've made to create some result
> in failure. I have one example of one composed character in this thread
> that I can cling to, but in order to supply some real numbers, we need a
> large amount of data.

Oh, one more thing. You don't need a lot of Unicode text containing 
combining characters to write benchmarks. (You do need it for testing 
purposes.) Most text won't contain combining characters anyway, so after 
you implement graphemes, just benchmark them on regular text.

Andrei
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and
On 01/15/2011 08:51 PM, Steven Schveighoffer wrote:
>> More over, Even if you ignore Hebrew as a tiny insignificant minority
>> you cannot do the same for Arabic which has over one *billion* people
>> that use that language.
>
> I hope that the medium type works 'good enough' for those languages,
> with the high level type needed for advanced usages.  At a minimum,
> comparison and substring should work for all languages.

Hello Steven,

How does an application know that a given text, which supposedly is 
written in a given natural language (as for instance indicated by an 
html header) does not also hold terms from other languages? There are 
various occasions for this: quotations, use of foreign words, pointers...

A side-issue is raised by precomposed codes for composite characters. 
For most languages of the world, I guess (but unsure), all "official" 
characters have single-code representations. Good, but unfortunately 
this is not enforced by the standard (instead, the decomposed form can 
sensibly be considered the base form, but this is another topic).
So that even if ones knows for sure that all characters of all texts an 
app will ever deal with can be mapped to single codes, to be safe one 
would have to normalise to NFC anyway (Normalised Form Composed). Then, 
where is the actual gain? In fact, it is a loss because NFC is more 
costly than NFD (Decomposed) --actually, the standard NFC algo first 
decomposes to NFD to initially get an unique representation that can 
then be more easily (re)composed via simple mappings.

For further information:
Unicode's normalisation algos: http://unicode.org/reports/tr15/
list of technical reports: http://unicode.org/reports/
(Unicode's technical reports are far more readible than the standard 
itself, but unfortunately often refer to it.)

Denis
_________________
vita es estrany
spir.wikidot.com
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On Mon, 17 Jan 2011 10:00:57 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail@erdani.org> wrote:

> On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
>> We need to get some real numbers together. I'll see what I can create
>> for a type, but someone else needs to supply the input :) I'm on short
>> supply of unicode data, and any attempts I've made to create some result
>> in failure. I have one example of one composed character in this thread
>> that I can cling to, but in order to supply some real numbers, we need a
>> large amount of data.
>
> Oh, one more thing. You don't need a lot of Unicode text containing  
> combining characters to write benchmarks. (You do need it for testing  
> purposes.) Most text won't contain combining characters anyway, so after  
> you implement graphemes, just benchmark them on regular text.

True, benchmarking doesn't apply with combining characters because we have  
nothing to compare it to.  The current scheme fails on it anyways, so it  
by default would be the best solution.

-Steve
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and
On Mon, 17 Jan 2011 10:14:19 -0500, spir <denis.spir@gmail.com> wrote:

> On 01/15/2011 08:51 PM, Steven Schveighoffer wrote:
>>> More over, Even if you ignore Hebrew as a tiny insignificant minority
>>> you cannot do the same for Arabic which has over one *billion* people
>>> that use that language.
>>
>> I hope that the medium type works 'good enough' for those languages,
>> with the high level type needed for advanced usages.  At a minimum,
>> comparison and substring should work for all languages.
>
> Hello Steven,
>
> How does an application know that a given text, which supposedly is  
> written in a given natural language (as for instance indicated by an  
> html header) does not also hold terms from other languages? There are  
> various occasions for this: quotations, use of foreign words, pointers...
>
> A side-issue is raised by precomposed codes for composite characters.  
> For most languages of the world, I guess (but unsure), all "official"  
> characters have single-code representations. Good, but unfortunately  
> this is not enforced by the standard (instead, the decomposed form can  
> sensibly be considered the base form, but this is another topic).
> So that even if ones knows for sure that all characters of all texts an  
> app will ever deal with can be mapped to single codes, to be safe one  
> would have to normalise to NFC anyway (Normalised Form Composed). Then,  
> where is the actual gain? In fact, it is a loss because NFC is more  
> costly than NFD (Decomposed) --actually, the standard NFC algo first  
> decomposes to NFD to initially get an unique representation that can  
> then be more easily (re)composed via simple mappings.
>
> For further information:
> Unicode's normalisation algos: http://unicode.org/reports/tr15/
> list of technical reports: http://unicode.org/reports/
> (Unicode's technical reports are far more readible than the standard  
> itself, but unfortunately often refer to it.)

I'll reply to this to save you the trouble.  I have reversed my position  
since writing a lot of these posts.

In summary, I think strings should default to an element type of a  
grapheme, which should be implemented via a slice of the original data.   
Updated string type forthcoming.

-Steve
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and
On 01/15/2011 09:46 PM, foobar wrote:
> I'd like to have full Unicode support. I think it is a good thing for D to have in order to expand in the world. As an alternative, I'd settle for loud errors that make absolutely clear to the non-Unicode expert programmer that D simply does NOT support e.g. Normalization.
>
> As Spir already said, Unicode is something few understand and even it's own official docs do not explain such issues properly. We should not confuse users even further with incomplete support.

In a few days, D will have an external library able to deal with those 
issues, hopefully correctly and clearly for client programmers. 
Possibly, its design is not the best possible approach (esp for 
efficiency: Michel let me doubt about that, and my competence in this 
field is close to nothing). But it has the merit to exist and provide a 
clear example of the correct semantics. Let us use it as a base for 
experimentation.

Then, everything can be redesigned from scratch if we realise I was 
initially completely wrong. In any case, it would certainly be a far 
easier and fast job to do now, after having explored the issues at 
length, and with a reference implementation at hand.

Denis
_________________
vita es estrany
spir.wikidot.com
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/15/2011 11:45 PM, Michel Fortin wrote:
> That said, I'm sure if someone could redesign Unicode by breaking
> backward-compatibility we'd have something simpler. You could probably
> get rid of pre-combined characters and reduce the number of
> normalization forms. But would you be able to get rid of normalization
> entirely? I don't think so. Reinventing Unicode is probably not worth it.

I think like you about pre-composed characters: they bring no real gain 
(even for easing passage from historic charsets, since texts must be 
decoded anyway, then mapping to single or multiple codes is nothing).
But they add complication to the design in proposing 2 // representation 
schemes (one character <--> one "code pile" versus one character <--> 
one precomposed code). And impose much weight on the back of software 
(and programmers) relative to correct indexing/ slicing and comparison, 
search, count, etc. Where normalisation forms enter the game.
My whoice would be:
* decomposed form only
* ordering imposed by the standard at text-composition time
==> no normalisation because everything is normalised from scratch.

Remains only what I call "piling". But we cannot easily get rid of it 
--without separators in standard UTF encodings.
I had the idea of UTF-33 ;-): a alternative freely agreed-upon encoding 
that just says (in addition to UTF-32) that the content is already 
normalised (NFD decomposed and ordered): either so produced intially or 
already processed. So that software can happily read texts in and only 
think at piling if needed. UTF-33+ would add "grapheme" separators (a 
costly solution in terms of space) to get rid of piling.
The aim indeed beeing to avoid stupidly doing the same job multiple 
times on the same text.

Denis
_________________
vita es estrany
spir.wikidot.com
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/14/2011 04:50 PM, Michel Fortin wrote:
>> This might be a good time to see whether we need to address graphemes
>> systematically. Could you please post a few links that would educate
>> me and others in the mysteries of combining characters?
>
> As usual, Wikipedia offers a good summary and a couple of references.
> Here's the part about combining characters:
> <http://en.wikipedia.org/wiki/Combining_character>.
>
> There's basically four ranges of code points which are combining:
> - Combining Diacritical Marks (0300–036F)
> - Combining Diacritical Marks Supplement (1DC0–1DFF)
> - Combining Diacritical Marks for Symbols (20D0–20FF)
> - Combining Half Marks (FE20–FE2F)
>
> A code point followed by one or more code points in these ranges is
> conceptually a single character (a grapheme).

Unfortunatly, things are complicated by _prepend_ combining marks that 
happen in a code sequence _before_ the base mark.
The Unicode algorithm is described here: 
http://unicode.org/reports/tr29/ section 3 (humanly readable ;-). See 
esp the first table in section 3.1.

Denis
_________________
vita es estrany
spir.wikidot.com
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-16 18:58:54 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail@erdani.org> said:

> On 1/16/11 3:20 PM, Michel Fortin wrote:
>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
>> <SeeWebsiteForEmail@erdani.org> said:
>> 
>>> On 1/15/11 10:45 PM, Michel Fortin wrote:
>>>> No doubt it's easier to implement it that way. The problem is that in
>>>> most cases it won't be used. How many people really know what is a
>>>> grapheme?
>>> 
>>> How many people really should care?
>> 
>> I think the only people who should *not* care are those who have
>> validated that the input does not contain any combining code point. If
>> you know the input *can't* contain combining code points, then it's safe
>> to ignore them.
> 
> I agree. Now let me ask again: how many people really should care?

As I said: all those people who are not validating the inputs to make 
sure they don't contain combining code points. As far as I know, no one 
is doing that, so that means everybody should use algorithms capable of 
handling multi-code-point graphemes. If someone indeed is doing this 
validation, he'll probably also be smart enough to make his algorithms 
to work with dchars.

That said, no one should really have to care but those who implement 
the string manipulation functions. The idea behind making the grapheme 
the element type is to make it easier to write grapheme-aware string 
manipulation functions, even if you don't know about graphemes. But the 
reality is probably more mixed than that.

- - -

I gave some thought about all this, and came to an interesting 
realizations that made me refine the proposal. The new proposal is 
disruptive perhaps as much as the first, but in a different way.

But first, let's state a few facts to reframe the current discussion:

Fact 1: most people don't know Unicode very well
Fact 2: most people are confused by code units, code points, graphemes, 
and what is a 'character'
Fact 3: most people won't bother with all this, they'll just use the 
basic language facilities and assume everything work correctly if it it 
works correctly for them

Now, let's define two goals:

Goal 1: make most people's string operations work correctly
Goal 2: make most people's string operations work fast

To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not 
sure we agree on this, but let's continue.

From the above 3 facts, we can deduce that a user won't want to bother 
to using byDchar, byGrapheme, or byWhatever when using algorithms. You 
were annoyed by having to write byDchar everywhere, so changed the 
element type to always be dchar and you don't have to write byDchar 
anymore. That's understandable and perfectly reasonable.

The problem is of course that it doesn't give you correct results. Most 
of the time what you really want is to use graphemes, dchar just happen 
to be a good approximation of that that works most of the time.

Iterating by grapheme is somewhat problematic, and it degrades 
performance. Same for comparing graphemes for normalized equivalence. 
That's all true. I'm not too sure what we can do about that. It can be 
optimized, but it's very understandable that some people won't be 
satisfied by the performance and will want to avoid graphemes.

Speaking of optimization, I do understand that iterating by grapheme 
using the range interface won't give you the best performance. It's 
certainly convenient as it enables the reuse of existing algorithms 
with graphemes, but more specialized algorithms and interfaces might be 
more suited.

One observation I made with having dchar as the default element type is 
that not all algorithms really need to deal with dchar. If I'm 
searching for code point 'a' in a UTF-8 string, decoding code units 
into code points is a waste of time. Why? because the only way to 
represent code point 'a' is by having code point 'a'. And guess what? 
The almost same optimization can apply to graphemes: if you're 
searching for 'a' in a grapheme-aware manner in a UTF-8 string, all you 
have to do is search for the UTF-8 code unit 'a', then check if the 'a' 
code unit is followed by a combining mark code point to confirm it is 
really a 'a', not a composed grapheme. Iterating the string by code 
unit is enough for these cases, and it'd increase performance by a lot.

So making dchar the default type is no doubt convenient because it 
abstracts things enough so that generic algorithms can work with 
strings, but it has a performance penalty that you don't always need. I 
made an example using UTF-8, it applies even more to UTF-16. And it 
applies to grapheme-aware manipulations too.

This penalty with generic algorithms comes from the fact that they take 
a predicate of the form "a == 'a'" or "a == b", which is ill-suited for 
strings because you always need to fully decode the string (by dchar or 
by graphemes) for the purpose of calling the predicate. Given that 
comparing characters for something else than equality or them being 
part of a set is very rarely something you do, generic algorithms miss 
a big optimization opportunity here.

- - -

So here's what I think we should do:

Todo 1: disallow generic algorithms on naked strings: string-specific 
Unicode-aware algorithms should be used instead; they can share the 
same name if their usage is similar

Todo 2: to use a generic algorithm with a strings, you must dress the 
string using one of toDchar, toGrapheme, toCodeUnits; this way your 
intentions are clear

Todo 3: string-specific algorithms can implemented as simple wrappers 
for generic algorithms with the string dressed correctly for the task, 
or they can implement more sophisticated algorithms to increase 
performance

There's two major benefits to this approach:

Benefit 1: if indeed you really don't want the performance penalty that 
comes with checking for composed graphemes, you can bypass it at some 
specific places in your code using byDchar, or you can disable it 
altogether by modifying the string-specific algorithms and recompiling 
Phobos.

Benefit 2: we don't have to rush to implementing graphemes in the 
Unicode-aware algorithms. Just make sure the interface for 
string-specific algorithms *can* accept graphemes, and we can roll out 
support for them at a later time once we have a decent implementation.

Also, all this is leaving the question open as to what to do when 
someone uses the string as a range. In my opinion, it should either 
iterate on code units (because the string is actually an array, and 
because that's what foreach does) or simply disallow iteration (asking 
that you dress the string first using toCodeUnit, toDchar, or 
toGrapheme).

Do you like that more?


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/15/2011 12:21 AM, Michel Fortin wrote:
> Also, it'd really help this discussion to have some hard numbers about
> the cost of decoding graphemes.

Text has a perf module that provides such numbers (on different stages 
of Text object construction) (but the measured algos are not yet 
stabilised, so that said numbers regularly change, but in the right 
sense ;-)
You can try the current version at 
https://bitbucket.org/denispir/denispir-d/src (the perf module is called 
chrono.d)

For information, recently, the cost of full text construction: decoding, 
normalisation (both decomp & ordering), piling, was about 5 times 
decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just 
informed me about a new gain in piling I have not yet tested.
This performance places our library in-between Windows native tools and 
ICU in terms of speed. Which is imo rather good for a brand new tool 
written in a still unstable language.

I have carefully read your arguments on Text's approach to 
systematically "pile" and normalise source texts not beeing the right 
one from an efficiency point of view. Even for strict use cases of 
universal text manipulation (because the relative space cost would 
indirectly cause time cost due to cache effects). Instead, you state we 
should "pile" and/or normalise on the fly. But I am, similarly to you, 
rather doubtful on this point without any numbers available.
So, let us produce some benchmark results on both approaches if you like.


Denis
_________________
vita es estrany
spir.wikidot.com
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/17/2011 01:44 PM, Steven Schveighoffer wrote:
> On Sun, 16 Jan 2011 13:06:16 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>
>> On 1/15/11 9:25 PM, Jonathan M Davis wrote:
>>> Considering that strings are already dealt with specially in order to
>>> have an
>>> element of dchar, I wouldn't think that it would be all that
>>> distruptive to make
>>> it so that they had an element type of Grapheme instead. Wouldn't
>>> that then fix
>>> all of std.algorithm and the like without really disrupting anything?
>>
>> It would make everything related a lot (a TON) slower, and it would
>> break all client code that uses dchar as the element type, or is
>> otherwise unprepared to use Graphemes explicitly. There is no question
>> there will be disruption.
>
> I would have agreed with you last week. Now I understand that using
> dchar is just as useless for unicode as using char.
>
> Will it be slower? Perhaps. A TON slower? Probably not.
>
> But it will be correct. Correct and slow is better than incorrect and
> fast. If I showed you a shortest-path algorithm that ran in O(V) time,
> but didn't always find the shortest path, would you call it a success?
>
> We need to get some real numbers together. I'll see what I can create
> for a type, but someone else needs to supply the input :) I'm on short
> supply of unicode data, and any attempts I've made to create some result
> in failure. I have one example of one composed character in this thread
> that I can cling to, but in order to supply some real numbers, we need a
> large amount of data.
>
> -Steve

Hello Steve & Andrei,


I see 2 questions: (1) whether we should provide Unicode correctness as 
a default or not? and relative points of level of abstraction & 
normalisation (2) what is the best way to implement such correctness?
Let us put aside (1) for a while, anyway nothing prevents us to 
experiment while waiting for an agreement; such experiment would in fact 
feed the debate with real facts instead of "airy" ideas.

It seems there are 2 opposite approaches to Unicode correctness. Mine 
was to build a types that systematically abstracts UCS-created issues 
(that real whole characters are coded by mini-arrays of codes I call 
"code piles", that those piles have variable lengths, _and_ that 
cheracters even may have several representations). Then, in my wild 
guesses, every text manipulation method should obviously be "flash 
fast", actually faster than any on the fly algo by several orders of 
magnitude. But Michel let me doubt on that point.

The other approach is precisely to provide needed abstraction ("piling" 
and normalisation) on the fly. Like proposed by Michel, and like 
Objective-C does, IIUC. This way seems to me closer to a kind of 
re-design Steven's new String type and/or Andrei's VLERange.

As you say, we need real timing numbers to decide. I think we should 
measure at least 2 routines:
* indexing (or better iteration?) which only requires "piling"
* counting occurrences of a given character or slice, which requires 
both piling and normalisation

I do not feel like implementating such routine for the on the fly 
version, and have no time for this in coming days; but if anyone is 
volunteer, feel free to rip code and data from Text's current 
implementation if it may help.

As source text, we can use the one at 
https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/data/unicode.txt 
(already my source for perf measures). It has the only merit to be a 
text (about unicode!) in twelve rather different languages.

[My intuitive guess is that Michel is wrong by orders of magnitude --but 
again I know about nothing about code performance.]


Denis
_________________
vita es estrany
spir.wikidot.com
11 12 13 14 15 16 17 18 19
Top | Discussion index | About this forum | D home