January 17, 2011
On 1/17/11 10:34 AM, Michel Fortin wrote:
> On 2011-01-16 18:58:54 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>
>> On 1/16/11 3:20 PM, Michel Fortin wrote:
>>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
>>> <SeeWebsiteForEmail@erdani.org> said:
>>>
>>>> On 1/15/11 10:45 PM, Michel Fortin wrote:
>>>>> No doubt it's easier to implement it that way. The problem is that in
>>>>> most cases it won't be used. How many people really know what is a
>>>>> grapheme?
>>>>
>>>> How many people really should care?
>>>
>>> I think the only people who should *not* care are those who have
>>> validated that the input does not contain any combining code point. If
>>> you know the input *can't* contain combining code points, then it's safe
>>> to ignore them.
>>
>> I agree. Now let me ask again: how many people really should care?
>
> As I said: all those people who are not validating the inputs to make
> sure they don't contain combining code points.

The question (which I see you keep on dodging :o)) is how much text contains combining code points.

I have worked in NLP for years, and still do. I even worked on Arabic text (albeit Romanized). I work with Wikipedia. I use Unicode all the time, but I have yet to have trouble with a combining character. I was just vaguely aware of their existence up until this discussion, but just waved it away and guess what - it worked for me.

It does not serve us well to rigidly claim that the only good way of doing anything Unicode is to care about graphemes. Even NSString exposes the UTF16 underlying encoding and provides dedicated functions for grapheme-based processing. For one thing, if you care about the width of a word in printed text (one of the case where graphemes are important), you need font information. And - surprise! - some fonts do NOT support combining characters and print signs next to one another instead of juxtaposing them, so the "wrong" method of counting characters is more informative.

> As far as I know, no one
> is doing that, so that means everybody should use algorithms capable of
> handling multi-code-point graphemes. If someone indeed is doing this
> validation, he'll probably also be smart enough to make his algorithms
> to work with dchars.

I am not sure everybody should use graphemes.

> That said, no one should really have to care but those who implement the
> string manipulation functions. The idea behind making the grapheme the
> element type is to make it easier to write grapheme-aware string
> manipulation functions, even if you don't know about graphemes. But the
> reality is probably more mixed than that.

The reality is indeed more mixed. Inevitably at some point the API needs to answer the question: "what is the first character of this string?" Transparency is not possible. You break all string code out there.

> - - -
>
> I gave some thought about all this, and came to an interesting
> realizations that made me refine the proposal. The new proposal is
> disruptive perhaps as much as the first, but in a different way.
>
> But first, let's state a few facts to reframe the current discussion:
>
> Fact 1: most people don't know Unicode very well
> Fact 2: most people are confused by code units, code points, graphemes,
> and what is a 'character'
> Fact 3: most people won't bother with all this, they'll just use the
> basic language facilities and assume everything work correctly if it it
> works correctly for them

Nice :o).

> Now, let's define two goals:
>
> Goal 1: make most people's string operations work correctly
> Goal 2: make most people's string operations work fast

Goal 3: don't break all existing code
Goal 4: make most people's string-based code easy to write and understand

> To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not
> sure we agree on this, but let's continue.

I think we disagree about what "most" means. For you it means "people who don't understand Unicode well but deal with combining characters anyway". For me it's "the largest percentage of D users across various writing systems".

>  From the above 3 facts, we can deduce that a user won't want to bother
> to using byDchar, byGrapheme, or byWhatever when using algorithms. You
> were annoyed by having to write byDchar everywhere, so changed the
> element type to always be dchar and you don't have to write byDchar
> anymore. That's understandable and perfectly reasonable.
>
> The problem is of course that it doesn't give you correct results. Most
> of the time what you really want is to use graphemes, dchar just happen
> to be a good approximation of that that works most of the time.

Again, it's a matter of tradeoffs. I chose dchar because char was plain _wrong_ most of the time, not because char was a pretty darn good approximation that worked for most people most of the time. The fact remains that dchar _is_ a pretty darn good approximation that also has pretty good darn speed. So I'd say that I _still_ want to use dchar most of the time.

Committing to graphemes would complicate APIs for _everyone_ and would make things slower for _everyone_ for the sake of combining characters that _never_ occur in _most_ people's text. This is bad design, pure and simple. A good design is to cater for the majority and provide dedicated APIs for the few.

> Iterating by grapheme is somewhat problematic, and it degrades
> performance.

Yes.

> Same for comparing graphemes for normalized equivalence.

Yes, although I think you can optimize code such that comparing two strings wholesale only has a few more comparisons on the critical path. That would be still slower, but not as slow as iterating by grapheme in a naive implementation.

> That's all true. I'm not too sure what we can do about that. It can be
> optimized, but it's very understandable that some people won't be
> satisfied by the performance and will want to avoid graphemes.

I agree.

> Speaking of optimization, I do understand that iterating by grapheme
> using the range interface won't give you the best performance. It's
> certainly convenient as it enables the reuse of existing algorithms with
> graphemes, but more specialized algorithms and interfaces might be more
> suited.

Even the specialized algorithms will be significantly slower.

> One observation I made with having dchar as the default element type is
> that not all algorithms really need to deal with dchar. If I'm searching
> for code point 'a' in a UTF-8 string, decoding code units into code
> points is a waste of time. Why? because the only way to represent code
> point 'a' is by having code point 'a'.

Right. That's why many algorithms in std are specialized for such cases.

> And guess what? The almost same
> optimization can apply to graphemes: if you're searching for 'a' in a
> grapheme-aware manner in a UTF-8 string, all you have to do is search
> for the UTF-8 code unit 'a', then check if the 'a' code unit is followed
> by a combining mark code point to confirm it is really a 'a', not a
> composed grapheme. Iterating the string by code unit is enough for these
> cases, and it'd increase performance by a lot.

Unfortunately it all breaks as soon as you go beyond one code point. You can't search efficiently, you can't compare efficiently. Boyer-Moore and friends are out.

I'm not saying that we shouldn't implement the correct operations! I'm just not convinced they should be the default.

> So making dchar the default type is no doubt convenient because it
> abstracts things enough so that generic algorithms can work with
> strings, but it has a performance penalty that you don't always need. I
> made an example using UTF-8, it applies even more to UTF-16. And it
> applies to grapheme-aware manipulations too.

It is true that UTF manipulation incurs overhead. The tradeoff has many dimensions: UTF-16 is bulkier and less cache friendly, ASCII is not sufficient for most people, the UTF decoding overhead is not that high... it's difficult to find the sweetest spot.

> This penalty with generic algorithms comes from the fact that they take
> a predicate of the form "a == 'a'" or "a == b", which is ill-suited for
> strings because you always need to fully decode the string (by dchar or
> by graphemes) for the purpose of calling the predicate. Given that
> comparing characters for something else than equality or them being part
> of a set is very rarely something you do, generic algorithms miss a big
> optimization opportunity here.

How can we improve that? You can't argue for an inefficient scheme just because what we have isn't as efficient as it could possibly be.

> - - -
>
> So here's what I think we should do:
>
> Todo 1: disallow generic algorithms on naked strings: string-specific
> Unicode-aware algorithms should be used instead; they can share the same
> name if their usage is similar

I don't understand this. We already do this, and by "Unicode-aware" we understand using dchar throughout. This is transparent to client code.

> Todo 2: to use a generic algorithm with a strings, you must dress the
> string using one of toDchar, toGrapheme, toCodeUnits; this way your
> intentions are clear

Breaks a lot of existing code. Won't fly with Walter unless it solves world hunger. Nevertheless I have to say I like it; the #1 thing I'd change about the built-in strings is that they implicitly are two things at the same time. Asking for representation should be explicit.

> Todo 3: string-specific algorithms can implemented as simple wrappers
> for generic algorithms with the string dressed correctly for the task,
> or they can implement more sophisticated algorithms to increase performance

One thing I like about the current scheme is that all bidirectional-range algorithms work out of the box with all strings, and lend themselves to optimization whenever you want to.

This will have trouble passing Walter's wanking test. Mine too; every time I need to write a bunch of forwarding functions, that's a signal something went wrong somewhere. Remember MFC? :o)

> There's two major benefits to this approach:
>
> Benefit 1: if indeed you really don't want the performance penalty that
> comes with checking for composed graphemes, you can bypass it at some
> specific places in your code using byDchar, or you can disable it
> altogether by modifying the string-specific algorithms and recompiling
> Phobos.
>
> Benefit 2: we don't have to rush to implementing graphemes in the
> Unicode-aware algorithms. Just make sure the interface for
> string-specific algorithms *can* accept graphemes, and we can roll out
> support for them at a later time once we have a decent implementation.

I'm not seeing the drawbacks. Hurts everyone for the sake of a few, breaks existent code, makes all string processing a mess, would-be users will throw their hands in the air seeing the simplest examples, but we'll have the satisfaction of high-five-ing one another telling ourselves that we did the right thing.

> Also, all this is leaving the question open as to what to do when
> someone uses the string as a range. In my opinion, it should either
> iterate on code units (because the string is actually an array, and
> because that's what foreach does) or simply disallow iteration (asking
> that you dress the string first using toCodeUnit, toDchar, or toGrapheme).
>
> Do you like that more?

This is not about liking. I like doing the right thing as much as you do, and I think Phobos shows that. Clearly doing the right thing through and through is handling combining characters appropriately. The problem is keeping all desiderata in careful balance.


Andrei
January 17, 2011
On 1/17/11 10:55 AM, spir wrote:
> On 01/15/2011 12:21 AM, Michel Fortin wrote:
>> Also, it'd really help this discussion to have some hard numbers about
>> the cost of decoding graphemes.
>
> Text has a perf module that provides such numbers (on different stages
> of Text object construction) (but the measured algos are not yet
> stabilised, so that said numbers regularly change, but in the right
> sense ;-)
> You can try the current version at
> https://bitbucket.org/denispir/denispir-d/src (the perf module is called
> chrono.d)
>
> For information, recently, the cost of full text construction: decoding,
> normalisation (both decomp & ordering), piling, was about 5 times
> decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just
> informed me about a new gain in piling I have not yet tested.
> This performance places our library in-between Windows native tools and
> ICU in terms of speed. Which is imo rather good for a brand new tool
> written in a still unstable language.
>
> I have carefully read your arguments on Text's approach to
> systematically "pile" and normalise source texts not beeing the right
> one from an efficiency point of view. Even for strict use cases of
> universal text manipulation (because the relative space cost would
> indirectly cause time cost due to cache effects). Instead, you state we
> should "pile" and/or normalise on the fly. But I am, similarly to you,
> rather doubtful on this point without any numbers available.
> So, let us produce some benchmark results on both approaches if you like.

Congrats on this great work. The initial numbers are in keeping with my expectation; UTF adds for certain primitives up to 3x overhead compared to ASCII, and I expect combining character handling to bring about as much on top of that.

Your work and Steve's won't go to waste; one way or another we need to add grapheme-based processing to D. I think it would be great if later on a Phobos submission was made.


Andrei
January 17, 2011
On 01/17/2011 05:34 PM, Michel Fortin wrote:
> As I said: all those people who are not validating the inputs to make
> sure they don't contain combining code points. As far as I know, no one
> is doing that, so that means everybody should use algorithms capable of
> handling multi-code-point graphemes. If someone indeed is doing this
> validation, he'll probably also be smart enough to make his algorithms
> to work with dchars.

Actually, there are at least 2 special cases:
* apps that only deal with pre-unicode source stuff
* apps that only deal with source stuff "mechanically" generated by text-producing software which itself guarantees single-code-only graphemes

Denis
_________________
vita es estrany
spir.wikidot.com

January 17, 2011
On 01/17/2011 06:36 PM, Andrei Alexandrescu wrote:
> On 1/17/11 10:55 AM, spir wrote:
>> On 01/15/2011 12:21 AM, Michel Fortin wrote:
>>> Also, it'd really help this discussion to have some hard numbers about
>>> the cost of decoding graphemes.
>>
>> Text has a perf module that provides such numbers (on different stages
>> of Text object construction) (but the measured algos are not yet
>> stabilised, so that said numbers regularly change, but in the right
>> sense ;-)
>> You can try the current version at
>> https://bitbucket.org/denispir/denispir-d/src (the perf module is called
>> chrono.d)
>>
>> For information, recently, the cost of full text construction: decoding,
>> normalisation (both decomp & ordering), piling, was about 5 times
>> decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just
>> informed me about a new gain in piling I have not yet tested.
>> This performance places our library in-between Windows native tools and
>> ICU in terms of speed. Which is imo rather good for a brand new tool
>> written in a still unstable language.
>>
>> I have carefully read your arguments on Text's approach to
>> systematically "pile" and normalise source texts not beeing the right
>> one from an efficiency point of view. Even for strict use cases of
>> universal text manipulation (because the relative space cost would
>> indirectly cause time cost due to cache effects). Instead, you state we
>> should "pile" and/or normalise on the fly. But I am, similarly to you,
>> rather doubtful on this point without any numbers available.
>> So, let us produce some benchmark results on both approaches if you like.
>
> Congrats on this great work. The initial numbers are in keeping with my
> expectation; UTF adds for certain primitives up to 3x overhead compared
> to ASCII, and I expect combining character handling to bring about as
> much on top of that.
>
> Your work and Steve's won't go to waste; one way or another we need to
> add grapheme-based processing to D. I think it would be great if later
> on a Phobos submission was made.

Andrei, would you have a look at Text's current state, mainly theinterface, when you have time for that (no hurry) at https://bitbucket.org/denispir/denispir-d/src
It is actually a bit more than just a string type considering true characters as natural elements.
* It is a textual type providing a client interface of common text manipulation methods similar to ones in common high-level languages.
(including the fact that a character is a singleton string)
* The repo also holds the main module (unicodedata) of Text's sister lib (dunicode), providing access to various unicode algos and data.
(We are about to merge the 2 libs into a new repository.)

Denis
_________________
vita es estrany
spir.wikidot.com

January 17, 2011
On 01/17/2011 04:00 PM, Andrei Alexandrescu wrote:
> On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
>> We need to get some real numbers together. I'll see what I can create
>> for a type, but someone else needs to supply the input :) I'm on short
>> supply of unicode data, and any attempts I've made to create some result
>> in failure. I have one example of one composed character in this thread
>> that I can cling to, but in order to supply some real numbers, we need a
>> large amount of data.
>
> Oh, one more thing. You don't need a lot of Unicode text containing
> combining characters to write benchmarks. (You do need it for testing
> purposes.) Most text won't contain combining characters anyway, so after
> you implement graphemes, just benchmark them on regular text.

Correct. For this reason, we do not use the same source at all for correctness and performance testing.
It is impossible to define typical or representative source (who judges?) But at very minimum, source texts for perf measurement should mix languages as diverse as possible, including some material of the ones known to be problematic and/or atypical (english, korean, hebrew...)
The following (ripped and composed from ICU data sets) is just that: https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/data/unicode.txt

Content:
12 natural languages
34767 bytes = utf8 code units
--> 20133 code points
--> 22033 normal codes (NFD decomposed)
--> 19205 piles = true characters

Denis
_________________
vita es estrany
spir.wikidot.com

January 17, 2011
On 1/17/11 12:23 PM, spir wrote:
> Andrei, would you have a look at Text's current state, mainly
> theinterface, when you have time for that (no hurry) at
> https://bitbucket.org/denispir/denispir-d/src
> It is actually a bit more than just a string type considering true
> characters as natural elements.
> * It is a textual type providing a client interface of common text
> manipulation methods similar to ones in common high-level languages.
> (including the fact that a character is a singleton string)
> * The repo also holds the main module (unicodedata) of Text's sister lib
> (dunicode), providing access to various unicode algos and data.
> (We are about to merge the 2 libs into a new repository.)

I think this is solid work that reveals good understanding of Unicode. That being said, there are a few things I disagree about and I don't think it can be integrated into Phobos. One thing is that it looks a lot more like D1 code than D2. D2 code of this kind is automatically expected to play nice with the rest of Phobos (ranges and algorithms). As it is, the code is an island that implements its own algorithms (mostly by equivalent handwritten code).

In detail:

* Line 130: representing a text as a dchar[][] has its advantages but major efficiency issues. To be frank I think it's a disaster. I think a representation building on UTF strings directly is bound to be vastly better.

* 163: equality does what std.algorithm.equal does.

* 174: equality also does what std.algorithm.equal does (possibly with a custom pred)

* 189: TextException is unnecessary

* 340: Unless properly motivate, iteration with opApply is archaic and inefficient.

* 370: Why lose the information that the result is in fact a single Pile?

* 430, 456, 474: contains, indexOf, count and probably others should use generic algorithms, not duplicate them.

* 534: replace is std.array.replace

* 623: copy copies the piles shallowly (not sure if that's a problem)

As I mentioned before - why not focus on defining a Grapheme type (what you call Pile, but using UTF encoding) and defining a ByGrapheme range that iterates a UTF-encoded string by grapheme?


Andrei
January 17, 2011
On 2011-01-17 12:33:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> On 1/17/11 10:34 AM, Michel Fortin wrote:
>> As I said: all those people who are not validating the inputs to make
>> sure they don't contain combining code points.
> 
> The question (which I see you keep on dodging :o)) is how much text contains combining code points.

Not much, right now.

The problem is that the answer to this question is likely to change as Unicode support improves in operating system and applications. Shouldn't we future-proof Phobos?


> It does not serve us well to rigidly claim that the only good way of doing anything Unicode is to care about graphemes.

For the time being we can probably afford it.


> Even NSString exposes the UTF16 underlying encoding and provides dedicated functions for grapheme-based processing. For one thing, if you care about the width of a word in printed text (one of the case where graphemes are important), you need font information. And - surprise! - some fonts do NOT support combining characters and print signs next to one another instead of juxtaposing them, so the "wrong" method of counting characters is more informative.

Generally what OS X does in those case is that it displays that character in another font. That said, counting grapheme is never a good way to tell how much space some text will take (unless the application enforces a fixed width per grapheme). It's more useful for telling the number of character in a text document, similar to a word count.


>> That said, no one should really have to care but those who implement the
>> string manipulation functions. The idea behind making the grapheme the
>> element type is to make it easier to write grapheme-aware string
>> manipulation functions, even if you don't know about graphemes. But the
>> reality is probably more mixed than that.
> 
> The reality is indeed more mixed. Inevitably at some point the API needs to answer the question: "what is the first character of this string?" Transparency is not possible. You break all string code out there.

I'm not sure what you mean by that.


>> - - -
>> 
>> I gave some thought about all this, and came to an interesting
>> realizations that made me refine the proposal. The new proposal is
>> disruptive perhaps as much as the first, but in a different way.
>> 
>> But first, let's state a few facts to reframe the current discussion:
>> 
>> Fact 1: most people don't know Unicode very well
>> Fact 2: most people are confused by code units, code points, graphemes,
>> and what is a 'character'
>> Fact 3: most people won't bother with all this, they'll just use the
>> basic language facilities and assume everything work correctly if it it
>> works correctly for them
> 
> Nice :o).
> 
>> Now, let's define two goals:
>> 
>> Goal 1: make most people's string operations work correctly
>> Goal 2: make most people's string operations work fast
> 
> Goal 3: don't break all existing code
> Goal 4: make most people's string-based code easy to write and understand

Those are worthy goals too.


>> To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not
>> sure we agree on this, but let's continue.
> 
> I think we disagree about what "most" means. For you it means "people who don't understand Unicode well but deal with combining characters anyway". For me it's "the largest percentage of D users across various writing systems".

It's not just D users, it's also for the users of programs written by D users. I can't count how many times I've seen accented character mishandled on websites and elsewhere, and I probably have an aversion about doing the same thing to people of other cultures and languages.

If the operating system supports combining marks, users have an expectations that applications running on it will deal with them correctly too, and they'll (rightfully) blame your application if it doesn't work. Same for websites.

I understand that in some situations you don't want to deal with graphemes even if you theoretically should, but I don't think it should be the default.


>> One observation I made with having dchar as the default element type is
>> that not all algorithms really need to deal with dchar. If I'm searching
>> for code point 'a' in a UTF-8 string, decoding code units into code
>> points is a waste of time. Why? because the only way to represent code
>> point 'a' is by having code point 'a'.
> 
> Right. That's why many algorithms in std are specialized for such cases.
> 
>> And guess what? The almost same
>> optimization can apply to graphemes: if you're searching for 'a' in a
>> grapheme-aware manner in a UTF-8 string, all you have to do is search
>> for the UTF-8 code unit 'a', then check if the 'a' code unit is followed
>> by a combining mark code point to confirm it is really a 'a', not a
>> composed grapheme. Iterating the string by code unit is enough for these
>> cases, and it'd increase performance by a lot.
> 
> Unfortunately it all breaks as soon as you go beyond one code point. You can't search efficiently, you can't compare efficiently. Boyer-Moore and friends are out.

Ok. Say you were searching for the needle "étoilé" in an UTF-8 haystack, I see two way to extend the optimization described above:

1. search for the easy part "toil", then check its surrounding graphemes to confirm it's really "étoilé"
2. search for a code point matching 'é' or 'e', then confirm that the code points following it form the right graphemes.

Implementing the second one can be done by converting the needle to a regular expression operating at code-unit level. With that you can search efficiently for the needle directly in code units without having to decode and/or normalize the whole haystack.


>> This penalty with generic algorithms comes from the fact that they take
>> a predicate of the form "a == 'a'" or "a == b", which is ill-suited for
>> strings because you always need to fully decode the string (by dchar or
>> by graphemes) for the purpose of calling the predicate. Given that
>> comparing characters for something else than equality or them being part
>> of a set is very rarely something you do, generic algorithms miss a big
>> optimization opportunity here.
> 
> How can we improve that? You can't argue for an inefficient scheme just because what we have isn't as efficient as it could possibly be.

You ask what's inefficient about generic algorithms having customizable predicates? You can't implement the above optimization if you can't guaranty the predicate is "==". That said, perhaps we can detect "==" and only apply the optimization then.

Being able to specify the predicate doesn't gain you much for strings, because a < 'a' doesn't make much sense. All you need to check for is equality with some value or membership of given character set, both of which can use the optimization above.


>> So here's what I think we should do:
>> 
>> Todo 1: disallow generic algorithms on naked strings: string-specific
>> Unicode-aware algorithms should be used instead; they can share the same
>> name if their usage is similar
> 
> I don't understand this. We already do this, and by "Unicode-aware" we understand using dchar throughout. This is transparent to client code.

That's probably because you haven't understood the intent (I might not have made it very clear either).

The problem I see currently is that you rely on dchar being the element type. That should be an implementation detail, not something client code can see or rely on. By making it an implementation detail, you can later make grapheme-aware algorithms the default without changing the API. Since you're the gatekeeper to Phobos, you can make this change conditional to getting an acceptable level of performance out of the grapheme-aware algorithms, or on other factors like the amount of combining characters you encounter in the wild in the next few years.

So the general string functions would implement your compromise (using dchar) but not commit indefinitely to it. Someone who really want to work in code point can use toDchar, someone who want to deal with graphemes uses toGraphemes, someone who doesn't care won't choose anything and get the default behaviour of compromise.

All you need to do for this is document it, and try to make sure the string APIs don't force the implementation to work with code points.


>> Todo 2: to use a generic algorithm with a strings, you must dress the
>> string using one of toDchar, toGrapheme, toCodeUnits; this way your
>> intentions are clear
> 
> Breaks a lot of existing code.
> 
> Won't fly with Walter unless it solves world hunger. Nevertheless I have to say I like it; the #1 thing I'd change about the built-in strings is that they implicitly are two things at the same time. Asking for representation should be explicit.

No, it doesn't break anything. This is just the continuation of what I tried to explain above: if you want to be sure you're working with graphemes or dchar, say it.

Also, it said nothing about iteration or foreach, so I'm not sure why it wouldn't fly with Walter. It can stay as it is, except for one thing: you and Walter should really get on the same wavelength regarding ElementType!(char[]) and foreach(c; string). I don't care that much which is the default, but they absolutely need to be the same.


>> Todo 3: string-specific algorithms can implemented as simple wrappers
>> for generic algorithms with the string dressed correctly for the task,
>> or they can implement more sophisticated algorithms to increase performance
> 
> One thing I like about the current scheme is that all bidirectional-range algorithms work out of the box with all strings, and lend themselves to optimization whenever you want to.

I like this as the default behaviour too. I think however that you should restrict the algorithms that work out of the box to those which can also work with graphemes. This way you can change the behaviour in the future and support graphemes by a simple upgrade of Phobos.

Algorithms that doesn't work with graphemes would still work with toDchar.

So what doesn't work with graphemes? Predicates such as "a < b" for instance. That's pretty much it.


> This will have trouble passing Walter's wanking test. Mine too; every time I need to write a bunch of forwarding functions, that's a signal something went wrong somewhere. Remember MFC? :o)

The idea is that we write the API as it would apply to graphemes, but we implement it using dchar for the time being. Some function signatures might have to differ a bit.


>> Do you like that more?
> 
> This is not about liking. I like doing the right thing as much as you do, and I think Phobos shows that. Clearly doing the right thing through and through is handling combining characters appropriately. The problem is keeping all desiderata in careful balance.

Well then, don't you find it balanced enough? I'm not asking that everything be done with graphemes. I'm not even asking that anything be done with graphemes by default. I'm only asking that we keep the API clean enough so we can pass to graphemes by default in the future without having to rewrite all the code everywhere to use byGrapheme. If this isn't the right balance.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 17, 2011
On 1/17/11 2:29 PM, Michel Fortin wrote:
> The problem I see currently is that you rely on dchar being the element
> type. That should be an implementation detail, not something client code
> can see or rely on.

But at some point you must be able to talk about individual characters in a text. It can't be something that client code doesn't see!!!

SuperDuperText txt;
auto c = giveMeTheFirstCharacter(txt);

What is the type of c? That is visible to the client!


Andrei
January 17, 2011
On 2011-01-17 15:49:26 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> On 1/17/11 2:29 PM, Michel Fortin wrote:
>> The problem I see currently is that you rely on dchar being the element
>> type. That should be an implementation detail, not something client code
>> can see or rely on.
> 
> But at some point you must be able to talk about individual characters in a text. It can't be something that client code doesn't see!!!

It seems that it can. NSString only exposes individual UTF-16 code units directly (or semi-directly via an accessor method), even though searching and comparing is grapheme-aware. I'm not saying it's a good design, but it certainly can work in practice.

In any case, I didn't mean to say the client code should't be aware of the characters in a string. I meant that the client shouldn't assume the algorithm works at the same layer as ElementType!(string) for a given string type. Even if ElementType!(string) is dchar, the default function you get if you don't use any of toCodeUnit, toDchar, or toGrapheme can work at the dchar or grapheme level if it makes more sense that way.

In other words, the client says: "I have two strings, compare them!" The client didn't specify if they should be compared by char, wchar, dchar, or by normalized grapheme; so we do what's sensible. That's what I call the 'default' string functions, those you get when you don't ask for anything specific. They should have a signature making them able to work at the grapheme level, even though they might not for practical reasons (performance). This way if it becomes more important or practical to support graphemes, it's easy to evolve to them.


> SuperDuperText txt;
> auto c = giveMeTheFirstCharacter(txt);
> 
> What is the type of c? That is visible to the client!

That depends on how you implement the giveMeTheFirstCharacter function. :-)

More seriously, you have four choice:

1. code unit
2. code point
3. grapheme
4. require the client to state explicitly which kind of 'character' he wants; 'character' being an overloaded word, it's reasonable to ask for disambiguation.

You and Walter can't come to understand each other between 1 and 2, regarding foreach and ranges. To keep things consistent with what I said above I'd tend to say 4, but that's weird for something that looks like an array. My second choice goes for 1 when it comes to consistency, and 3 when it comes to correctness, and 2 when it comes to being practical.

Given something is going to be inconsistent either way, I'd say any of the above is acceptable. But please make sure you and Walter agree on the default element type for ranges and foreach.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

January 17, 2011
On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:
> On 1/17/11 12:23 PM, spir wrote:
>> Andrei, would you have a look at Text's current state, mainly
>> theinterface, when you have time for that (no hurry) at
>> https://bitbucket.org/denispir/denispir-d/src
>> It is actually a bit more than just a string type considering true
>> characters as natural elements.
>> * It is a textual type providing a client interface of common text
>> manipulation methods similar to ones in common high-level languages.
>> (including the fact that a character is a singleton string)
>> * The repo also holds the main module (unicodedata) of Text's sister lib
>> (dunicode), providing access to various unicode algos and data.
>> (We are about to merge the 2 libs into a new repository.)
>
> I think this is solid work that reveals good understanding of Unicode.
> That being said, there are a few things I disagree about and I don't
> think it can be integrated into Phobos.

We are exploring a new field. (Except for the work Objective-C designers did -- but we just discovered it.)

> One thing is that it looks a lot
> more like D1 code than D2. D2 code of this kind is automatically
> expected to play nice with the rest of Phobos (ranges and algorithms).
> As it is, the code is an island that implements its own algorithms
> (mostly by equivalent handwritten code).

Right. We precisely initially wanted to let it play nicely with the rest of new Phobos. This meant mainly provide a range interface, which also gives access to std.algorithm routines. But we were blocked by current bugs related to ranges. I have posted about those issues (you may remember having replied to this post).

> In detail:
>
> * Line 130: representing a text as a dchar[][] has its advantages but
> major efficiency issues. To be frank I think it's a disaster. I think a
> representation building on UTF strings directly is bound to be vastly
> better.

I don't understand your point. Where is the difference with D's builtin types, then?

Also, which efficiency issue do you mention? Upon text object construction, we do agree and I have given some data. But this happens only once; it is an investment intended to provide correctness first, and efficiency of _every_ operation on constructed text.
Upon speed ofsuch  methods / algorithms operating _correctly_ on universal text, precisely, since there is no alternative to Text (yet), there are also no available performance data to judge.

(What about comparing Objective-C's NSString to Text's current performance for indexing, slicing, searching, counting,...? Even in its current experimental stage, I bet it would not be ridiculous, rather the opposite. But I may be completely wrong.)

> * 163: equality does what std.algorithm.equal does.
>
> * 174: equality also does what std.algorithm.equal does (possibly with a
> custom pred)

Right, these are unimportant tool func at the "pile" level. (Initially introduced because builtin "==" showed strange inefficency in our case. May test again later.)

> * 189: TextException is unnecessary

Agreed.

> * 340: Unless properly motivate, iteration with opApply is archaic and
> inefficient.

See range bug evoked above. opApply is the only workaround AFAIK.
Also, ranges cannot yet provide indexed iteration like
	foreach(i, char ; text) {...}

> * 370: Why lose the information that the result is in fact a single Pile?

I don't know what information loss you mean.

Generally speaking, Pile is more or less an implementation detail used to internally represent a true character; while Text is the important thing.
At one time we had to chose whether make Pile an obviously exposed type as well, or not. I chose (after some exchange on the topic) not to do it for a few reasons:
* Simplicity: one type does all the job well.
* Avoid confusion due to conflict with historic string types which elements (codes=characters) were atomic thingies. This was also a reason not to name it simply "Character"; "Pile" for me was supposed to rather evoke the technical side than the meaningful side.
* Lightness of the interface: if we expose Pile obviously, then we need to double all methods that may take or return a single character, like searching, counting, replacing etc... and also possibly indexing and iteration.

In fact, the resulting interface is more or less like a string type in high-level languages such as Python; with the motivating difference that it operates correctly on universal text.

Now, it seems you rather expect, maybe, the character/pile type to be the important thing and Text to just be a sequence of them? (possibly even unnecessary to be defined formally)

> * 430, 456, 474: contains, indexOf, count and probably others should use
> generic algorithms, not duplicate them.
>
> * 534: replace is std.array.replace

I had to write algos because most of them in std.algorithm require a range interface, IIUC; and also for testing purpose.

> * 623: copy copies the piles shallowly (not sure if that's a problem)

Had the same interrogation.

> As I mentioned before - why not focus on defining a Grapheme type (what
> you call Pile, but using UTF encoding) and defining a ByGrapheme range
> that iterates a UTF-encoded string by grapheme?

Dunno. This simply was not my approach. Seems to me Text as is provides clients with an interface a simple and clear as possible, while operating correctly in the backgroung.

It seems if you just build a ByGrapheme iterator, then you have no other choice than abstracting on the fly (constructing piles on the fly for operations like indexing and normalising them in addition for searching, counting...).
As I said in other posts, this may be the right thing to do from an efficiency point of view, but this remains to be proven. I bet the opposite, in fact, that --with same implementation language and same investment in optimisation-- the approach defining a true textual type like Text is inevitbly more efficient by orders of magnitude (*). Again, Text construction initial cost is an investment. Prove me wrong (**).

> Andrei

Denis


(*) Except, probably, for the choice of making the ElemenType a singleton Text (seems costly).
(**) I'm now aware of the high speed loss Text certainly suffers from representing characters as mini-arrays, but I guess it is marginally relevant compared to the gain of not piling and normalising for every operation.
_________________
vita es estrany
spir.wikidot.com