View mode: basic / threaded / horizontal-split · Log in · Help
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/17/11 10:34 AM, Michel Fortin wrote:
> On 2011-01-16 18:58:54 -0500, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>
>> On 1/16/11 3:20 PM, Michel Fortin wrote:
>>> On 2011-01-16 14:29:04 -0500, Andrei Alexandrescu
>>> <SeeWebsiteForEmail@erdani.org> said:
>>>
>>>> On 1/15/11 10:45 PM, Michel Fortin wrote:
>>>>> No doubt it's easier to implement it that way. The problem is that in
>>>>> most cases it won't be used. How many people really know what is a
>>>>> grapheme?
>>>>
>>>> How many people really should care?
>>>
>>> I think the only people who should *not* care are those who have
>>> validated that the input does not contain any combining code point. If
>>> you know the input *can't* contain combining code points, then it's safe
>>> to ignore them.
>>
>> I agree. Now let me ask again: how many people really should care?
>
> As I said: all those people who are not validating the inputs to make
> sure they don't contain combining code points.

The question (which I see you keep on dodging :o)) is how much text 
contains combining code points.

I have worked in NLP for years, and still do. I even worked on Arabic 
text (albeit Romanized). I work with Wikipedia. I use Unicode all the 
time, but I have yet to have trouble with a combining character. I was 
just vaguely aware of their existence up until this discussion, but just 
waved it away and guess what - it worked for me.

It does not serve us well to rigidly claim that the only good way of 
doing anything Unicode is to care about graphemes. Even NSString exposes 
the UTF16 underlying encoding and provides dedicated functions for 
grapheme-based processing. For one thing, if you care about the width of 
a word in printed text (one of the case where graphemes are important), 
you need font information. And - surprise! - some fonts do NOT support 
combining characters and print signs next to one another instead of 
juxtaposing them, so the "wrong" method of counting characters is more 
informative.

> As far as I know, no one
> is doing that, so that means everybody should use algorithms capable of
> handling multi-code-point graphemes. If someone indeed is doing this
> validation, he'll probably also be smart enough to make his algorithms
> to work with dchars.

I am not sure everybody should use graphemes.

> That said, no one should really have to care but those who implement the
> string manipulation functions. The idea behind making the grapheme the
> element type is to make it easier to write grapheme-aware string
> manipulation functions, even if you don't know about graphemes. But the
> reality is probably more mixed than that.

The reality is indeed more mixed. Inevitably at some point the API needs 
to answer the question: "what is the first character of this string?" 
Transparency is not possible. You break all string code out there.

> - - -
>
> I gave some thought about all this, and came to an interesting
> realizations that made me refine the proposal. The new proposal is
> disruptive perhaps as much as the first, but in a different way.
>
> But first, let's state a few facts to reframe the current discussion:
>
> Fact 1: most people don't know Unicode very well
> Fact 2: most people are confused by code units, code points, graphemes,
> and what is a 'character'
> Fact 3: most people won't bother with all this, they'll just use the
> basic language facilities and assume everything work correctly if it it
> works correctly for them

Nice :o).

> Now, let's define two goals:
>
> Goal 1: make most people's string operations work correctly
> Goal 2: make most people's string operations work fast

Goal 3: don't break all existing code
Goal 4: make most people's string-based code easy to write and understand

> To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not
> sure we agree on this, but let's continue.

I think we disagree about what "most" means. For you it means "people 
who don't understand Unicode well but deal with combining characters 
anyway". For me it's "the largest percentage of D users across various 
writing systems".

>  From the above 3 facts, we can deduce that a user won't want to bother
> to using byDchar, byGrapheme, or byWhatever when using algorithms. You
> were annoyed by having to write byDchar everywhere, so changed the
> element type to always be dchar and you don't have to write byDchar
> anymore. That's understandable and perfectly reasonable.
>
> The problem is of course that it doesn't give you correct results. Most
> of the time what you really want is to use graphemes, dchar just happen
> to be a good approximation of that that works most of the time.

Again, it's a matter of tradeoffs. I chose dchar because char was plain 
_wrong_ most of the time, not because char was a pretty darn good 
approximation that worked for most people most of the time. The fact 
remains that dchar _is_ a pretty darn good approximation that also has 
pretty good darn speed. So I'd say that I _still_ want to use dchar most 
of the time.

Committing to graphemes would complicate APIs for _everyone_ and would 
make things slower for _everyone_ for the sake of combining characters 
that _never_ occur in _most_ people's text. This is bad design, pure and 
simple. A good design is to cater for the majority and provide dedicated 
APIs for the few.

> Iterating by grapheme is somewhat problematic, and it degrades
> performance.

Yes.

> Same for comparing graphemes for normalized equivalence.

Yes, although I think you can optimize code such that comparing two 
strings wholesale only has a few more comparisons on the critical path. 
That would be still slower, but not as slow as iterating by grapheme in 
a naive implementation.

> That's all true. I'm not too sure what we can do about that. It can be
> optimized, but it's very understandable that some people won't be
> satisfied by the performance and will want to avoid graphemes.

I agree.

> Speaking of optimization, I do understand that iterating by grapheme
> using the range interface won't give you the best performance. It's
> certainly convenient as it enables the reuse of existing algorithms with
> graphemes, but more specialized algorithms and interfaces might be more
> suited.

Even the specialized algorithms will be significantly slower.

> One observation I made with having dchar as the default element type is
> that not all algorithms really need to deal with dchar. If I'm searching
> for code point 'a' in a UTF-8 string, decoding code units into code
> points is a waste of time. Why? because the only way to represent code
> point 'a' is by having code point 'a'.

Right. That's why many algorithms in std are specialized for such cases.

> And guess what? The almost same
> optimization can apply to graphemes: if you're searching for 'a' in a
> grapheme-aware manner in a UTF-8 string, all you have to do is search
> for the UTF-8 code unit 'a', then check if the 'a' code unit is followed
> by a combining mark code point to confirm it is really a 'a', not a
> composed grapheme. Iterating the string by code unit is enough for these
> cases, and it'd increase performance by a lot.

Unfortunately it all breaks as soon as you go beyond one code point. You 
can't search efficiently, you can't compare efficiently. Boyer-Moore and 
friends are out.

I'm not saying that we shouldn't implement the correct operations! I'm 
just not convinced they should be the default.

> So making dchar the default type is no doubt convenient because it
> abstracts things enough so that generic algorithms can work with
> strings, but it has a performance penalty that you don't always need. I
> made an example using UTF-8, it applies even more to UTF-16. And it
> applies to grapheme-aware manipulations too.

It is true that UTF manipulation incurs overhead. The tradeoff has many 
dimensions: UTF-16 is bulkier and less cache friendly, ASCII is not 
sufficient for most people, the UTF decoding overhead is not that 
high... it's difficult to find the sweetest spot.

> This penalty with generic algorithms comes from the fact that they take
> a predicate of the form "a == 'a'" or "a == b", which is ill-suited for
> strings because you always need to fully decode the string (by dchar or
> by graphemes) for the purpose of calling the predicate. Given that
> comparing characters for something else than equality or them being part
> of a set is very rarely something you do, generic algorithms miss a big
> optimization opportunity here.

How can we improve that? You can't argue for an inefficient scheme just 
because what we have isn't as efficient as it could possibly be.

> - - -
>
> So here's what I think we should do:
>
> Todo 1: disallow generic algorithms on naked strings: string-specific
> Unicode-aware algorithms should be used instead; they can share the same
> name if their usage is similar

I don't understand this. We already do this, and by "Unicode-aware" we 
understand using dchar throughout. This is transparent to client code.

> Todo 2: to use a generic algorithm with a strings, you must dress the
> string using one of toDchar, toGrapheme, toCodeUnits; this way your
> intentions are clear

Breaks a lot of existing code. Won't fly with Walter unless it solves 
world hunger. Nevertheless I have to say I like it; the #1 thing I'd 
change about the built-in strings is that they implicitly are two things 
at the same time. Asking for representation should be explicit.

> Todo 3: string-specific algorithms can implemented as simple wrappers
> for generic algorithms with the string dressed correctly for the task,
> or they can implement more sophisticated algorithms to increase performance

One thing I like about the current scheme is that all 
bidirectional-range algorithms work out of the box with all strings, and 
lend themselves to optimization whenever you want to.

This will have trouble passing Walter's wanking test. Mine too; every 
time I need to write a bunch of forwarding functions, that's a signal 
something went wrong somewhere. Remember MFC? :o)

> There's two major benefits to this approach:
>
> Benefit 1: if indeed you really don't want the performance penalty that
> comes with checking for composed graphemes, you can bypass it at some
> specific places in your code using byDchar, or you can disable it
> altogether by modifying the string-specific algorithms and recompiling
> Phobos.
>
> Benefit 2: we don't have to rush to implementing graphemes in the
> Unicode-aware algorithms. Just make sure the interface for
> string-specific algorithms *can* accept graphemes, and we can roll out
> support for them at a later time once we have a decent implementation.

I'm not seeing the drawbacks. Hurts everyone for the sake of a few, 
breaks existent code, makes all string processing a mess, would-be users 
will throw their hands in the air seeing the simplest examples, but 
we'll have the satisfaction of high-five-ing one another telling 
ourselves that we did the right thing.

> Also, all this is leaving the question open as to what to do when
> someone uses the string as a range. In my opinion, it should either
> iterate on code units (because the string is actually an array, and
> because that's what foreach does) or simply disallow iteration (asking
> that you dress the string first using toCodeUnit, toDchar, or toGrapheme).
>
> Do you like that more?

This is not about liking. I like doing the right thing as much as you 
do, and I think Phobos shows that. Clearly doing the right thing through 
and through is handling combining characters appropriately. The problem 
is keeping all desiderata in careful balance.


Andrei
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/17/11 10:55 AM, spir wrote:
> On 01/15/2011 12:21 AM, Michel Fortin wrote:
>> Also, it'd really help this discussion to have some hard numbers about
>> the cost of decoding graphemes.
>
> Text has a perf module that provides such numbers (on different stages
> of Text object construction) (but the measured algos are not yet
> stabilised, so that said numbers regularly change, but in the right
> sense ;-)
> You can try the current version at
> https://bitbucket.org/denispir/denispir-d/src (the perf module is called
> chrono.d)
>
> For information, recently, the cost of full text construction: decoding,
> normalisation (both decomp & ordering), piling, was about 5 times
> decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just
> informed me about a new gain in piling I have not yet tested.
> This performance places our library in-between Windows native tools and
> ICU in terms of speed. Which is imo rather good for a brand new tool
> written in a still unstable language.
>
> I have carefully read your arguments on Text's approach to
> systematically "pile" and normalise source texts not beeing the right
> one from an efficiency point of view. Even for strict use cases of
> universal text manipulation (because the relative space cost would
> indirectly cause time cost due to cache effects). Instead, you state we
> should "pile" and/or normalise on the fly. But I am, similarly to you,
> rather doubtful on this point without any numbers available.
> So, let us produce some benchmark results on both approaches if you like.

Congrats on this great work. The initial numbers are in keeping with my 
expectation; UTF adds for certain primitives up to 3x overhead compared 
to ASCII, and I expect combining character handling to bring about as 
much on top of that.

Your work and Steve's won't go to waste; one way or another we need to 
add grapheme-based processing to D. I think it would be great if later 
on a Phobos submission was made.


Andrei
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/17/2011 05:34 PM, Michel Fortin wrote:
> As I said: all those people who are not validating the inputs to make
> sure they don't contain combining code points. As far as I know, no one
> is doing that, so that means everybody should use algorithms capable of
> handling multi-code-point graphemes. If someone indeed is doing this
> validation, he'll probably also be smart enough to make his algorithms
> to work with dchars.

Actually, there are at least 2 special cases:
* apps that only deal with pre-unicode source stuff
* apps that only deal with source stuff "mechanically" generated by 
text-producing software which itself guarantees single-code-only graphemes

Denis
_________________
vita es estrany
spir.wikidot.com
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/17/2011 06:36 PM, Andrei Alexandrescu wrote:
> On 1/17/11 10:55 AM, spir wrote:
>> On 01/15/2011 12:21 AM, Michel Fortin wrote:
>>> Also, it'd really help this discussion to have some hard numbers about
>>> the cost of decoding graphemes.
>>
>> Text has a perf module that provides such numbers (on different stages
>> of Text object construction) (but the measured algos are not yet
>> stabilised, so that said numbers regularly change, but in the right
>> sense ;-)
>> You can try the current version at
>> https://bitbucket.org/denispir/denispir-d/src (the perf module is called
>> chrono.d)
>>
>> For information, recently, the cost of full text construction: decoding,
>> normalisation (both decomp & ordering), piling, was about 5 times
>> decoding alone. The heavy part (~ 70%) beeing piling. But Stephan just
>> informed me about a new gain in piling I have not yet tested.
>> This performance places our library in-between Windows native tools and
>> ICU in terms of speed. Which is imo rather good for a brand new tool
>> written in a still unstable language.
>>
>> I have carefully read your arguments on Text's approach to
>> systematically "pile" and normalise source texts not beeing the right
>> one from an efficiency point of view. Even for strict use cases of
>> universal text manipulation (because the relative space cost would
>> indirectly cause time cost due to cache effects). Instead, you state we
>> should "pile" and/or normalise on the fly. But I am, similarly to you,
>> rather doubtful on this point without any numbers available.
>> So, let us produce some benchmark results on both approaches if you like.
>
> Congrats on this great work. The initial numbers are in keeping with my
> expectation; UTF adds for certain primitives up to 3x overhead compared
> to ASCII, and I expect combining character handling to bring about as
> much on top of that.
>
> Your work and Steve's won't go to waste; one way or another we need to
> add grapheme-based processing to D. I think it would be great if later
> on a Phobos submission was made.

Andrei, would you have a look at Text's current state, mainly 
theinterface, when you have time for that (no hurry) at 
https://bitbucket.org/denispir/denispir-d/src
It is actually a bit more than just a string type considering true 
characters as natural elements.
* It is a textual type providing a client interface of common text 
manipulation methods similar to ones in common high-level languages.
(including the fact that a character is a singleton string)
* The repo also holds the main module (unicodedata) of Text's sister lib 
(dunicode), providing access to various unicode algos and data.
(We are about to merge the 2 libs into a new repository.)

Denis
_________________
vita es estrany
spir.wikidot.com
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/17/2011 04:00 PM, Andrei Alexandrescu wrote:
> On 1/17/11 6:44 AM, Steven Schveighoffer wrote:
>> We need to get some real numbers together. I'll see what I can create
>> for a type, but someone else needs to supply the input :) I'm on short
>> supply of unicode data, and any attempts I've made to create some result
>> in failure. I have one example of one composed character in this thread
>> that I can cling to, but in order to supply some real numbers, we need a
>> large amount of data.
>
> Oh, one more thing. You don't need a lot of Unicode text containing
> combining characters to write benchmarks. (You do need it for testing
> purposes.) Most text won't contain combining characters anyway, so after
> you implement graphemes, just benchmark them on regular text.

Correct. For this reason, we do not use the same source at all for 
correctness and performance testing.
It is impossible to define typical or representative source (who 
judges?) But at very minimum, source texts for perf measurement should 
mix languages as diverse as possible, including some material of the 
ones known to be problematic and/or atypical (english, korean, hebrew...)
The following (ripped and composed from ICU data sets) is just that: 
https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/data/unicode.txt

Content:
12 natural languages
34767 bytes = utf8 code units
--> 20133 code points
--> 22033 normal codes (NFD decomposed)
--> 19205 piles = true characters

Denis
_________________
vita es estrany
spir.wikidot.com
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/17/11 12:23 PM, spir wrote:
> Andrei, would you have a look at Text's current state, mainly
> theinterface, when you have time for that (no hurry) at
> https://bitbucket.org/denispir/denispir-d/src
> It is actually a bit more than just a string type considering true
> characters as natural elements.
> * It is a textual type providing a client interface of common text
> manipulation methods similar to ones in common high-level languages.
> (including the fact that a character is a singleton string)
> * The repo also holds the main module (unicodedata) of Text's sister lib
> (dunicode), providing access to various unicode algos and data.
> (We are about to merge the 2 libs into a new repository.)

I think this is solid work that reveals good understanding of Unicode. 
That being said, there are a few things I disagree about and I don't 
think it can be integrated into Phobos. One thing is that it looks a lot 
more like D1 code than D2. D2 code of this kind is automatically 
expected to play nice with the rest of Phobos (ranges and algorithms). 
As it is, the code is an island that implements its own algorithms 
(mostly by equivalent handwritten code).

In detail:

* Line 130: representing a text as a dchar[][] has its advantages but 
major efficiency issues. To be frank I think it's a disaster. I think a 
representation building on UTF strings directly is bound to be vastly 
better.

* 163: equality does what std.algorithm.equal does.

* 174: equality also does what std.algorithm.equal does (possibly with a 
custom pred)

* 189: TextException is unnecessary

* 340: Unless properly motivate, iteration with opApply is archaic and 
inefficient.

* 370: Why lose the information that the result is in fact a single Pile?

* 430, 456, 474: contains, indexOf, count and probably others should use 
generic algorithms, not duplicate them.

* 534: replace is std.array.replace

* 623: copy copies the piles shallowly (not sure if that's a problem)

As I mentioned before - why not focus on defining a Grapheme type (what 
you call Pile, but using UTF encoding) and defining a ByGrapheme range 
that iterates a UTF-encoded string by grapheme?


Andrei
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-17 12:33:04 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail@erdani.org> said:

> On 1/17/11 10:34 AM, Michel Fortin wrote:
>> As I said: all those people who are not validating the inputs to make
>> sure they don't contain combining code points.
> 
> The question (which I see you keep on dodging :o)) is how much text 
> contains combining code points.

Not much, right now.

The problem is that the answer to this question is likely to change as 
Unicode support improves in operating system and applications. 
Shouldn't we future-proof Phobos?


> It does not serve us well to rigidly claim that the only good way of 
> doing anything Unicode is to care about graphemes.

For the time being we can probably afford it.


> Even NSString exposes the UTF16 underlying encoding and provides 
> dedicated functions for grapheme-based processing. For one thing, if 
> you care about the width of a word in printed text (one of the case 
> where graphemes are important), you need font information. And - 
> surprise! - some fonts do NOT support combining characters and print 
> signs next to one another instead of juxtaposing them, so the "wrong" 
> method of counting characters is more informative.

Generally what OS X does in those case is that it displays that 
character in another font. That said, counting grapheme is never a good 
way to tell how much space some text will take (unless the application 
enforces a fixed width per grapheme). It's more useful for telling the 
number of character in a text document, similar to a word count.


>> That said, no one should really have to care but those who implement the
>> string manipulation functions. The idea behind making the grapheme the
>> element type is to make it easier to write grapheme-aware string
>> manipulation functions, even if you don't know about graphemes. But the
>> reality is probably more mixed than that.
> 
> The reality is indeed more mixed. Inevitably at some point the API 
> needs to answer the question: "what is the first character of this 
> string?" Transparency is not possible. You break all string code out 
> there.

I'm not sure what you mean by that.


>> - - -
>> 
>> I gave some thought about all this, and came to an interesting
>> realizations that made me refine the proposal. The new proposal is
>> disruptive perhaps as much as the first, but in a different way.
>> 
>> But first, let's state a few facts to reframe the current discussion:
>> 
>> Fact 1: most people don't know Unicode very well
>> Fact 2: most people are confused by code units, code points, graphemes,
>> and what is a 'character'
>> Fact 3: most people won't bother with all this, they'll just use the
>> basic language facilities and assume everything work correctly if it it
>> works correctly for them
> 
> Nice :o).
> 
>> Now, let's define two goals:
>> 
>> Goal 1: make most people's string operations work correctly
>> Goal 2: make most people's string operations work fast
> 
> Goal 3: don't break all existing code
> Goal 4: make most people's string-based code easy to write and understand

Those are worthy goals too.


>> To me, goal 1 trumps goal 2, even if goal 2 is also important. I'm not
>> sure we agree on this, but let's continue.
> 
> I think we disagree about what "most" means. For you it means "people 
> who don't understand Unicode well but deal with combining characters 
> anyway". For me it's "the largest percentage of D users across various 
> writing systems".

It's not just D users, it's also for the users of programs written by D 
users. I can't count how many times I've seen accented character 
mishandled on websites and elsewhere, and I probably have an aversion 
about doing the same thing to people of other cultures and languages.

If the operating system supports combining marks, users have an 
expectations that applications running on it will deal with them 
correctly too, and they'll (rightfully) blame your application if it 
doesn't work. Same for websites.

I understand that in some situations you don't want to deal with 
graphemes even if you theoretically should, but I don't think it should 
be the default.


>> One observation I made with having dchar as the default element type is
>> that not all algorithms really need to deal with dchar. If I'm searching
>> for code point 'a' in a UTF-8 string, decoding code units into code
>> points is a waste of time. Why? because the only way to represent code
>> point 'a' is by having code point 'a'.
> 
> Right. That's why many algorithms in std are specialized for such cases.
> 
>> And guess what? The almost same
>> optimization can apply to graphemes: if you're searching for 'a' in a
>> grapheme-aware manner in a UTF-8 string, all you have to do is search
>> for the UTF-8 code unit 'a', then check if the 'a' code unit is followed
>> by a combining mark code point to confirm it is really a 'a', not a
>> composed grapheme. Iterating the string by code unit is enough for these
>> cases, and it'd increase performance by a lot.
> 
> Unfortunately it all breaks as soon as you go beyond one code point. 
> You can't search efficiently, you can't compare efficiently. 
> Boyer-Moore and friends are out.

Ok. Say you were searching for the needle "étoilé" in an UTF-8 
haystack, I see two way to extend the optimization described above:

1. search for the easy part "toil", then check its surrounding 
graphemes to confirm it's really "étoilé"
2. search for a code point matching 'é' or 'e', then confirm that the 
code points following it form the right graphemes.

Implementing the second one can be done by converting the needle to a 
regular expression operating at code-unit level. With that you can 
search efficiently for the needle directly in code units without having 
to decode and/or normalize the whole haystack.


>> This penalty with generic algorithms comes from the fact that they take
>> a predicate of the form "a == 'a'" or "a == b", which is ill-suited for
>> strings because you always need to fully decode the string (by dchar or
>> by graphemes) for the purpose of calling the predicate. Given that
>> comparing characters for something else than equality or them being part
>> of a set is very rarely something you do, generic algorithms miss a big
>> optimization opportunity here.
> 
> How can we improve that? You can't argue for an inefficient scheme just 
> because what we have isn't as efficient as it could possibly be.

You ask what's inefficient about generic algorithms having customizable 
predicates? You can't implement the above optimization if you can't 
guaranty the predicate is "==". That said, perhaps we can detect "==" 
and only apply the optimization then.

Being able to specify the predicate doesn't gain you much for strings, 
because a < 'a' doesn't make much sense. All you need to check for is 
equality with some value or membership of given character set, both of 
which can use the optimization above.


>> So here's what I think we should do:
>> 
>> Todo 1: disallow generic algorithms on naked strings: string-specific
>> Unicode-aware algorithms should be used instead; they can share the same
>> name if their usage is similar
> 
> I don't understand this. We already do this, and by "Unicode-aware" we 
> understand using dchar throughout. This is transparent to client code.

That's probably because you haven't understood the intent (I might not 
have made it very clear either).

The problem I see currently is that you rely on dchar being the element 
type. That should be an implementation detail, not something client 
code can see or rely on. By making it an implementation detail, you can 
later make grapheme-aware algorithms the default without changing the 
API. Since you're the gatekeeper to Phobos, you can make this change 
conditional to getting an acceptable level of performance out of the 
grapheme-aware algorithms, or on other factors like the amount of 
combining characters you encounter in the wild in the next few years.

So the general string functions would implement your compromise (using 
dchar) but not commit indefinitely to it. Someone who really want to 
work in code point can use toDchar, someone who want to deal with 
graphemes uses toGraphemes, someone who doesn't care won't choose 
anything and get the default behaviour of compromise.

All you need to do for this is document it, and try to make sure the 
string APIs don't force the implementation to work with code points.


>> Todo 2: to use a generic algorithm with a strings, you must dress the
>> string using one of toDchar, toGrapheme, toCodeUnits; this way your
>> intentions are clear
> 
> Breaks a lot of existing code.
> 
> Won't fly with Walter unless it solves world hunger. Nevertheless I 
> have to say I like it; the #1 thing I'd change about the built-in 
> strings is that they implicitly are two things at the same time. Asking 
> for representation should be explicit.

No, it doesn't break anything. This is just the continuation of what I 
tried to explain above: if you want to be sure you're working with 
graphemes or dchar, say it.

Also, it said nothing about iteration or foreach, so I'm not sure why 
it wouldn't fly with Walter. It can stay as it is, except for one 
thing: you and Walter should really get on the same wavelength 
regarding ElementType!(char[]) and foreach(c; string). I don't care 
that much which is the default, but they absolutely need to be the same.


>> Todo 3: string-specific algorithms can implemented as simple wrappers
>> for generic algorithms with the string dressed correctly for the task,
>> or they can implement more sophisticated algorithms to increase performance
> 
> One thing I like about the current scheme is that all 
> bidirectional-range algorithms work out of the box with all strings, 
> and lend themselves to optimization whenever you want to.

I like this as the default behaviour too. I think however that you 
should restrict the algorithms that work out of the box to those which 
can also work with graphemes. This way you can change the behaviour in 
the future and support graphemes by a simple upgrade of Phobos.

Algorithms that doesn't work with graphemes would still work with toDchar.

So what doesn't work with graphemes? Predicates such as "a < b" for 
instance. That's pretty much it.


> This will have trouble passing Walter's wanking test. Mine too; every 
> time I need to write a bunch of forwarding functions, that's a signal 
> something went wrong somewhere. Remember MFC? :o)

The idea is that we write the API as it would apply to graphemes, but 
we implement it using dchar for the time being. Some function 
signatures might have to differ a bit.


>> Do you like that more?
> 
> This is not about liking. I like doing the right thing as much as you 
> do, and I think Phobos shows that. Clearly doing the right thing 
> through and through is handling combining characters appropriately. The 
> problem is keeping all desiderata in careful balance.

Well then, don't you find it balanced enough? I'm not asking that 
everything be done with graphemes. I'm not even asking that anything be 
done with graphemes by default. I'm only asking that we keep the API 
clean enough so we can pass to graphemes by default in the future 
without having to rewrite all the code everywhere to use byGrapheme. If 
this isn't the right balance.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 1/17/11 2:29 PM, Michel Fortin wrote:
> The problem I see currently is that you rely on dchar being the element
> type. That should be an implementation detail, not something client code
> can see or rely on.

But at some point you must be able to talk about individual characters 
in a text. It can't be something that client code doesn't see!!!

SuperDuperText txt;
auto c = giveMeTheFirstCharacter(txt);

What is the type of c? That is visible to the client!


Andrei
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 2011-01-17 15:49:26 -0500, Andrei Alexandrescu 
<SeeWebsiteForEmail@erdani.org> said:

> On 1/17/11 2:29 PM, Michel Fortin wrote:
>> The problem I see currently is that you rely on dchar being the element
>> type. That should be an implementation detail, not something client code
>> can see or rely on.
> 
> But at some point you must be able to talk about individual characters 
> in a text. It can't be something that client code doesn't see!!!

It seems that it can. NSString only exposes individual UTF-16 code 
units directly (or semi-directly via an accessor method), even though 
searching and comparing is grapheme-aware. I'm not saying it's a good 
design, but it certainly can work in practice.

In any case, I didn't mean to say the client code should't be aware of 
the characters in a string. I meant that the client shouldn't assume 
the algorithm works at the same layer as ElementType!(string) for a 
given string type. Even if ElementType!(string) is dchar, the default 
function you get if you don't use any of toCodeUnit, toDchar, or 
toGrapheme can work at the dchar or grapheme level if it makes more 
sense that way.

In other words, the client says: "I have two strings, compare them!" 
The client didn't specify if they should be compared by char, wchar, 
dchar, or by normalized grapheme; so we do what's sensible. That's what 
I call the 'default' string functions, those you get when you don't ask 
for anything specific. They should have a signature making them able to 
work at the grapheme level, even though they might not for practical 
reasons (performance). This way if it becomes more important or 
practical to support graphemes, it's easy to evolve to them.


> SuperDuperText txt;
> auto c = giveMeTheFirstCharacter(txt);
> 
> What is the type of c? That is visible to the client!

That depends on how you implement the giveMeTheFirstCharacter function. :-)

More seriously, you have four choice:

1. code unit
2. code point
3. grapheme
4. require the client to state explicitly which kind of 'character' he 
wants; 'character' being an overloaded word, it's reasonable to ask for 
disambiguation.

You and Walter can't come to understand each other between 1 and 2, 
regarding foreach and ranges. To keep things consistent with what I 
said above I'd tend to say 4, but that's weird for something that looks 
like an array. My second choice goes for 1 when it comes to 
consistency, and 3 when it comes to correctness, and 2 when it comes to 
being practical.

Given something is going to be inconsistent either way, I'd say any of 
the above is acceptable. But please make sure you and Walter agree on 
the default element type for ranges and foreach.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
January 17, 2011
Re: VLERange: a range in between BidirectionalRange and RandomAccessRange
On 01/17/2011 07:57 PM, Andrei Alexandrescu wrote:
> On 1/17/11 12:23 PM, spir wrote:
>> Andrei, would you have a look at Text's current state, mainly
>> theinterface, when you have time for that (no hurry) at
>> https://bitbucket.org/denispir/denispir-d/src
>> It is actually a bit more than just a string type considering true
>> characters as natural elements.
>> * It is a textual type providing a client interface of common text
>> manipulation methods similar to ones in common high-level languages.
>> (including the fact that a character is a singleton string)
>> * The repo also holds the main module (unicodedata) of Text's sister lib
>> (dunicode), providing access to various unicode algos and data.
>> (We are about to merge the 2 libs into a new repository.)
>
> I think this is solid work that reveals good understanding of Unicode.
> That being said, there are a few things I disagree about and I don't
> think it can be integrated into Phobos.

We are exploring a new field. (Except for the work Objective-C designers 
did -- but we just discovered it.)

> One thing is that it looks a lot
> more like D1 code than D2. D2 code of this kind is automatically
> expected to play nice with the rest of Phobos (ranges and algorithms).
> As it is, the code is an island that implements its own algorithms
> (mostly by equivalent handwritten code).

Right. We precisely initially wanted to let it play nicely with the rest 
of new Phobos. This meant mainly provide a range interface, which also 
gives access to std.algorithm routines. But we were blocked by current 
bugs related to ranges. I have posted about those issues (you may 
remember having replied to this post).

> In detail:
>
> * Line 130: representing a text as a dchar[][] has its advantages but
> major efficiency issues. To be frank I think it's a disaster. I think a
> representation building on UTF strings directly is bound to be vastly
> better.

I don't understand your point. Where is the difference with D's builtin 
types, then?

Also, which efficiency issue do you mention? Upon text object 
construction, we do agree and I have given some data. But this happens 
only once; it is an investment intended to provide correctness first, 
and efficiency of _every_ operation on constructed text.
Upon speed ofsuch  methods / algorithms operating _correctly_ on 
universal text, precisely, since there is no alternative to Text (yet), 
there are also no available performance data to judge.

(What about comparing Objective-C's NSString to Text's current 
performance for indexing, slicing, searching, counting,...? Even in its 
current experimental stage, I bet it would not be ridiculous, rather the 
opposite. But I may be completely wrong.)

> * 163: equality does what std.algorithm.equal does.
>
> * 174: equality also does what std.algorithm.equal does (possibly with a
> custom pred)

Right, these are unimportant tool func at the "pile" level. (Initially 
introduced because builtin "==" showed strange inefficency in our case. 
May test again later.)

> * 189: TextException is unnecessary

Agreed.

> * 340: Unless properly motivate, iteration with opApply is archaic and
> inefficient.

See range bug evoked above. opApply is the only workaround AFAIK.
Also, ranges cannot yet provide indexed iteration like
	foreach(i, char ; text) {...}

> * 370: Why lose the information that the result is in fact a single Pile?

I don't know what information loss you mean.

Generally speaking, Pile is more or less an implementation detail used 
to internally represent a true character; while Text is the important thing.
At one time we had to chose whether make Pile an obviously exposed type 
as well, or not. I chose (after some exchange on the topic) not to do it 
for a few reasons:
* Simplicity: one type does all the job well.
* Avoid confusion due to conflict with historic string types which 
elements (codes=characters) were atomic thingies. This was also a reason 
not to name it simply "Character"; "Pile" for me was supposed to rather 
evoke the technical side than the meaningful side.
* Lightness of the interface: if we expose Pile obviously, then we need 
to double all methods that may take or return a single character, like 
searching, counting, replacing etc... and also possibly indexing and 
iteration.

In fact, the resulting interface is more or less like a string type in 
high-level languages such as Python; with the motivating difference that 
it operates correctly on universal text.

Now, it seems you rather expect, maybe, the character/pile type to be 
the important thing and Text to just be a sequence of them? (possibly 
even unnecessary to be defined formally)

> * 430, 456, 474: contains, indexOf, count and probably others should use
> generic algorithms, not duplicate them.
>
> * 534: replace is std.array.replace

I had to write algos because most of them in std.algorithm require a 
range interface, IIUC; and also for testing purpose.

> * 623: copy copies the piles shallowly (not sure if that's a problem)

Had the same interrogation.

> As I mentioned before - why not focus on defining a Grapheme type (what
> you call Pile, but using UTF encoding) and defining a ByGrapheme range
> that iterates a UTF-encoded string by grapheme?

Dunno. This simply was not my approach. Seems to me Text as is provides 
clients with an interface a simple and clear as possible, while 
operating correctly in the backgroung.

It seems if you just build a ByGrapheme iterator, then you have no other 
choice than abstracting on the fly (constructing piles on the fly for 
operations like indexing and normalising them in addition for searching, 
counting...).
As I said in other posts, this may be the right thing to do from an 
efficiency point of view, but this remains to be proven. I bet the 
opposite, in fact, that --with same implementation language and same 
investment in optimisation-- the approach defining a true textual type 
like Text is inevitbly more efficient by orders of magnitude (*). Again, 
Text construction initial cost is an investment. Prove me wrong (**).

> Andrei

Denis


(*) Except, probably, for the choice of making the ElemenType a 
singleton Text (seems costly).
(**) I'm now aware of the high speed loss Text certainly suffers from 
representing characters as mini-arrays, but I guess it is marginally 
relevant compared to the gain of not piling and normalising for every 
operation.
_________________
vita es estrany
spir.wikidot.com
12 13 14 15 16 17 18 19
Top | Discussion index | About this forum | D home