November 27, 2013
27-Nov-2013 22:12, H. S. Teoh пишет:
> On Wed, Nov 27, 2013 at 10:07:43AM -0800, Andrei Alexandrescu wrote:
>> On 11/27/13 7:43 AM, Jakob Ovrum wrote:
>>> On that note, I tried to use std.uni to write a simple example of how
>>> to correctly handle this in D, but it became apparent that std.uni
>>> should expose something like `byGrapheme` which lazily transforms a
>>> range of code points to a range of graphemes (probably needs a
>>> `byCodePoint` to do the converse too). The two extant grapheme
>>> functions, `decodeGrapheme` and `graphemeStride`, are *awful* for
>>> string manipulation (granted, they are probably perfect for text
>>> rendering).
>>
>> Yah, byGrapheme would be a great addition.
> [...]
>
> +1. This is better than the GraphemeString / i18nString proposal
> elsewhere in this thread, because it discourages people from using
> graphemes (poor performance) unless where actually necessary.

I could have sworn we had byGrapheme somewhere, well apparently not :(

BTW I believe that GraphemeString could still be a valuable addition. I known of at least one good implementation that gives you O(1) grapheme access with nice memory footprint numbers. It has many benefits but the chief problem with it:
a) It doesn't at all solve the interchange at all - you'd have to encode on  write/re-code on read
b) It relies on having global shared state across the whole program, and that's the real show-stopper thing about it

In any case it's a direction well worth exploring.
>
>
> T
>


-- 
Dmitry Olshansky
November 27, 2013
27-Nov-2013 22:54, Jacob Carlborg пишет:
> On 2013-11-27 18:56, Dicebot wrote:
>
>> +1
>>
>> Working with graphemes is rather expensive thing to do performance-wise.
>> I like how D makes this fact obvious and provides continuous transition
>> through abstraction levels here. It is important to make the costs
>> obvious.
>
> I think it's missing a final high level abstraction. As with the rest of
> the abstractions you're not forced to use them.
>

This could give an idea of what Perl folks do to get the grapheme feel like a unit of string:
http://www.parrot.org/content/ucs-4-nfg-and-how-grapheme-tables-makes-it-awesome

You seriously don't want this kind of behind the scenes work taking place in systems language.

P.S. The text linked presents some incorrect "facts" about Unicode that I'm not to be held responsible for :) I do believe however that the general idea described is interesting and is worth trying out in addition to what we have in std.uni.

-- 
Dmitry Olshansky
November 27, 2013
27-Nov-2013 20:22, Wyatt пишет:
> On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:
>>
>> trouble following all that (e.g. Isn't "noe\u0308l" a grapheme
>>
> Whoops, overzealous pasting.  That is, "e\u0308", which composes to
> "ë".  A grapheme cluster seems to represent one printed character: "...a
> horizontally segmentable unit of text, consisting of some grapheme base
> (which may consist of a Korean syllable) together with any number of
> nonspacing marks applied to it."
>
> Is that about right?

As much as standard defines it. (actually they talk about boundaries, and grapheme is what happens to be in between).


More specifically D's std.uni follows the notion of the extended grapheme cluster. There is no need to stick with ugly legacy crap.

See also
http://www.unicode.org/reports/tr29/
>
> -Wyatt


-- 
Dmitry Olshansky
November 27, 2013
27-Nov-2013 20:18, Wyatt пишет:
> On Wednesday, 27 November 2013 at 15:43:11 UTC, Jakob Ovrum wrote:

> It
> honestly surprised me how many things in std.uni don't seem to work on
> ranges.
>

Which ones? Or do you mean more like isAlpha(rangeOfCodepoints)?



-- 
Dmitry Olshansky
November 27, 2013
On Wednesday, 27 November 2013 at 17:22:43 UTC, Jakob Ovrum wrote:
>
> i18nString sounds like a range of graphemes to me.
>
Maybe.  If I had called it...say, "normalisedString"?  Would you still think that?  That was an off-the-cuff name because my morning brain imagined that this sort of thing would be useful for user input where you can't make assumptions about its form.

> I would like a convenient function in std.uni to get such a range of graphemes from a range of points, but I wouldn't want to elevate it to any particular status; that would be a knee-jerk reaction. D's granularity when it comes to Unicode is because there is an appropriate level of representation for each domain. Shoe-horning everything into a range of graphemes is something we should avoid.
>
Okay, hold up.  It's a bit late to prevent everyone from diving down this rabbit hole, but let me be clear:

This really isn't about graphemes.  Not really.  They may be involved, but I think focusing on that obscures the point.

If you recall the original article, I don't think he's being unfair in expecting "noël" to have a length of four no matter how it was composed.  I don't think it's unfair to expect that "noël".take(3) returns "noë", and I don't think it's unfair that reversing it should be "lëon".  All the places where his expectations were defied (and more!) are implementation details.

While I stated before that I don't necessarily have anything against people learning more about unicode, neither do I fundamentally believe that's something a lot of people _need_ to worry about.  I'm not saying the default string in D should change or anything crazy like that.  All I'm suggesting is maybe, rather than telling people they should read a small book about the most arcane stuff imaginable and then explaining which tool does what when that doesn't take, we could just tell them "Here, use this library type where you need it" with the admonishment that it may be too slow if abused.  I think THAT could be useful.

> In D, we can write code that is both Unicode-correct and highly performant, while still being simple and pleasant to read. To write such code, one must have a modicum of understanding of how Unicode works (in order to choose the right tools from the toolbox), but I think it's a novel compromise.

See, this sways me only a little bit.  The reason for that is, often, convenience greatly trumps elegance or performance.  Sure I COULD write something in C to look for obvious bad stuff in my syslog, but would I bother when I have a shell with pipes, grep, cut, and sed?  This all isn't to say I don't LIKE performance and elegance; but I live, work, and play on both sides of this spectrum, and I'd like to think they can peacefully coexist without too much fuss.

-Wyatt
November 28, 2013
On 11/27/2013 9:22 AM, Jakob Ovrum wrote:
> In D, we can write code that is both Unicode-correct and highly performant,
> while still being simple and pleasant to read. To write such code, one must have
> a modicum of understanding of how Unicode works (in order to choose the right
> tools from the toolbox), but I think it's a novel compromise.

Sadly, std.array is determined to decode (i.e. convert to dchar[]) all your strings when they are used as ranges. This means that all algorithms on strings will be crippled as far as performance goes.

http://dlang.org/glossary.html#narrow strings

Very, very few operations on strings need decoding. The decoding should have gone into a separate layer.
November 28, 2013
On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright wrote:
> Sadly, std.array is determined to decode (i.e. convert to dchar[]) all your strings when they are used as ranges. This means that all algorithms on strings will be crippled as far as performance goes.
>
> http://dlang.org/glossary.html#narrow strings
>
> Very, very few operations on strings need decoding. The decoding should have gone into a separate layer.

Decoding by default means that algorithms can work reasonably with strings without being designed specifically for strings. The algorithms can then later be specialized for narrow strings, which I believe is happening for a few algorithms in std.algorithm like substring search.

Decoding is still available as a separate layer through std.utf, when more control over decoding is required.
November 28, 2013
Walter Bright:

> This means that all algorithms on strings will be crippled
> as far as performance goes.

If you want to sort an array of chars you need to use a dchar[], or code like this:

char[] word = "just a test".dup;
auto sword = cast(char[])word.representation.sort().release;

See:
http://d.puremagic.com/issues/show_bug.cgi?id=10162

Bye,
bearophile
November 28, 2013
On Thursday, 28 November 2013 at 09:02:12 UTC, Walter Bright
wrote:
> Sadly,

I think it's great. It means by default, your strings will always
be handled correctly. I think there's quite a few algorithms that
were written without ever taking strings into account, but still
happen to work with them.

> std.array is determined to decode (i.e. convert to dchar[]) all your strings when they are used as ranges.
> This means that all algorithms on
> strings will be crippled as far as performance goes.

Quite a few algorithms in array/algorithm/string *don't* decode
the string when they don't need to actually.

> Very, very few operations on strings need decoding. The decoding should have gone into a separate layer.

Which operations are you thinking of in std.array that decode
when they shouldn't?
November 28, 2013
On 11/28/2013 5:24 AM, monarch_dodra wrote:
> Which operations are you thinking of in std.array that decode
> when they shouldn't?

front() in std.array looks like:

@property dchar front(T)(T[] a) @safe pure if (isNarrowString!(T[]))
{
    assert(a.length, "Attempting to fetch the front of an empty array of " ~ T.stringof);
    size_t i = 0;
    return decode(a, i);
}

So anytime I write a generic algorithm using empty, front, and popFront(), it decodes the strings, which is a large pessimization.