November 27, 2013
On Wed, Nov 27, 2013 at 06:22:41PM +0100, Jakob Ovrum wrote:
> On Wednesday, 27 November 2013 at 16:15:53 UTC, Wyatt wrote:
> >I don't remember if it was brought up before, but this makes me wonder if something like an i18nString should exist for cases where it IS important.  Making i18n stuff as simple as it looks like it "should" be has merit, IMO.  (Maybe there's even room for a std.string.i18n submodule?)
> >
> >-Wyatt
> 
> What would it do that std.uni doesn't already?
> 
> i18nString sounds like a range of graphemes to me.

Maybe it should be called graphemeString?

I'm not sure what this has to do with i18n, though. Properly done i18n should use Unicode line-breaking algorithms and other such standardized functions, rather than manipulating graphemes directly (which fails to take into account double-width characters, language-specific decomposition rules, and many other gotchas, not to mention poorly-performing). AFAIK std.uni already provides a way to extract graphemes when you need it (e.g., for rendering fonts), so there's really no reason to default to graphemeString everywhere in your program. *That* is a sign of poorly written code, IMNSHO.


> I would like a convenient function in std.uni to get such a range of graphemes from a range of points, but I wouldn't want to elevate it to any particular status; that would be a knee-jerk reaction. D's granularity when it comes to Unicode is because there is an appropriate level of representation for each domain. Shoe-horning everything into a range of graphemes is something we should avoid.
> 
> In D, we can write code that is both Unicode-correct and highly performant, while still being simple and pleasant to read. To write such code, one must have a modicum of understanding of how Unicode works (in order to choose the right tools from the toolbox), but I think it's a novel compromise.

Agreed.


T

-- 
MASM = Mana Ada Sistem, Man!
November 27, 2013
On Wednesday, 27 November 2013 at 17:37:48 UTC, Jakob Ovrum wrote:
> On Wednesday, 27 November 2013 at 17:30:22 UTC, Jacob Carlborg wrote:
>> On 2013-11-27 18:22, Jakob Ovrum wrote:
>>
>>> What would it do that std.uni doesn't already?
>>
>> A class/struct that handles all these normalizations and other stuff automatically.
>
> Sounds terrible :)

+1

Working with graphemes is rather expensive thing to do performance-wise. I like how D makes this fact obvious and provides continuous transition through abstraction levels here. It is important to make the costs obvious.
November 27, 2013
On 11/27/13 7:43 AM, Jakob Ovrum wrote:
> On that note, I tried to use std.uni to write a simple example of how to
> correctly handle this in D, but it became apparent that std.uni should
> expose something like `byGrapheme` which lazily transforms a range of
> code points to a range of graphemes (probably needs a `byCodePoint` to
> do the converse too). The two extant grapheme functions,
> `decodeGrapheme` and `graphemeStride`, are *awful* for string
> manipulation (granted, they are probably perfect for text rendering).

Yah, byGrapheme would be a great addition.

Andrei
November 27, 2013
On Wed, Nov 27, 2013 at 10:07:43AM -0800, Andrei Alexandrescu wrote:
> On 11/27/13 7:43 AM, Jakob Ovrum wrote:
> >On that note, I tried to use std.uni to write a simple example of how to correctly handle this in D, but it became apparent that std.uni should expose something like `byGrapheme` which lazily transforms a range of code points to a range of graphemes (probably needs a `byCodePoint` to do the converse too). The two extant grapheme functions, `decodeGrapheme` and `graphemeStride`, are *awful* for string manipulation (granted, they are probably perfect for text rendering).
> 
> Yah, byGrapheme would be a great addition.
[...]

+1. This is better than the GraphemeString / i18nString proposal elsewhere in this thread, because it discourages people from using graphemes (poor performance) unless where actually necessary.


T

-- 
He who laughs last thinks slowest.
November 27, 2013
On 2013-11-27 18:56, Dicebot wrote:

> +1
>
> Working with graphemes is rather expensive thing to do performance-wise.
> I like how D makes this fact obvious and provides continuous transition
> through abstraction levels here. It is important to make the costs obvious.

I think it's missing a final high level abstraction. As with the rest of the abstractions you're not forced to use them.

-- 
/Jacob Carlborg
November 27, 2013
On 11/27/2013 8:18 AM, Wyatt wrote:
> It honestly surprised me how
> many things in std.uni don't seem to work on ranges.

Many things in Phobos either predate ranges, or are written by people who aren't used to ranges and don't think in terms of ranges. It's an ongoing issue, and one we need to improve upon.

And, of course, you're welcome to pitch in and help with pull requests on the documentation and implementation!

November 27, 2013
On 11/27/2013 06:45 AM, David Nadlinger wrote:
> On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
>> Through Reddit I have seen this small comparison of Unicode handling between different programming languages:
>>
>> http://mortoray.com/2013/11/27/the-string-type-is-broken/
>>
>> D+Phobos seem to fail most things (it produces BAFFLE):
>> http://dpaste.dzfl.pl/a5268c435
>
> If you need to perform this kind of operations on Unicode strings in D, you can call normalize (std.uni) on the string first to make sure it is in one of the Normalization Forms. For example, just appending .normalize to your strings (which defaults to NFC) would make the code produce the "expected" results.
>
> As far as I'm aware, this behavior is the result of a deliberate decision, as normalizing strings on the fly isn't really cheap.
>
> David
>
I don't like the overhead, and I don't know how important this is, but perhaps the best way to solve it would be to have string include a "normalization" byte, saying whether it was normalized, and if so in what way.  That there can be multiple ways of normalizing is painful, but it *is* the standard.  And this would allow normalization to be skipped whenever the comparison of two strings showed the same normalization (or lack thereof).  What to do if they're normalized differently is a bit of a puzzle, but most reasonable solutions would work for most cases, so you just need a way to override the defaults.

-- 
Charles Hixson

November 27, 2013
On 11/27/2013 08:53 AM, Jakob Ovrum wrote:
> On Wednesday, 27 November 2013 at 16:18:34 UTC, Wyatt wrote:
>> I agree with the assertion that people SHOULD know how unicode works if they want to work with it, but the way our docs are now is off-putting enough that most probably won't learn anything.  If they know, they know; if they don't, the wall of jargon is intimidating and hard to grasp (more examples up front of more things that you'd actually use std.uni for).  Even though I'm decently familiar with Unicode, I was having trouble following all that (e.g. Isn't "noe\u0308l" a grapheme cluster according to std.uni?).  On the flip side, std.utf has a serious dearth of examples and the relationship between the two isn't clear.
>
> I thought it was nice that std.uni had a proper terminology section, complete with links to Unicode documents to kick-start beginners to Unicode. It mentions its relationship with std.utf right at the top.
>
> Maybe the first paragraph is just too thin, and it's hard to see the big picture. Maybe it should include a small leading paragraph detailing the three levels of Unicode granularity that D/Phobos chooses; arrays of code units -> ranges of code points -> std.uni for graphemes and algorithms.
>
>> Yes, please.  While operations on single codepoints and characters seem pretty robust (i.e. you can do lots of things with and to them), it feels like it just falls apart when you try to work with strings.  It honestly surprised me how many things in std.uni don't seem to work on ranges.
>>
>> -Wyatt
>
> Most string code is Unicode-correct as long as it works on code points and all inputs are of the same normalization format; explicit grapheme-awareness is rarely a necessity. By that I mean the most common string operations, such as searching, getting a substring etc. will work without any special grapheme decoding (beyond normalization).
>
> The hiccups appear when code points are shuffled around, or the order is changed. Apart from these rare string manipulation cases, grapheme awareness is necessary for rendering code.
>
I would put things a bit more emphatically.  The codepoint is analogous to assembler, where the character is analogous to a high level language (and the binary representation is analogous to a binary representation).  The desire is to make the characters easy to use in a way that is cheap to do.  To me this means that the highlevel language (i.e., D) should make it easy to deal with characters, possible to deal with codepoints, and you can deal with binary representations if you really want to.  (Also note the isomorphism between assembler code and binary is matched by an isomorphism between codepoints and binary representation.)  To do this cheaply, D needs to know what kind of normalization each string is in.  This is likely to cost one byte per string, unless there's some slack in the current representation.

But is this worth while?  This is the direction that things will eventually go, but that doesn't really mean that we need to push them in that direction today.  But if D had a default normalization that occurred during i/o operations, to cost of the normalization would probably be lost during the impedance matching between RAM and storage.  (Again, however, any default requires the ability to be overridden.)

Also, of course, none of this will be of any significance to ASCII.

-- 
Charles Hixson

November 27, 2013
27-Nov-2013 18:45, David Nadlinger пишет:
> On Wednesday, 27 November 2013 at 12:46:38 UTC, bearophile wrote:
>> Through Reddit I have seen this small comparison of Unicode handling
>> between different programming languages:
>>
>> http://mortoray.com/2013/11/27/the-string-type-is-broken/
>>
>> D+Phobos seem to fail most things (it produces BAFFLE):
>> http://dpaste.dzfl.pl/a5268c435
>
> If you need to perform this kind of operations on Unicode strings in D,
> you can call normalize (std.uni) on the string first to make sure it is
> in one of the Normalization Forms. For example, just appending
> .normalize to your strings (which defaults to NFC) would make the code
> produce the "expected" results.
>
> As far as I'm aware, this behavior is the result of a deliberate
> decision, as normalizing strings on the fly isn't really cheap.

It's anything but cheap.
At the minimum imagine crawling the string and issuing a table lookup per codepoint.

>
> David


-- 
Dmitry Olshansky
November 27, 2013
On 27.11.2013 19:07, Andrei Alexandrescu wrote:
> On 11/27/13 7:43 AM, Jakob Ovrum wrote:
>> On that note, I tried to use std.uni to write a simple example of how to
>> correctly handle this in D, but it became apparent that std.uni should
>> expose something like `byGrapheme` which lazily transforms a range of
>> code points to a range of graphemes (probably needs a `byCodePoint` to
>> do the converse too). The two extant grapheme functions,
>> `decodeGrapheme` and `graphemeStride`, are *awful* for string
>> manipulation (granted, they are probably perfect for text rendering).
>
> Yah, byGrapheme would be a great addition.

It shouldn't be hard to make, either:

import std.uni : Grapheme, decodeGrapheme;
import std.traits : isSomeString;
import std.array : empty;

struct ByGrapheme(T) if (isSomeString!T) {
    Grapheme _front;
    bool _empty;
    T _range;

    this(T value) {
        _range = value;
        popFront();
    }

    @property
    Grapheme front() {
        assert(!empty);
        return _front;
    }

    void popFront() {
        assert(!empty);
        _empty = _range.empty;
        if (!_empty) {
            _front = decodeGrapheme(_range);
        }
    }

    @property
    bool empty() {
        return _empty;
    }
}

auto byGrapheme(T)(T value) if (isSomeString!T) {
    return ByGrapheme!T(value);
}

void main() {
    import std.stdio;
    string s = "তঃঅ৩৵பஂஅபூ௩ᐁᑦᕵᙧᚠᚳᛦᛰ¥¼Ññ";
    writeln(s.byGrapheme);
}


-- 
  Simen