April 18, 2015
> Also another issue is that lower case letters and upper case might have different size requirements or look different depending on where on the word they are located.
>
> For example, German ß and SS, Greek σ and ς. I know Turkish also has similar cases.
>
> --
> Paulo

While true, it does not affect wrap (the algorithm) as far as I can see.
April 18, 2015
On Saturday, 18 April 2015 at 11:52:52 UTC, Chris wrote:
> On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:
>> On 2015-04-18 12:27, Walter Bright wrote:
>>
>>> That doesn't make sense to me, because the umlauts and the accented e
>>> all have Unicode code point assignments.
>>
>> This code snippet demonstrates the problem:
>>
>> import std.stdio;
>>
>> void main ()
>> {
>>    dstring a = "e\u0301";
>>    dstring b = "é";
>>    assert(a != b);
>>    assert(a.length == 2);
>>    assert(b.length == 1);
>>    writefln(a, " ", b);
>> }
>>
>> If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same.
>>
>> \u0301 is the "combining acute accent" [1].
>>
>> [1] http://www.fileformat.info/info/unicode/char/0301/index.htm
>
> Yep, this was the cause of some bugs I had in my program. The thing is you never know, if a text is composed or decomposed, so you have to be prepared that "é" has length 2 or 1. On OS X these characters are automatically decomposed by default. So if you pipe it through the system an "é" (length=1) automatically becomes "e\u0301" (length=2). Same goes for file names on OS X. I've had to find a workaround for this more than once.

byGrapheme to the rescue:

http://dlang.org/phobos/std_uni.html#byGrapheme

Or is this unsuitable here?
April 18, 2015
On 2015-04-18 14:25, Gary Willoughby wrote:

> byGrapheme to the rescue:
>
> http://dlang.org/phobos/std_uni.html#byGrapheme
>
> Or is this unsuitable here?

How is byGrapheme supposed to be used? I tried this put it doesn't do what I expected:

foreach (e ; "e\u0301".byGrapheme)
    writeln(e);

-- 
/Jacob Carlborg
April 18, 2015
On Saturday, 18 April 2015 at 12:48:53 UTC, Jacob Carlborg wrote:
> On 2015-04-18 14:25, Gary Willoughby wrote:
>
>> byGrapheme to the rescue:
>>
>> http://dlang.org/phobos/std_uni.html#byGrapheme
>>
>> Or is this unsuitable here?
>
> How is byGrapheme supposed to be used? I tried this put it doesn't do what I expected:
>
> foreach (e ; "e\u0301".byGrapheme)
>     writeln(e);

void main()
{
    import std.stdio;
    import std.uni;

    foreach (e ; "e\u0301".byGrapheme)
        writeln(e[]);
}
April 18, 2015
On Sat, Apr 18, 2015 at 11:52:50AM +0000, Chris via Digitalmars-d wrote:
> On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:
> >On 2015-04-18 12:27, Walter Bright wrote:
> >
> >>That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments.
> >
> >This code snippet demonstrates the problem:
> >
> >import std.stdio;
> >
> >void main ()
> >{
> >    dstring a = "e\u0301";
> >    dstring b = "é";
> >    assert(a != b);
> >    assert(a.length == 2);
> >    assert(b.length == 1);
> >    writefln(a, " ", b);
> >}
> >
> >If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same.
> >
> >\u0301 is the "combining acute accent" [1].
> >
> >[1] http://www.fileformat.info/info/unicode/char/0301/index.htm
> 
> Yep, this was the cause of some bugs I had in my program. The thing is you never know, if a text is composed or decomposed, so you have to be prepared that "é" has length 2 or 1. On OS X these characters are automatically decomposed by default. So if you pipe it through the system an "é" (length=1) automatically becomes "e\u0301" (length=2). Same goes for file names on OS X. I've had to find a workaround for this more than once.

Wait, I thought the recommended approach is to normalize first, then do string processing later? Normalizing first will eliminate inconsistencies of this sort, and allow string-processing code to use a uniform approach to handling the string. I don't think it's a good idea to manually deal with composed/decomposed issues within every individual string function.

Of course, even after normalization, you still have the issue of zero-width characters and combining diacritics, because not every language has precomposed characters handy.

Using byGrapheme, within the current state of Phobos, is still the best bet as to correctly counting the number of printed columns as opposed to the number of "characters" (which, in the Unicode definition, does not always match the layman's notion of "character"). Unfortunately, byGrapheme may allocate, which fails Walter's requirements.

Well, to be fair, byGrapheme only *occasionally* allocates -- only for input with unusually long sequences of combining diacritics -- for normal use cases you'll pretty much never have any allocations. But the language can't express the idea of "occasionally allocates", there is only "allocates" or "@nogc". Which makes it unusable in @nogc code.

One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of the next grapheme.


T

-- 
Just because you survived after you did it, doesn't mean it wasn't stupid!
April 18, 2015
> Wait, I thought the recommended approach is to normalize first, then do
> string processing later? Normalizing first will eliminate
> inconsistencies of this sort, and allow string-processing code to use a
> uniform approach to handling the string. I don't think it's a good idea
> to manually deal with composed/decomposed issues within every individual
> string function.


1. Problem: Normalization is not closed under almost all operations. E.g. concatenating two normalized strings does not guarantee the result is in normalized form.

2. Problem: Some unicode algorithms e.g. string comparison require a normalization step. It doesn't matter which form you use, but you have to pick one.

Now we could say that all strings passed to phobos have to be normalized as (say) NFC and that phobos function thus skip the normalization.
April 18, 2015
On Saturday, 18 April 2015 at 13:30:09 UTC, H. S. Teoh wrote:
> On Sat, Apr 18, 2015 at 11:52:50AM +0000, Chris via Digitalmars-d wrote:
>> On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:
>> >On 2015-04-18 12:27, Walter Bright wrote:
>> >
>> >>That doesn't make sense to me, because the umlauts and the accented
>> >>e all have Unicode code point assignments.
>> >
>> >This code snippet demonstrates the problem:
>> >
>> >import std.stdio;
>> >
>> >void main ()
>> >{
>> >    dstring a = "e\u0301";
>> >    dstring b = "é";
>> >    assert(a != b);
>> >    assert(a.length == 2);
>> >    assert(b.length == 1);
>> >    writefln(a, " ", b);
>> >}
>> >
>> >If you run the above code all asserts should pass. If your system
>> >correctly supports Unicode (works on OS X 10.10) the two printed
>> >characters should look exactly the same.
>> >
>> >\u0301 is the "combining acute accent" [1].
>> >
>> >[1] http://www.fileformat.info/info/unicode/char/0301/index.htm
>> 
>> Yep, this was the cause of some bugs I had in my program. The thing is
>> you never know, if a text is composed or decomposed, so you have to be
>> prepared that "é" has length 2 or 1. On OS X these characters are
>> automatically decomposed by default. So if you pipe it through the
>> system an "é" (length=1) automatically becomes "e\u0301" (length=2).
>> Same goes for file names on OS X. I've had to find a workaround for
>> this more than once.
>
> Wait, I thought the recommended approach is to normalize first, then do
> string processing later? Normalizing first will eliminate
> inconsistencies of this sort, and allow string-processing code to use a
> uniform approach to handling the string. I don't think it's a good idea
> to manually deal with composed/decomposed issues within every individual
> string function.
>
> Of course, even after normalization, you still have the issue of
> zero-width characters and combining diacritics, because not every
> language has precomposed characters handy.
>
> Using byGrapheme, within the current state of Phobos, is still the best
> bet as to correctly counting the number of printed columns as opposed to
> the number of "characters" (which, in the Unicode definition, does not
> always match the layman's notion of "character"). Unfortunately,
> byGrapheme may allocate, which fails Walter's requirements.
>
> Well, to be fair, byGrapheme only *occasionally* allocates -- only for
> input with unusually long sequences of combining diacritics -- for
> normal use cases you'll pretty much never have any allocations. But the
> language can't express the idea of "occasionally allocates", there is
> only "allocates" or "@nogc". Which makes it unusable in @nogc code.
>
> One possible solution would be to modify std.uni.graphemeStride to not
> allocate, since it shouldn't need to do so just to compute the length of
> the next grapheme.
>
>
> T

This is why on OS X I always normalized strings to composed. However, there are always issues with Unicode, because, as you said, the layman's notion of what a character is is not the same as Unicode's. I wrote a utility function that uses byGrapheme and byCodePoint. It's a bit of an overhead, but I always get the correct length and character access (e.g. if txt.startsWith("é")).
April 18, 2015
On 4/18/15 4:35 AM, Jacob Carlborg wrote:
> On 2015-04-18 12:27, Walter Bright wrote:
>
>> That doesn't make sense to me, because the umlauts and the accented e
>> all have Unicode code point assignments.
>
> This code snippet demonstrates the problem:
>
> import std.stdio;
>
> void main ()
> {
>      dstring a = "e\u0301";
>      dstring b = "é";
>      assert(a != b);
>      assert(a.length == 2);
>      assert(b.length == 1);
>      writefln(a, " ", b);
> }
>
> If you run the above code all asserts should pass. If your system
> correctly supports Unicode (works on OS X 10.10) the two printed
> characters should look exactly the same.
>
> \u0301 is the "combining acute accent" [1].
>
> [1] http://www.fileformat.info/info/unicode/char/0301/index.htm

Isn't this solved commonly with a normalization pass? We should have a normalizeUTF() that can be inserted in a pipeline. Then the rest of Phobos doesn't need to mind these combining characters. -- Andrei

April 18, 2015
> Isn't this solved commonly with a normalization pass? We should have a normalizeUTF() that can be inserted in a pipeline.

Yes.

> Then the rest of Phobos doesn't need to mind these combining characters. -- Andrei

I don't think so. The thing is, even after normalization we have to deal with combining characters because in all normalization forms there will be combining characters left after normalization.

April 18, 2015
On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
> \u0301 is the "combining acute accent" [1].
>
> [1] http://www.fileformat.info/info/unicode/char/0301/index.htm

I won't deny what the spec says, but it doesn't make any sense to have two different representations of eacute, and I don't know why anyone would use the two code point version.