Thread overview
Accented Characters and Counting Syllables
Dec 06, 2014
Nordlöw
Dec 06, 2014
H. S. Teoh
Dec 07, 2014
Nordlöw
Dec 07, 2014
H. S. Teoh
Dec 08, 2014
Nordlöw
Dec 08, 2014
Nordlöw
Dec 07, 2014
anonymous
Dec 07, 2014
Marc Schütz
Dec 07, 2014
John Colvin
Dec 07, 2014
Marc Schütz
December 06, 2014
Given the fact that

    static assert("é".length == 2);

I was surprised that

    static assert("é".byCodeUnit.length == 2);
    static assert("é".byCodePoint.length == 2);

Isn't there a way to iterate over accented characters (in my case UTF-8) in D? Or is this an inherent problem in Unicode? I need this in a syllable counting algorithm that needs to distinguish accented and non-accented variants of vowels. For example café (2 syllables) compared to babe (one syllable.
December 06, 2014
On Sat, Dec 06, 2014 at 10:37:17PM +0000, "Nordlöw" via Digitalmars-d-learn wrote:
> Given the fact that
> 
>     static assert("é".length == 2);
> 
> I was surprised that
> 
>     static assert("é".byCodeUnit.length == 2);
>     static assert("é".byCodePoint.length == 2);
> 
> Isn't there a way to iterate over accented characters (in my case UTF-8) in D? Or is this an inherent problem in Unicode? I need this in a syllable counting algorithm that needs to distinguish accented and non-accented variants of vowels. For example café (2 syllables) compared to babe (one syllable.

This is a Unicode issue. What you want is neither byCodeUnit nor byCodePoint, but byGrapheme. A grapheme is the Unicode equivalent of what lay people would call a "character". A Unicode character (or more precisely, a "code point") is not necessarily a complete grapheme, as your example above shows; it's just a numerical value that uniquely identifies an entry in the Unicode character database.


T

-- 
There are 10 kinds of people in the world: those who can count in binary, and those who can't.
December 07, 2014
On Saturday, 6 December 2014 at 22:37:19 UTC, Nordlöw wrote:
> Given the fact that
>
>     static assert("é".length == 2);
>
> I was surprised that
>
>     static assert("é".byCodeUnit.length == 2);
>     static assert("é".byCodePoint.length == 2);

string already iterates over code points. So byCodePoint doesn't
have to do anything on it, and it just returns the same string
again.

string's .length is the number of code units. It's not compatible
with the range primitives. That's why hasLength is false for
string (and wstring). Don't use .length on ranges without
checking hasLength.

So, while "é".byCodeUnit and "é".byCodePoint have equal
`.length`s, they have different range element counts.
December 07, 2014
On Saturday, 6 December 2014 at 22:37:19 UTC, Nordlöw wrote:
>     static assert("é".byCodePoint.length == 2);

Huh? Why is byCodePoint.length even defined?
December 07, 2014
On Sunday, 7 December 2014 at 13:24:28 UTC, Marc Schütz wrote:
> On Saturday, 6 December 2014 at 22:37:19 UTC, Nordlöw wrote:
>>    static assert("é".byCodePoint.length == 2);
>
> Huh? Why is byCodePoint.length even defined?

because string has ElementType dchar (i.e. it already iterates by codepoint), which means that byCodePoint is just the identity function.
December 07, 2014
On Sunday, 7 December 2014 at 13:24:28 UTC, Marc Schütz wrote:
> On Saturday, 6 December 2014 at 22:37:19 UTC, Nordlöw wrote:
>>    static assert("é".byCodePoint.length == 2);
>
> Huh? Why is byCodePoint.length even defined?

import std.uni;
pragma(msg, typeof("é".byCodePoint));
=> string

Something's very broken...

It's this definition in std.uni:

    Range byCodePoint(Range)(Range range)
        if(isInputRange!Range && is(Unqual!(ElementType!Range) == dchar))
    {
        return range;
    }

`Unqual!(ElementType!string)` is indeed `dchar` because of auto-decoding.

Filed as bug:
https://issues.dlang.org/show_bug.cgi?id=13829
December 07, 2014
On Saturday, 6 December 2014 at 23:11:49 UTC, H. S. Teoh via Digitalmars-d-learn wrote:
> This is a Unicode issue. What you want is neither byCodeUnit nor
> byCodePoint, but byGrapheme. A grapheme is the Unicode equivalent of
> what lay people would call a "character". A Unicode character (or more
> precisely, a "code point") is not necessarily a complete grapheme, as
> your example above shows; it's just a numerical value that uniquely
> identifies an entry in the Unicode character database.
>
>
> T

Ok, thanks.

I just noticed that byGrapheme() lacks bidirectional access. Further it also lacks graphemeStrideBack() in complement to graphemeStride()? Similar to stride() and strideBack(). Is this difficult to implement?
December 07, 2014
On Sun, Dec 07, 2014 at 02:30:13PM +0000, "Nordlöw" via Digitalmars-d-learn wrote:
> On Saturday, 6 December 2014 at 23:11:49 UTC, H. S. Teoh via Digitalmars-d-learn wrote:
> >This is a Unicode issue. What you want is neither byCodeUnit nor byCodePoint, but byGrapheme. A grapheme is the Unicode equivalent of what lay people would call a "character". A Unicode character (or more precisely, a "code point") is not necessarily a complete grapheme, as your example above shows; it's just a numerical value that uniquely identifies an entry in the Unicode character database.
> >
> >
> >T
> 
> Ok, thanks.
> 
> I just noticed that byGrapheme() lacks bidirectional access. Further
> it also lacks graphemeStrideBack() in complement to graphemeStride()?
> Similar to stride() and strideBack(). Is this difficult to implement?

Not sure, but I wouldn't be surprised if it is. Unicode algorithms are generally non-trivial.


T

-- 
Who told you to swim in Crocodile Lake without life insurance??
December 08, 2014
On Sunday, 7 December 2014 at 15:47:45 UTC, H. S. Teoh via Digitalmars-d-learn wrote:
>> Ok, thanks.
>> 
>> I just noticed that byGrapheme() lacks bidirectional access. Further
>> it also lacks graphemeStrideBack() in complement to graphemeStride()?
>> Similar to stride() and strideBack(). Is this difficult to implement?
>
> Not sure, but I wouldn't be surprised if it is. Unicode algorithms are
> generally non-trivial.
>
>
> T

What's the best source of information for these algorithms? Is it certain that graphemes iteration is backwards iteratable by definition?
December 08, 2014
On Monday, 8 December 2014 at 14:57:06 UTC, Nordlöw wrote:
> What's the best source of information for these algorithms? Is it certain that graphemes iteration is backwards iteratable by definition?

I guess

https://en.wikipedia.org/wiki/Combining_character

could be a good start.