October 16, 2013
On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
> On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
>> Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function implemented in the core like ".length"?
>
> Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.

Most code might be buggy then.

An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.
October 16, 2013
On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
> On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
>> On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
>>> Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function implemented in the core like ".length"?
>>
>> Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
>
> Most code might be buggy then.
>
> An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.

Now that you mention it, I had a program that would send strings to a socket written in D. Before I could process the strings on OS X, I had to normalize the decomposed OS X version of the strings to the composed form that D could handle, else it wouldn't work. I used libutf8proc for it (only one tiny little function). It was no problem to interface to the C library, however, I thought it would have been nice, if D could've handled this on its own without depending on third party libraries.
October 16, 2013
On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:
> On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
>> On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
>>> On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
>>>> Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function implemented in the core like ".length"?
>>>
>>> Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
>>
>> Most code might be buggy then.
>>
>> An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.
>
> Now that you mention it, I had a program that would send strings to a socket written in D. Before I could process the strings on OS X, I had to normalize the decomposed OS X version of the strings to the composed form that D could handle, else it wouldn't work. I used libutf8proc for it (only one tiny little function). It was no problem to interface to the C library, however, I thought it would have been nice, if D could've handled this on its own without depending on third party libraries.

I'm not sure this is a "D" issue though: It's a fact of unicode
that there are two different ways to write ä.
October 16, 2013
On Wednesday, 16 October 2013 at 09:00:01 UTC, monarch_dodra wrote:
> On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:
>> On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
>>> On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
>>>> On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
>>>>> Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function implemented in the core like ".length"?
>>>>
>>>> Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
>>>
>>> Most code might be buggy then.
>>>
>>> An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.
>>
>> Now that you mention it, I had a program that would send strings to a socket written in D. Before I could process the strings on OS X, I had to normalize the decomposed OS X version of the strings to the composed form that D could handle, else it wouldn't work. I used libutf8proc for it (only one tiny little function). It was no problem to interface to the C library, however, I thought it would have been nice, if D could've handled this on its own without depending on third party libraries.
>
> I'm not sure this is a "D" issue though: It's a fact of unicode
> that there are two different ways to write ä.

My point was it would have been nice to have a native D function that can convert between the two types, especially because this is a well known issue. NSString (Cocoa / Objective-C) for example has things like precomposedStringWithCompatibilityMapping etc.
October 16, 2013
On Wednesday, 16 October 2013 at 09:00:01 UTC, monarch_dodra wrote:
> On Wednesday, 16 October 2013 at 08:48:30 UTC, Chris wrote:
>> On Wednesday, 16 October 2013 at 08:03:26 UTC, qznc wrote:
>>> On Tuesday, 15 October 2013 at 14:11:37 UTC, Kagamin wrote:
>>>> On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
>>>>> Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function implemented in the core like ".length"?
>>>>
>>>> Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.
>>>
>>> Most code might be buggy then.
>>>
>>> An issue the often comes up is file names. A file called "bär" will be normalized differently depending on the operating system. In both cases it is one grapheme. However, on Linux it is one code point, but on OS X it is two code points.
>>
>> Now that you mention it, I had a program that would send strings to a socket written in D. Before I could process the strings on OS X, I had to normalize the decomposed OS X version of the strings to the composed form that D could handle, else it wouldn't work. I used libutf8proc for it (only one tiny little function). It was no problem to interface to the C library, however, I thought it would have been nice, if D could've handled this on its own without depending on third party libraries.
>
> I'm not sure this is a "D" issue though: It's a fact of unicode
> that there are two different ways to write ä.

As I argued previously, it is implementation issue which treats "bär" is sequence of objects which are not capable of representing values (like int[] = [3.14]). By the way, it is a rare case of type system hole. Usually in D you need cast or union to reinterpret some value, with "bär"[X] you need not.
October 16, 2013
On 2013-10-16 10:03, qznc wrote:

> Most code might be buggy then.
>
> An issue the often comes up is file names. A file called "bär" will be
> normalized differently depending on the operating system. In both cases
> it is one grapheme. However, on Linux it is one code point, but on OS X
> it is two code points.

Why would it require two code points?

-- 
/Jacob Carlborg
October 16, 2013
On Wednesday, 16 October 2013 at 12:18:40 UTC, Jacob Carlborg wrote:
> On 2013-10-16 10:03, qznc wrote:
>
>> Most code might be buggy then.
>>
>> An issue the often comes up is file names. A file called "bär" will be
>> normalized differently depending on the operating system. In both cases
>> it is one grapheme. However, on Linux it is one code point, but on OS X
>> it is two code points.
>
> Why would it require two code points?

It is either [U+00E4] as one code point or [a,U+0308] for two code points. The second is "combining diaeresis" [0]. Not required, but possible. Those combining characters [1] provide a nearly infinite number of combinations. You can go crazy with it: http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work

[0] http://www.fileformat.info/info/unicode/char/0308/index.htm
[1] http://en.wikipedia.org/wiki/Combining_character
October 16, 2013
On 2013-10-16 14:33, qznc wrote:

> It is either [U+00E4] as one code point or [a,U+0308] for two code
> points. The second is "combining diaeresis" [0]. Not required, but
> possible. Those combining characters [1] provide a nearly infinite
> number of combinations. You can go crazy with it:
> http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work
>
> [0] http://www.fileformat.info/info/unicode/char/0308/index.htm
> [1] http://en.wikipedia.org/wiki/Combining_character

Aha, now I see.

-- 
/Jacob Carlborg
October 16, 2013
On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg wrote:
> On 2013-10-16 14:33, qznc wrote:
>
>> It is either [U+00E4] as one code point or [a,U+0308] for two code
>> points. The second is "combining diaeresis" [0]. Not required, but
>> possible. Those combining characters [1] provide a nearly infinite
>> number of combinations. You can go crazy with it:
>> http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work
>>
>> [0] http://www.fileformat.info/info/unicode/char/0308/index.htm
>> [1] http://en.wikipedia.org/wiki/Combining_character
>
> Aha, now I see.

One of the interesting points, is with "ba\u00E4r" vs "baa\u0308r", you can run a replace to replace 'a' with 'o'. Then, you'll get: "boär" vs "boör"

Which is the correct behavior? There is no correct answer.

So while a grapheme should never be separated from it's "letter" (eg, sorting "oäa" should *not* generate "aaö". What it *should* generate is up to debate), you can't entirely consider that a letter+grapheme is a single entity.

Long story short: unicode is f***ing complicated.

And I think D does a *damn* fine job of supporting it. In particular, it does an awesome job of *teaching* the coder *what* unicode is. Virtually everyone here has solid knowledge of unicode (I feel). They understand, and can explain it, and can work with.

On the other hand, I don't know many C++ coders that understand unicode.
October 16, 2013
On Wednesday, 16 October 2013 at 18:13:37 UTC, monarch_dodra wrote:
> On Wednesday, 16 October 2013 at 13:57:01 UTC, Jacob Carlborg wrote:
>> On 2013-10-16 14:33, qznc wrote:
>>
>>> It is either [U+00E4] as one code point or [a,U+0308] for two code
>>> points. The second is "combining diaeresis" [0]. Not required, but
>>> possible. Those combining characters [1] provide a nearly infinite
>>> number of combinations. You can go crazy with it:
>>> http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work
>>>
>>> [0] http://www.fileformat.info/info/unicode/char/0308/index.htm
>>> [1] http://en.wikipedia.org/wiki/Combining_character
>>
>> Aha, now I see.
>
> One of the interesting points, is with "ba\u00E4r" vs "baa\u0308r", you can run a replace to replace 'a' with 'o'. Then, you'll get: "boär" vs "boör"
>
> Which is the correct behavior? There is no correct answer.
>
> So while a grapheme should never be separated from it's "letter" (eg, sorting "oäa" should *not* generate "aaö". What it *should* generate is up to debate), you can't entirely consider that a letter+grapheme is a single entity.
>
> Long story short: unicode is f***ing complicated.
>
> And I think D does a *damn* fine job of supporting it. In particular, it does an awesome job of *teaching* the coder *what* unicode is. Virtually everyone here has solid knowledge of unicode (I feel). They understand, and can explain it, and can work with.
>
> On the other hand, I don't know many C++ coders that understand unicode.

I agree with your point. Nevertheless you understanding of grapheme is off. U+0308 is not a grapheme.  "a\u0308" is one grapheme. U+00e4 is the same grapheme as "a\u0308".

http://en.wikipedia.org/wiki/Grapheme