May 28, 2016
On Saturday, 28 May 2016 at 19:04:14 UTC, Walter Bright wrote:
> On 5/28/2016 5:04 AM, Andrei Alexandrescu wrote:
>> So it harkens back to the original mistake: strings should NOT be arrays with
>> the respective primitives.
>
> An array of code units provides consistency, predictability, flexibility, and performance. It's a solid base upon which the programmer can build what he needs as required.
>
> A string class does not do that (from the article: "I admit the correct answer is not always clear").

You're right. An "array of code units" is a very useful low-level primitive. I've dealt with a lot of code that uses these (more or less correctly) in various languages.

But when providing such a thing, I think it's very important to make it *look* like a low-level primitive, and use the type system to distinguish it from higher-level ones.


E.g. A string literal should not implicitly convert into an array of code units. What should it implicitly convert to? I'm not sure. Something close to how it looks in the source code, probably. A sequential range of graphemes? From all the detail in this thread, I wonder now if "a grapheme" is even an unambiguous concept across different environments. But one thing I'm sure of (and this is from other languages/API's, not from D specifically): A function which converts from one representation to another, but doesn't keep track of the change (e.g. Different compile-time type; e.g. State in a "string" class about whether it is in normalized form), is a "bug farm".

May 28, 2016
On Saturday, 28 May 2016 at 12:04:20 UTC, Andrei Alexandrescu wrote:
> OK, that's a fair argument, thanks. So it seems there should be no "default" way to iterate a string

Yes!

> So it harkens back to the original mistake: strings should NOT be arrays with the respective primitives.

If you're proposing a library type, a la RCStr, as an alternative then yeah.


May 29, 2016
On 05/28/2016 03:04 PM, Andrei Alexandrescu wrote:
> On 5/28/16 6:59 AM, Marc Schütz wrote:
>> The fundamental problem is choosing one of those possibilities over the others without knowing what the user actually wants, which is what both BEFORE and AFTER do.
> 
> OK, that's a fair argument, thanks. So it seems there should be no "default" way to iterate a string.

Ideally there should not be a way to iterate a (unicode) string at all without explictily stating mode of operations, i.e.

struct String
{
    private void[] data;

    CodeUnitRange byCodeUnit ( );
    CodePointRange byCodePoint ( );
    GraphemeRange byGrapheme ( );
    bool normalize ( );
}

(byGrapheme and normalize have rather expensive dependencies so probably better to provide those via UFCS on demand)
May 29, 2016
On Saturday, 28 May 2016 at 22:29:12 UTC, Andrew Godfrey wrote:
[snip]
>
>
> From all the detail in this thread, I wonder now if "a grapheme" is even an unambiguous concept across different environments.

Unicode graphemes are not always the same as graphemes in natural (written) languages. If <é> is composed in Unicode, it is still one grapheme in a written language, not two distinct characters. However, in natural languages two characters can be one grapheme, as in English <sh>, it represents the sound in `shower, shop, fish`. In German the same sound is represented by three characters <sch> as in `Schaf` ("sheep"). A bit nit-picky but we should make clear that we talk about "Unicode graphemes" that map to single characters on the written page. But is that at all possible across all languages?

To avoid confusion and misunderstandings we should agree on the terminology first.
May 29, 2016
On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:
> Unicode graphemes are not always the same as graphemes in natural (written) languages. If <é> is composed in Unicode, it is still one grapheme in a written language, not two distinct characters. However, in natural languages two characters can be one grapheme, as in English <sh>, it represents the sound in `shower, shop, fish`. In German the same sound is represented by three characters <sch> as in `Schaf` ("sheep"). A bit nit-picky but we should make clear that we talk about "Unicode graphemes" that map to single characters on the written page. But is that at all possible across all languages?
>
> To avoid confusion and misunderstandings we should agree on the terminology first.

No, this is well established terminology, you are confusing several things here:

- A grapheme is a "character" as written on the page
- A phoneme is a spoken "character"
- A codepoint is the fundamental "unit" of unicode

Graphemes are built from one or more codepoints.
Phonemes are a different topic and not really covered by the unicode standard AFAIK. Except for the IPA notation, but these are again graphemes that represent phonemes.
May 29, 2016
On Sunday, 29 May 2016 at 11:47:30 UTC, Tobias Müller wrote:
> On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:
>> Unicode graphemes are not always the same as graphemes in natural (written) languages. If <é> is composed in Unicode, it is still one grapheme in a written language, not two distinct characters. However, in natural languages two characters can be one grapheme, as in English <sh>, it represents the sound in `shower, shop, fish`. In German the same sound is represented by three characters <sch> as in `Schaf` ("sheep"). A bit nit-picky but we should make clear that we talk about "Unicode graphemes" that map to single characters on the written page. But is that at all possible across all languages?
>>
>> To avoid confusion and misunderstandings we should agree on the terminology first.
>
> No, this is well established terminology, you are confusing several things here:
>
> - A grapheme is a "character" as written on the page
> - A phoneme is a spoken "character"
> - A codepoint is the fundamental "unit" of unicode
>
> Graphemes are built from one or more codepoints.
> Phonemes are a different topic and not really covered by the unicode standard AFAIK. Except for the IPA notation, but these are again graphemes that represent phonemes.

I am pretty sure that a single grapheme in unicode does not correspond to your notion of "character". I am pretty sure that what you think of as a "character" is officially called "Grapheme Cluster" not "Grapheme".

See here: http://www.unicode.org/glossary/#grapheme_cluster
May 29, 2016
On Sunday, 29 May 2016 at 11:47:30 UTC, Tobias Müller wrote:
> On Sunday, 29 May 2016 at 11:25:11 UTC, Chris wrote:
>> Unicode graphemes are not always the same as graphemes in natural (written) languages. If <é> is composed in Unicode, it is still one grapheme in a written language, not two distinct characters. However, in natural languages two characters can be one grapheme, as in English <sh>, it represents the sound in `shower, shop, fish`. In German the same sound is represented by three characters <sch> as in `Schaf` ("sheep"). A bit nit-picky but we should make clear that we talk about "Unicode graphemes" that map to single characters on the written page. But is that at all possible across all languages?
>>
>> To avoid confusion and misunderstandings we should agree on the terminology first.
>
> No, this is well established terminology, you are confusing several things here:
>
> - A grapheme is a "character" as written on the page
> - A phoneme is a spoken "character"
> - A codepoint is the fundamental "unit" of unicode
>
> Graphemes are built from one or more codepoints.
> Phonemes are a different topic and not really covered by the unicode standard AFAIK. Except for the IPA notation, but these are again graphemes that represent phonemes.

Ok, you have a point there, to be precise <sh> is a multigraph (a digraph)(cf. [1]). In French you can have multigraphs consisting of three or more characters <eau> /o/, as in Irish <aoi> => /i:/. However, a phoneme is not necessarily a spoken "character" as <sh> represents one phoneme but consists of two "characters" or graphemes. <th> can represent two different phonemes (voiced and unvoiced "th" as in `this` vs. `thorough`).

My point was that we have to be _very_ careful not to mix our cultural experience with written text with machine representations. There's bound to be confusion. That's why we should always make clear what we refer to when we use the words grapheme, character, code point etc.

[1] https://en.wikipedia.org/wiki/Grapheme
May 29, 2016
On Sunday, 29 May 2016 at 12:08:52 UTC, default0 wrote:
> I am pretty sure that a single grapheme in unicode does not correspond to your notion of "character". I am pretty sure that what you think of as a "character" is officially called "Grapheme Cluster" not "Grapheme".

Grapheme is a linguistic term. AFAIUI, a grapheme cluster is a cluster of codepoints representing a grapheme. It's called "cluster" in the unicode spec, because there there is no dedicated grapheme unit.

I put "character" into quotes, because the term is not really well defined. I just used it for a short and pregnant answer. I'm sure there's a better/more correct definition of graphem/phoneme, but it's probably also much longer and complicated.
May 29, 2016
On Sunday, 29 May 2016 at 12:41:50 UTC, Chris wrote:
> Ok, you have a point there, to be precise <sh> is a multigraph (a digraph)(cf. [1]). In French you can have multigraphs consisting of three or more characters <eau> /o/, as in Irish <aoi> => /i:/. However, a phoneme is not necessarily a spoken "character" as <sh> represents one phoneme but consists of two "characters" or graphemes. <th> can represent two different phonemes (voiced and unvoiced "th" as in `this` vs. `thorough`).

What I meant was, a phoneme is the "character" (smallest unit) in a spoken language, not that it corresponds to a character (whatever that means).

> My point was that we have to be _very_ careful not to mix our cultural experience with written text with machine representations. There's bound to be confusion. That's why we should always make clear what we refer to when we use the words grapheme, character, code point etc.

I used 'character' in quotes, because it's not a well defined therm. Code point, grapheme and phoneme are well defined.
May 29, 2016
On Friday, 27 May 2016 at 19:43:16 UTC, H. S. Teoh wrote:
> On Fri, May 27, 2016 at 03:30:53PM -0400, Andrei Alexandrescu via Digitalmars-d wrote:
>> On 5/27/16 3:10 PM, ag0aep6g wrote:
>> > I don't think there is value in distinguishing by language. The point of Unicode is that you shouldn't need to do that.
>> 
>> It seems code points are kind of useless because they don't really mean anything, would that be accurate? -- Andrei
>
> That's what we've been trying to say all along! :-P  They're a kind of low-level Unicode construct used for building "real" characters, i.e., what a layperson would consider to be a "character".

Code points are *the fundamental unit* of unicode. AFAIK most (all?) algorithms in the unicode spec are defined in terms of code points.
Sure, some algorithms also work on the code unit level. That can be used as an optimization, but they are still defined on code points.

Code points are also abstracting over the different representations (UTF-...), providing a uniform "interface".
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18