October 13, 2013
Ok, I understand, that "length" is - obviously - used in analogy to any array's length value.

Still, this seems to be inconsistent. D elaborates on implementing "char"s as UTF-8 which means that a "char" in D can be of any length between 1 and 4 bytes for an arbitrary Unicode code point. Shouldn't then this (i.e. the character's length) be the "unit of measurement" for "char"s - like e.g. the size of the underlying struct in an array of "struct"s? The story continues with indexing "string"s: In a consistent implementation, shouldn't

   writeln("säд"[2])

return "д" instead of the trailing surrogate of this cyrillic letter?
Btw. how do YOU implement this for "string" (for "dstring" it works - logically, for "wstring" the same problem arises for code points above D800)?

Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function implemented in the core like ".length"?

October 13, 2013
> implementation, shouldn't
>
>    writeln("säд"[2])
>
> return "д" instead of the trailing surrogate of this cyrillic letter?

First index is zero, no?
October 13, 2013
Am 13.10.2013 16:14, schrieb nickles:
> Ok, I understand, that "length" is - obviously - used in analogy to any
> array's length value.
>
> Still, this seems to be inconsistent. D elaborates on implementing
> "char"s as UTF-8 which means that a "char" in D can be of any length
> between 1 and 4 bytes for an arbitrary Unicode code point. Shouldn't
> then this (i.e. the character's length) be the "unit of measurement" for
> "char"s - like e.g. the size of the underlying struct in an array of
> "struct"s? The story continues with indexing "string"s: In a consistent
> implementation, shouldn't
>
>     writeln("säд"[2])
>
> return "д" instead of the trailing surrogate of this cyrillic letter?

This will _not_ return a trailing surrogate of a Cyrillic letter. It will return the second code unit of the "ä" character (U+00E4). However, it could also yield the first code unit of the umlaut diacritic, depending on how the string is represented. If the string were in UTF-32, [2] could yield either the Cyrillic character, or the umlaut diacritic. The .length of the UTF-32 string could be either 3 or 4.

There are multiple reasons why .length and index access is based on code units rather than code points or any higher level representation, but one is that the complexity would suddenly be O(n) instead of O(1). In-place modifications of char[] arrays also wouldn't be possible anymore as the size of the underlying array might have to change.
October 13, 2013
Am 13.10.2013 15:50, schrieb Dmitry Olshansky:
> 13-Oct-2013 17:25, nickles пишет:
>> Ok, if my understandig is wrong, how do YOU measure the length of a
>> string?
>> Do you always use count(), or is there an alternative?
>>
>>
> It's all there:
> http://www.unicode.org/glossary/
> http://www.unicode.org/versions/Unicode6.3.0/
>
> I measure string length in code units (as defined in the above
> standard). This bears no easy relation to the number of visible
> characters but I don't mind it.
>
> Measuring number of visible characters isn't trivial but can be done by
> counting number of graphemes. For simple alphabets counting code points
> will do the trick as well (what count does).
>

But you have to take care to normalize the string WRT diacritics if the estimate is supposed to work. OS X for example (if I remember right) always uses explicit combining characters, while Windows uses precomposed characters if possible.
October 13, 2013
> This will _not_ return a trailing surrogate of a Cyrillic letter. It will return the second code unit of the "ä" character (U+00E4).

True. It's UTF-8, not UTF-16.

> However, it could also yield the first code unit of the umlaut diacritic, depending on how the string is represented.

This is not true for UTF-8, which is not subject to "endianism".

> If the string were in UTF-32, [2] could yield either the Cyrillic character, or the umlaut diacritic.
> The .length of the UTF-32 string could be either 3 or 4.

Both are not true for UTF-32. There is no interpretation (except for the "endianism", which could be taken care of in a library/the core) for the code point.

> There are multiple reasons why .length and index access is based on code units rather than code points or any higher level representation, but one is that the complexity would suddenly be O(n) instead of O(1).

see my last statement below

> In-place modifications of char[] arrays also wouldn't be possible anymore

They would be, but

> as the size of the underlying array might have to change.

Well that's a point; on the other hand, D is constantly creating and throwing away new strings, so this isn't quite an argument. The current solution puts the programmer in charge of dealing with UTF-x, where a more consistent implementation would put the burden on the implementors of the libraries/core, i.e. the ones who usually have a better understanding of Unicode than the average programmer.

Also, implementing such a semantics would not per se abandon a byte-wise access, would it?

So, how do you guys handle UTF-8 strings in D? What are your solutions to the problems described? Does it all come down to converting "string"s and "wstring"s to "dstring"s, manipulating them, and re-convert them to "string"s? Btw, what would this mean in terms of speed?

These is no irony in my questions. I'm really looking for solutions...
October 13, 2013
On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
> Well that's a point; on the other hand, D is constantly creating and throwing away new strings, so this isn't quite an argument. The current solution puts the programmer in charge of dealing with UTF-x, where a more consistent implementation would put the burden on the implementors of the libraries/core, i.e. the ones who usually have a better understanding of Unicode than the average programmer.

Ironically, reason is consistency. `string` is just `immutable(char)[]` and it conforms to usual array behavior rules. Saying that array element value assignment may allocate it hardly a good option.

> So, how do you guys handle UTF-8 strings in D? What are your solutions to the problems described? Does it all come down to converting "string"s and "wstring"s to "dstring"s, manipulating them, and re-convert them to "string"s? Btw, what would this mean in terms of speed?

If single element access is needed, str.front yields decoded `dchar`. Or simple `foreach (dchar d; str)` - it won't hide the fact it is O(n) operation at least. As `str.front` yields dchar, most `std.algorithm` and `std.range` utilities will also work correctly on default UTF-8 strings.

Slicing / .length are probably the only operations that do not respect UTF-8 encoding (because they are exactly the same for all arrays).
October 13, 2013
On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:
>> This is simply wrong. All strings return number of codeunits. And it's only UTF-32 where codepoint (~ character) happens to fit into one codeunit.
>
> I do not agree:
>
>    writeln("säд".length);        => 5  chars: 5 (1 + 2 [C3A4] + 2 [D094], UTF-8)
>    writeln(std.utf.count("säд")) => 3  chars: 5 (ibidem)
>    writeln("säд"w.length);       => 3  chars: 6 (2 x 3, UTF-16)
>    writeln("säд"d.length);       => 3  chars: 12 (4 x 3, UTF-32)
>
> This is not consistent - from my point of view.

This is not a single inconsistency here.

First of all, typeof("säд") yileds string type (immutable char) while typeof(['s', 'ä', 'д']) yileds neither char[], nor wchar[], nor even dchar[] but int[]. In this case D is close to C which also treats character literals as integer type. Secondly, character arrays are only one who have two kinds of array literals - usual [item. item, item] and special "blah", as you see there is no correspondence between them.

If you try char[] x = cast(char[])['s', 'ä', 'д'] then length would be indeed 3 (but don't use it - it is broken).

In D dynamic array is at binary level represented as struct { void *ptr; size_t length; }. When you perform some operations on dynamic arrays they are implemented by compiler as calls to runtime functions. However, during runtime it is impossible to do something useful on arrays for which there is only information about address of beginning and total elements (this is a source of other problems in D). To handle this, compiler generates and passes as separate argument "TypeInfo" to runtime functions. TypeInfo contains some data, most relevant here is size of the element.

What happens is follows. Compiler recognizes that "säд" should be string literal and encoded as UTF-8 (http://dlang.org/lex.html#DoubleQuotedString), so element type should be char, which requires to have 5 elements in array. So, at runtime an object "säд" is treated as array of 5 elements each having 1 byte per element.

Basically string (and char[]) plays dual role in the language - on the one hand, it is array of elements having strictly 1 byte size by definition, on the other hand D tries to use it as 'generic' UTF type for which size is not fixed. So, there is contradiction - in source code such strings are viewed by programmer as some abstract UTF string, but druntime views it as 5 byte array. In my view, trouble begins when "säд" is internally casted to char (which is no better than int[] x = [3.14, 5.6]). And indeed, char[] x = ['s', 'ä', 'д'] is refused by language, so there is great inconsistency here.

By the way, UTF definition is irrelevant here, this is pure implementation issue (I think it is design fault).
October 13, 2013
On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
> Ok, I understand, that "length" is - obviously - used in analogy to any array's length value.
>
> Still, this seems to be inconsistent. D elaborates on implementing "char"s as UTF-8 which means that a "char" in D can be of any length between 1 and 4 bytes for an arbitrary Unicode code point. Shouldn't then this (i.e. the character's length) be the "unit of measurement" for "char"s - like e.g. the size of the underlying struct in an array of "struct"s? The story continues with indexing "string"s: In a consistent implementation, shouldn't
>
>    writeln("säд"[2])
>
> return "д" instead of the trailing surrogate of this cyrillic letter?

This is impossible given current design. At runtime "säд"[2] is viewed as struct { void *ptr; size_t length; }; ptr points to memory having at least five bytes and length having value 5. Druntime hasn't taken UTF course.

One option would be to add support in druntime so it can correctly handle such strings, or implement separate string type which does not default to char[], but of course the easiest way is to convince everybody that everything is OK and advice to use some library function which does the job correctly essentially implying that the language does the job wrong (pardon me, some D skepticism, the deeper I am in it, the more critically view it).
October 13, 2013
On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
>> However, it could also yield the first code unit of the umlaut diacritic, depending on how the string is represented.
>
> This is not true for UTF-8, which is not subject to "endianism".

This is not about endianness. It's "\u00E4" vs "a\u0308". The first is the single code point 'ä', the second is two code points, 'a' plus umlaut dots.

[...]
> Well that's a point; on the other hand, D is constantly creating and throwing away new strings, so this isn't quite an argument. The current solution puts the programmer in charge of dealing with UTF-x, where a more consistent implementation would put the burden on the implementors of the libraries/core, i.e. the ones who usually have a better understanding of Unicode than the average programmer.
>
> Also, implementing such a semantics would not per se abandon a byte-wise access, would it?
>
> So, how do you guys handle UTF-8 strings in D? What are your solutions to the problems described? Does it all come down to converting "string"s and "wstring"s to "dstring"s, manipulating them, and re-convert them to "string"s? Btw, what would this mean in terms of speed?
>
> These is no irony in my questions. I'm really looking for solutions...

I think, std.uni and std.utf are supposed to supply everything Unicode.
October 13, 2013
On Sunday, 13 October 2013 at 16:31:58 UTC, nickles wrote:
>> However, it could also yield the first code unit of the umlaut diacritic, depending on how the string is represented.
>
> This is not true for UTF-8, which is not subject to "endianism".

You are correct in that UTF-8 is endian agnostic, but I don't
believe that was Sönke's point. The point is that ä can be
produced in Unicode in more than one way. This program
illustrates:

import std.stdio;
void main()
{
      string a = "ä";
      string b = "a\u0308";
      writeln(a);
      writeln(b);
      writeln(cast(ubyte[])a);
      writeln(cast(ubyte[])b);
}

This prints:

ä
ä
[195, 164]
[97, 204, 136]

Notice that they are both the same "character" but have different
representations. The first is just the 'ä' code point, which
consists of two code units, the second is the 'a' code point
followed by a Combining Diaeresis code point.

In short, the string "ä" could be either 2 or 3 code units, and
either 1 or 2 code points.