January 20, 2014
On Thursday, 16 January 2014 at 06:59:43 UTC, Maxim Fomin wrote:
> This is wrong. String in D is de facto (by implementation, spec may say whatever is convenient for advertising D) array of single bytes which can keep UTF-8 code units. No way string type in D is always a string in a sense of code points/characters. Sometimes it happens that string type behaves like 'string', but if you put UTF-16 or UTF-32 text it would remind you what string type really is.

By implementation they are also UTF strings. String literals use UTF, `char.init` is 0xFF and `wchar.init` is 0xFFFF, foreach over narrow strings with `dchar` iterator variable type does UTF decoding etc.

I don't think you know what you're talking about; putting UTF-16 or UTF-32 in `string` is utter madness and not trivially possible. We have `wchar`/`wstring` and `dchar`/`dstring` for UTF-16 and UTF-32, respectively.

>> Operations on code units are rare, which is why the standard library instead treats strings as ranges of code points, for correctness by default. However, we must not prevent the user from being able to work on arrays of code units, as many string algorithms can be optimized by not doing full UTF decoding. The standard library does this on many occasions, and there are more to come.
>
> This is attempt to explain problematic design as a wise action.

No, it's not. Please leave crappy, unsubstantiated arguments like this out of these forums.

>> [1] http://dlang.org/type
>
> By the way, the link you provide says char is unsigned 8 bit type which can keep value of UTF-8 code unit.

Not *can*, but *does*. Otherwise it is an error in the program. The specification, compiler implementation (as shown above) and standard library all treat `char` as a UTF-8 code unit. Treat it otherwise at your own peril.

> UTF is irrelevant because the problem is in D implementation. See http://forum.dlang.org/thread/hoopiiobddbapybbwwoc@forum.dlang.org (in particular 2nd page).
>
> The root of the issue is that D does not provide 'utf' type which would handle correctly strings and characters irrespective of the format. But instead the language pretends that it supports such type by allowing to convert to single byte char array both literals "sad" and "säд". And ['s', 'ä', 'д'] is by the way neither char[], no wchar[], even not dchar[] but sequence of integers, which compounds oddities in character types.

The only problem in the implementation here that you illustrate is that `['s', 'ä', 'д']` is of type `int[]`, which is a bug. It should be `dchar[]`. The length of `char[]` works as intended.

> Problems with string type can be illustrated as possible situation in domain of integers type. Assume that user wants 'number' type which accepts both integers, floats and doubles and treats them properly. This would require either library solution or a new special type in a language which is supported by both compiler and runtime library, which performs operation at runtime on objects of number type according to their effective type.
>
> D designers want to support such feature (to make the language better), but as it happens in other situations, the support is only limited: compiler allows to do
>
> alias immutable(int)[] number;
> number my_number = [0, 3.14, 3.14l];

I don't understand this example. The compiler does *not* allow that code; try it for yourself.
January 20, 2014
On Monday, 20 January 2014 at 09:58:07 UTC, Jakob Ovrum wrote:
> On Thursday, 16 January 2014 at 06:59:43 UTC, Maxim Fomin wrote:
>> This is wrong. String in D is de facto (by implementation, spec may say whatever is convenient for advertising D) array of single bytes which can keep UTF-8 code units. No way string type in D is always a string in a sense of code points/characters. Sometimes it happens that string type behaves like 'string', but if you put UTF-16 or UTF-32 text it would remind you what string type really is.
>
> By implementation they are also UTF strings. String literals use UTF, `char.init` is 0xFF and `wchar.init` is 0xFFFF, foreach over narrow strings with `dchar` iterator variable type does UTF decoding etc.
>
> I don't think you know what you're talking about; putting UTF-16 or UTF-32 in `string` is utter madness and not trivially possible. We have `wchar`/`wstring` and `dchar`/`dstring` for UTF-16 and UTF-32, respectively.
>

import std.stdio;

void main()
{
	string s = "о";
	writeln(s.length);
}

This compiles and prints 2. This means that string type is broken. It is broken in the way as I was attempting to explain.

>> This is attempt to explain problematic design as a wise action.
>
> No, it's not. Please leave crappy, unsubstantiated arguments like this out of these forums.

Note, that I provided examples why design is problematic. The arguement isn't unsubstained.

>
>>> [1] http://dlang.org/type
>>
>> By the way, the link you provide says char is unsigned 8 bit type which can keep value of UTF-8 code unit.
>
> Not *can*, but *does*. Otherwise it is an error in the program. The specification, compiler implementation (as shown above) and standard library all treat `char` as a UTF-8 code unit. Treat it otherwise at your own peril.
>

But such treating is nonsense. It is like treating integer or floating number as sequence of bytes. You are essentially saying that treating char as UTF-8 code unit is OK because language is treating char as UTF-8 code unit.

> The only problem in the implementation here that you illustrate is that `['s', 'ä', 'д']` is of type `int[]`, which is a bug. It should be `dchar[]`. The length of `char[]` works as intended.

You are saying that length of char works as intended, which is true, but shows that design is broken.

>> Problems with string type can be illustrated as possible situation in domain of integers type. Assume that user wants 'number' type which accepts both integers, floats and doubles and treats them properly. This would require either library solution or a new special type in a language which is supported by both compiler and runtime library, which performs operation at runtime on objects of number type according to their effective type.
>>
>> D designers want to support such feature (to make the language better), but as it happens in other situations, the support is only limited: compiler allows to do
>>
>> alias immutable(int)[] number;
>> number my_number = [0, 3.14, 3.14l];
>
> I don't understand this example. The compiler does *not* allow that code; try it for yourself.

It does not allow because it is nonsense. However it does allows equivalent nonsesnce in character types.

alias immutable(int)[] number;
number my_number = [0, 3.14, 3.14l]; // does not compile

alias immutable(char)[] string;
string s = "säд"; // compiles, however "säд" should default to wstring or dstring

Same reasons which prevent sane person from being OK with int[] number = [3.14l] should prevent him from being OK with string s = "säд"
January 20, 2014
> Same reasons which prevent sane person from being OK with int[] number = [3.14l] should prevent him from being OK with string s = "säд"

No, since this literal can be encoded as utf8 just fine. Keep in mind that literals are nothing else as values written directly into the source. And as is happens your example is a perfect value of type string.

(w|d)string.length returning anything else then the number of underlying code points would be inconsistent to other array types and m aking (d|w)string arrays of code points was a (arguably good) design decision.

That said: nothing prevents you from writing another string type that abstracts from the actual string encoding.


Phobos did it wrong though with handling char[] different from T[].
January 20, 2014
On Monday, 20 January 2014 at 13:30:11 UTC, Tobias Pankrath wrote:
> (w|d)string.length returning anything else then the number of underlying code points would be inconsistent to other array types and m aking (d|w)string arrays of code points was a (arguably good) design decision.

Code units, not code points.

Of course, a single UTF-32 code unit is also a single code point.

> That said: nothing prevents you from writing another string type that abstracts from the actual string encoding.

Such types tend to have absolutely awful performance. It is a minefield of disastrous algorithmic complexity, too (e.g. length).

> Phobos did it wrong though with handling char[] different from T[].

It is only for ranges, and I think it's a good decision.
January 20, 2014
On Monday, 20 January 2014 at 16:53:32 UTC, Jakob Ovrum wrote:
> On Monday, 20 January 2014 at 13:30:11 UTC, Tobias Pankrath wrote:
>> (w|d)string.length returning anything else then the number of underlying code points would be inconsistent to other array types and making (d|w)string arrays of code points was a (arguably good) design decision.
>
> Code units, not code points.
>
Arg! Of course.
January 23, 2014
On 01/16/2014 06:56 AM, Jakob Ovrum wrote:
>
> Note that the Unicode definition of an unqualified "character" is the
> translation of a code *point*, which is very different from a *glyph*,
> which is what people generally associate the word "character" with.
> Thus, `string` is not an array of characters (i.e. an array where each
> element is a character), but `dstring` can be said to be.

A character can be made of more than one dchar. (There are also more exotic examples, eg. IIRC there are cases where three dchars make approximately two characters.)
January 23, 2014
On Thursday, 23 January 2014 at 01:17:19 UTC, Timon Gehr wrote:
> On 01/16/2014 06:56 AM, Jakob Ovrum wrote:
>>
>> Note that the Unicode definition of an unqualified "character" is the
>> translation of a code *point*, which is very different from a *glyph*,
>> which is what people generally associate the word "character" with.
>> Thus, `string` is not an array of characters (i.e. an array where each
>> element is a character), but `dstring` can be said to be.
>
> A character can be made of more than one dchar. (There are also more exotic examples, eg. IIRC there are cases where three dchars make approximately two characters.)

No, I believe you are thinking of graphemes.
January 23, 2014
On 01/23/2014 02:39 AM, Jakob Ovrum wrote:
> On Thursday, 23 January 2014 at 01:17:19 UTC, Timon Gehr wrote:
>> On 01/16/2014 06:56 AM, Jakob Ovrum wrote:
>>>
>>> Note that the Unicode definition of an unqualified "character" is the
>>> translation of a code *point*, which is very different from a *glyph*,
>>> which is what people generally associate the word "character" with.
>>> Thus, `string` is not an array of characters (i.e. an array where each
>>> element is a character), but `dstring` can be said to be.
>>
>> A character can be made of more than one dchar. (There are also more
>> exotic examples, eg. IIRC there are cases where three dchars make
>> approximately two characters.)
>
> No, I believe you are thinking of graphemes.

Sure. Their existence means it is in general wrong to think of a dchar as one character.
January 24, 2014
On Thursday, 23 January 2014 at 10:25:40 UTC, Timon Gehr wrote:
> Sure. Their existence means it is in general wrong to think of a dchar as one character.

As stated, I was specifically talking about the Unicode definition of a character, which is completely distinct from graphemes.
1 2
Next ›   Last »