December 30, 2011
Le 30/12/2011 20:55, Timon Gehr a écrit :
> On 12/30/2011 08:33 PM, Joshua Reusch wrote:
>> Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
>>> On 12/29/11 12:28 PM, Don wrote:
>>>> On 28.12.2011 20:00, Andrei Alexandrescu wrote:
>>>>> Oh, one more thing - one good thing that could come out of this thread
>>>>> is abolition (through however slow a deprecation path) of s.length and
>>>>> s[i] for narrow strings. Requiring s.rep.length instead of s.length
>>>>> and
>>>>> s.rep[i] instead of s[i] would improve the quality of narrow strings
>>>>> tremendously. Also, s.rep[i] should return ubyte/ushort, not
>>>>> char/wchar.
>>>>> Then, people would access the decoding routines on the needed
>>>>> occasions,
>>>>> or would consciously use the representation.
>>>>>
>>>>> Yum.
>>>>
>>>>
>>>> If I understand this correctly, most others don't. Effectively, .rep
>>>> just means, "I know what I'm doing", and there's no change to existing
>>>> semantics, purely a syntax change.
>>>
>>> Exactly!
>>>
>>>> If you change s[i] into s.rep[i], it does the same thing as now.
>>>> There's
>>>> no loss of functionality -- it's just stops you from accidentally doing
>>>> the wrong thing. Like .ptr for getting the address of an array.
>>>> Typically all the ".rep" everywhere would get annoying, so you would
>>>> write:
>>>> ubyte [] u = s.rep;
>>>> and use u from then on.
>>>>
>>>> I don't like the name 'rep'. Maybe 'raw' or 'utf'?
>>>> Apart from that, I think this would be perfect.
>>>
>>> Yes, I mean "rep" as a short for "representation" but upon first sight
>>> the connection is tenuous. "raw" sounds great.
>>>
>>> Now I'm twice sorry this will not happen...
>>>
>>
>> Maybe it could happen if we
>> 1. make dstring the default strings type --
>
> Inefficient.
>
>> code units and characters would be the same
>
> Wrong.
>
>> or 2. forward string.length to std.utf.count and opIndex to
>> std.utf.toUTFindex
>
> Inconsistent and inefficient (it blows up the algorithmic complexity).
>
>>
>> so programmers could use the slices/indexing/length (no lazyness
>> problems), and if they really want codeunits use .raw/.rep (or better
>> .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)
>>
>
> Anyone who intends to write efficient string processing code needs this.
> Anyone who does not want to write string processing code will not need
> to index into a string -- standard library functions will suffice.
>
>> But generally I liked the idea of just having an alias for strings...
>
> Me too. I think the way we have it now is optimal. The only reason we
> are discussing this is because of fear that uneducated users will write
> code that does not take into account Unicode characters above code point
> 0x80. But what is the worst thing that can happen?
>

ATOS origin was hacked because of bad management of unicode in string in some of their software.

Consequences can be more importants than you may think.

Additionnaly, you make an asumption that is realy wrong : an educated programmer will not make mistake. C programmers will just tell you excactly the same thing is the discution comes to pointers. But the fact is, we all do mistakes. Many of them ! We should go into unsafe behaviour, that rely on programmer capabilities only when needed.

I do understand pointers. I do make mistake with them and it does have crazy consequences sometime. And I do not trust anyone that say me he/she doesn't.

The #1 quality of a programmer is to act like he/she is a morron. Because sometime we all are morrons.
December 30, 2011
On 12/30/2011 11:55 AM, Timon Gehr wrote:
> Me too. I think the way we have it now is optimal.

Consider your X macro implementation. Strip out the utf.stride code and use plain indexing - it will not break the code in any way. The naive implementation still works correctly with ASCII and UTF-8.

That's not true for any other multibyte encoding, which is why UTF-8 is inspired genius.
December 30, 2011
On 12/30/2011 10:36 PM, deadalnix wrote:
> Le 30/12/2011 20:55, Timon Gehr a écrit :
>> On 12/30/2011 08:33 PM, Joshua Reusch wrote:
>>> Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
>>>> On 12/29/11 12:28 PM, Don wrote:
>>>>> On 28.12.2011 20:00, Andrei Alexandrescu wrote:
>>>>>> Oh, one more thing - one good thing that could come out of this
>>>>>> thread
>>>>>> is abolition (through however slow a deprecation path) of s.length
>>>>>> and
>>>>>> s[i] for narrow strings. Requiring s.rep.length instead of s.length
>>>>>> and
>>>>>> s.rep[i] instead of s[i] would improve the quality of narrow strings
>>>>>> tremendously. Also, s.rep[i] should return ubyte/ushort, not
>>>>>> char/wchar.
>>>>>> Then, people would access the decoding routines on the needed
>>>>>> occasions,
>>>>>> or would consciously use the representation.
>>>>>>
>>>>>> Yum.
>>>>>
>>>>>
>>>>> If I understand this correctly, most others don't. Effectively, .rep
>>>>> just means, "I know what I'm doing", and there's no change to existing
>>>>> semantics, purely a syntax change.
>>>>
>>>> Exactly!
>>>>
>>>>> If you change s[i] into s.rep[i], it does the same thing as now.
>>>>> There's
>>>>> no loss of functionality -- it's just stops you from accidentally
>>>>> doing
>>>>> the wrong thing. Like .ptr for getting the address of an array.
>>>>> Typically all the ".rep" everywhere would get annoying, so you would
>>>>> write:
>>>>> ubyte [] u = s.rep;
>>>>> and use u from then on.
>>>>>
>>>>> I don't like the name 'rep'. Maybe 'raw' or 'utf'?
>>>>> Apart from that, I think this would be perfect.
>>>>
>>>> Yes, I mean "rep" as a short for "representation" but upon first sight
>>>> the connection is tenuous. "raw" sounds great.
>>>>
>>>> Now I'm twice sorry this will not happen...
>>>>
>>>
>>> Maybe it could happen if we
>>> 1. make dstring the default strings type --
>>
>> Inefficient.
>>
>>> code units and characters would be the same
>>
>> Wrong.
>>
>>> or 2. forward string.length to std.utf.count and opIndex to
>>> std.utf.toUTFindex
>>
>> Inconsistent and inefficient (it blows up the algorithmic complexity).
>>
>>>
>>> so programmers could use the slices/indexing/length (no lazyness
>>> problems), and if they really want codeunits use .raw/.rep (or better
>>> .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)
>>>
>>
>> Anyone who intends to write efficient string processing code needs this.
>> Anyone who does not want to write string processing code will not need
>> to index into a string -- standard library functions will suffice.
>>
>>> But generally I liked the idea of just having an alias for strings...
>>
>> Me too. I think the way we have it now is optimal. The only reason we
>> are discussing this is because of fear that uneducated users will write
>> code that does not take into account Unicode characters above code point
>> 0x80. But what is the worst thing that can happen?
>>
>
> ATOS origin was hacked because of bad management of unicode in string in
> some of their software.

And cast(string)s.rep[i..j] would magically fix all those bugs?

>
> Consequences can be more importants than you may think.
>
> Additionnaly, you make an asumption that is realy wrong : an educated
> programmer will not make mistake.

I am not. I am just assuming that the proposed change does not help with that.

> C programmers will just tell you
> excactly the same thing is the discution comes to pointers. But the fact
> is, we all do mistakes. Many of them ! We should go into unsafe
> behaviour, that rely on programmer capabilities only when needed.
>
> I do understand pointers. I do make mistake with them and it does have
> crazy consequences sometime. And I do not trust anyone that say me
> he/she doesn't.
>
> The #1 quality of a programmer is to act like he/she is a morron.
> Because sometime we all are morrons.

The #1 quality of a programmer is to write correct code. If he/she acts as if he/she is a moron, he/she will write code that acts like a moron. Simple as that.
December 30, 2011
On 12/30/2011 11:01 PM, Walter Bright wrote:
> On 12/30/2011 11:55 AM, Timon Gehr wrote:
>> Me too. I think the way we have it now is optimal.
>
> Consider your X macro implementation. Strip out the utf.stride code and
> use plain indexing - it will not break the code in any way. The naive
> implementation still works correctly with ASCII and UTF-8.
>

You are right, that obviously needs fixing. ☺
Thanks!

> That's not true for any other multibyte encoding, which is why UTF-8 is
> inspired genius.
December 30, 2011
On 12/30/11 4:01 PM, Walter Bright wrote:
> On 12/30/2011 11:55 AM, Timon Gehr wrote:
>> Me too. I think the way we have it now is optimal.
>
> Consider your X macro implementation. Strip out the utf.stride code and
> use plain indexing - it will not break the code in any way. The naive
> implementation still works correctly with ASCII and UTF-8.
>
> That's not true for any other multibyte encoding, which is why UTF-8 is
> inspired genius.

It's true for any encoding with the prefix property, such as Huffman.

Using .raw is /optimal/ because it states the assumption appropriately. The user knows '$' cannot be in the prefix of any other symbol, so she can state the byte alone is the character. If that were a non-ASCII character, the assumption wouldn't have worked.

So yeah, UTF-8 is great. But it is not miraculous. We need .raw.


Andrei
December 30, 2011
On 12/31/2011 12:00 AM, Andrei Alexandrescu wrote:
> On 12/30/11 4:01 PM, Walter Bright wrote:
>> On 12/30/2011 11:55 AM, Timon Gehr wrote:
>>> Me too. I think the way we have it now is optimal.
>>
>> Consider your X macro implementation. Strip out the utf.stride code and
>> use plain indexing - it will not break the code in any way. The naive
>> implementation still works correctly with ASCII and UTF-8.
>>
>> That's not true for any other multibyte encoding, which is why UTF-8 is
>> inspired genius.
>
> It's true for any encoding with the prefix property, such as Huffman.
>
> Using .raw is /optimal/ because it states the assumption appropriately.
> The user knows '$' cannot be in the prefix of any other symbol, so she
> can state the byte alone is the character. If that were a non-ASCII
> character, the assumption wouldn't have worked.
>
> So yeah, UTF-8 is great. But it is not miraculous. We need .raw.
>
>
> Andrei

auto raw(S)(S s) if(isNarrowString!S){
    static if(is(S==string)) return cast(ubyte[])s;
    else static if(is(S==wstring)) return cast(ushort[])s;
}
December 30, 2011
On 12/30/2011 3:00 PM, Andrei Alexandrescu wrote:
> On 12/30/11 4:01 PM, Walter Bright wrote:
>> On 12/30/2011 11:55 AM, Timon Gehr wrote:
>>> Me too. I think the way we have it now is optimal.
>>
>> Consider your X macro implementation. Strip out the utf.stride code and
>> use plain indexing - it will not break the code in any way. The naive
>> implementation still works correctly with ASCII and UTF-8.
>>
>> That's not true for any other multibyte encoding, which is why UTF-8 is
>> inspired genius.
>
> It's true for any encoding with the prefix property, such as Huffman.

Any other multibyte character encoding I've seen standardized for use in C.
December 31, 2011
On 12/30/11 5:07 PM, Timon Gehr wrote:
> On 12/31/2011 12:00 AM, Andrei Alexandrescu wrote:
>> On 12/30/11 4:01 PM, Walter Bright wrote:
>>> On 12/30/2011 11:55 AM, Timon Gehr wrote:
>>>> Me too. I think the way we have it now is optimal.
>>>
>>> Consider your X macro implementation. Strip out the utf.stride code and
>>> use plain indexing - it will not break the code in any way. The naive
>>> implementation still works correctly with ASCII and UTF-8.
>>>
>>> That's not true for any other multibyte encoding, which is why UTF-8 is
>>> inspired genius.
>>
>> It's true for any encoding with the prefix property, such as Huffman.
>>
>> Using .raw is /optimal/ because it states the assumption appropriately.
>> The user knows '$' cannot be in the prefix of any other symbol, so she
>> can state the byte alone is the character. If that were a non-ASCII
>> character, the assumption wouldn't have worked.
>>
>> So yeah, UTF-8 is great. But it is not miraculous. We need .raw.
>>
>>
>> Andrei
>
> auto raw(S)(S s) if(isNarrowString!S){
> static if(is(S==string)) return cast(ubyte[])s;
> else static if(is(S==wstring)) return cast(ushort[])s;
> }

Almost there.

https://github.com/D-Programming-Language/phobos/blob/master/std/string.d#L809


Andrei
December 31, 2011
On 12/31/2011 01:03 AM, Andrei Alexandrescu wrote:
> On 12/30/11 5:07 PM, Timon Gehr wrote:
>> On 12/31/2011 12:00 AM, Andrei Alexandrescu wrote:
>>> On 12/30/11 4:01 PM, Walter Bright wrote:
>>>> On 12/30/2011 11:55 AM, Timon Gehr wrote:
>>>>> Me too. I think the way we have it now is optimal.
>>>>
>>>> Consider your X macro implementation. Strip out the utf.stride code and
>>>> use plain indexing - it will not break the code in any way. The naive
>>>> implementation still works correctly with ASCII and UTF-8.
>>>>
>>>> That's not true for any other multibyte encoding, which is why UTF-8 is
>>>> inspired genius.
>>>
>>> It's true for any encoding with the prefix property, such as Huffman.
>>>
>>> Using .raw is /optimal/ because it states the assumption appropriately.
>>> The user knows '$' cannot be in the prefix of any other symbol, so she
>>> can state the byte alone is the character. If that were a non-ASCII
>>> character, the assumption wouldn't have worked.
>>>
>>> So yeah, UTF-8 is great. But it is not miraculous. We need .raw.
>>>
>>>
>>> Andrei
>>
>> auto raw(S)(S s) if(isNarrowString!S){
>> static if(is(S==string)) return cast(ubyte[])s;
>> else static if(is(S==wstring)) return cast(ushort[])s;
>> }
>
> Almost there.
>
> https://github.com/D-Programming-Language/phobos/blob/master/std/string.d#L809
>
>
>
> Andrei

alias std.string.representation raw;
December 31, 2011
On 12/30/11 6:07 PM, Timon Gehr wrote:
> alias std.string.representation raw;

I meant your implementation is incomplete.

But the main point is that presence of representation/raw is not the issue. The availability of good-for-nothing .length and operator[] are the issue. Putting in place the convention of using .raw is hardly useful within the context.


Andrei