January 01, 2012
On 01/01/2012 08:10 AM, Don wrote:
> On 31.12.2011 17:13, Timon Gehr wrote:
>> On 12/31/2011 01:15 PM, Don wrote:
>>> On 31.12.2011 01:56, Timon Gehr wrote:
>>>> On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
>>>>> On 12/30/11 6:07 PM, Timon Gehr wrote:
>>>>>> alias std.string.representation raw;
>>>>>
>>>>> I meant your implementation is incomplete.
>>>>
>>>> It was more a sketch than an implementation. It is not even type safe
>>>> :o).
>>>>
>>>>>
>>>>> But the main point is that presence of representation/raw is not the
>>>>> issue.
>>>>> The availability of good-for-nothing .length and operator[] are
>>>>> the issue. Putting in place the convention of using .raw is hardly
>>>>> useful within the context.
>>>>>
>>>>
>>>> D strings are arrays. An array without .length and operator[] is close
>>>> to being good for nothing. The language specification is quite clear
>>>> about the fact that e.g. char is not a character but an utf-8 code
>>>> unit.
>>>> Therefore char[] is an array of code units.
>>>
>>> No, it isn't. That's the problem. char[] is not an array of char.
>>> It has an additional invariant: it is a UTF8 string. If you randomly
>>> change elements, the invariant is violated.
>>
>> char[] is an array of char and the additional invariant is not enforced
>> by the language.
>
> No, it isn't an ordinary array. For example with concatenation. char[] ~
> int will never create an invalid string.

Yes it will.

void main() {
    char[] x;
    writeln(x~255);
}

> You can end up with multiple chars being appended, even from a single append. foreach is different,
> too. They are a bit magical.

Fair enough, but type conversion rules are a bit magical in general.

void main() {
    auto a = cast(short[])[1,2,3];
    auto b = [1,2,3];
    auto c = cast(short[])b;
    assert(a!=c);
}

> There's quite a lot of code in the compiler to make sure that strings
> remain valid.
>

At the same time, there are many language features that allow to create invalid strings.

auto a = "\377\252\314";
auto b = x"FF AA CC";
auto c = import("binary");

> The additional invariant is not enforced in the case of slicing; that's
> the point.

January 01, 2012
On 01/01/2012 07:59 AM, Timon Gehr wrote:
> On 01/01/2012 05:53 AM, Chad J wrote:
>>
>> If you haven't been educated about unicode or how D handles it, you might write this:
>>
>> char[] str;
>> ... load str ...
>> for ( int i = 0; i<  str.length; i++ )
>> {
>>      font.render(str[i]); // Ewww.
>>      ...
>> }
>>
> 
> That actually looks like a bug that might happen in real world code. What is the signature of font.render?

In my mind it's defined something like this:

class Font
{
 ...

    /** Render the given code point at
        the current (x,y) cursor position. */
    void render( dchar c )
    {
        ...
    }
}

(Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.)

I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.
January 01, 2012
On 01/01/2012 04:13 PM, Chad J wrote:
> On 01/01/2012 07:59 AM, Timon Gehr wrote:
>> On 01/01/2012 05:53 AM, Chad J wrote:
>>>
>>> If you haven't been educated about unicode or how D handles it, you
>>> might write this:
>>>
>>> char[] str;
>>> ... load str ...
>>> for ( int i = 0; i<   str.length; i++ )
>>> {
>>>       font.render(str[i]); // Ewww.
>>>       ...
>>> }
>>>
>>
>> That actually looks like a bug that might happen in real world code.
>> What is the signature of font.render?
>
> In my mind it's defined something like this:
>
> class Font
> {
>   ...
>
>      /** Render the given code point at
>          the current (x,y) cursor position. */
>      void render( dchar c )
>      {
>          ...
>      }
> }
>
> (Of course I don't know minute details like where the "cursor position"
> comes from, but I figure it doesn't matter.)
>
> I probably wrote some code like that loop a very long time ago, but I
> probably don't have that code around anymore, or at least not easily
> findable.

I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.
January 01, 2012
On 01/01/2012 10:39 AM, Timon Gehr wrote:
> On 01/01/2012 04:13 PM, Chad J wrote:
>> On 01/01/2012 07:59 AM, Timon Gehr wrote:
>>> On 01/01/2012 05:53 AM, Chad J wrote:
>>>>
>>>> If you haven't been educated about unicode or how D handles it, you might write this:
>>>>
>>>> char[] str;
>>>> ... load str ...
>>>> for ( int i = 0; i<   str.length; i++ )
>>>> {
>>>>       font.render(str[i]); // Ewww.
>>>>       ...
>>>> }
>>>>
>>>
>>> That actually looks like a bug that might happen in real world code. What is the signature of font.render?
>>
>> In my mind it's defined something like this:
>>
>> class Font
>> {
>>   ...
>>
>>      /** Render the given code point at
>>          the current (x,y) cursor position. */
>>      void render( dchar c )
>>      {
>>          ...
>>      }
>> }
>>
>> (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.)
>>
>> I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.
> 
> I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.

I agree.

Perhaps the compiler should insert a check on the 8th bit in cases like these?

I suppose it's possible someone could declare a bunch of individual char's and then start manipulating code units that way, and such an 8th bit check could thwart those manipulations, but I would also counter that such low manipulations should be done on ubyte's instead.

I don't know how much this would help though.  Seems like too little, too late.

The bigger problem is that a char is being taken from a char[] and thereby loses its context as (potentially) being part of a larger codepoint.
January 01, 2012
On 01/01/2012 08:01 PM, Chad J wrote:
> On 01/01/2012 10:39 AM, Timon Gehr wrote:
>> On 01/01/2012 04:13 PM, Chad J wrote:
>>> On 01/01/2012 07:59 AM, Timon Gehr wrote:
>>>> On 01/01/2012 05:53 AM, Chad J wrote:
>>>>>
>>>>> If you haven't been educated about unicode or how D handles it, you
>>>>> might write this:
>>>>>
>>>>> char[] str;
>>>>> ... load str ...
>>>>> for ( int i = 0; i<    str.length; i++ )
>>>>> {
>>>>>        font.render(str[i]); // Ewww.
>>>>>        ...
>>>>> }
>>>>>
>>>>
>>>> That actually looks like a bug that might happen in real world code.
>>>> What is the signature of font.render?
>>>
>>> In my mind it's defined something like this:
>>>
>>> class Font
>>> {
>>>    ...
>>>
>>>       /** Render the given code point at
>>>           the current (x,y) cursor position. */
>>>       void render( dchar c )
>>>       {
>>>           ...
>>>       }
>>> }
>>>
>>> (Of course I don't know minute details like where the "cursor position"
>>> comes from, but I figure it doesn't matter.)
>>>
>>> I probably wrote some code like that loop a very long time ago, but I
>>> probably don't have that code around anymore, or at least not easily
>>> findable.
>>
>> I think the main issue here is that char implicitly converts to dchar:
>> This is an implicit reinterpret-cast that is nonsensical if the
>> character is outside the ascii-range.
>
> I agree.
>
> Perhaps the compiler should insert a check on the 8th bit in cases like
> these?
>
> I suppose it's possible someone could declare a bunch of individual
> char's and then start manipulating code units that way, and such an 8th
> bit check could thwart those manipulations, but I would also counter
> that such low manipulations should be done on ubyte's instead.
>
> I don't know how much this would help though.  Seems like too little,
> too late.

I think the conversion char -> dchar should just require an explicit cast. The runtime check is better left to std.conv.to;

>
> The bigger problem is that a char is being taken from a char[] and
> thereby loses its context as (potentially) being part of a larger
> codepoint.

If it is part of a larger code point, then it has its highest bit set. Any individual char that has its highest bit set does not carry a character on its own. If it is not set, then it is a single ASCII character.
January 01, 2012
Le 31/12/2011 19:13, Timon Gehr a écrit :
> On 12/31/2011 06:32 PM, Chad J wrote:
>> On 12/30/2011 05:27 PM, Timon Gehr wrote:
>>> On 12/30/2011 10:36 PM, deadalnix wrote:
>>>>
>>>> The #1 quality of a programmer is to act like he/she is a morron.
>>>> Because sometime we all are morrons.
>>>
>>> The #1 quality of a programmer is to write correct code. If he/she acts
>>> as if he/she is a moron, he/she will write code that acts like a moron.
>>> Simple as that.
>>
>> Programs worth writing are complex enough that there is no way any of us
>> can write them perfectly correct code on first draft. There is always
>> going to be some polishing, and maybe even /a lot/ of polishing, and
>> perhaps some complete tear downs and rebuilds from time to time. "Build
>> one to throw away; you will anyways." If you tell me that you can
>> always write correct code the first time and you never need to go back
>> and fix anything when you do testing (you do test right?) then I will
>> have a hard time taking you seriously.
>
> Testing is the main part of my development. Furthermore, I use
> assertions all over the place.
>

Well, if you write correct code, you don't need assertion. They will always be true because your code is correct. Stop wasting your time with that. Remeber the #1 quality of a programmer : write correct code.

See how stupid this becomes ?
January 01, 2012
On 01/01/2012 11:36 PM, deadalnix wrote:
> Le 31/12/2011 19:13, Timon Gehr a écrit :
>> On 12/31/2011 06:32 PM, Chad J wrote:
>>> On 12/30/2011 05:27 PM, Timon Gehr wrote:
>>>> On 12/30/2011 10:36 PM, deadalnix wrote:
>>>>>
>>>>> The #1 quality of a programmer is to act like he/she is a morron.
>>>>> Because sometime we all are morrons.
>>>>
>>>> The #1 quality of a programmer is to write correct code. If he/she acts
>>>> as if he/she is a moron, he/she will write code that acts like a moron.
>>>> Simple as that.
>>>
>>> Programs worth writing are complex enough that there is no way any of us
>>> can write them perfectly correct code on first draft. There is always
>>> going to be some polishing, and maybe even /a lot/ of polishing, and
>>> perhaps some complete tear downs and rebuilds from time to time. "Build
>>> one to throw away; you will anyways." If you tell me that you can
>>> always write correct code the first time and you never need to go back
>>> and fix anything when you do testing (you do test right?) then I will
>>> have a hard time taking you seriously.
>>
>> Testing is the main part of my development. Furthermore, I use
>> assertions all over the place.
>>
>
> Well, if you write correct code, you don't need assertion. They will
> always be true because your code is correct. Stop wasting your time with
> that. Remeber the #1 quality of a programmer : write correct code.
>
> See how stupid this becomes ?

You miss the point. Testing and assertions are part of how I write correct code.
January 01, 2012
On 01/01/2012 02:25 PM, Timon Gehr wrote:
> On 01/01/2012 08:01 PM, Chad J wrote:
>> On 01/01/2012 10:39 AM, Timon Gehr wrote:
>>> On 01/01/2012 04:13 PM, Chad J wrote:
>>>> On 01/01/2012 07:59 AM, Timon Gehr wrote:
>>>>> On 01/01/2012 05:53 AM, Chad J wrote:
>>>>>>
>>>>>> If you haven't been educated about unicode or how D handles it, you might write this:
>>>>>>
>>>>>> char[] str;
>>>>>> ... load str ...
>>>>>> for ( int i = 0; i<    str.length; i++ )
>>>>>> {
>>>>>>        font.render(str[i]); // Ewww.
>>>>>>        ...
>>>>>> }
>>>>>>
>>>>>
>>>>> That actually looks like a bug that might happen in real world code. What is the signature of font.render?
>>>>
>>>> In my mind it's defined something like this:
>>>>
>>>> class Font
>>>> {
>>>>    ...
>>>>
>>>>       /** Render the given code point at
>>>>           the current (x,y) cursor position. */
>>>>       void render( dchar c )
>>>>       {
>>>>           ...
>>>>       }
>>>> }
>>>>
>>>> (Of course I don't know minute details like where the "cursor position" comes from, but I figure it doesn't matter.)
>>>>
>>>> I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.
>>>
>>> I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.
>>
>> I agree.
>>
>> Perhaps the compiler should insert a check on the 8th bit in cases like these?
>>
>> I suppose it's possible someone could declare a bunch of individual char's and then start manipulating code units that way, and such an 8th bit check could thwart those manipulations, but I would also counter that such low manipulations should be done on ubyte's instead.
>>
>> I don't know how much this would help though.  Seems like too little, too late.
> 
> I think the conversion char -> dchar should just require an explicit cast. The runtime check is better left to std.conv.to;
> 

What of valid transfers of ASCII characters into dchar?

Normally this is a widening operation, so I can see how it is permissible.

>>
>> The bigger problem is that a char is being taken from a char[] and thereby loses its context as (potentially) being part of a larger codepoint.
> 
> If it is part of a larger code point, then it has its highest bit set. Any individual char that has its highest bit set does not carry a character on its own. If it is not set, then it is a single ASCII character.

See above.


I think that assigning from a char[i] to another char[j] is probably
safe.  Similarly for slicing.  These calculations tend to occur, I
suspect, when the text is well-anchored.  I believe your balanced
parentheses example falls into this category:
(repasted for reader convenience)

void main(){
    string s = readln();
    int nest = 0;
    foreach(x;s){ // iterates by code unit
        if(x=='(') nest++;
        else if(x==')' && --nest<0) goto unbalanced;
    }
    if(!nest){
        writeln("balanced parentheses");
        return;
    }
unbalanced:
    writeln("unbalanced parentheses");
}

With these observations in hand, I would consider the safety of operations to go like this:

char[i] = char[j];           // (Reasonably) Safe
char[i1..i2] = char[j1..j2]; // (Reasonably) Safe
char = char;                 // Safe
dchar = char                 // Safe.  Widening.
char = char[i];              // Not safe.  Should error.
dchar = char[i];             // Not safe.  Should error. (Corollary)
dchar = dchar[i];            // Safe.
char = char[i1..i2];         // Nonsensical; already an error.
January 01, 2012
On 01/02/2012 12:16 AM, Chad J wrote:
> On 01/01/2012 02:25 PM, Timon Gehr wrote:
>> On 01/01/2012 08:01 PM, Chad J wrote:
>>> On 01/01/2012 10:39 AM, Timon Gehr wrote:
>>>> On 01/01/2012 04:13 PM, Chad J wrote:
>>>>> On 01/01/2012 07:59 AM, Timon Gehr wrote:
>>>>>> On 01/01/2012 05:53 AM, Chad J wrote:
>>>>>>>
>>>>>>> If you haven't been educated about unicode or how D handles it, you
>>>>>>> might write this:
>>>>>>>
>>>>>>> char[] str;
>>>>>>> ... load str ...
>>>>>>> for ( int i = 0; i<     str.length; i++ )
>>>>>>> {
>>>>>>>         font.render(str[i]); // Ewww.
>>>>>>>         ...
>>>>>>> }
>>>>>>>
>>>>>>
>>>>>> That actually looks like a bug that might happen in real world code.
>>>>>> What is the signature of font.render?
>>>>>
>>>>> In my mind it's defined something like this:
>>>>>
>>>>> class Font
>>>>> {
>>>>>     ...
>>>>>
>>>>>        /** Render the given code point at
>>>>>            the current (x,y) cursor position. */
>>>>>        void render( dchar c )
>>>>>        {
>>>>>            ...
>>>>>        }
>>>>> }
>>>>>
>>>>> (Of course I don't know minute details like where the "cursor position"
>>>>> comes from, but I figure it doesn't matter.)
>>>>>
>>>>> I probably wrote some code like that loop a very long time ago, but I
>>>>> probably don't have that code around anymore, or at least not easily
>>>>> findable.
>>>>
>>>> I think the main issue here is that char implicitly converts to dchar:
>>>> This is an implicit reinterpret-cast that is nonsensical if the
>>>> character is outside the ascii-range.
>>>
>>> I agree.
>>>
>>> Perhaps the compiler should insert a check on the 8th bit in cases like
>>> these?
>>>
>>> I suppose it's possible someone could declare a bunch of individual
>>> char's and then start manipulating code units that way, and such an 8th
>>> bit check could thwart those manipulations, but I would also counter
>>> that such low manipulations should be done on ubyte's instead.
>>>
>>> I don't know how much this would help though.  Seems like too little,
>>> too late.
>>
>> I think the conversion char ->  dchar should just require an explicit
>> cast. The runtime check is better left to std.conv.to;
>>
>
> What of valid transfers of ASCII characters into dchar?
>
> Normally this is a widening operation, so I can see how it is permissible.
>
>>>
>>> The bigger problem is that a char is being taken from a char[] and
>>> thereby loses its context as (potentially) being part of a larger
>>> codepoint.
>>
>> If it is part of a larger code point, then it has its highest bit set.
>> Any individual char that has its highest bit set does not carry a
>> character on its own. If it is not set, then it is a single ASCII
>> character.
>
> See above.
>
>
> I think that assigning from a char[i] to another char[j] is probably
> safe.  Similarly for slicing.  These calculations tend to occur, I
> suspect, when the text is well-anchored.  I believe your balanced
> parentheses example falls into this category:
> (repasted for reader convenience)
>
> void main(){
>      string s = readln();
>      int nest = 0;
>      foreach(x;s){ // iterates by code unit
>          if(x=='(') nest++;
>          else if(x==')'&&  --nest<0) goto unbalanced;
>      }
>      if(!nest){
>          writeln("balanced parentheses");
>          return;
>      }
> unbalanced:
>      writeln("unbalanced parentheses");
> }
>
> With these observations in hand, I would consider the safety of
> operations to go like this:
>
> char[i] = char[j];           // (Reasonably) Safe
> char[i1..i2] = char[j1..j2]; // (Reasonably) Safe
> char = char;                 // Safe
> dchar = char                 // Safe.  Widening.
> char = char[i];              // Not safe.  Should error.
> dchar = char[i];             // Not safe.  Should error. (Corollary)
> dchar = dchar[i];            // Safe.
> char = char[i1..i2];         // Nonsensical; already an error.

That is an interesting point of view. Your proposal would therefore be to constrain char to the ASCII range except if it is embedded in an array? It would break the balanced parentheses example.
January 02, 2012
On 01/01/2012 06:36 PM, Timon Gehr wrote:
> On 01/02/2012 12:16 AM, Chad J wrote:
>> On 01/01/2012 02:25 PM, Timon Gehr wrote:
>>> On 01/01/2012 08:01 PM, Chad J wrote:
>>>> On 01/01/2012 10:39 AM, Timon Gehr wrote:
>>>>> On 01/01/2012 04:13 PM, Chad J wrote:
>>>>>> On 01/01/2012 07:59 AM, Timon Gehr wrote:
>>>>>>> On 01/01/2012 05:53 AM, Chad J wrote:
>>>>>>>>
>>>>>>>> If you haven't been educated about unicode or how D handles it, you might write this:
>>>>>>>>
>>>>>>>> char[] str;
>>>>>>>> ... load str ...
>>>>>>>> for ( int i = 0; i<     str.length; i++ )
>>>>>>>> {
>>>>>>>>         font.render(str[i]); // Ewww.
>>>>>>>>         ...
>>>>>>>> }
>>>>>>>>
>>>>>>>
>>>>>>> That actually looks like a bug that might happen in real world code. What is the signature of font.render?
>>>>>>
>>>>>> In my mind it's defined something like this:
>>>>>>
>>>>>> class Font
>>>>>> {
>>>>>>     ...
>>>>>>
>>>>>>        /** Render the given code point at
>>>>>>            the current (x,y) cursor position. */
>>>>>>        void render( dchar c )
>>>>>>        {
>>>>>>            ...
>>>>>>        }
>>>>>> }
>>>>>>
>>>>>> (Of course I don't know minute details like where the "cursor
>>>>>> position"
>>>>>> comes from, but I figure it doesn't matter.)
>>>>>>
>>>>>> I probably wrote some code like that loop a very long time ago, but I probably don't have that code around anymore, or at least not easily findable.
>>>>>
>>>>> I think the main issue here is that char implicitly converts to dchar: This is an implicit reinterpret-cast that is nonsensical if the character is outside the ascii-range.
>>>>
>>>> I agree.
>>>>
>>>> Perhaps the compiler should insert a check on the 8th bit in cases like these?
>>>>
>>>> I suppose it's possible someone could declare a bunch of individual char's and then start manipulating code units that way, and such an 8th bit check could thwart those manipulations, but I would also counter that such low manipulations should be done on ubyte's instead.
>>>>
>>>> I don't know how much this would help though.  Seems like too little, too late.
>>>
>>> I think the conversion char ->  dchar should just require an explicit cast. The runtime check is better left to std.conv.to;
>>>
>>
>> What of valid transfers of ASCII characters into dchar?
>>
>> Normally this is a widening operation, so I can see how it is permissible.
>>
>>>>
>>>> The bigger problem is that a char is being taken from a char[] and thereby loses its context as (potentially) being part of a larger codepoint.
>>>
>>> If it is part of a larger code point, then it has its highest bit set. Any individual char that has its highest bit set does not carry a character on its own. If it is not set, then it is a single ASCII character.
>>
>> See above.
>>
>>
>> I think that assigning from a char[i] to another char[j] is probably
>> safe.  Similarly for slicing.  These calculations tend to occur, I
>> suspect, when the text is well-anchored.  I believe your balanced
>> parentheses example falls into this category:
>> (repasted for reader convenience)
>>
>> void main(){
>>      string s = readln();
>>      int nest = 0;
>>      foreach(x;s){ // iterates by code unit
>>          if(x=='(') nest++;
>>          else if(x==')'&&  --nest<0) goto unbalanced;
>>      }
>>      if(!nest){
>>          writeln("balanced parentheses");
>>          return;
>>      }
>> unbalanced:
>>      writeln("unbalanced parentheses");
>> }
>>
>> With these observations in hand, I would consider the safety of operations to go like this:
>>
>> char[i] = char[j];           // (Reasonably) Safe
>> char[i1..i2] = char[j1..j2]; // (Reasonably) Safe
>> char = char;                 // Safe
>> dchar = char                 // Safe.  Widening.
>> char = char[i];              // Not safe.  Should error.
>> dchar = char[i];             // Not safe.  Should error. (Corollary)
>> dchar = dchar[i];            // Safe.
>> char = char[i1..i2];         // Nonsensical; already an error.
> 
> That is an interesting point of view. Your proposal would therefore be to constrain char to the ASCII range except if it is embedded in an array? It would break the balanced parentheses example.

I just ran the example and wow, x didn't type-infer to dchar like I expected it to.  I thought the comment might be wrong, but no, it is correct, x type-infers to char.

I expected it to behave more like the old days before type inference showed up everywhere:

void main(){
     string s = readln();
     int nest = 0;
     foreach(dchar x;s){ // iterates by code POINT; notice the dchar.
         if(x=='(') nest++;
         else if(x==')'&&  --nest<0) goto unbalanced;
     }
     if(!nest){
         writeln("balanced parentheses");
         return;
     }
unbalanced:
     writeln("unbalanced parentheses");
}

This version wouldn't be broken.  If the type inference changed, the other version wouldn't be broken either.  This could break other things though.  Bummer.