April 07, 2015
On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
> On Tuesday, 7 April 2015 at 09:04:09 UTC, Walter Bright wrote:
>> On 4/7/2015 1:19 AM, Dicebot wrote:
>>> I have doubts about it similar to Vladimir. Main problem is that I have no idea
>>> what actually happens if replacement characters appear in some unicode text my
>>> program processes.
>>
>> It's much like floating point NaN values, which are 'sticky'.
>
> Yes, but std.conv doesn't return NaN if you try to convert "banana" to a double.

Maybe it should :-)


>> With UTF strings, if you care about invalid UTF (a surprisingly large amount
>> of operations done on strings simply don't care about invalid UTF) the
>> validation can be done as a separate step.
>
> So can converting invalid UTF to replacement characters.

I know, I read your post. The machinery to allocate, throw, catch, and replace is still there.


> I think the correct solution to that is to kill auto-decoding :) Then all
> decoding is explicit, and since it is explicit, it is trivial to allow
> specifying the desired behavior upon encountering invalid UTF-8.

I agree autodecoding is a mistake, but we're stuck with it.
April 07, 2015
On Tuesday, 7 April 2015 at 09:21:52 UTC, Walter Bright wrote:
> On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
>> On Tuesday, 7 April 2015 at 09:04:09 UTC, Walter Bright wrote:
>>> On 4/7/2015 1:19 AM, Dicebot wrote:
>>>> I have doubts about it similar to Vladimir. Main problem is that I have no idea
>>>> what actually happens if replacement characters appear in some unicode text my
>>>> program processes.
>>>
>>> It's much like floating point NaN values, which are 'sticky'.
>>
>> Yes, but std.conv doesn't return NaN if you try to convert "banana" to a double.
>
> Maybe it should :-)

There was a time when operations on NaNs where painfully slow. Also, since NaNs tend to spread, once a NaN appears, there usual is not much of a result left. Debugging used to be painfully hard if NaNs are enabled. We used to rely on floating point exceptions instead.

This might or might not be relevant.
April 07, 2015
On Tuesday, 7 April 2015 at 07:50:40 UTC, Vladimir Panteleev wrote:
> On Tuesday, 7 April 2015 at 07:42:02 UTC, w0rp wrote:
>> Maybe autodecoding could throw an Error (No 'new' allowed) when debug mode is on, and use replacement characters in release mode. I haven't thought it through, but that's an idea.
>
> No no no, terrible idea. This means your program will pass your test suite in debug mode (which, of course, is never going to test behavior with bad UTF in all the relevant places), but silently corrupt real-world data in release mode. Errors and asserts are for logic errors, not for validating user input!

I'd say that invalid UTF8 in `string`s _is_ a logic error, because these are defined to be valid UTF8. If they aren't, someone didn't correctly validate their inputs.

Unfortunately, not even the runtime cares about UTF correctness:

    void main(string[] args) {
        import std.utf;
        args[1].validate; // throws
    }

    # ./testutf8 `echo 'äöü' | recode utf8..latin1`
April 07, 2015
On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
> http://wiki.dlang.org/DIP76

The DIP lists the benefits but does not mention any cons.

A con that I can see is that it is violating the 'fail fast' principle. By silently replacing data the developer will be presented with a probably-hard-to-debug problem later down the application lifecyle (probably in an unrelated area), wasting developer time.

April 07, 2015
On Tue, Apr 07, 2015 at 09:10:32AM +0000, Vladimir Panteleev via Digitalmars-d wrote: [...]
> I think the correct solution to that is to kill auto-decoding :) Then all decoding is explicit, and since it is explicit, it is trivial to allow specifying the desired behavior upon encountering invalid UTF-8.

I used to be pro-autodecoding... nowadays, I'm starting to lean towards killing it. This is another nail in the coffin.


T

-- 
He who sacrifices functionality for ease of use, loses both and deserves neither. -- Slashdotter
April 07, 2015
On Tue, Apr 07, 2015 at 02:21:50AM -0700, Walter Bright via Digitalmars-d wrote:
> On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
[...]
> >I think the correct solution to that is to kill auto-decoding :) Then all decoding is explicit, and since it is explicit, it is trivial to allow specifying the desired behavior upon encountering invalid UTF-8.
> 
> I agree autodecoding is a mistake, but we're stuck with it.

How so? There *are* possible options we can consider to migrate away from autodecoding. AFAICT the real roadblock here is that some people strongly disagree with this, so it's more a community barrier than a technical one.


T

-- 
Unix is my IDE. -- Justin Whear
April 07, 2015
On 4/7/2015 5:04 AM, Abdulhaq wrote:
> On Tuesday, 7 April 2015 at 03:17:26 UTC, Walter Bright wrote:
>> http://wiki.dlang.org/DIP76
>
> The DIP lists the benefits but does not mention any cons.
>
> A con that I can see is that it is violating the 'fail fast' principle. By
> silently replacing data the developer will be presented with a
> probably-hard-to-debug problem later down the application lifecyle (probably in
> an unrelated area), wasting developer time.


On the other hand, if there's any place where people demand the highest performance, it's string processing.
April 07, 2015
On Tuesday, 7 April 2015 at 09:21:52 UTC, Walter Bright wrote:
> On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
>> I think the correct solution to that is to kill auto-decoding :) Then all
>> decoding is explicit, and since it is explicit, it is trivial to allow
>> specifying the desired behavior upon encountering invalid UTF-8.
>
> I agree autodecoding is a mistake, but we're stuck with it.

I don't think we are stuck with it. I think we can change it. I think a lot of the automatic decoding happens inside of Phobos, while people care mostly about the boundaries of the API. If we do get rid of it, then as Vladimir says, you can opt in to whether or not you want a non-throwing conversion, or a throwing one.

I was going to write about how the auto decoding doesn't solve the problem of comparing strings, given that you need to look at ranges of characters, subject to normalisation, unless you're dealing with just ASCII. I think all of that has been said to death, though. I think it's possible for us to get rid of automatic decoding.
April 07, 2015
On Tue, Apr 07, 2015 at 06:00:12PM +0000, w0rp via Digitalmars-d wrote:
> On Tuesday, 7 April 2015 at 09:21:52 UTC, Walter Bright wrote:
> >On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
> >>I think the correct solution to that is to kill auto-decoding :) Then all decoding is explicit, and since it is explicit, it is trivial to allow specifying the desired behavior upon encountering invalid UTF-8.
> >
> >I agree autodecoding is a mistake, but we're stuck with it.
> 
> I don't think we are stuck with it. I think we can change it. I think a lot of the automatic decoding happens inside of Phobos, while people care mostly about the boundaries of the API. If we do get rid of it, then as Vladimir says, you can opt in to whether or not you want a non-throwing conversion, or a throwing one.
> 
> I was going to write about how the auto decoding doesn't solve the problem of comparing strings, given that you need to look at ranges of characters, subject to normalisation, unless you're dealing with just ASCII. I think all of that has been said to death, though. I think it's possible for us to get rid of automatic decoding.

If somebody were to write a DIP for killing autodecoding, I'd vote in favor.

Getting it past Andrei, OTOH, is a different story. ;-)


T

-- 
Never trust an operating system you don't have source for! -- Martin Schulze
April 07, 2015
On Tue, 7 Apr 2015 11:16:16 -0700
"H. S. Teoh via Digitalmars-d" <digitalmars-d@puremagic.com> wrote:

> On Tue, Apr 07, 2015 at 06:00:12PM +0000, w0rp via Digitalmars-d wrote:
> > On Tuesday, 7 April 2015 at 09:21:52 UTC, Walter Bright wrote:
> > >On 4/7/2015 2:10 AM, Vladimir Panteleev wrote:
> > >>I think the correct solution to that is to kill auto-decoding :) Then all decoding is explicit, and since it is explicit, it is trivial to allow specifying the desired behavior upon encountering invalid UTF-8.
> > >
> > >I agree autodecoding is a mistake, but we're stuck with it.
> > 
> > I don't think we are stuck with it. I think we can change it. I think a lot of the automatic decoding happens inside of Phobos, while people care mostly about the boundaries of the API. If we do get rid of it, then as Vladimir says, you can opt in to whether or not you want a non-throwing conversion, or a throwing one.
> > 
> > I was going to write about how the auto decoding doesn't solve the problem of comparing strings, given that you need to look at ranges of characters, subject to normalisation, unless you're dealing with just ASCII. I think all of that has been said to death, though. I think it's possible for us to get rid of automatic decoding.
> 
> If somebody were to write a DIP for killing autodecoding, I'd vote in favor.
> 
me too