The Case For Autodecode

Index » General » The Case For Autodecode

The Case For Autodecode
Jun 03, 2016 ag0aep6g
Jun 03, 2016 Steven Schveighoffer
Jun 03, 2016 Kagamin
Jun 03, 2016 ag0aep6g
Jun 03, 2016 Patrick Schluter
Jun 03, 2016 Steven Schveighoffer
Jun 03, 2016 ag0aep6g
Jun 03, 2016 Steven Schveighoffer
Jun 03, 2016 Steven Schveighoffer
Jun 03, 2016 Steven Schveighoffer
Jun 03, 2016 ag0aep6g
Jun 03, 2016 Steven Schveighoffer
Jun 03, 2016 ag0aep6g
Jun 03, 2016 Steven Schveighoffer
Jun 03, 2016 ag0aep6g
Jun 04, 2016 Patrick Schluter
Jun 04, 2016 Steven Schveighoffer
Jun 03, 2016 Patrick Schluter
Jun 03, 2016 ag0aep6g
Jun 04, 2016 Observer

June 03, 2016

Posted by ag0aep6g

Permalink

ag0aep6g

Permalink

This is mostly me trying to make sense of the discussion.

So everyone hates autodecoding. But Andrei seems to hate it a good bit less than everyone else. As far as I could follow, he has one reason for that, which might not be clear to everyone:

char converts implicitly to dchar, so the compiler lets you search for a dchar in a range of chars. But that gives nonsensical results. For example, you won't find 'ö' in "ö".byChar, but you will find '¶' in there ('¶' is U+00B6, 'ö' is U+00F6, and 'ö' is encoded as 0xC3 0xB6 in UTF-8).

The same does not happen when searching for a grapheme in a range of code points, because you just can't do that accidentally. dchar does not implicitly convert to std.uni.Grapheme.

So autodecoding shields the user from one surprising aspect of narrow strings, and indeed this one kind of problem does not exist with code points.

So:
code units - a lot of surprises
code points - a lot of surprises minus one

I don't think this makes autodecoding actually desirable, but I do think it prevents a mistake that could otherwise be common.

The issue could also be avoided by making char not convert implicitly to dchar. I would like that, but it would of course be another substantial breaking change.

At Andrei: Apologies if I'm misrepresenting your position. If you have other arguments in favor of autodecoding, they haven't gotten through to me.

At everyone: Apologies if I'm just stating the obvious here. I needed this pointed out, and it happened in the depths of the other thread. So maybe this is an aspect others haven't considered either.

Finally, this is not the only argument in favor of *keeping* autodecoding, of course. Not wanting to break user code is the big one there, I guess.

June 03, 2016

Re: The Case For Autodecode

Posted by Steven Schveighoffer
in reply to ag0aep6g

Permalink

Steven Schveighoffer

Posted in reply to ag0aep6g

Permalink

On 6/3/16 7:24 AM, ag0aep6g wrote:
> This is mostly me trying to make sense of the discussion.
>
> So everyone hates autodecoding. But Andrei seems to hate it a good bit
> less than everyone else. As far as I could follow, he has one reason for
> that, which might not be clear to everyone:

I don't hate autodecoding. What I hate is that char[] autodecodes.

If strings were some auto-decoding type that wasn't immutable(char)[], that would be absolutely fine with me. In fact, I see this as the only way to fix this, since it shouldn't break any code.

> char converts implicitly to dchar, so the compiler lets you search for a
> dchar in a range of chars. But that gives nonsensical results. For
> example, you won't find 'ö' in  "ö".byChar, but you will find '¶' in
> there ('¶' is U+00B6, 'ö' is U+00F6, and 'ö' is encoded as 0xC3 0xB6 in
> UTF-8).

Question: why couldn't the compiler emit (in non-release builds) a runtime check to make sure you aren't converting non-ASCII characters to dchars? That is, like out of bounds checking, but for char -> dchar conversions, or any other invalid mechanism?

Yep, it's going to kill a lot of performance. But it's going to catch a lot of problems.

One thing to point out here, is that autodecoding only happens on arrays, and even then, only in certain cases.

-Steve

June 03, 2016

Re: The Case For Autodecode

Posted by Kagamin
in reply to ag0aep6g

Permalink

Kagamin

Posted in reply to ag0aep6g

Permalink

On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote:
> Finally, this is not the only argument in favor of *keeping* autodecoding, of course. Not wanting to break user code is the big one there, I guess.

A lot of discussion is disagreement on understanding of correctness of unicode support. I see 4 possible meanings here:
1. Implemented according to spec.
2. Provides level 1 unicode support.
3. Provides level 2 unicode support.
4. Achieves the goal of unicode, i.e. text processing according to natural language rules.

June 03, 2016

Re: The Case For Autodecode

Posted by ag0aep6g
in reply to Kagamin

Permalink

ag0aep6g

Posted in reply to Kagamin

Permalink

On 06/03/2016 03:56 PM, Kagamin wrote:
> A lot of discussion is disagreement on understanding of correctness of
> unicode support. I see 4 possible meanings here:
> 1. Implemented according to spec.
> 2. Provides level 1 unicode support.
> 3. Provides level 2 unicode support.
> 4. Achieves the goal of unicode, i.e. text processing according to
> natural language rules.

Speaking of that, the document that Walter dug up [1], which talks about supports levels, is about regular expression engines in particular. It's not about general language support.

The version he linked to is also pretty old. A more recent revision [2] calls level 1 (code points) the "minimally useful level of support", speaks warmly about level 2 (graphemes), and says that level 3 (locale dependent behavior) is "only useful for specific applications".

[1] http://unicode.org/reports/tr18/tr18-5.1.html
[2] http://www.unicode.org/reports/tr18/tr18-17.html

June 03, 2016

Re: The Case For Autodecode

Posted by Patrick Schluter
in reply to ag0aep6g

Permalink

Patrick Schluter

Posted in reply to ag0aep6g

Permalink

On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote:
> This is mostly me trying to make sense of the discussion.
>
> So everyone hates autodecoding. But Andrei seems to hate it a good bit less than everyone else. As far as I could follow, he has one reason for that, which might not be clear to everyone:
>
> char converts implicitly to dchar, so the compiler lets you search for a dchar in a range of chars. But that gives nonsensical results. For example, you won't find 'ö' in  "ö".byChar, but you will find '¶' in there ('¶' is U+00B6, 'ö' is U+00F6, and 'ö' is encoded as 0xC3 0xB6 in UTF-8).

You mean that '¶' is represented internally as 1 byte 0xB6 and that it can be handled as such without error? This would mean that char literals are broken. The only valid way to represent '¶' in memory is 0xC3 0x86.
Sorry if I misunderstood, I'm only starting to learn D.

June 03, 2016

Re: The Case For Autodecode

Posted by Steven Schveighoffer
in reply to Patrick Schluter

Permalink

Steven Schveighoffer

Posted in reply to Patrick Schluter

Permalink

On 6/3/16 1:51 PM, Patrick Schluter wrote:
> On Friday, 3 June 2016 at 11:24:40 UTC, ag0aep6g wrote:
>> This is mostly me trying to make sense of the discussion.
>>
>> So everyone hates autodecoding. But Andrei seems to hate it a good bit
>> less than everyone else. As far as I could follow, he has one reason
>> for that, which might not be clear to everyone:
>>
>> char converts implicitly to dchar, so the compiler lets you search for
>> a dchar in a range of chars. But that gives nonsensical results. For
>> example, you won't find 'ö' in  "ö".byChar, but you will find '¶' in
>> there ('¶' is U+00B6, 'ö' is U+00F6, and 'ö' is encoded as 0xC3 0xB6
>> in UTF-8).
>
> You mean that '¶' is represented internally as 1 byte 0xB6 and that it
> can be handled as such without error? This would mean that char literals
> are broken. The only valid way to represent '¶' in memory is 0xC3 0x86.
> Sorry if I misunderstood, I'm only starting to learn D.

Not if '¶' is a dchar.

What is happening in the example is that find is looking at the "ö".byChar range and saying "hm... can I compare dchar('¶') to char? Well, char implicitly casts to dchar, so I'm good!", but a direct cast of the bits from char does NOT mean the same thing as a dchar. It has to go through a decoding first.

The real problem here is that char implicitly casts to dchar. That should not be allowed.

-Steve

June 03, 2016

Re: The Case For Autodecode

Posted by ag0aep6g
in reply to Patrick Schluter

Permalink

ag0aep6g

Posted in reply to Patrick Schluter

Permalink

On 06/03/2016 07:51 PM, Patrick Schluter wrote:
> You mean that '¶' is represented internally as 1 byte 0xB6 and that it
> can be handled as such without error? This would mean that char literals
> are broken. The only valid way to represent '¶' in memory is 0xC3 0x86.
> Sorry if I misunderstood, I'm only starting to learn D.

There is no single char for '¶', that's right, and D gets that right. That's not what happens.

But there is a single wchar for it. wchar is a UTF-16 code unit, 2 bytes. UTF-16 encodes '¶' as a single code unit, so that's correct.

The problem is that you can accidentally search for a wchar in a range of chars. Every char is compared to the wchar by numeric value. But the numeric values of a char don't mean the same as those of a wchar, so you get nonsensical results.

A similar implicit conversion lets you search for a large number in a byte[]:

----
byte[] arr = [1, 2, 3];
foreach(x; arr) if (x == 1000) writeln("found it!");
----

You won't ever find 1000 in a byte[], of course. The byte type simply can't store the value. But you can compare a byte with an int. And that comparison is meaningful, unlike the comparison of a char with a wchar.

You can also produce false positives with numeric types, by mixing signed and unsigned types:

----
int[] arr = [1, -1, 3];
foreach(x; arr) if (x == uint.max) writeln("found it!");
----

uint.max is a large number, -1 is a small number. They're considered equal here because of an implicit conversion that messes with the meaning of the bits.

False negatives are not possible with numeric types. At least not in the same way as with differently sized Unicode code units.

June 03, 2016

Re: The Case For Autodecode

Posted by ag0aep6g
in reply to Steven Schveighoffer

Permalink

ag0aep6g

Posted in reply to Steven Schveighoffer

Permalink

On 06/03/2016 08:36 PM, Steven Schveighoffer wrote:
> but a direct cast
> of the bits from char does NOT mean the same thing as a dchar.

That gives me an idea. A bitwise reinterpretation of int to float is nonsensical, too. Yet int implicitly converts to float and (for small values) preserves the meaning. I mean, implicit conversion doesn't have to mean bitwise reinterpretation.

How about replacing non-standalone code units with replacement character (U+FFFD) in implicit widening conversions?

For example:

----
char c = "ö"[0];
wchar w = c;
assert(w == '\uFFFD');
----

Would probably just be band-aid, though.

June 03, 2016

Re: The Case For Autodecode

Posted by Patrick Schluter
in reply to Steven Schveighoffer

Permalink

Patrick Schluter

Posted in reply to Steven Schveighoffer

Permalink

On Friday, 3 June 2016 at 18:36:45 UTC, Steven Schveighoffer wrote:
>
> The real problem here is that char implicitly casts to dchar. That should not be allowed.
>
Indeed.

June 03, 2016

Re: The Case For Autodecode

Posted by Steven Schveighoffer
in reply to ag0aep6g

Permalink

Steven Schveighoffer

Posted in reply to ag0aep6g

Permalink

On 6/3/16 2:55 PM, ag0aep6g wrote:
> On 06/03/2016 08:36 PM, Steven Schveighoffer wrote:
>> but a direct cast
>> of the bits from char does NOT mean the same thing as a dchar.
>
> That gives me an idea. A bitwise reinterpretation of int to float is
> nonsensical, too. Yet int implicitly converts to float and (for small
> values) preserves the meaning. I mean, implicit conversion doesn't have
> to mean bitwise reinterpretation.

I'm pretty sure the CPU handles this, though.

> How about replacing non-standalone code units with replacement character
> (U+FFFD) in implicit widening conversions?
>
> For example:
>
> ----
> char c = "ö"[0];
> wchar w = c;
> assert(w == '\uFFFD');
> ----
>
> Would probably just be band-aid, though.

Except many chars *do* properly convert. This should work:

char c = 'a';
dchar d = c;
assert(d == 'a');

As I mentioned in my earlier reply, some kind of "bounds checking" for the conversion could be a possibility.

Hm... an interesting possiblity:

dchar _dchar_convert(char c)
{
   return cast(int)cast(byte)c; // get sign extension for non-ASCII
}

-Steve

Top | Forum index | About this forum

Forums