The Case Against Autodecode (page 4) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » The Case Against Autodecode (page 4)

May 13, 2016

Re: The Case Against Autodecode

Posted by Vladimir Panteleev
in reply to Chris

Vladimir Panteleev

Posted in reply to Chris

On Friday, 13 May 2016 at 13:41:30 UTC, Chris wrote:
> PS Why does do I get a "StopForumSpam error" every time I post today? Has anyone else experienced the same problem:
>
> "StopForumSpam error: Socket error: Lookup error: getaddrinfo error: Name or service not known. Please solve a CAPTCHA to continue."

https://twitter.com/StopForumSpam

May 13, 2016

Re: The Case Against Autodecode

Posted by Chris
in reply to Vladimir Panteleev

Chris

Posted in reply to Vladimir Panteleev

On Friday, 13 May 2016 at 14:06:28 UTC, Vladimir Panteleev wrote:
> On Friday, 13 May 2016 at 13:41:30 UTC, Chris wrote:
>> PS Why does do I get a "StopForumSpam error" every time I post today? Has anyone else experienced the same problem:
>>
>> "StopForumSpam error: Socket error: Lookup error: getaddrinfo error: Name or service not known. Please solve a CAPTCHA to continue."
>
> https://twitter.com/StopForumSpam

I don't understand. Does that mean we have to solve CAPTCHAs every time we post? Annoying CAPTCHAs at that.

May 13, 2016

Re: The Case Against Autodecode

Posted by Steven Schveighoffer
in reply to Walter Bright

Steven Schveighoffer

Posted in reply to Walter Bright

On 5/12/16 4:15 PM, Walter Bright wrote:

> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
> benefit of being arrays in the first place.

I'll repeat what I said in the other thread.

The problem isn't auto-decoding. The problem is hijacking the char[] and wchar[] (and variants) array type to mean autodecoding non-arrays.

If you think this code makes sense, then my definition of sane varies slightly from yours:

static assert(!hasLength!R && is(typeof(R.init.length)));
static assert(!is(ElementType!R == R.init[0]));
static assert(!isRandomAccessRange!R && is(typeof(R.init[0])) && is(typeof(R.init[0 .. $])));

I think D would be fine if string meant some auto-decoding struct with an immutable(char)[] array backing. I can accept and work with that. I can transform that into a char[] that makes sense if I have no use for auto-decoding. As of today, I have to use byCodePoint, or .representation, etc. and it's very unwieldy.

If I ran D, that's what I would do.

-Steve

May 13, 2016

Re: The Case Against Autodecode

Posted by H. S. Teoh
in reply to Nick Treleaven

H. S. Teoh

Posted in reply to Nick Treleaven

On Fri, May 13, 2016 at 12:16:30PM +0000, Nick Treleaven via Digitalmars-d wrote:
> On Friday, 13 May 2016 at 00:47:04 UTC, Jack Stouffer wrote:
> >If you're serious about removing auto-decoding, which I think you and others have shown has merits, you have to the THE SIMPLEST migration path ever, or you will kill D. I'm talking a simple press of a button.
> 
> char[] is always going to be unsafe for UTF-8. I don't think we can remove it or auto-decoding, only discourage use of it. We need a String struct IMO, without length or indexing. Its front can do autodecoding, and it has a ubyte[] raw() property too. (Possibly the byte length of front can be cached for use in popFront, assuming it was faster). This would be a gradual transition.

alias String = typeof(std.uni.byGrapheme(immutable(char)[].init));

:-)

Well, OK, perhaps you could wrap this in a struct that allows extraction of .raw, etc.. But basically this isn't hard to implement today. We already have all of the tools necessary.


T

-- 
Dogs have owners ... cats have staff. -- Krista Casada

May 13, 2016

Re: The Case Against Autodecode

Posted by Marco Leise
in reply to Marc Schütz

Marco Leise

Posted in reply to Marc Schütz

Am Fri, 13 May 2016 10:49:24 +0000
schrieb Marc Schütz <schuetzm@gmx.net>:

> In fact, even most European languages are affected if NFD normalization is used, which is the default on MacOS X.
> 
> And this is actually the main problem with it: It was introduced to make unicode string handling correct. Well, it doesn't, therefore it has no justification.

+1 for leaning back and contemplate exactly what auto-decode was aiming for and how it missed that goal.

You'll see that an ö may still be cut between the o and the ¨. Hangul symbols are composed of pieces that go in different corners. Those would also be split up by auto-decode.

Can we handle real world text AT ALL? Are graphemes good enough to find the column in a fixed width display of some string (e.g. line+column or an error)? No, there my still be full-width characters in there that take 2 columns. :p

-- 
Marco

May 13, 2016

Re: The Case Against Autodecode

Posted by Iakh
in reply to Walter Bright

Iakh

Posted in reply to Walter Bright

On Friday, 13 May 2016 at 01:00:54 UTC, Walter Bright wrote:
> On 5/12/2016 5:47 PM, Jack Stouffer wrote:
>> D is much less popular now than was Python at the time, and Python 2 problems
>> were more straight forward than the auto-decoding problem.  You'll need a very
>> clear migration path, years long deprecations, and automatic tools in order to
>> make the transition work, or else D's usage will be permanently damaged.
>
> I agree, if it is possible at all.

A plan:
1. Mark as deprecated places where auto-decoding used. I
think it's all "range" functions for string(front, popFront, back, ...).
Force using byChar & co.

2. Introduce new String type in Phobos.

3. After ages make immutable(char) ordinal array.

Is it OK? Profit?

May 13, 2016

Re: The Case Against Autodecode

Posted by Jonathan M Davis
in reply to Marc Schütz

Jonathan M Davis

Posted in reply to Marc Schütz

On Friday, May 13, 2016 11:00:19 Marc Schütz via Digitalmars-d wrote:
> On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:
> > Ideally, algorithms would be Unicode aware as appropriate, but the default would be to operate on code units with wrappers to handle decoding by code point or grapheme. Then it's easy to write fast code while still allowing for full correctness. Granted, it's not necessarily easy to get correct code that way, but anyone who wants fully correctness without caring about efficiency can just use ranges of graphemes. Ranges of code points are rare regardless.
>
> char[], wchar[] etc. can simply be made non-ranges, so that the user has to choose between .byCodePoint, .byCodeUnit (or .representation as it already exists), .byGrapheme, or even higher-level units like .byLine or .byWord. Ranges of char, wchar however stay as they are today. That way it's harder to accidentally get it wrong.

It also means yet more special cases. You have arrays which aren't treated as ranges when every other type of array out there is treated as a range. And even if that's what we want to do, there isn't really a clean deprecation path.

> There is a simple deprecation path that's already been suggested. `isInputRange` and friends can output a helpful deprecation warning when they're called with a range that currently triggers auto-decoding.

How would you put a deprecation message inside of an eponymous template like isInputRange?  Deprecation messages are triggered when a symbol is used, not when it passes or fails a static if inside of a template. And even if we did something like put a pragma in isInputRange, you'd get a _flood_ of messages in any program that does much of anything with ranges and strings. It's a possible path, but it sure isn't a pretty one. Honestly, I'd have to wonder whether just outright breaking code would be better.

- Jonathan M Davis

May 13, 2016

Re: The Case Against Autodecode

Posted by Alex Parrill
in reply to Steven Schveighoffer

Alex Parrill

Posted in reply to Steven Schveighoffer

On Friday, 13 May 2016 at 16:05:21 UTC, Steven Schveighoffer wrote:
> On 5/12/16 4:15 PM, Walter Bright wrote:
>
>> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
>> benefit of being arrays in the first place.
>
> I'll repeat what I said in the other thread.
>
> The problem isn't auto-decoding. The problem is hijacking the char[] and wchar[] (and variants) array type to mean autodecoding non-arrays.
>
> If you think this code makes sense, then my definition of sane varies slightly from yours:
>
> static assert(!hasLength!R && is(typeof(R.init.length)));
> static assert(!is(ElementType!R == R.init[0]));
> static assert(!isRandomAccessRange!R && is(typeof(R.init[0])) && is(typeof(R.init[0 .. $])));
>
> I think D would be fine if string meant some auto-decoding struct with an immutable(char)[] array backing. I can accept and work with that. I can transform that into a char[] that makes sense if I have no use for auto-decoding. As of today, I have to use byCodePoint, or .representation, etc. and it's very unwieldy.
>
> If I ran D, that's what I would do.
>
> -Steve

Well, the "auto" part of autodecoding means "automatically doing it for plain strings", right? If you explicitly do decoding, I think it would just be "decoding"; there's no "auto" part.

I doubt anyone is going to complain if you add in a struct wrapper around a string that iterates over code units or graphemes. The issue most people have, as you say, is the fact that the default for strings is to decode.

May 13, 2016

Re: The Case Against Autodecode

Posted by Steven Schveighoffer
in reply to Alex Parrill

Steven Schveighoffer

Posted in reply to Alex Parrill

On 5/13/16 5:25 PM, Alex Parrill wrote:
> On Friday, 13 May 2016 at 16:05:21 UTC, Steven Schveighoffer wrote:
>> On 5/12/16 4:15 PM, Walter Bright wrote:
>>
>>> 10. Autodecoded arrays cannot be RandomAccessRanges, losing a key
>>> benefit of being arrays in the first place.
>>
>> I'll repeat what I said in the other thread.
>>
>> The problem isn't auto-decoding. The problem is hijacking the char[]
>> and wchar[] (and variants) array type to mean autodecoding non-arrays.
>>
>> If you think this code makes sense, then my definition of sane varies
>> slightly from yours:
>>
>> static assert(!hasLength!R && is(typeof(R.init.length)));
>> static assert(!is(ElementType!R == R.init[0]));
>> static assert(!isRandomAccessRange!R && is(typeof(R.init[0])) &&
>> is(typeof(R.init[0 .. $])));
>>
>> I think D would be fine if string meant some auto-decoding struct with
>> an immutable(char)[] array backing. I can accept and work with that. I
>> can transform that into a char[] that makes sense if I have no use for
>> auto-decoding. As of today, I have to use byCodePoint, or
>> .representation, etc. and it's very unwieldy.
>>
>> If I ran D, that's what I would do.
>>
>
> Well, the "auto" part of autodecoding means "automatically doing it for
> plain strings", right? If you explicitly do decoding, I think it would
> just be "decoding"; there's no "auto" part.

No, the problem isn't the auto-decoding. The problem is having *arrays* do that. Sometimes.

I would be perfectly fine with a custom string type that all string literals were typed as, as long as I can get a sanely behaving array out of it.

> I doubt anyone is going to complain if you add in a struct wrapper
> around a string that iterates over code units or graphemes. The issue
> most people have, as you say, is the fact that the default for strings
> is to decode.

I want to clarify that I don't really care if strings by default auto-decode. I think that's fine. What I dislike is that immutable(char)[] auto-decodes.

-Steve

May 13, 2016

Re: The Case Against Autodecode

Posted by Jonathan M Davis
in reply to Kagamin

Jonathan M Davis

Posted in reply to Kagamin

On Friday, May 13, 2016 12:52:13 Kagamin via Digitalmars-d wrote:
> On Friday, 13 May 2016 at 10:38:09 UTC, Jonathan M Davis wrote:
> > IIRC, Andrei talked in TDPL about how Java's choice to go with UTF-16 was worse than the choice to go with UTF-8, because it was correct in many more cases
>
> UTF-16 was a migration from UCS-2, and UCS-2 was superior at the time.

The history of why UTF-16 was chosen isn't really relevant to my point (Win32 has the same problem as Java and for similar reasons).

My point was that if you use UTF-8, then it's obvious _really_ fast when you screwed up Unicode-handling by treating a code unit as a character, because anything beyond ASCII is going to fall flat on its face. But with UTF-16, a _lot_ more code units are representable as a single code point - as well as a single grapheme - so it's far easier to write code that treats a code unit as if it were a full character without realizing that you're screwing it up. UTF-8 is fail-fast in this regard, whereas UTF-16 is not.

UTF-32 takes that problem to a new level, because now you'll only notice problems when you're dealing with a grapheme constructed of multiple code points. So, odds are that even if you test with Unicode strings, you won't catch the bugs. It'll work 99% of the time, and you'll get subtle bugs the rest of the time.

There are reasons to operate at the code point level, but in general, you either want to be operating at the code unit level or the grapheme level, not the code point level, and if you don't know what you're doing, then anything other than the grapheme level is likely going to be wrong if you're manipulating individual characters. Fortunately, a lot of string processing doesn't need to operate on individual characters and as long as the standard library functions get it right, you'll tend to be okay, but still, operating at the code point level is almost always wrong, and it's even harder to catch when it's wrong than when treating UTF-16 code units as characters.

- Jonathan M Davis

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation