Dealing with Autodecode (page 6)

On 06/02/2016 11:14 AM, deadalnix wrote:
> On Thursday, 2 June 2016 at 15:03:34 UTC, Andrei Alexandrescu wrote:
>>> You start to sound like a car salesman.
>>
>> I assume that means overselling or false advertising. Where do either
>> of these happen? -- Andrei
>
> For SDC for instance, autodecode is a problem (in fact, it is the very
> reason I abandoned making the lexer usable as a standalone) while RCStr
> would not help one bit as string are pretty much never manipulated
> directly anywhere.

Well I'm not sure how SDC works.

> More generally, using RCStr is at best sidestepping the issue rather
> than solving it.

What is the issue?

> On the GC side of the issue, I think there are also overstatements.

What are those?


Andrei

June 07, 2016

Re: Dealing with Autodecode

Posted by Jon Degenhardt
in reply to Walter Bright

Permalink

Jon Degenhardt

Posted in reply to Walter Bright

Permalink

On Wednesday, 1 June 2016 at 00:46:04 UTC, Walter Bright wrote:
> It is not practical to just delete or deprecate autodecode - it is too embedded into things. What we can do, however, is stop using it ourselves and stop relying on it in the documentation, much like [] is eschewed in favor of std::vector in C++.
>
Hopefully my perspective on auto-decoding topic is useful rather than disruptive. I work on search applications, both run-time engines and data science. Processing multi-lingual text is an important aspect of these applications. There are a couple issues with D's current auto-decoding implementation for these applications.

One is lack of control over error handling when encountering corrupt utf-8 text. Real world data contains corrupt utf-8 sequences, robust applications need to handle them. Proper handling is generally application specific. Both replacement character and throwing exceptions are useful behaviors, but the ability to choose between them is often necessary. At present, this behavior is built into the low-level primitives, without application control. Notably, 'front' and 'popFront' have different behaviors. This is also a consideration for explicitly invoked decoding facilities like 'byUTF'.

Another is performance. Iteration triggering auto-decoding is apparently an order of magnitude more costly than iteration without decoding. This is too large a delta when the algorithm doesn't require decoding. (Such algorithms are common.) Frankly, I'm surprised the cost is so large. It wouldn't surprise me to find out it's partly a compiler artifact, but it doesn't matter.

As to what to do about it - if changing currently built-in auto decoding is not an option, then perhaps providing parallel facilities that don't auto-decode would do the trick. RCStr would seem a real opportunity. Perhaps a raw array of utf-8 code units ala ubyte[] that doesn't get auto-decoded? With either, explicit decoding would be needed to invoke standard library routines operating on unicode code points or graphemes. (Sounds like interaction with character literals could still be an issue, as the actual representation is not obvious.) Having a consistent set of error handling options for explicit decoding facilities would be helpful as well.

Another possibility would be support for detecting inadvertent auto-decoding. D has very nice support for ensuring or detecting code properties (eg. '@nogc', '-vgc' compiler option). If there was a way to identify code triggering auto-decoding, that would be useful.

Forums