June 02, 2016
On 06/02/2016 11:14 AM, deadalnix wrote:
> On Thursday, 2 June 2016 at 15:03:34 UTC, Andrei Alexandrescu wrote:
>>> You start to sound like a car salesman.
>>
>> I assume that means overselling or false advertising. Where do either
>> of these happen? -- Andrei
>
> For SDC for instance, autodecode is a problem (in fact, it is the very
> reason I abandoned making the lexer usable as a standalone) while RCStr
> would not help one bit as string are pretty much never manipulated
> directly anywhere.

Well I'm not sure how SDC works.

> More generally, using RCStr is at best sidestepping the issue rather
> than solving it.

What is the issue?

> On the GC side of the issue, I think there are also overstatements.

What are those?


Andrei

June 02, 2016
On 06/02/2016 11:26 AM, Jonathan M Davis via Digitalmars-d wrote:
> Unless we're outright getting rid of string, char[], wstring, etc., RCStr
> clearly doesn't solve the auto-decoding problem.

It does if you use it. If you don't, it doesn't. -- Andrei
June 07, 2016
On Wednesday, 1 June 2016 at 00:46:04 UTC, Walter Bright wrote:
> It is not practical to just delete or deprecate autodecode - it is too embedded into things. What we can do, however, is stop using it ourselves and stop relying on it in the documentation, much like [] is eschewed in favor of std::vector in C++.
>
Hopefully my perspective on auto-decoding topic is useful rather than disruptive. I work on search applications, both run-time engines and data science. Processing multi-lingual text is an important aspect of these applications. There are a couple issues with D's current auto-decoding implementation for these applications.

One is lack of control over error handling when encountering corrupt utf-8 text. Real world data contains corrupt utf-8 sequences, robust applications need to handle them. Proper handling is generally application specific. Both replacement character and throwing exceptions are useful behaviors, but the ability to choose between them is often necessary. At present, this behavior is built into the low-level primitives, without application control. Notably, 'front' and 'popFront' have different behaviors. This is also a consideration for explicitly invoked decoding facilities like 'byUTF'.

Another is performance. Iteration triggering auto-decoding is apparently an order of magnitude more costly than iteration without decoding. This is too large a delta when the algorithm doesn't require decoding. (Such algorithms are common.) Frankly, I'm surprised the cost is so large. It wouldn't surprise me to find out it's partly a compiler artifact, but it doesn't matter.

As to what to do about it - if changing currently built-in auto decoding is not an option, then perhaps providing parallel facilities that don't auto-decode would do the trick. RCStr would seem a real opportunity. Perhaps a raw array of utf-8 code units ala ubyte[] that doesn't get auto-decoded? With either, explicit decoding would be needed to invoke standard library routines operating on unicode code points or graphemes. (Sounds like interaction with character literals could still be an issue, as the actual representation is not obvious.) Having a consistent set of error handling options for explicit decoding facilities would be helpful as well.

Another possibility would be support for detecting inadvertent auto-decoding. D has very nice support for ensuring or detecting code properties (eg. '@nogc', '-vgc' compiler option). If there was a way to identify code triggering auto-decoding, that would be useful.

June 07, 2016
It's possible to add a new      alias bstring immutable(ubyte)[]

a new literal postfix (ustring s = "test string"b) or UFCS  (ustring s = "test string".b)

add UFCS byCodePoint byGrapheme

and add overload function in phobos where necessary

so we can have a autodecode free string


1 2 3 4 5 6
Next ›   Last »