September 28, 2014
On Sunday, 28 September 2014 at 14:38:57 UTC, H. S. Teoh via Digitalmars-d wrote:
> On Sun, Sep 28, 2014 at 12:06:16PM +0000, Uranuz via Digitalmars-d wrote:
>> On Sunday, 28 September 2014 at 00:13:59 UTC, Andrei Alexandrescu wrote:
>> >On 9/27/14, 3:40 PM, H. S. Teoh via Digitalmars-d wrote:
>> >>If we can get Andrei on board, I'm all for killing off autodecoding.
>> >
>> >That's rather vague; it's unclear what would replace it. -- Andrei
>> 
>> I believe that removing autodeconding will make things even worse. As
>> far as understand if we will remove it from front() function that
>> operates on narrow strings then it will return just byte of char. I
>> believe that proceeding on narrow string by `user perceived chars`
>> (graphemes) is more common use case.
> [...]
>
> Unfortunately this is not what autodecoding does today. Today's
> autodecoding only segments strings into code *points*, which are not the
> same thing as graphemes. For example, combining diacritics are normally
> not considered separate characters from the user's POV, but they *are*
> separate codepoints from their base character. The only reason today's
> autodecoding is even remotely considered "correct" from an intuitive POV
> is because most Western character sets happen to use only precomposed
> characters rather than combining diacritic sequences. If you were
> processing, say, Korean text, the present autodecoding .front would
> *not* give you what you might imagine is a "single character"; it would
> only be halves of Korean graphemes. Which, from a user's POV, would
> suffer from the same issues as dealing with individual bytes in a UTF-8
> stream -- any mistake on the program's part in handling these half-units
> will cause "corruption" of the text (not corruption in the same sense as
> an improperly segmented UTF-8 byte stream, but in the sense that the
> wrong glyphs will be displayed on the screen -- from the user's POV
> these two are basically the same thing).
>
> You might then be tempted to say, well let's make .front return
> graphemes instead. That will solve the "single intuitive character"
> issue, but the performance will be FAR worse than what it is today.
>
> So basically, what we have today is neither efficient nor complete, but
> a halfway solution that mostly works for Western character sets but
> is incomplete for others. We're paying efficiency for only a partial
> benefit. Is it worth the cost?
>
> I think the correct solution is not for Phobos to decide for the
> application at what level of abstraction a string ought to be processed.
> Rather, let the user decide. If they're just dealing with opaque blocks
> of text, decoding or segmenting by grapheme is completely unnecessary --
> they should just operate on byte ranges as opaque data. They should use
> byCodeUnit. If they need to work with Unicode codepoints, let them use
> byCodePoint. If they need to work with individual user-perceived
> characters (i.e., graphemes), let them use byGrapheme.
>
> This is why I proposed the deprecation path of making it illegal to pass
> raw strings to Phobos algorithms -- the caller should specify what level
> of abstraction they want to work with -- byCodeUnit, byCodePoint, or
> byGrapheme. The standard library's job is to empower the D programmer by
> giving him the choice, not to shove a predetermined solution down his
> throat.
>
>
> T

I totally agree with all of that.

It's one of those cases where correct by default is far too slow (that would have to be graphemes) but fast by default is far too broken. Better to force an explicit choice.

There is no magic bullet for unicode in a systems language such as D. The programmer must be aware of it and make choices about how to treat it.
September 28, 2014
On 9/28/2014 4:46 AM, Dmitry Olshansky wrote:
> In all honesty - 2 RAII structs w/o inlining + setting up exception frame +
> creating and allocating an exception + idup-ing a string does account to about
> this much.

Twice as much generated code as actually necessary, and this is just for 3 lines of source code.

September 28, 2014
On 9/28/2014 5:06 AM, Uranuz wrote:
> A question: can you list some languages that represent UTF-8 narrow strings as
> array of single bytes?

C and C++.
September 28, 2014
On 9/28/2014 10:03 AM, John Colvin wrote:
> There is no magic bullet for unicode in a systems language such as D. The
> programmer must be aware of it and make choices about how to treat it.

That's really the bottom line.

The trouble with autodecode is it is done at the lowest level, meaning it is very hard to bypass. By moving the decision up a level (by using .byDchar or .byCodeUnit adapters) the caller makes the decision.
September 28, 2014
On 9/28/2014 3:14 AM, bearophile wrote:
> I get refusals if I propose tiny breaking changes that require changes in a
> small amount of user code.  In comparison the user code changes you are
> suggesting are very large.

I'm painfully aware of what a large change removing autodecoding is. That means it'll take a long time to do it. In the meantime, we can stop adding new code to Phobos that does autodecoding. We have taken the first step by adding the .byDchar and .byCodeUnit adapters.

September 28, 2014
On 9/28/2014 5:09 AM, Andrei Alexandrescu wrote:
> Stuff that's missing:
>
> * Reasonable effort to improve performance of auto-decoding;
>
> * A study of the matter revealing either new artifacts and idioms, or the
> insufficiency of such;
>
> * An assessment of the impact on compilability of existing code
>
> * An assessment of the impact on correctness of existing code (that compiles and
> runs in both cases)
>
> * An assessment of the improvement in speed of eliminating auto-decoding
>
> I think there's a very strong need for this stuff, because claims that current
> alternatives to selectively avoid auto-decoding use the throwing of hands (and
> occasional chairs out windows) without any real investigation into how library
> artifacts may help. This approach to justifying risky moves is frustratingly
> unprincipled.

I know I have to go a ways further to convince you :-) This is definitely a longer term issue, not a stop-the-world-we-must-fix-it-now thing.


> Also I submit that diverting into this is a huge distraction at probably the
> worst moment in the history of the D programming language.

I don't plan to work on this particular issue for the time being, but do want to stop adding more autodecoding functions like the proposed std.path.withExtension().


> C++ and GC. C++ and GC...

Currently, the autodecoding functions allocate with the GC and throw as well. (They'll GC allocate an exception and throw it if they encounter an invalid UTF sequence. The adapters use the more common method of inserting a substitution character and continuing on.) This makes it harder to make GC-free Phobos code.

September 28, 2014
Walter Bright:

> I'm painfully aware of what a large change removing autodecoding is. That means it'll take a long time to do it. In the meantime, we can stop adding new code to Phobos that does autodecoding. We have taken the first step by adding the .byDchar and .byCodeUnit adapters.

We have .representation and .assumeUTF, I am using it to avoid most autodecoding problems. Have you tried to use them in your D code?

The changes you propose seem able to break almost every D program I have written (most or all code that uses strings with Phobos ranges/algorithms, and I use them everywhere). Compared to this change, disallowing comma operator to implement nice built-in tuples will cause nearly no breakage in my code (I have done a small analysis of the damages caused by disallowing the tuple operator in my code). It sounds like a change fit for a D3 language, even more than the introduction of reference counting. I think this change will cause some people to permanently stop using D.

In the end you are the designer and the benevolent dictator of D, I am not qualified to refuse or oppose such changes. But before doing this change I suggest to study how many changes it causes in an average small D program that uses strings and ranges/algorithms.

Bye,
bearophile
September 28, 2014
Walter Bright:

> but do want to stop adding more autodecoding functions like the proposed std.path.withExtension().

I am not sure that can work. Perhaps you need to create a range2 and algorithm2 modules, and keep adding some autodecoding functions to the old modules.

Bye,
bearophile
September 28, 2014
On 9/28/2014 11:39 AM, bearophile wrote:
> Walter Bright:
>
>> I'm painfully aware of what a large change removing autodecoding is. That
>> means it'll take a long time to do it. In the meantime, we can stop adding new
>> code to Phobos that does autodecoding. We have taken the first step by adding
>> the .byDchar and .byCodeUnit adapters.
>
> We have .representation and .assumeUTF, I am using it to avoid most autodecoding
> problems. Have you tried to use them in your D code?

Yes. They don't work. Well, technically they do "work", but your code gets filled with explicit casts, which is awful.

The problem is the "representation" of char[] is type char, not type ubyte.


> The changes you propose seem able to break almost every D program I have written
> (most or all code that uses strings with Phobos ranges/algorithms, and I use
> them everywhere). Compared to this change, disallowing comma operator to
> implement nice built-in tuples will cause nearly no breakage in my code (I have
> done a small analysis of the damages caused by disallowing the tuple operator in
> my code). It sounds like a change fit for a D3 language, even more than the
> introduction of reference counting. I think this change will cause some people
> to permanently stop using D.

It's quite possible we will be unable to make this change. But the question that started all this would be what would I change if breaking code was allowed.

I suggest that in the future write code that is explicit about the intention - by character or by decoded character - by using adapters .byChar or .byDchar.

September 28, 2014
> I totally agree with all of that.
>
> It's one of those cases where correct by default is far too slow (that would have to be graphemes) but fast by default is far too broken. Better to force an explicit choice.
>
> There is no magic bullet for unicode in a systems language such as D. The programmer must be aware of it and make choices about how to treat it.

I see didn't know about difference between byCodeUnit and
byGrapheme, because I speak Russian and it is close to English,
because it doesn't have diacritics. As far as I remember German,
that I learned at school have diacritics. So you opened my eyes
in this question. My position as usual programmer is that I
speaking language which graphemes coded by 2 bytes and I alwas
need to do decoding otherwise my programme will be broken. Other
possibility is to use wstring or dstring, but it is less memory
efficient. Also UTF-8 is more commonly used in the Internet so I
don't want to do some conversions to UTF-32, for example.

Where I could read about byGrapheme? Isn't this approach
overcomplicated? I don't want to write Dostoevskiy's book "War
and Peace" in order to write some parser for simple DSL.