September 28, 2014
On Saturday, 27 September 2014 at 22:11:39 UTC, H. S. Teoh via Digitalmars-d wrote:
> I vaguely recall somebody mentioning a while back that range-based code
> is poorly optimized because compilers weren't designed to recognize
> such patterns before. I wonder if there are ways for the compiler to
> recognize range primitives and apply special optimizations to them.
>
> I do find, though, that gdc -O3 generally tends to do a pretty good job
> of reducing range-based code to near-minimal assembly. Sadly, dmd is
> changing too fast for gdc releases to catch up with the latest and
> greatest, so I haven't been using gdc very much recently. :-(
>

That was me, specifically for LLVM (I don't know much about GCC's innards). Hopefully, this is being worked (as it also impact C++'s stdlib).
September 28, 2014
On Saturday, 27 September 2014 at 23:33:14 UTC, H. S. Teoh via Digitalmars-d wrote:
> On Sat, Sep 27, 2014 at 11:00:16PM +0000, bearophile via Digitalmars-d wrote:
>> H. S. Teoh:
>> 
>> >If we can get Andrei on board, I'm all for killing off autodecoding.
>> 
>> Killing auto-decoding for std.algorithm functions will break most of
>> my D2 code... perhaps we can do that in a D3 language.
> [...]
>
> Well, obviously it's not going to be done in a careless, drastic way!
>
> There will be a proper migration path and deprecation cycle. We already
> have byCodeUnit and byCodePoint, and the first step is probably to
> migrate towards requiring usage of one or the other for iterating over
> strings, and only once all code is using them, we will get rid of
> autodecoding (the job now being done by byCodePoint). Then, the final
> step would be to allow the direct use of strings in iteration constructs
> again, but this time without autodecoding by default. Of course,
> .byCodePoint will still be available for code that needs to use it.

The final step would almost inevitably lead to Unicode incorrectness, which was the reason why autodecoding was introduced in the first place. Just require byCodePoint/byCodeUnit, always. It might be a bit inconvenient, but that's a consequence of the fact that we're dealing with Unicode strings.
September 28, 2014
H. S. Teoh:

> There will be a proper migration path and deprecation cycle.

I get refusals if I propose tiny breaking changes that require changes in a small amount of user code.  In comparison the user code changes you are suggesting are very large.

Bye,
bearophile
September 28, 2014
On Sunday, 28 September 2014 at 00:13:59 UTC, Andrei Alexandrescu wrote:
> On 9/27/14, 3:40 PM, H. S. Teoh via Digitalmars-d wrote:
>> If we can get Andrei on board, I'm all for killing off autodecoding.
>
> That's rather vague; it's unclear what would replace it. -- Andrei

No autodecoding ;-)

Specifically:

1. ref T front(T[] r) always returns r[0]
2. popFront(ref T[] r) always does { ++r.ptr; --r.length; }
3. Narrow string will be hasLength, hasSlicing, isRandomAccessRange (i.e. they are just like any other array).

Also:

4. Disallow implicit conversions, comparisons, or any operation among char, wchar, dchar. This makes things like "foo".find('π') compile time errors (or better, errors until we specialize to it to do "foo".find("π"), as it should)
5. Provide byCodePoint for narrow strings (although I suspect this will be rarely used).

The argument is as follows:
* First, this is a hell of a lot simpler for the implementation.
* People rarely ever search for single, non-ASCII characters in strings, and #4 makes it an error if they do (until we specialize to make it work).
* Searching, comparison, joining, and splitting functions will be fast and correct by default.

One possible counter argument is that this makes it easier to corrupt strings (since you could, e.g. insert a substring into the middle of a multi-byte code point). To that I say that it's unlikely. When inserting into a string, you're either doing it at the front or back (which is safe), or to some point that you've found by some other means (e.g. using find). I can't imagine a scenario where you could find a point in the middle of a string, that is in the middle of a code point.

Of course, I'd probably say this change isn't practical right now, but this is how I'd do things if I were to start over.
September 28, 2014
Am Sun, 28 Sep 2014 10:04:21 +0000
schrieb "Marc Schütz" <schuetzm@gmx.net>:

> On Saturday, 27 September 2014 at 23:33:14 UTC, H. S. Teoh via Digitalmars-d wrote:
> > On Sat, Sep 27, 2014 at 11:00:16PM +0000, bearophile via Digitalmars-d wrote:
> >> H. S. Teoh:
> >> 
> >> >If we can get Andrei on board, I'm all for killing off autodecoding.
> >> 
> >> Killing auto-decoding for std.algorithm functions will break
> >> most of
> >> my D2 code... perhaps we can do that in a D3 language.
> > [...]
> >
> > Well, obviously it's not going to be done in a careless, drastic way!
> >
> > There will be a proper migration path and deprecation cycle. We
> > already
> > have byCodeUnit and byCodePoint, and the first step is probably
> > to
> > migrate towards requiring usage of one or the other for
> > iterating over
> > strings, and only once all code is using them, we will get rid
> > of
> > autodecoding (the job now being done by byCodePoint). Then, the
> > final
> > step would be to allow the direct use of strings in iteration
> > constructs
> > again, but this time without autodecoding by default. Of course,
> > .byCodePoint will still be available for code that needs to use
> > it.
> 
> The final step would almost inevitably lead to Unicode incorrectness, which was the reason why autodecoding was introduced in the first place. Just require byCodePoint/byCodeUnit, always. It might be a bit inconvenient, but that's a consequence of the fact that we're dealing with Unicode strings.

And I would go so far to say that you have to make an informed decision between code unit, code point and grapheme. They are all useful. Graphemes being the most generally useful, hiding away normalization and allowing cutting by "user perceived character".

-- 
Marco

September 28, 2014
28-Sep-2014 00:57, Walter Bright пишет:
>  From time to time, I take a break from bugs and enhancements and just
> look at what some piece of code is actually doing. Sometimes, I'm
> appalled. Phobos, for example, should be a lean and mean fighting machine:
>
>
> http://www.nbcnews.com/id/38545625/ns/technology_and_science-science/t/king-tuts-chariots-were-formula-one-cars/#.VCceNmd0xjs
>
>
> Instead, we have something more akin to:
>
>
> http://untappedcities.com/2012/10/31/roulez-carrosses-carriages-of-versailles-arrive-in-arras/
>
>
> More specifically, I looked at std.file.copy():
>
>    https://github.com/D-Programming-Language/phobos/blob/master/std/file.d
>
> Which is 3 lines of code:
>
>    void copy(in char[] from, in char[] to) {
>          immutable result = CopyFileW(from.tempCStringW(),
> to.tempCStringW(), false);
>          if (!result)
>              throw new FileException(to.idup);
>    }
>
> Compiling this code for Windows produces the rather awful:

In all honesty - 2 RAII structs w/o inlining + setting up exception frame + creating and allocating an exception + idup-ing a string does account to about this much.


>
> _D3std4file4copyFxAaxAaZv       comdat
>          assume  CS:_D3std4file4copyFxAaxAaZv
> L0:             push    EBP
>                  mov     EBP,ESP
>                  mov     EDX,FS:__except_list
>                  push    0FFFFFFFFh
>                  lea     EAX,-0220h[EBP]
>                  push    offset _D3std4file4copyFxAaxAaZv[0106h]
>                  push    EDX
>                  mov     FS:__except_list,ESP
>                  sub     ESP,8
>                  sub     ESP,041Ch
>                  push    0
>                  push    dword ptr 0Ch[EBP]
>                  push    dword ptr 8[EBP]
>                  call    near ptr
> _D3std8internal7cstring21__T11tempCSÇàÆTuTaZÇìÆFNbNixAaZSÇ┬├3Res
>                  mov     dword ptr -4[EBP],0
>                  lea     EAX,-0220h[EBP]
>                  call    near ptr
> _D3std8internal7cstring21__T11tempCStringTuTaZ11tempCStringFNbNixAaZ3Res3ptrMxFNaNbNdNiNfZPxu
>
>                  push    EAX
>                  lea     EAX,-0430h[EBP]
>                  push    dword ptr 014h[EBP]
>                  push    dword ptr 010h[EBP]
>                  call    near ptr
> _D3std8internal7cstring21__T11tempCSÇàÆTuTaZÇìÆFNbNixAaZSÇ┬├3Res
>                  mov     dword ptr -4[EBP],1
>                  lea     EAX,-0430h[EBP]
>                  call    near ptr
> _D3std8internal7cstring21__T11tempCStringTuTaZ11tempCStringFNbNixAaZ3Res3ptrMxFNaNbNdNiNfZPxu
>
>                  push    EAX
>                  call    dword ptr __imp__CopyFileW@12
>                  mov     -01Ch[EBP],EAX
>                  mov     dword ptr -4[EBP],0
>                  call    near ptr L83
>                  jmp short       L8F
> L83:            lea     EAX,-0220h[EBP]
>                  call    near ptr
> _D3std8internal7cstring21__T11tempCStringTuTaZ11tempCStringFNbNixAaZ3Res6__dtorMFNbNiZv
>
>                  ret
> L8F:            mov     dword ptr -4[EBP],0FFFFFFFFh
>                  call    near ptr L9D
>                  jmp short       LA9
> L9D:            lea     EAX,-0430h[EBP]
>                  call    near ptr
> _D3std8internal7cstring21__T11tempCStringTuTaZ11tempCStringFNbNixAaZ3Res6__dtorMFNbNiZv
>
>                  ret
> LA9:            cmp     dword ptr -01Ch[EBP],0
>                  jne     LF3
>                  mov     ECX,offset
> FLAT:_D3std4file13FileException7__ClassZ
>                  push    ECX
>                  call    near ptr __d_newclass
>                  add     ESP,4
>                  push    dword ptr 0Ch[EBP]
>                  mov     -018h[EBP],EAX
>                  push    dword ptr 8[EBP]
>                  call    near ptr
> _D6object12__T4idupTxaZ4idupFNaNbNdNfAxaZAya
>                  push    EDX
>                  push    EAX
>                  call    dword ptr __imp__GetLastError@0
>                  push    EAX
>                  push    dword ptr _D3std4file13FileException6__vtblZ[02Ch]
>                  push    dword ptr _D3std4file13FileException6__vtblZ[028h]
>                  push    095Dh
>                  mov     EAX,-018h[EBP]
>                  call    near ptr
> _D3std4file13FileException6__ctorMFNfxAakAyakZC3std4file13FileException
>                  push    EAX
>                  call    near ptr __d_throwc
> LF3:            mov     ECX,-0Ch[EBP]
>                  mov     FS:__except_list,ECX
>                  mov     ESP,EBP
>                  pop     EBP
>                  ret     010h
>                  mov     EAX,offset
> FLAT:_D3std4file13FileException6__vtblZ[0310h]
>                  jmp     near ptr __d_framehandler
>
> which is TWICE as much generated code as for D1's copy(), which does the
> same thing. No, it is not because D2's compiler sux. It's because it has
> become encrustified with gee-gaws, jewels, decorations, and other crap.
>
> To scrape the barnacles off, I've filed:
>
> https://issues.dlang.org/show_bug.cgi?id=13541
> https://issues.dlang.org/show_bug.cgi?id=13542
> https://issues.dlang.org/show_bug.cgi?id=13543
> https://issues.dlang.org/show_bug.cgi?id=13544
>
> I'm sure there's much more in std.file (and elsewhere) that can be done.
> Guys, when developing Phobos/Druntime code, please look at the assembler
> once in a while and see what is being wrought. You may be appalled, too.
>
>
>


-- 
Dmitry Olshansky
September 28, 2014
On Sunday, 28 September 2014 at 00:13:59 UTC, Andrei Alexandrescu wrote:
> On 9/27/14, 3:40 PM, H. S. Teoh via Digitalmars-d wrote:
>> If we can get Andrei on board, I'm all for killing off autodecoding.
>
> That's rather vague; it's unclear what would replace it. -- Andrei

I believe that removing autodeconding will make things even worse. As far as understand if we will remove it from front() function that operates on narrow strings then it will return just byte of char. I believe that proceeding on narrow string by `user perceived chars` (graphemes) is more common use case. Operating on single bytes of multibyte character is uncommon task and you can do that via direct indexing of char[] array. I believe what number of bytes is in *user perceived chars* is internal implementation of UTF-8 encoding and it should not be considered in common tasks such as parsing, searching, replacing text and etc. If you need byte representation of string you should cast it into ubyte[] and work with it using the same range functions without autodecoding.

The main problem that I see that unexpirienced in D programmer can be confused where he operates by bytes or by graphemes. Especially it could happen when he migrates from C#, Python where string is not considered as array of it's bytes. Because *char* in D is not char it's a part of char, but not entire char. It's main inconsistence.

Possible solution is to include class or struct implementation of string and hide internal implementation of narrow string for those users who don't need to operate on single bytes of UTF-8 characters. I believe it's the best way to kill all the rabbits)) We could provide this class String with method returning ubyte[] (better way) or char[] that will expose internal implementation for those who need it.

A question: can you list some languages that represent UTF-8 narrow strings as array of single bytes?
September 28, 2014
On 9/27/14, 4:31 PM, H. S. Teoh via Digitalmars-d wrote:
> On Sat, Sep 27, 2014 at 11:00:16PM +0000, bearophile via Digitalmars-d wrote:
>> H. S. Teoh:
>>
>>> If we can get Andrei on board, I'm all for killing off autodecoding.
>>
>> Killing auto-decoding for std.algorithm functions will break most of
>> my D2 code... perhaps we can do that in a D3 language.
> [...]
>
> Well, obviously it's not going to be done in a careless, drastic way!

Stuff that's missing:

* Reasonable effort to improve performance of auto-decoding;

* A study of the matter revealing either new artifacts and idioms, or the insufficiency of such;

* An assessment of the impact on compilability of existing code

* An assessment of the impact on correctness of existing code (that compiles and runs in both cases)

* An assessment of the improvement in speed of eliminating auto-decoding

I think there's a very strong need for this stuff, because claims that current alternatives to selectively avoid auto-decoding use the throwing of hands (and occasional chairs out windows) without any real investigation into how library artifacts may help. This approach to justifying risky moves is frustratingly unprincipled.

Also I submit that diverting into this is a huge distraction at probably the worst moment in the history of the D programming language.

C++ and GC. C++ and GC...



Andrei

September 28, 2014
On 9/28/14, 3:04 AM, "Marc Schütz" <schuetzm@gmx.net>" wrote:
> The final step would almost inevitably lead to Unicode incorrectness,
> which was the reason why autodecoding was introduced in the first place.
> Just require byCodePoint/byCodeUnit, always. It might be a bit
> inconvenient, but that's a consequence of the fact that we're dealing
> with Unicode strings.

Also let's not forget how well it's worked for C++ to conflate arrays of char with Unicode strings. -- Andrei

September 28, 2014
On Sun, Sep 28, 2014 at 12:06:16PM +0000, Uranuz via Digitalmars-d wrote:
> On Sunday, 28 September 2014 at 00:13:59 UTC, Andrei Alexandrescu wrote:
> >On 9/27/14, 3:40 PM, H. S. Teoh via Digitalmars-d wrote:
> >>If we can get Andrei on board, I'm all for killing off autodecoding.
> >
> >That's rather vague; it's unclear what would replace it. -- Andrei
> 
> I believe that removing autodeconding will make things even worse. As far as understand if we will remove it from front() function that operates on narrow strings then it will return just byte of char. I believe that proceeding on narrow string by `user perceived chars` (graphemes) is more common use case.
[...]

Unfortunately this is not what autodecoding does today. Today's autodecoding only segments strings into code *points*, which are not the same thing as graphemes. For example, combining diacritics are normally not considered separate characters from the user's POV, but they *are* separate codepoints from their base character. The only reason today's autodecoding is even remotely considered "correct" from an intuitive POV is because most Western character sets happen to use only precomposed characters rather than combining diacritic sequences. If you were processing, say, Korean text, the present autodecoding .front would *not* give you what you might imagine is a "single character"; it would only be halves of Korean graphemes. Which, from a user's POV, would suffer from the same issues as dealing with individual bytes in a UTF-8 stream -- any mistake on the program's part in handling these half-units will cause "corruption" of the text (not corruption in the same sense as an improperly segmented UTF-8 byte stream, but in the sense that the wrong glyphs will be displayed on the screen -- from the user's POV these two are basically the same thing).

You might then be tempted to say, well let's make .front return graphemes instead. That will solve the "single intuitive character" issue, but the performance will be FAR worse than what it is today.

So basically, what we have today is neither efficient nor complete, but
a halfway solution that mostly works for Western character sets but
is incomplete for others. We're paying efficiency for only a partial
benefit. Is it worth the cost?

I think the correct solution is not for Phobos to decide for the application at what level of abstraction a string ought to be processed. Rather, let the user decide. If they're just dealing with opaque blocks of text, decoding or segmenting by grapheme is completely unnecessary -- they should just operate on byte ranges as opaque data. They should use byCodeUnit. If they need to work with Unicode codepoints, let them use byCodePoint. If they need to work with individual user-perceived characters (i.e., graphemes), let them use byGrapheme.

This is why I proposed the deprecation path of making it illegal to pass raw strings to Phobos algorithms -- the caller should specify what level of abstraction they want to work with -- byCodeUnit, byCodePoint, or byGrapheme. The standard library's job is to empower the D programmer by giving him the choice, not to shove a predetermined solution down his throat.


T

-- 
Life is unfair. Ask too much from it, and it may decide you don't deserve what you have now either.