September 27, 2014 Creeping Bloat in Phobos | ||||
---|---|---|---|---|
| ||||
From time to time, I take a break from bugs and enhancements and just look at what some piece of code is actually doing. Sometimes, I'm appalled. Phobos, for example, should be a lean and mean fighting machine: http://www.nbcnews.com/id/38545625/ns/technology_and_science-science/t/king-tuts-chariots-were-formula-one-cars/#.VCceNmd0xjs Instead, we have something more akin to: http://untappedcities.com/2012/10/31/roulez-carrosses-carriages-of-versailles-arrive-in-arras/ More specifically, I looked at std.file.copy(): https://github.com/D-Programming-Language/phobos/blob/master/std/file.d Which is 3 lines of code: void copy(in char[] from, in char[] to) { immutable result = CopyFileW(from.tempCStringW(), to.tempCStringW(), false); if (!result) throw new FileException(to.idup); } Compiling this code for Windows produces the rather awful: _D3std4file4copyFxAaxAaZv comdat assume CS:_D3std4file4copyFxAaxAaZv L0: push EBP mov EBP,ESP mov EDX,FS:__except_list push 0FFFFFFFFh lea EAX,-0220h[EBP] push offset _D3std4file4copyFxAaxAaZv[0106h] push EDX mov FS:__except_list,ESP sub ESP,8 sub ESP,041Ch push 0 push dword ptr 0Ch[EBP] push dword ptr 8[EBP] call near ptr _D3std8internal7cstring21__T11tempCSÇàÆTuTaZÇìÆFNbNixAaZSÇ┬├3Res mov dword ptr -4[EBP],0 lea EAX,-0220h[EBP] call near ptr _D3std8internal7cstring21__T11tempCStringTuTaZ11tempCStringFNbNixAaZ3Res3ptrMxFNaNbNdNiNfZPxu push EAX lea EAX,-0430h[EBP] push dword ptr 014h[EBP] push dword ptr 010h[EBP] call near ptr _D3std8internal7cstring21__T11tempCSÇàÆTuTaZÇìÆFNbNixAaZSÇ┬├3Res mov dword ptr -4[EBP],1 lea EAX,-0430h[EBP] call near ptr _D3std8internal7cstring21__T11tempCStringTuTaZ11tempCStringFNbNixAaZ3Res3ptrMxFNaNbNdNiNfZPxu push EAX call dword ptr __imp__CopyFileW@12 mov -01Ch[EBP],EAX mov dword ptr -4[EBP],0 call near ptr L83 jmp short L8F L83: lea EAX,-0220h[EBP] call near ptr _D3std8internal7cstring21__T11tempCStringTuTaZ11tempCStringFNbNixAaZ3Res6__dtorMFNbNiZv ret L8F: mov dword ptr -4[EBP],0FFFFFFFFh call near ptr L9D jmp short LA9 L9D: lea EAX,-0430h[EBP] call near ptr _D3std8internal7cstring21__T11tempCStringTuTaZ11tempCStringFNbNixAaZ3Res6__dtorMFNbNiZv ret LA9: cmp dword ptr -01Ch[EBP],0 jne LF3 mov ECX,offset FLAT:_D3std4file13FileException7__ClassZ push ECX call near ptr __d_newclass add ESP,4 push dword ptr 0Ch[EBP] mov -018h[EBP],EAX push dword ptr 8[EBP] call near ptr _D6object12__T4idupTxaZ4idupFNaNbNdNfAxaZAya push EDX push EAX call dword ptr __imp__GetLastError@0 push EAX push dword ptr _D3std4file13FileException6__vtblZ[02Ch] push dword ptr _D3std4file13FileException6__vtblZ[028h] push 095Dh mov EAX,-018h[EBP] call near ptr _D3std4file13FileException6__ctorMFNfxAakAyakZC3std4file13FileException push EAX call near ptr __d_throwc LF3: mov ECX,-0Ch[EBP] mov FS:__except_list,ECX mov ESP,EBP pop EBP ret 010h mov EAX,offset FLAT:_D3std4file13FileException6__vtblZ[0310h] jmp near ptr __d_framehandler which is TWICE as much generated code as for D1's copy(), which does the same thing. No, it is not because D2's compiler sux. It's because it has become encrustified with gee-gaws, jewels, decorations, and other crap. To scrape the barnacles off, I've filed: https://issues.dlang.org/show_bug.cgi?id=13541 https://issues.dlang.org/show_bug.cgi?id=13542 https://issues.dlang.org/show_bug.cgi?id=13543 https://issues.dlang.org/show_bug.cgi?id=13544 I'm sure there's much more in std.file (and elsewhere) that can be done. Guys, when developing Phobos/Druntime code, please look at the assembler once in a while and see what is being wrought. You may be appalled, too. |
September 27, 2014 Re: Creeping Bloat in Phobos | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Saturday, 27 September 2014 at 20:57:53 UTC, Walter Bright wrote: > From time to time, I take a break from bugs and enhancements and just look at what some piece of code is actually doing. Sometimes, I'm appalled. Me too, and yes it can be appalling. It's pretty bad for even simple range chains, e.g. import std.algorithm, std.stdio; int main(string[] args) { return cast(int)args.map!("a.length").reduce!"a+b"(); } Here's what LDC produces (with -O -inline -release -noboundscheck) __Dmain: 0000000100001480 pushq %r15 0000000100001482 pushq %r14 0000000100001484 pushq %rbx 0000000100001485 movq %rsi, %rbx 0000000100001488 movq %rdi, %r14 000000010000148b callq 0x10006df10 ## symbol stub for: __D3std5array14__T5emptyTAyaZ5emptyFNaNbNdNfxAAyaZb 0000000100001490 xorb $0x1, %al 0000000100001492 movzbl %al, %r9d 0000000100001496 leaq _.str12(%rip), %rdx ## literal pool for: "/Users/pja/ldc2-0.14.0-osx-x86_64/bin/../import/std/algorithm.d" 000000010000149d movq 0xcbd2c(%rip), %r8 ## literal pool symbol address: __D3std9algorithm24__T6reduceVAyaa3_612b62Z124__T6reduceTS3std9algorithm85__T9MapResultS633std10functional36__T8unaryFunVAyaa8_612e6c656e677468Z8unaryFunTAAyaZ9MapResultZ6reduceFNaNfS3std9algorithm85__T 00000001000014a4 movl $0x2dd, %edi 00000001000014a9 movl $0x3f, %esi 00000001000014ae xorl %ecx, %ecx 00000001000014b0 callq 0x10006e0a2 ## symbol stub for: __D3std9exception14__T7enforceTbZ7enforceFNaNfbLAxaAyamZb 00000001000014b5 movq (%rbx), %r15 00000001000014b8 leaq 0x10(%rbx), %rsi 00000001000014bc leaq -0x1(%r14), %rdi 00000001000014c0 callq 0x10006df10 ## symbol stub for: __D3std5array14__T5emptyTAyaZ5emptyFNaNbNdNfxAAyaZb 00000001000014c5 testb $0x1, %al 00000001000014c7 jne 0x1000014fa 00000001000014c9 addq $-0x2, %r14 00000001000014cd addq $0x20, %rbx 00000001000014d1 nopw %cs:(%rax,%rax) 00000001000014e0 addq -0x10(%rbx), %r15 00000001000014e4 movq %r14, %rdi 00000001000014e7 movq %rbx, %rsi 00000001000014ea callq 0x10006df10 ## symbol stub for: __D3std5array14__T5emptyTAyaZ5emptyFNaNbNdNfxAAyaZb 00000001000014ef decq %r14 00000001000014f2 addq $0x10, %rbx 00000001000014f6 testb $0x1, %al 00000001000014f8 je 0x1000014e0 00000001000014fa movl %r15d, %eax 00000001000014fd popq %rbx 00000001000014fe popq %r14 0000000100001500 popq %r15 0000000100001502 ret and for: import std.algorithm, std.stdio; int main(string[] args) { int r = 0; foreach (i; 0..args.length) r += args[i].length; return r; } __Dmain: 00000001000015c0 xorl %eax, %eax 00000001000015c2 testq %rdi, %rdi 00000001000015c5 je 0x1000015de 00000001000015c7 nopw (%rax,%rax) 00000001000015d0 movl %eax, %eax 00000001000015d2 addq (%rsi), %rax 00000001000015d5 addq $0x10, %rsi 00000001000015d9 decq %rdi 00000001000015dc jne 0x1000015d0 00000001000015de ret (and sorry, don't even bother looking at what dmd does...) I'm not complaining about LDC here (although I'm surprised array.empty isn't inlined). The way ranges are formulated make them difficult to optimize. I think there's things we can do here in the library. Maybe I'll write up something about that at some point. I think the takeaway here is that people should be aware of (a) what kind of instructions their code is generating, (b) what kind of instructions their code SHOULD be generating, and (c) what is practically possible for present-day compilers. Like you say, it helps to look at the assembled code once in a while to get a feel for this kind of thing. Modern compilers are good, but they aren't magic. |
September 27, 2014 Re: Creeping Bloat in Phobos | ||||
---|---|---|---|---|
| ||||
Posted in reply to Peter Alexander | On Sat, Sep 27, 2014 at 09:59:17PM +0000, Peter Alexander via Digitalmars-d wrote: > On Saturday, 27 September 2014 at 20:57:53 UTC, Walter Bright wrote: > >From time to time, I take a break from bugs and enhancements and just look at what some piece of code is actually doing. Sometimes, I'm appalled. > > Me too, and yes it can be appalling. It's pretty bad for even simple range chains, e.g. > > import std.algorithm, std.stdio; > int main(string[] args) { > return cast(int)args.map!("a.length").reduce!"a+b"(); > } I vaguely recall somebody mentioning a while back that range-based code is poorly optimized because compilers weren't designed to recognize such patterns before. I wonder if there are ways for the compiler to recognize range primitives and apply special optimizations to them. I do find, though, that gdc -O3 generally tends to do a pretty good job of reducing range-based code to near-minimal assembly. Sadly, dmd is changing too fast for gdc releases to catch up with the latest and greatest, so I haven't been using gdc very much recently. :-( T -- If Java had true garbage collection, most programs would delete themselves upon execution. -- Robert Sewell |
September 27, 2014 Re: Creeping Bloat in Phobos | ||||
---|---|---|---|---|
| ||||
Posted in reply to Peter Alexander | What we're seeing here is pretty much the same problem that early c++ suffered from: abstraction penalty. It took years of work to help overcome it, both from the compiler and the library. Not having trivial functions inlined and optimized down through standard techniques like dead store elimination, value range propagation, various loop restructurings, etc means that code will look like what Walter and you have shown. Given DMD's relatively weak inliner, I'm not shocked by Walter's example. I am curious why ldc failed to inline those functions.
On 9/27/2014 2:59 PM, Peter Alexander via Digitalmars-d wrote:
> On Saturday, 27 September 2014 at 20:57:53 UTC, Walter Bright wrote:
>> From time to time, I take a break from bugs and enhancements and just
>> look at what some piece of code is actually doing. Sometimes, I'm
>> appalled.
>
> Me too, and yes it can be appalling. It's pretty bad for even simple
> range chains, e.g.
>
> import std.algorithm, std.stdio;
> int main(string[] args) {
> return cast(int)args.map!("a.length").reduce!"a+b"();
> }
>
> Here's what LDC produces (with -O -inline -release -noboundscheck)
>
> __Dmain:
> 0000000100001480 pushq %r15
> 0000000100001482 pushq %r14
> 0000000100001484 pushq %rbx
> 0000000100001485 movq %rsi, %rbx
> 0000000100001488 movq %rdi, %r14
> 000000010000148b callq 0x10006df10 ## symbol stub for:
> __D3std5array14__T5emptyTAyaZ5emptyFNaNbNdNfxAAyaZb
> 0000000100001490 xorb $0x1, %al
> 0000000100001492 movzbl %al, %r9d
> 0000000100001496 leaq _.str12(%rip), %rdx ## literal pool for:
> "/Users/pja/ldc2-0.14.0-osx-x86_64/bin/../import/std/algorithm.d"
> 000000010000149d movq 0xcbd2c(%rip), %r8 ## literal pool symbol
> address:
> __D3std9algorithm24__T6reduceVAyaa3_612b62Z124__T6reduceTS3std9algorithm85__T9MapResultS633std10functional36__T8unaryFunVAyaa8_612e6c656e677468Z8unaryFunTAAyaZ9MapResultZ6reduceFNaNfS3std9algorithm85__T
>
> 00000001000014a4 movl $0x2dd, %edi
> 00000001000014a9 movl $0x3f, %esi
> 00000001000014ae xorl %ecx, %ecx
> 00000001000014b0 callq 0x10006e0a2 ## symbol stub for:
> __D3std9exception14__T7enforceTbZ7enforceFNaNfbLAxaAyamZb
> 00000001000014b5 movq (%rbx), %r15
> 00000001000014b8 leaq 0x10(%rbx), %rsi
> 00000001000014bc leaq -0x1(%r14), %rdi
> 00000001000014c0 callq 0x10006df10 ## symbol stub for:
> __D3std5array14__T5emptyTAyaZ5emptyFNaNbNdNfxAAyaZb
> 00000001000014c5 testb $0x1, %al
> 00000001000014c7 jne 0x1000014fa
> 00000001000014c9 addq $-0x2, %r14
> 00000001000014cd addq $0x20, %rbx
> 00000001000014d1 nopw %cs:(%rax,%rax)
> 00000001000014e0 addq -0x10(%rbx), %r15
> 00000001000014e4 movq %r14, %rdi
> 00000001000014e7 movq %rbx, %rsi
> 00000001000014ea callq 0x10006df10 ## symbol stub for:
> __D3std5array14__T5emptyTAyaZ5emptyFNaNbNdNfxAAyaZb
> 00000001000014ef decq %r14
> 00000001000014f2 addq $0x10, %rbx
> 00000001000014f6 testb $0x1, %al
> 00000001000014f8 je 0x1000014e0
> 00000001000014fa movl %r15d, %eax
> 00000001000014fd popq %rbx
> 00000001000014fe popq %r14
> 0000000100001500 popq %r15
> 0000000100001502 ret
>
> and for:
>
> import std.algorithm, std.stdio;
> int main(string[] args) {
> int r = 0;
> foreach (i; 0..args.length)
> r += args[i].length;
> return r;
> }
>
> __Dmain:
> 00000001000015c0 xorl %eax, %eax
> 00000001000015c2 testq %rdi, %rdi
> 00000001000015c5 je 0x1000015de
> 00000001000015c7 nopw (%rax,%rax)
> 00000001000015d0 movl %eax, %eax
> 00000001000015d2 addq (%rsi), %rax
> 00000001000015d5 addq $0x10, %rsi
> 00000001000015d9 decq %rdi
> 00000001000015dc jne 0x1000015d0
> 00000001000015de ret
>
> (and sorry, don't even bother looking at what dmd does...)
>
> I'm not complaining about LDC here (although I'm surprised array.empty
> isn't inlined). The way ranges are formulated make them difficult to
> optimize. I think there's things we can do here in the library. Maybe
> I'll write up something about that at some point.
>
> I think the takeaway here is that people should be aware of (a) what
> kind of instructions their code is generating, (b) what kind of
> instructions their code SHOULD be generating, and (c) what is
> practically possible for present-day compilers. Like you say, it helps
> to look at the assembled code once in a while to get a feel for this
> kind of thing. Modern compilers are good, but they aren't magic.
|
September 27, 2014 Re: Creeping Bloat in Phobos | ||||
---|---|---|---|---|
| ||||
Posted in reply to Peter Alexander | On 9/27/2014 2:59 PM, Peter Alexander wrote:
> On Saturday, 27 September 2014 at 20:57:53 UTC, Walter Bright wrote:
>> From time to time, I take a break from bugs and enhancements and just look at
>> what some piece of code is actually doing. Sometimes, I'm appalled.
>
> Me too, and yes it can be appalling. It's pretty bad for even simple range
> chains, e.g.
>
> import std.algorithm, std.stdio;
> int main(string[] args) {
> return cast(int)args.map!("a.length").reduce!"a+b"();
> }
>
> Here's what LDC produces (with -O -inline -release -noboundscheck)
Part of this particular case problem is not a compiler optimizer weakness, but that autodecode problem I've been throwing (!) chairs through windows on.
|
September 27, 2014 Re: Creeping Bloat in Phobos | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On Sat, Sep 27, 2014 at 03:26:35PM -0700, Walter Bright via Digitalmars-d wrote: > On 9/27/2014 2:59 PM, Peter Alexander wrote: > >On Saturday, 27 September 2014 at 20:57:53 UTC, Walter Bright wrote: > >>From time to time, I take a break from bugs and enhancements and just look at what some piece of code is actually doing. Sometimes, I'm appalled. > > > >Me too, and yes it can be appalling. It's pretty bad for even simple range chains, e.g. > > > >import std.algorithm, std.stdio; > >int main(string[] args) { > > return cast(int)args.map!("a.length").reduce!"a+b"(); > >} > > > >Here's what LDC produces (with -O -inline -release -noboundscheck) > > Part of this particular case problem is not a compiler optimizer weakness, but that autodecode problem I've been throwing (!) chairs through windows on. If we can get Andrei on board, I'm all for killing off autodecoding. T -- MAS = Mana Ada Sistem? |
September 27, 2014 Re: Creeping Bloat in Phobos | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Walter Bright:
>> import std.algorithm, std.stdio;
>> int main(string[] args) {
>> return cast(int)args.map!("a.length").reduce!"a+b"();
>> }
>>
>> Here's what LDC produces (with -O -inline -release -noboundscheck)
>
> Part of this particular case problem is not a compiler optimizer weakness, but that autodecode problem I've been throwing (!) chairs through windows on.
There is no char auto decoding in this program, right?
Note: in Phobos now we have std.algorithm.sum, that is better than reduce!"a+b"().
Bye,
bearophile
|
September 27, 2014 Re: Creeping Bloat in Phobos | ||||
---|---|---|---|---|
| ||||
Posted in reply to Brad Roberts | On 9/27/2014 3:26 PM, Brad Roberts via Digitalmars-d wrote:
> What we're seeing here is pretty much the same problem that early c++ suffered
> from: abstraction penalty. It took years of work to help overcome it, both from
> the compiler and the library. Not having trivial functions inlined and
> optimized down through standard techniques like dead store elimination, value
> range propagation, various loop restructurings, etc means that code will look
> like what Walter and you have shown. Given DMD's relatively weak inliner, I'm
> not shocked by Walter's example. I am curious why ldc failed to inline those
> functions.
Again, this accumulation of barnacles is not a failure of the optimizer. It's a failure of adding gee-gaws to the source code without checking their effect.
|
September 27, 2014 Re: Creeping Bloat in Phobos | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | H. S. Teoh:
> If we can get Andrei on board, I'm all for killing off autodecoding.
Killing auto-decoding for std.algorithm functions will break most of my D2 code... perhaps we can do that in a D3 language.
Bye,
bearophile
|
September 27, 2014 Re: Creeping Bloat in Phobos | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | On 9/27/2014 3:54 PM, Walter Bright via Digitalmars-d wrote:
> On 9/27/2014 3:26 PM, Brad Roberts via Digitalmars-d wrote:
>> What we're seeing here is pretty much the same problem that early c++
>> suffered
>> from: abstraction penalty. It took years of work to help overcome it,
>> both from
>> the compiler and the library. Not having trivial functions inlined and
>> optimized down through standard techniques like dead store
>> elimination, value
>> range propagation, various loop restructurings, etc means that code
>> will look
>> like what Walter and you have shown. Given DMD's relatively weak
>> inliner, I'm
>> not shocked by Walter's example. I am curious why ldc failed to
>> inline those
>> functions.
>
> Again, this accumulation of barnacles is not a failure of the optimizer.
> It's a failure of adding gee-gaws to the source code without checking
> their effect.
Look at Peter's example, it's better for this, I believe. Why isn't empty being inlined? That's a tiny little function with a lot of impact.
Of course there's more than just optimization, but it's a big player in the game too.
|
Copyright © 1999-2021 by the D Language Foundation