March 27, 2018 Re: rvalues -> ref (yup... again!) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Rubn | On Tue, Mar 27, 2018 at 08:25:36PM +0000, Rubn via Digitalmars-d wrote: [...] > _D7example__T3fooTSQr3FooZQnFNbNiNfQrZv: > push rbp > mov rbp, rsp > sub rsp, 3104 > lea rax, [rbp + 16] > lea rdi, [rbp - 2048] > lea rcx, [rbp - 1024] > mov edx, 1024 > mov rsi, rcx > mov qword ptr [rbp - 2056], rdi > mov rdi, rsi > mov rsi, rax > mov qword ptr [rbp - 2064], rcx > call memcpy@PLT <--------------------- hidden copy [...] Is this generated by dmd, or gdc/ldc? Generally, when it comes to performance issues, I don't even bother looking at dmd-generated code anymore. If the extra copying is still happening with gdc -O2 / ldc -O, then you have a point. Otherwise, it doesn't really say very much. T -- People tell me that I'm skeptical, but I don't believe them. |
March 27, 2018 Re: rvalues -> ref (yup... again!) | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Tuesday, 27 March 2018 at 20:38:35 UTC, H. S. Teoh wrote:
> On Tue, Mar 27, 2018 at 08:25:36PM +0000, Rubn via Digitalmars-d wrote: [...]
>> _D7example__T3fooTSQr3FooZQnFNbNiNfQrZv:
>> push rbp
>> mov rbp, rsp
>> sub rsp, 3104
>> lea rax, [rbp + 16]
>> lea rdi, [rbp - 2048]
>> lea rcx, [rbp - 1024]
>> mov edx, 1024
>> mov rsi, rcx
>> mov qword ptr [rbp - 2056], rdi
>> mov rdi, rsi
>> mov rsi, rax
>> mov qword ptr [rbp - 2064], rcx
>> call memcpy@PLT <--------------------- hidden copy
> [...]
>
> Is this generated by dmd, or gdc/ldc?
>
> Generally, when it comes to performance issues, I don't even bother looking at dmd-generated code anymore. If the extra copying is still happening with gdc -O2 / ldc -O, then you have a point. Otherwise, it doesn't really say very much.
>
>
> T
It happens with LDC too, not sure how it would be able to know to do any kind of optimization like that unless it was able to inline every single function called into one function and be able to do optimize it from there. I don't imagine that'll be likely though.
|
March 27, 2018 Re: rvalues -> ref (yup... again!) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Rubn | On Tue, Mar 27, 2018 at 09:52:25PM +0000, Rubn via Digitalmars-d wrote: > On Tuesday, 27 March 2018 at 20:38:35 UTC, H. S. Teoh wrote: > > On Tue, Mar 27, 2018 at 08:25:36PM +0000, Rubn via Digitalmars-d wrote: [...] > > > _D7example__T3fooTSQr3FooZQnFNbNiNfQrZv: > > > push rbp > > > mov rbp, rsp > > > sub rsp, 3104 > > > lea rax, [rbp + 16] > > > lea rdi, [rbp - 2048] > > > lea rcx, [rbp - 1024] > > > mov edx, 1024 > > > mov rsi, rcx > > > mov qword ptr [rbp - 2056], rdi > > > mov rdi, rsi > > > mov rsi, rax > > > mov qword ptr [rbp - 2064], rcx > > > call memcpy@PLT <--------------------- hidden copy > > [...] > > > > Is this generated by dmd, or gdc/ldc? > > > > Generally, when it comes to performance issues, I don't even bother looking at dmd-generated code anymore. If the extra copying is still happening with gdc -O2 / ldc -O, then you have a point. Otherwise, it doesn't really say very much. > > > > > > T > > It happens with LDC too, not sure how it would be able to know to do any kind of optimization like that unless it was able to inline every single function called into one function and be able to do optimize it from there. I don't imagine that'll be likely though. You'll be surprised. Don't underestimate the power of modern optimizers. I've seen LDC do inlining that's so aggressive, that it essentially evaluated an entire series of function calls at compile-time (likely on the IR) and generated a single instruction to load the answer into the return register at runtime. :-D Of course, it still generated the individual functions, but those are never actually called at runtime. (On one occasion, this produced odd-looking "benchmark" results where the ldc executable computed the answer in exactly 0ms, whereas everyone else took a lot longer than that. :-D (Well, it was probably a few nanosecs while the CPU decoded and ran the instruction, but I don't think any benchmark could measure that!)) For your code example, you might want to look at the code generated for callers of the function, since when compiling individual functions in isolation, LDC is obligated to follow the ABI, which could include redundant copying. But if inlining was possible, it could generate very different code. T -- Dogs have owners ... cats have staff. -- Krista Casada |
March 27, 2018 Re: rvalues -> ref (yup... again!) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Rubn | On Tuesday, 27 March 2018 at 21:52:25 UTC, Rubn wrote:
> It happens with LDC too, not sure how it would be able to know to do any kind of optimization like that unless it was able to inline every single function called into one function and be able to do optimize it from there. I don't imagine that'll be likely though.
It does it in your code sample with `-O`, there's no call to bar and the foo() by-value arg is memcpy'd to the global.
If you compile everything with LTO, your code and all 3rd-party libs as well as druntime/Phobos, LLVM is able to optimize the whole program as if it were inside a single gigantic 'object' file in LLVM bitcode IR, and is thus indeed theoretically able to inline *all* functions.
|
March 27, 2018 Re: rvalues -> ref (yup... again!) | ||||
---|---|---|---|---|
| ||||
Posted in reply to kinke | On Tuesday, 27 March 2018 at 23:35:44 UTC, kinke wrote: > On Tuesday, 27 March 2018 at 21:52:25 UTC, Rubn wrote: >> It happens with LDC too, not sure how it would be able to know to do any kind of optimization like that unless it was able to inline every single function called into one function and be able to do optimize it from there. I don't imagine that'll be likely though. > > It does it in your code sample with `-O`, there's no call to bar and the foo() by-value arg is memcpy'd to the global. > > If you compile everything with LTO, your code and all 3rd-party libs as well as druntime/Phobos, LLVM is able to optimize the whole program as if it were inside a single gigantic 'object' file in LLVM bitcode IR, and is thus indeed theoretically able to inline *all* functions. A bit off topic now but anyways: Well that example I posted didn't do anything, so it would optimize it out quite easily. The entire function was excluded essentially. Just adding a few writeln it isn't able to remove the function entirely anymore and can't optimize it out. Idk if you want to try some different options but flto didn't do anything for it. https://godbolt.org/g/bLdpnm import std.stdio : writeln; struct Foo { ubyte[1024] data; this(int a) { data[0] = cast(ubyte)a; } } void foo(T)(auto ref T t) { import std.functional: forward; writeln(gfoo.data[0]); bar(forward!t); writeln(gfoo.data[0]); } __gshared Foo gfoo; void bar(T)(auto ref T t) { import std.algorithm.mutation : move; writeln(gfoo.data[0]); move(t, gfoo); } void main() { foo(Foo(10)); } |
March 28, 2018 Re: rvalues -> ref (yup... again!) | ||||
---|---|---|---|---|
| ||||
Posted in reply to kinke | On Tuesday, 27 March 2018 at 23:35:44 UTC, kinke wrote: > On Tuesday, 27 March 2018 at 21:52:25 UTC, Rubn wrote: >> It happens with LDC too, not sure how it would be able to know to do any kind of optimization like that unless it was able to inline every single function called into one function and be able to do optimize it from there. I don't imagine that'll be likely though. > > It does it in your code sample with `-O`, there's no call to bar and the foo() by-value arg is memcpy'd to the global. For reference: https://run.dlang.io/is/2vDEXP Note that main() boils down to a `memset(&gfoo, 10, 1024); return 0;`: _Dmain: .cfi_startproc pushq %rax .Lcfi0: .cfi_def_cfa_offset 16 data16 leaq onlineapp.Foo onlineapp.gfoo@TLSGD(%rip), %rdi data16 data16 rex64 callq __tls_get_addr@PLT movl $10, %esi movl $1024, %edx movq %rax, %rdi callq memset@PLT xorl %eax, %eax popq %rcx retq |
March 28, 2018 Re: rvalues -> ref (yup... again!) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Rubn | On Tuesday, 27 March 2018 at 23:59:09 UTC, Rubn wrote: > Just adding a few writeln it isn't able to remove the function entirely anymore and can't optimize it out. Well writeln() here involves number -> string formatting, GC, I/O, template bloat... There are indeed superfluous memcpy's in your foo() there (although the forward and bar calls are still inlined), which after a quick glance seem to be LLVM optimizer shortcomings, the IR emitted by LDC looks fine. For an abitrary external function, it's all fine as it should be, boiling down to a single memcpy in foo() and a direct memset in main(): https://run.dlang.io/is/O1aeLK |
March 28, 2018 Re: rvalues -> ref (yup... again!) | ||||
---|---|---|---|---|
| ||||
Posted in reply to kinke | On Wednesday, 28 March 2018 at 00:56:29 UTC, kinke wrote:
> On Tuesday, 27 March 2018 at 23:59:09 UTC, Rubn wrote:
>> Just adding a few writeln it isn't able to remove the function entirely anymore and can't optimize it out.
>
> Well writeln() here involves number -> string formatting, GC, I/O, template bloat... There are indeed superfluous memcpy's in your foo() there (although the forward and bar calls are still inlined), which after a quick glance seem to be LLVM optimizer shortcomings, the IR emitted by LDC looks fine.
> For an abitrary external function, it's all fine as it should be, boiling down to a single memcpy in foo() and a direct memset in main(): https://run.dlang.io/is/O1aeLK
Well somethings wrong if writeln causes optimization to not occur, if that is the case then it'd be best to just use printf() instead. Anyways using small examples to show optimization is usually not what's going to happen in actual code. Functions are rarely that simple, and if adding a single writeln() to a call is enough to eliminate that optimization, I can only imagine what other little things do as well.
|
March 28, 2018 Re: rvalues -> ref (yup... again!) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | On Friday, 23 March 2018 at 22:01:44 UTC, Manu wrote:
> By contrast, people will NOT forgive the fact that they have to change:
>
> func(f(x), f(y), f(z));
>
> to:
>
> T temp = f(x);
> T temp2 = f(y);
> T temp3 = f(z);
> func(temp, temp2, temp3);
>
> That's just hideous and in-defensible.
>
> A better story would be:
>
> func(f(x), f(y), f(z));
> =>
> func(x.f, y.f, z.f);
Another workaround:
auto r(T)(T a)
{
struct R { T val; }
return R(a);
}
void f(in ref int p);
int main()
{
f(1.r.val);
return 0;
}
|
March 28, 2018 Re: rvalues -> ref (yup... again!) | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | On 27.03.2018 20:14, Manu wrote: > That's exactly what I've been saying. For like, 9 years.. > It looks like this: > https://github.com/TurkeyMan/DIPs/blob/ref_args/DIPs/DIP1xxx-rval_to_ref.md > (contribution appreciated) > > As far as I can tell, it's completely benign, it just eliminates the > annoying edge cases when interacting with functions that take > arguments by ref. There's no spill-over affect anywhere that I'm aware > of, and if you can find a single wart, I definitely want to know about > it. ??? > I've asked so many times for a technical destruction, nobody will > present any opposition that is anything other than a rejection *in > principle*. This is a holy war, not a technical one. That's extremely unfair. It is just a bad idea to overload D const for this purpose. Remove the "const" requirement and I'm on board. |
Copyright © 1999-2021 by the D Language Foundation