January 24, 2020 Re: Pay as you go is really going to make a difference | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johan Engelen | On 1/24/20 4:52 AM, Johan Engelen wrote:
> On Thursday, 23 January 2020 at 00:20:00 UTC, IGotD- wrote:
>> On Wednesday, 22 January 2020 at 23:51:23 UTC, FogD wrote:
>>>
>>> A recent comparison of languages from this perspective.
>>>
>>> https://drewdevault.com/2020/01/04/Slow.html
>>
>> It would be interesting to know what that huge number of system calls really do, especially when it comes to D which has around 150.
>
> Indeed. Also to figure out why LDC's binary calls 31 more than DMD's binary.
> Much appreciated if someone could repeat the test and post a list of all syscalls being made.
Most likely it's the runtime startup. Obviously sbrk quite a bit, but any runtime initialization (thread startup, mutex initialization, etc) are all going to go in there. Think of what the GC has to do!
A good test would be to do a betterC version with printf and see what the difference is (technically it should be the same as the C version).
-Steve
|
January 24, 2020 Re: Pay as you go is really going to make a difference | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | On Friday, 24 January 2020 at 13:06:33 UTC, Steven Schveighoffer wrote:
>
> Most likely it's the runtime startup. Obviously sbrk quite a bit, but any runtime initialization (thread startup, mutex initialization, etc) are all going to go in there. Think of what the GC has to do!
>
> A good test would be to do a betterC version with printf and see what the difference is (technically it should be the same as the C version).
>
> -Steve
That makes perfectly sense, I didn't think about sbrk which needs to bump the heap quite a bit during the startup.
|
January 24, 2020 Re: Pay as you go is really going to make a difference | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | On Friday, 24 January 2020 at 13:06:33 UTC, Steven Schveighoffer wrote:
>
> A good test would be to do a betterC version with printf and see what the difference is (technically it should be the same as the C version).
>
Indeed. I wrote an email to the author about that yesterday, though I haven't heard from him since.
|
January 24, 2020 Re: Pay as you go is really going to make a difference | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | On Fri, Jan 24, 2020 at 08:06:33AM -0500, Steven Schveighoffer via Digitalmars-d wrote: > On 1/24/20 4:52 AM, Johan Engelen wrote: [...] > > Indeed. Also to figure out why LDC's binary calls 31 more than DMD's binary. [...] > Most likely it's the runtime startup. Obviously sbrk quite a bit, but any runtime initialization (thread startup, mutex initialization, etc) are all going to go in there. Think of what the GC has to do! [...] It makes me wonder how much we can make all this startup stuff pay-as-you-go. I mean, IIRC, isn't the GC lazily initialized now? I vaguely remember some PR along that direction. Or was it the pool allocations? I suppose thread startup would be hard to elide, unless there was a way to initialize the thread stuff only on demand. Ditto for mutex inits. But it might not be worth the effort for such minimal benefits in such a marginal test case. T -- Those who've learned LaTeX swear by it. Those who are learning LaTeX swear at it. -- Pete Bleackley |
January 24, 2020 Re: Pay as you go is really going to make a difference | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On 1/24/20 1:24 PM, H. S. Teoh wrote: > On Fri, Jan 24, 2020 at 08:06:33AM -0500, Steven Schveighoffer via Digitalmars-d wrote: >> On 1/24/20 4:52 AM, Johan Engelen wrote: > [...] >>> Indeed. Also to figure out why LDC's binary calls 31 more than DMD's >>> binary. > [...] >> Most likely it's the runtime startup. Obviously sbrk quite a bit, but >> any runtime initialization (thread startup, mutex initialization, etc) >> are all going to go in there. Think of what the GC has to do! > [...] > > It makes me wonder how much we can make all this startup stuff > pay-as-you-go. I mean, IIRC, isn't the GC lazily initialized now? I > vaguely remember some PR along that direction. Or was it the pool > allocations? Yes, it is lazily initialized. It's kind of a cool mechanism too -- the "default" GC is a class that when used in a way where a "real" GC is needed (e.g. allocate some memory), figures out which one to create, creates it, and then replaces itself as the global handler with that new one. But the GC is going to be initialized in a writeln call I think. There's a few other things that are going to cause a lot of system calls too -- the static constructors and the cycle detection. At least the cycle detection we could rid ourselves of if we could make a post-compile step that runs the cycle detection algorithm and sets up the final ordering in the binary. > I suppose thread startup would be hard to elide, unless there was a way > to initialize the thread stuff only on demand. Ditto for mutex inits. > But it might not be worth the effort for such minimal benefits in such a > marginal test case. I'm not sure why we need to exactly minimize the system calls, we should just be able to explain them. 150 calls isn't horrific, and trying to reduce an "artificial" metric like that really shouldn't be the goal. I know this is exactly what the author is complaining about, but there is a world of difference between a 50MB web site that can't scroll and 150 system calls to do runtime startup + print hello world. However, there could easily be an obvious candidate for removal if something looks like it's being called way too often. So explanation is still a good goal. -Steve |
January 24, 2020 Re: Pay as you go is really going to make a difference | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johan Engelen | On 2020-01-24 10:52, Johan Engelen wrote: > Indeed. Also to figure out why LDC's binary calls 31 more than DMD's binary. > Much appreciated if someone could repeat the test and post a list of all syscalls being made. Not exactly the same as the original post, but here's some data I pulled out for macOS 10.14.6 with a Hello World compiled with DMD 2.088.0. This should give somewhat of an idea what's going on in the application. I've included the stacktrace for all syscalls NOT made by the system. As you can see below, 10 calls are made by the application, all of the remaining calls are made by the system itself. Most of the calls made by the system are made by the dynamic loader. Only three calls originate from the D main function. Only one call into the C standard library is made from the D main function, which is the call to `fwrite`. It's not our fault that the system does so many calls :). total: 120, system: 110, app: 10 stat64, total: 40, system: 40, app: 0 ----------------------------------------------- mach_vm_map_trap, total: 8, system: 6, app: 2 6 libsystem_malloc.dylib malloc 7 foobar _D2rt5minfo11ModuleGroup9sortCtorsMFAyaZ6doSortMFmKAPyS6object10ModuleInfoZb 8 foobar _D2rt5minfo11ModuleGroup9sortCtorsMFAyaZv 9 foobar _D2rt5minfo11ModuleGroup9sortCtorsMFZv 10 foobar _D2rt5minfo13rt_moduleCtorUZ14__foreachbody1MFKSQBu19sections_osx_x86_6412SectionGroupZi 11 foobar rt_moduleCtor 12 foobar rt_init 13 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv 14 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv 15 foobar _d_run_main2 16 foobar _d_run_main 17 foobar main 18 libdyld.dylib start 10 libsystem_c.dylib fwrite 11 foobar _D3std5stdio__T13trustedFwriteTaZQsFNbNiNePOS4core4stdcQBx7__sFILExAaZm ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:4322 12 foobar _D3std5stdio4File17LockingTextWriter__T3putTAyaZQjMFNfMQlZv ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:2930 13 foobar _D3std5stdio__T7writelnTAyaZQnFNfQjZv ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:3855 14 foobar _Dmain ~/development/d/main.d:15 15 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZ9__lambda1MFZv 16 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv 17 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv 18 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv 19 foobar _d_run_main2 20 foobar _d_run_main 21 foobar main 22 libdyld.dylib start ----------------------------------------------- mprotect, total: 8, system: 8 mach_port_deallocate_trap, total: 4, system: 4 ----------------------------------------------- sigaction, total: 4, system: 0, app: 4 0 libsystem_kernel.dylib __sigaction 1 libsystem_platform.dylib __platform_sigaction 2 foobar runModuleUnitTests 3 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv 4 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv 5 foobar _d_run_main2 6 foobar _d_run_main 7 foobar main 8 libdyld.dylib start ----------------------------------------------- mach_reply_port, total: 4, system: 4, app: 0 host_self_trap, total: 3, system: 3, app: 0 mach_port_mod_refs_trap, total: 3, system: 3, app: 0 ioctl, total: 3, system: 3, app: 0 mach_port_construct_trap, total: 3, system: 3, app: 0 kdebug_typefilter, total: 2, system: 2, app: 0 thread_self_trap, total: 2, system: 2, app: 0 proc_info, total: 2, system: 2, app: 0 mach_port_destruct_trap, total: 2, system: 2, app: 0 csops, total: 2, system: 2, app: 0 getpid, total: 2, system: 2, app: 0 task_self_trap, total: 2, system: 2, app: 0 access, total: 1, system: 1, app: 0 close, total: 1, system: 1, app: 0 shared_region_check_np, total: 1, system: 1, app: 0 ------------------------------------------------------- fstat64, total: 1, system: 0, app: 1 5 libsystem_c.dylib fwrite 6 foobar _D3std5stdio__T13trustedFwriteTaZQsFNbNiNePOS4core4stdcQBx7__sFILExAaZm ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:4322 7 foobar _D3std5stdio4File17LockingTextWriter__T3putTAyaZQjMFNfMQlZv ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:2930 8 foobar _D3std5stdio__T7writelnTAyaZQnFNfQjZv ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:3855 9 foobar _Dmain ~/development/d/main.d:15 10 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZ9__lambda1MFZv 11 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv 12 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv 13 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv 14 foobar _d_run_main2 15 foobar _d_run_main 16 foobar main 17 libdyld.dylib start ----------------------------------------------- mac_vm_allocate_trap, total: 1, system: 1, app: 0 open, total: 1, system: 1, app: 0 mac_syscall, total: 1, system: 1, app: 0 ----------------------------------------------- sysctl, total: 1, system: 0, app: 1 3 libsystem_c.dylib sysconf 4 foobar _D4core6thread26_sharedStaticCtor_L3685_C1FZv 5 foobar _D4core6thread15__modsharedctorFZv 6 foobar _D2rt5minfo__T14runModuleFuncsSQBdQBd11ModuleGroup8runCtorsMFZ9__lambda2ZQChMFAxPyS6object10ModuleInfoZv 7 foobar _D2rt5minfo11ModuleGroup8runCtorsMFZv 8 foobar _D2rt5minfo13rt_moduleCtorUZ14__foreachbody1MFKSQBu19sections_osx_x86_6412SectionGroupZi 9 foobar rt_moduleCtor 10 foobar rt_init 11 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv 12 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv 13 foobar _d_run_main2 14 foobar _d_run_main 15 foobar main 16 libdyld.dylib start ----------------------------------------------- getentropy, total: 1, system: 1, app: 0 issetugid, total: 1, system: 1, app: 0 bsdthread_register, total: 1, system: 1, app: 0 ----------------------------------------------- write_nocancel, total: 1, system: 0, app: 1 0 libsystem_kernel.dylib __write_nocancel 1 libsystem_c.dylib _swrite 2 libsystem_c.dylib __sflush 3 libsystem_c.dylib fflush 4 foobar _d_run_main2 5 foobar _d_run_main 6 foobar main 7 libdyld.dylib start ----------------------------------------------- exit, total: 1, system: 1, app: 0 ----------------------------------------------- getrlimit, total: 1, system: 0, app: 1 8 libsystem_c.dylib fwrite 9 foobar _D3std5stdio__T13trustedFwriteTaZQsFNbNiNePOS4core4stdcQBx7__sFILExAaZm ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:4322 10 foobar _D3std5stdio4File17LockingTextWriter__T3putTAyaZQjMFNfMQlZv ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:2930 11 foobar _D3std5stdio__T7writelnTAyaZQnFNfQjZv ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:3855 12 foobar _Dmain ~/development/d/main.d:15 13 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZ9__lambda1MFZv 14 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv 15 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv 16 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv 17 foobar _d_run_main2 18 foobar _d_run_main 19 foobar main 20 libdyld.dylib start ----------------------------------------------- mach_timebase_info, total: 1, system: 1, app: 0 sysctlbyname, total: 1, system: 1, app: 0 csrctl, total: 1, system: 1, app: 0 thread_selfid, total: 1, system: 1, app: 0 -- /Jacob Carlborg |
January 24, 2020 Re: Pay as you go is really going to make a difference | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jacob Carlborg | On 2020-01-24 21:29, Jacob Carlborg wrote: > total: 120, system: 110, app: 10 I would like to add that the number of calls varies between runs. -- /Jacob Carlborg |
January 24, 2020 Re: Pay as you go is really going to make a difference | ||||
---|---|---|---|---|
| ||||
Posted in reply to Johan Engelen | On Friday, 24 January 2020 at 09:52:13 UTC, Johan Engelen wrote: > Indeed. Also to figure out why LDC's binary calls 31 more than DMD's binary. > Much appreciated if someone could repeat the test and post a list of all syscalls being made. > > -Johan Here is a run of a statically linked hello world binary on Linux: execve("./hello", ["./hello"], 0x7fffd9d9d0e0 /* 15 vars */) = 0 arch_prctl(0x3001 /* ARCH_??? */, 0x7ffff1a05c70) = -1 EINVAL (Invalid argument) brk(NULL) = 0x1364000 brk(0x1365340) = 0x1365340 arch_prctl(ARCH_SET_FS, 0x1364a00) = 0 uname({sysname="Linux", nodename="kangoroo", ...}) = 0 set_tid_address(0x1364cd0) = 111 set_robust_list(0x1364ce0, 24) = 0 rt_sigaction(SIGRTMIN, {sa_handler=0x453cb0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x4531b0}, NULL, 8) = 0 rt_sigaction(SIGRT_1, {sa_handler=0x453d50, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x4531b0}, NULL, 8) = 0 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0 prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=8192*1024}) = 0 readlink("/proc/self/exe", "/root/hello", 4096) = 11 brk(0x1386340) = 0x1386340 brk(0x1387000) = 0x1387000 clock_getres(CLOCK_MONOTONIC, {tv_sec=0, tv_nsec=100}) = 0 clock_getres(CLOCK_BOOTTIME, {tv_sec=0, tv_nsec=100}) = 0 clock_getres(CLOCK_MONOTONIC_COARSE, {tv_sec=0, tv_nsec=100}) = 0 clock_getres(CLOCK_MONOTONIC, {tv_sec=0, tv_nsec=100}) = 0 clock_getres(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=15625000}) = 0 clock_getres(CLOCK_MONOTONIC_RAW, {tv_sec=0, tv_nsec=100}) = 0 clock_getres(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=15625000}) = 0 rt_sigaction(SIGUSR1, {sa_handler=0x42ac20, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x4531b0}, NULL, 8) = 0 rt_sigaction(SIGUSR2, {sa_handler=0x42ad20, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0x4531b0}, NULL, 8) = 0 openat(AT_FDCWD, "/proc/self/maps", O_RDONLY|O_CLOEXEC) = 3 prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=8192*1024}) = 0 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 read(3, "00400000-00561000 r-xp 00000000 "..., 4096) = 488 close(3) = 0 brk(0x1386000) = 0x1386000 sched_getaffinity(111, 32, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) = 32 clock_getres(CLOCK_MONOTONIC, {tv_sec=0, tv_nsec=100}) = 0 clock_gettime(CLOCK_MONOTONIC, {tv_sec=469, tv_nsec=741732400}) = 0 rt_sigaction(SIGSEGV, {sa_handler=0x4402b0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_RESETHAND|SA_SIGINFO, sa_restorer=0x4531b0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fa7e7f94fb0}, 8) = 0 rt_sigaction(SIGBUS, {sa_handler=0x4402b0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_RESETHAND|SA_SIGINFO, sa_restorer=0x4531b0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fa7e7f94fb0}, 8) = 0 rt_sigaction(SIGSEGV, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x4531b0}, NULL, 8) = 0 rt_sigaction(SIGBUS, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x4531b0}, NULL, 8) = 0 fstat(1, {st_mode=S_IFCHR|0660, st_rdev=makedev(0x4, 0x1), ...}) = 0 ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0 write(1, "hello world\n", 12) = 12 write(1, "\n", 1) = 1 exit_group(0) = ? +++ exited with 0 +++ I linked it statically to exclude the huge number of system calls made by ld.so. There's little that could be done to improve on those, I think. If I let strace dump stack traces for each call, the first mention of druntime is after the two consecutive brk calls. But I'm not sure if the traces are trustworthy. They seem cut short in a couple of instances. So there's a couple of calls that should be fairly fast (clock_getres etc.). Why does the runtime need to read /proc/self/maps, though? |
January 25, 2020 Re: Pay as you go is really going to make a difference | ||||
---|---|---|---|---|
| ||||
Posted in reply to Gregor Mückl | On 1/24/20 5:20 PM, Gregor Mückl wrote:
> So there's a couple of calls that should be fairly fast (clock_getres etc.). Why does the runtime need to read /proc/self/maps, though?
This is great stuff, thanks!
Note that /proc/self/maps appears nowhere in phobos or druntime. So I'm assuming that's some other function that's doing it (maybe in libc?)
The clock_getres calls are so you can use core.time.MonoTime. rt_init initializes those so they can be used early in the process. I would assume those are really fast, as they are constants in the kernel.
-Steve
|
January 25, 2020 Re: Pay as you go is really going to make a difference | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | On Saturday, 25 January 2020 at 16:07:24 UTC, Steven Schveighoffer wrote: > On 1/24/20 5:20 PM, Gregor Mückl wrote: > >> So there's a couple of calls that should be fairly fast (clock_getres etc.). Why does the runtime need to read /proc/self/maps, though? > > This is great stuff, thanks! > > Note that /proc/self/maps appears nowhere in phobos or druntime. So I'm assuming that's some other function that's doing it (maybe in libc?) > > The clock_getres calls are so you can use core.time.MonoTime. rt_init initializes those so they can be used early in the process. I would assume those are really fast, as they are constants in the kernel. > > -Steve I was curious so I ran a quick strace hello world experiment using only printf, not writeln, compiled with D (DMD), C++ and C (gcc and clang). Only the D binary opens /proc/self/maps. Running `strace dmd --version` also opens /proc/self/maps, but I guess that makes sense since the compiler itself is now written in D. === // D hello world import core.stdc.stdio; int main() { printf("Hello world\n"); return 0; } // CC and C hello world #include <stdio.h> int main() { printf("Hello world\n"); return 0; } === Cheers, Norm |
Copyright © 1999-2021 by the D Language Foundation