January 24, 2020
On 1/24/20 4:52 AM, Johan Engelen wrote:
> On Thursday, 23 January 2020 at 00:20:00 UTC, IGotD- wrote:
>> On Wednesday, 22 January 2020 at 23:51:23 UTC, FogD wrote:
>>>
>>> A recent comparison of languages from this perspective.
>>>
>>> https://drewdevault.com/2020/01/04/Slow.html
>>
>> It would be interesting to know what that huge number of system calls really do, especially when it comes to D which has around 150.
> 
> Indeed. Also to figure out why LDC's binary calls 31 more than DMD's binary.
> Much appreciated if someone could repeat the test and post a list of all syscalls being made.

Most likely it's the runtime startup. Obviously sbrk quite a bit, but any runtime initialization (thread startup, mutex initialization, etc) are all going to go in there. Think of what the GC has to do!

A good test would be to do a betterC version with printf and see what the difference is (technically it should be the same as the C version).

-Steve
January 24, 2020
On Friday, 24 January 2020 at 13:06:33 UTC, Steven Schveighoffer wrote:
>
> Most likely it's the runtime startup. Obviously sbrk quite a bit, but any runtime initialization (thread startup, mutex initialization, etc) are all going to go in there. Think of what the GC has to do!
>
> A good test would be to do a betterC version with printf and see what the difference is (technically it should be the same as the C version).
>
> -Steve

That makes perfectly sense, I didn't think about sbrk which needs to bump the heap quite a bit during the startup.

January 24, 2020
On Friday, 24 January 2020 at 13:06:33 UTC, Steven Schveighoffer wrote:
>
> A good test would be to do a betterC version with printf and see what the difference is (technically it should be the same as the C version).
>

Indeed. I wrote an email to the author about that yesterday, though I haven't heard from him since.

January 24, 2020
On Fri, Jan 24, 2020 at 08:06:33AM -0500, Steven Schveighoffer via Digitalmars-d wrote:
> On 1/24/20 4:52 AM, Johan Engelen wrote:
[...]
> > Indeed. Also to figure out why LDC's binary calls 31 more than DMD's binary.
[...]
> Most likely it's the runtime startup. Obviously sbrk quite a bit, but any runtime initialization (thread startup, mutex initialization, etc) are all going to go in there. Think of what the GC has to do!
[...]

It makes me wonder how much we can make all this startup stuff pay-as-you-go.  I mean, IIRC, isn't the GC lazily initialized now? I vaguely remember some PR along that direction. Or was it the pool allocations?

I suppose thread startup would be hard to elide, unless there was a way to initialize the thread stuff only on demand. Ditto for mutex inits. But it might not be worth the effort for such minimal benefits in such a marginal test case.


T

-- 
Those who've learned LaTeX swear by it. Those who are learning LaTeX swear at it. -- Pete Bleackley
January 24, 2020
On 1/24/20 1:24 PM, H. S. Teoh wrote:
> On Fri, Jan 24, 2020 at 08:06:33AM -0500, Steven Schveighoffer via Digitalmars-d wrote:
>> On 1/24/20 4:52 AM, Johan Engelen wrote:
> [...]
>>> Indeed. Also to figure out why LDC's binary calls 31 more than DMD's
>>> binary.
> [...]
>> Most likely it's the runtime startup. Obviously sbrk quite a bit, but
>> any runtime initialization (thread startup, mutex initialization, etc)
>> are all going to go in there. Think of what the GC has to do!
> [...]
> 
> It makes me wonder how much we can make all this startup stuff
> pay-as-you-go.  I mean, IIRC, isn't the GC lazily initialized now? I
> vaguely remember some PR along that direction. Or was it the pool
> allocations?

Yes, it is lazily initialized. It's kind of a cool mechanism too -- the "default" GC is a class that when used in a way where a "real" GC is needed (e.g. allocate some memory), figures out which one to create, creates it, and then replaces itself as the global handler with that new one.

But the GC is going to be initialized in a writeln call I think.

There's a few other things that are going to cause a lot of system calls too -- the static constructors and the cycle detection. At least the cycle detection we could rid ourselves of if we could make a post-compile step that runs the cycle detection algorithm and sets up the final ordering in the binary.

> I suppose thread startup would be hard to elide, unless there was a way
> to initialize the thread stuff only on demand. Ditto for mutex inits.
> But it might not be worth the effort for such minimal benefits in such a
> marginal test case.

I'm not sure why we need to exactly minimize the system calls, we should just be able to explain them. 150 calls isn't horrific, and trying to reduce an "artificial" metric like that really shouldn't be the goal. I know this is exactly what the author is complaining about, but there is a world of difference between a 50MB web site that can't scroll and 150 system calls to do runtime startup + print hello world.

However, there could easily be an obvious candidate for removal if something looks like it's being called way too often. So explanation is still a good goal.

-Steve
January 24, 2020
On 2020-01-24 10:52, Johan Engelen wrote:

> Indeed. Also to figure out why LDC's binary calls 31 more than DMD's binary.
> Much appreciated if someone could repeat the test and post a list of all syscalls being made.

Not exactly the same as the original post, but here's some data I pulled out for macOS 10.14.6 with a Hello World compiled with DMD 2.088.0. This should give somewhat of an idea what's going on in the application. I've included the stacktrace for all syscalls NOT made by the system.

As you can see below, 10 calls are made by the application, all of the remaining calls are made by the system itself. Most of the calls made by the system are made by the dynamic loader. Only three calls originate from the D main function. Only one call into the C standard library is made from the D main function, which is the call to `fwrite`.

It's not our fault that the system does so many calls :).




total: 120, system: 110, app: 10

stat64, total: 40, system: 40, app: 0

-----------------------------------------------

mach_vm_map_trap, total: 8, system: 6, app: 2

   6 libsystem_malloc.dylib malloc
   7 foobar _D2rt5minfo11ModuleGroup9sortCtorsMFAyaZ6doSortMFmKAPyS6object10ModuleInfoZb
   8 foobar _D2rt5minfo11ModuleGroup9sortCtorsMFAyaZv
   9 foobar _D2rt5minfo11ModuleGroup9sortCtorsMFZv
  10 foobar _D2rt5minfo13rt_moduleCtorUZ14__foreachbody1MFKSQBu19sections_osx_x86_6412SectionGroupZi
  11 foobar rt_moduleCtor
  12 foobar rt_init
  13 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv
  14 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv
  15 foobar _d_run_main2
  16 foobar _d_run_main
  17 foobar main
  18 libdyld.dylib start



  10 libsystem_c.dylib fwrite
  11 foobar _D3std5stdio__T13trustedFwriteTaZQsFNbNiNePOS4core4stdcQBx7__sFILExAaZm ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:4322
  12 foobar _D3std5stdio4File17LockingTextWriter__T3putTAyaZQjMFNfMQlZv ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:2930
  13 foobar _D3std5stdio__T7writelnTAyaZQnFNfQjZv ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:3855
  14 foobar _Dmain ~/development/d/main.d:15
  15 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZ9__lambda1MFZv
  16 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv
  17 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv
  18 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv
  19 foobar _d_run_main2
  20 foobar _d_run_main
  21 foobar main
  22 libdyld.dylib start


-----------------------------------------------
mprotect, total: 8, system: 8
mach_port_deallocate_trap, total: 4, system: 4
-----------------------------------------------
sigaction, total: 4, system: 0, app: 4

   0 libsystem_kernel.dylib __sigaction
   1 libsystem_platform.dylib __platform_sigaction
   2 foobar runModuleUnitTests
   3 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv
   4 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv
   5 foobar _d_run_main2
   6 foobar _d_run_main
   7 foobar main
   8 libdyld.dylib start

-----------------------------------------------
mach_reply_port, total: 4, system: 4, app: 0
host_self_trap, total: 3, system: 3, app: 0
mach_port_mod_refs_trap, total: 3, system: 3, app: 0
ioctl, total: 3, system: 3, app: 0
mach_port_construct_trap, total: 3, system: 3, app: 0
kdebug_typefilter, total: 2, system: 2, app: 0
thread_self_trap, total: 2, system: 2, app: 0
proc_info, total: 2, system: 2, app: 0
mach_port_destruct_trap, total: 2, system: 2, app: 0
csops, total: 2, system: 2, app: 0
getpid, total: 2, system: 2, app: 0
task_self_trap, total: 2, system: 2, app: 0
access, total: 1, system: 1, app: 0
close, total: 1, system: 1, app: 0
shared_region_check_np, total: 1, system: 1, app: 0
-------------------------------------------------------
fstat64, total: 1, system: 0, app: 1

   5 libsystem_c.dylib fwrite
   6 foobar _D3std5stdio__T13trustedFwriteTaZQsFNbNiNePOS4core4stdcQBx7__sFILExAaZm ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:4322
   7 foobar _D3std5stdio4File17LockingTextWriter__T3putTAyaZQjMFNfMQlZv ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:2930
   8 foobar _D3std5stdio__T7writelnTAyaZQnFNfQjZv ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:3855
   9 foobar _Dmain ~/development/d/main.d:15
  10 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZ9__lambda1MFZv
  11 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv
  12 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv
  13 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv
  14 foobar _d_run_main2
  15 foobar _d_run_main
  16 foobar main
  17 libdyld.dylib start

-----------------------------------------------
mac_vm_allocate_trap, total: 1, system: 1, app: 0
open, total: 1, system: 1, app: 0
mac_syscall, total: 1, system: 1, app: 0
-----------------------------------------------
sysctl, total: 1, system: 0, app: 1

   3 libsystem_c.dylib sysconf
   4 foobar _D4core6thread26_sharedStaticCtor_L3685_C1FZv
   5 foobar _D4core6thread15__modsharedctorFZv
   6 foobar _D2rt5minfo__T14runModuleFuncsSQBdQBd11ModuleGroup8runCtorsMFZ9__lambda2ZQChMFAxPyS6object10ModuleInfoZv
   7 foobar _D2rt5minfo11ModuleGroup8runCtorsMFZv
   8 foobar _D2rt5minfo13rt_moduleCtorUZ14__foreachbody1MFKSQBu19sections_osx_x86_6412SectionGroupZi
   9 foobar rt_moduleCtor
  10 foobar rt_init
  11 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv
  12 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv
  13 foobar _d_run_main2
  14 foobar _d_run_main
  15 foobar main
  16 libdyld.dylib start


-----------------------------------------------
getentropy, total: 1, system: 1, app: 0
issetugid, total: 1, system: 1, app: 0
bsdthread_register, total: 1, system: 1, app: 0

-----------------------------------------------
write_nocancel, total: 1, system: 0, app: 1

   0 libsystem_kernel.dylib __write_nocancel
   1 libsystem_c.dylib _swrite
   2 libsystem_c.dylib __sflush
   3 libsystem_c.dylib fflush
   4 foobar _d_run_main2
   5 foobar _d_run_main
   6 foobar main
   7 libdyld.dylib start

-----------------------------------------------
exit, total: 1, system: 1, app: 0

-----------------------------------------------
getrlimit, total: 1, system: 0, app: 1

   8 libsystem_c.dylib fwrite
   9 foobar _D3std5stdio__T13trustedFwriteTaZQsFNbNiNePOS4core4stdcQBx7__sFILExAaZm ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:4322
  10 foobar _D3std5stdio4File17LockingTextWriter__T3putTAyaZQjMFNfMQlZv ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:2930
  11 foobar _D3std5stdio__T7writelnTAyaZQnFNfQjZv ~/.dvm/compilers/dmd-2.088.0/osx/bin/../../src/phobos/std/stdio.d:3855
  12 foobar _Dmain ~/development/d/main.d:15
  13 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZ9__lambda1MFZv
  14 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv
  15 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv
  16 foobar _D2rt6dmain212_d_run_main2UAAamPUQgZiZ7tryExecMFMDFZvZv
  17 foobar _d_run_main2
  18 foobar _d_run_main
  19 foobar main
  20 libdyld.dylib start

-----------------------------------------------
mach_timebase_info, total: 1, system: 1, app: 0
sysctlbyname, total: 1, system: 1, app: 0
csrctl, total: 1, system: 1, app: 0
thread_selfid, total: 1, system: 1, app: 0




-- 
/Jacob Carlborg
January 24, 2020
On 2020-01-24 21:29, Jacob Carlborg wrote:

> total: 120, system: 110, app: 10

I would like to add that the number of calls varies between runs.

-- 
/Jacob Carlborg
January 24, 2020
On Friday, 24 January 2020 at 09:52:13 UTC, Johan Engelen wrote:
> Indeed. Also to figure out why LDC's binary calls 31 more than DMD's binary.
> Much appreciated if someone could repeat the test and post a list of all syscalls being made.
>
> -Johan

Here is a run of a statically linked hello world binary on Linux:

execve("./hello", ["./hello"], 0x7fffd9d9d0e0 /* 15 vars */) = 0
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffff1a05c70) = -1 EINVAL (Invalid argument)
brk(NULL)                               = 0x1364000
brk(0x1365340)                          = 0x1365340
arch_prctl(ARCH_SET_FS, 0x1364a00)      = 0
uname({sysname="Linux", nodename="kangoroo", ...}) = 0
set_tid_address(0x1364cd0)              = 111
set_robust_list(0x1364ce0, 24)          = 0
rt_sigaction(SIGRTMIN, {sa_handler=0x453cb0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x4531b0}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {sa_handler=0x453d50, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x4531b0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=8192*1024}) = 0
readlink("/proc/self/exe", "/root/hello", 4096) = 11
brk(0x1386340)                          = 0x1386340
brk(0x1387000)                          = 0x1387000
clock_getres(CLOCK_MONOTONIC, {tv_sec=0, tv_nsec=100}) = 0
clock_getres(CLOCK_BOOTTIME, {tv_sec=0, tv_nsec=100}) = 0
clock_getres(CLOCK_MONOTONIC_COARSE, {tv_sec=0, tv_nsec=100}) = 0
clock_getres(CLOCK_MONOTONIC, {tv_sec=0, tv_nsec=100}) = 0
clock_getres(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=15625000}) = 0
clock_getres(CLOCK_MONOTONIC_RAW, {tv_sec=0, tv_nsec=100}) = 0
clock_getres(CLOCK_THREAD_CPUTIME_ID, {tv_sec=0, tv_nsec=15625000}) = 0
rt_sigaction(SIGUSR1, {sa_handler=0x42ac20, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x4531b0}, NULL, 8) = 0
rt_sigaction(SIGUSR2, {sa_handler=0x42ad20, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0x4531b0}, NULL, 8) = 0
openat(AT_FDCWD, "/proc/self/maps", O_RDONLY|O_CLOEXEC) = 3
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=8192*1024}) = 0
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(3, "00400000-00561000 r-xp 00000000 "..., 4096) = 488
close(3)                                = 0
brk(0x1386000)                          = 0x1386000
sched_getaffinity(111, 32, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) = 32
clock_getres(CLOCK_MONOTONIC, {tv_sec=0, tv_nsec=100}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=469, tv_nsec=741732400}) = 0
rt_sigaction(SIGSEGV, {sa_handler=0x4402b0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_RESETHAND|SA_SIGINFO, sa_restorer=0x4531b0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fa7e7f94fb0}, 8) = 0
rt_sigaction(SIGBUS, {sa_handler=0x4402b0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER|SA_RESETHAND|SA_SIGINFO, sa_restorer=0x4531b0}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fa7e7f94fb0}, 8) = 0
rt_sigaction(SIGSEGV, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x4531b0}, NULL, 8) = 0
rt_sigaction(SIGBUS, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x4531b0}, NULL, 8) = 0
fstat(1, {st_mode=S_IFCHR|0660, st_rdev=makedev(0x4, 0x1), ...}) = 0
ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
write(1, "hello world\n", 12)           = 12
write(1, "\n", 1)                       = 1
exit_group(0)                           = ?
+++ exited with 0 +++

I linked it statically to exclude the huge number of system calls made by ld.so. There's little that could be done to improve on those, I think.

If I let strace dump stack traces for each call, the first mention of druntime is after the two consecutive brk calls. But I'm not sure if the traces are trustworthy. They seem cut short in a couple of instances.

So there's a couple of calls that should be fairly fast (clock_getres etc.). Why does the runtime need to read /proc/self/maps, though?
January 25, 2020
On 1/24/20 5:20 PM, Gregor Mückl wrote:

> So there's a couple of calls that should be fairly fast (clock_getres etc.). Why does the runtime need to read /proc/self/maps, though?

This is great stuff, thanks!

Note that /proc/self/maps appears nowhere in phobos or druntime. So I'm assuming that's some other function that's doing it (maybe in libc?)

The clock_getres calls are so you can use core.time.MonoTime. rt_init initializes those so they can be used early in the process. I would assume those are really fast, as they are constants in the kernel.

-Steve
January 25, 2020
On Saturday, 25 January 2020 at 16:07:24 UTC, Steven Schveighoffer wrote:
> On 1/24/20 5:20 PM, Gregor Mückl wrote:
>
>> So there's a couple of calls that should be fairly fast (clock_getres etc.). Why does the runtime need to read /proc/self/maps, though?
>
> This is great stuff, thanks!
>
> Note that /proc/self/maps appears nowhere in phobos or druntime. So I'm assuming that's some other function that's doing it (maybe in libc?)
>
> The clock_getres calls are so you can use core.time.MonoTime. rt_init initializes those so they can be used early in the process. I would assume those are really fast, as they are constants in the kernel.
>
> -Steve

I was curious so I ran a quick strace hello world experiment using only printf, not writeln, compiled with D (DMD), C++ and C (gcc and clang). Only the D binary opens /proc/self/maps. Running `strace dmd --version` also opens /proc/self/maps, but I guess that makes sense since the compiler itself is now written in D.

===
// D hello world
import core.stdc.stdio;

int main() {
  printf("Hello world\n");
  return 0;
}

// CC and C hello world
#include <stdio.h>

int main() {
  printf("Hello world\n");
  return 0;
}
===


Cheers,
Norm