Jump to page: 1 2
Thread overview
Skynet 1M Fiber microbenchmark in D
Oct 18, 2017
Per Nordlöw
Oct 18, 2017
Per Nordlöw
Oct 18, 2017
Biotronic
Oct 18, 2017
Nordlöw
Oct 18, 2017
drug
Oct 18, 2017
Biotronic
Oct 18, 2017
Nordlöw
Oct 18, 2017
Biotronic
Oct 20, 2017
Martin Nowak
Oct 18, 2017
ikod
Oct 18, 2017
drug
Oct 20, 2017
Martin Nowak
October 18, 2017
I'm curious about Fiber/coroutine performance in D compared to other languages such as Rust.

How should the following test be implemented in D?

Creates an actor (goroutine, whatever), which spawns 10 new actors, each of them spawns 10 more actors, etc. until one million actors are created on the final level. Then, each of them returns back its ordinal number (from 0 to 999999), which are summed on the previous level and sent back upstream, until reaching the root actor. (The answer should be 499999500000).

See also: https://github.com/atemerev/skynet
October 18, 2017
On Wednesday, 18 October 2017 at 09:01:30 UTC, Per Nordlöw wrote:
> Creates an actor (goroutine, whatever), which spawns 10 new actors, each of them spawns 10 more actors, etc. until one million actors are created on the final level. Then, each of them returns back its ordinal number (from 0 to 999999), which are summed on the previous level and sent back upstream, until reaching the root actor. (The answer should be 499999500000).
>
> See also: https://github.com/atemerev/skynet

I Fibers aren't supposed to take any parameters how are we supposed to pass values to it during creation?
October 18, 2017
On Wednesday, 18 October 2017 at 11:01:56 UTC, Per Nordlöw wrote:
> On Wednesday, 18 October 2017 at 09:01:30 UTC, Per Nordlöw wrote:
>> Creates an actor (goroutine, whatever), which spawns 10 new actors, each of them spawns 10 more actors, etc. until one million actors are created on the final level. Then, each of them returns back its ordinal number (from 0 to 999999), which are summed on the previous level and sent back upstream, until reaching the root actor. (The answer should be 499999500000).
>>
>> See also: https://github.com/atemerev/skynet
>
> I Fibers aren't supposed to take any parameters how are we supposed to pass values to it during creation?

class MyFiber : Fiber {
    this(int arguments, string go, float here) {
        super(&run);
        // Save arguments somewhere
    }
    void run() {
        // Use arguments here.
    }
}

--
  Biotronic
October 18, 2017
On Wednesday, 18 October 2017 at 11:04:10 UTC, Biotronic wrote:
> On Wednesday, 18 October 2017 at 11:01:56 UTC, Per Nordlöw wrote:
>> On Wednesday, 18 October 2017 at 09:01:30 UTC, Per Nordlöw wrote:
>>> Creates an actor (goroutine, whatever), which spawns 10 new actors, each of them spawns 10 more actors, etc. until one million actors are created on the final level. Then, each of them returns back its ordinal number (from 0 to 999999), which are summed on the previous level and sent back upstream, until reaching the root actor. (The answer should be 499999500000).
>>>
>>> See also: https://github.com/atemerev/skynet
>>
>> I Fibers aren't supposed to take any parameters how are we supposed to pass values to it during creation?
>
> class MyFiber : Fiber {
>     this(int arguments, string go, float here) {
>         super(&run);
>         // Save arguments somewhere
>     }
>     void run() {
>         // Use arguments here.
>     }
> }
>
> --
>   Biotronic

Thanks, I figured that out myself after a while ;)

Another thing...how should the synchronization between the fibers figure out when the total number of fibers have reached one million?...via an atomic counter fed by reference to the constructor...or are there better ways? Because I do need an atomic reference counter here, right?

And how do I parallelize this over multiple worker threads? AFAICT fibers are by default all spawned in the same main thread, right?
October 18, 2017
18.10.2017 14:34, Nordlöw пишет:
> And how do I parallelize this over multiple worker threads? AFAICT fibers are by default all spawned in the same main thread, right?
Probably it will works - every fiber substract 1 from its argument, then divides remainder by count of child fibers and spawns fibers with that value. so the last fibers gets 1 as its argument, then substract 1 from it, get zero and terminate.
October 18, 2017
On Wednesday, 18 October 2017 at 11:34:57 UTC, Nordlöw wrote:
> Another thing...how should the synchronization between the fibers figure out when the total number of fibers have reached one million?...via an atomic counter fed by reference to the constructor...or are there better ways? Because I do need an atomic reference counter here, right?

This is how I did it:
import core.thread : Fiber;

class MyFiber : Fiber {
    int _depth;
    ulong _index;
    ulong _value;

    this(int depth, ulong index) {
        super(&run);
        _depth = depth;
        _index = index;
    }

    void run() {
        if (_depth == 6) { // 10^6 == 1 million, so stop here.
            _value = _index;
            return;
        }

        _value = 0;
        foreach (i; 0..10) { // Line 23
            auto e = new MyFiber(_depth+1, _index * 10 + i);
            e.call();
            _value += e._value;
        }
    }
}

unittest {
    import std.stdio : writeln;
    import std.datetime.stopwatch : StopWatch, AutoStart;
    auto sw = StopWatch(AutoStart.yes);
    auto a = new MyFiber(0, 0);
    a.call();
    sw.stop();
    assert(a._value == 499999500000);
    writeln(a._value, " after ", sw.peek);
}


> And how do I parallelize this over multiple worker threads? AFAICT fibers are by default all spawned in the same main thread, right?

True. Well, they're not really spawned on any thread - they're allocated on the heap, have their own stack, and are run on whichever thread happens to invoke their call() method.

I experimented a little bit with parallelism, and the easiest definitely is to replace line 23 with this:

foreach (i; taskPool.parallel(10.iota, 1)) {

It seems to make very little difference in terms of run time, though. I tried using a mix of these approaches - parallel at low depth, basically just to fill up the cores, and serial closer to the leaves. The difference is still negligible, so I assume the losses are elsewhere.

--
  Biotronic
October 18, 2017
On Wednesday, 18 October 2017 at 11:52:08 UTC, Biotronic wrote:
> It seems to make very little difference in terms of run time, though. I tried using a mix of these approaches - parallel at low depth, basically just to fill up the cores, and serial closer to the leaves. The difference is still negligible, so I assume the losses are elsewhere.
>
> --
>   Biotronic

Thanks!

Further, are we forced to use the GC for Fiber allocation or can a sub-class of Fibers implement its own allocation strategy?
October 18, 2017
On Wednesday, 18 October 2017 at 12:32:31 UTC, Nordlöw wrote:
> Further, are we forced to use the GC for Fiber allocation or can a sub-class of Fibers implement its own allocation strategy?

Afraid it's set in stone. Now, it doesn't actually use the GC for allocating the stack memory, instead opting for VirtualAlloc on Windows and mmap, valloc or malloc on other platforms. Of course, it's possible to change the implementation in core.thread, but it's probably more work than you want. Copying the entire Fiber class and changing it also runs into some problems, since it depends on private stuff in core.thread.

--
  Biotronic
October 18, 2017
On Wednesday, 18 October 2017 at 11:52:08 UTC, Biotronic wrote:
> On Wednesday, 18 October 2017 at 11:34:57 UTC, Nordlöw wrote:
>> Another thing...how should the synchronization between the fibers figure out when the total number of fibers have reached one million?...via an atomic counter fed by reference to the constructor...or are there better ways? Because I do need an atomic reference counter here, right?
>
> This is how I did it:
> import core.thread : Fiber;
>
> class MyFiber : Fiber {
>     int _depth;
>     ulong _index;
>     ulong _value;
>
>     this(int depth, ulong index) {
>         super(&run);
>         _depth = depth;
>         _index = index;
>     }
>
>     void run() {
>         if (_depth == 6) { // 10^6 == 1 million, so stop here.
>             _value = _index;
>             return;
>         }
>
>         _value = 0;
>         foreach (i; 0..10) { // Line 23
>             auto e = new MyFiber(_depth+1, _index * 10 + i);
>             e.call();
>             _value += e._value;
>         }
>     }
> }
>
> unittest {
>     import std.stdio : writeln;
>     import std.datetime.stopwatch : StopWatch, AutoStart;
>     auto sw = StopWatch(AutoStart.yes);
>     auto a = new MyFiber(0, 0);
>     a.call();
>     sw.stop();
>     assert(a._value == 499999500000);
>     writeln(a._value, " after ", sw.peek);
> }
>
>
>> And how do I parallelize this over multiple worker threads? AFAICT fibers are by default all spawned in the same main thread, right?
>
> True. Well, they're not really spawned on any thread - they're allocated on the heap, have their own stack, and are run on whichever thread happens to invoke their call() method.
>
> I experimented a little bit with parallelism, and the easiest definitely is to replace line 23 with this:
>
> foreach (i; taskPool.parallel(10.iota, 1)) {
>
> It seems to make very little difference in terms of run time, though. I tried using a mix of these approaches - parallel at low depth, basically just to fill up the cores, and serial closer to the leaves. The difference is still negligible, so I assume the losses are elsewhere.
>
> --
>   Biotronic

I ran this under linux perf, and here is top from 'perf report'

# Overhead  Command  Shared Object       Symbol
# ........  .......  ..................  ...........................................................................................
#
     7.34%  t        [kernel.kallsyms]   [k] clear_page
     6.80%  t        [kernel.kallsyms]   [k] __do_page_fault
     5.39%  t        [kernel.kallsyms]   [k] __lock_text_start
     3.90%  t        t                   [.] nothrow core.thread.Fiber core.thread.Fiber.__ctor(void delegate(), ulong)
     3.73%  t        [kernel.kallsyms]   [k] unmap_page_range
     3.32%  t        [kernel.kallsyms]   [k] flush_tlb_mm_range
     2.70%  t        [kernel.kallsyms]   [k] _raw_spin_lock
     2.57%  t        libpthread-2.23.so  [.] pthread_mutex_unlock
     2.53%  t        t                   [.] nothrow void core.thread.Fiber.__dtor()

So looks like memory management, even not GC, take most of the time (if I interpret these numbers correctly)
October 18, 2017
18.10.2017 16:37, ikod пишет:
> 
> I ran this under linux perf, and here is top from 'perf report'
> 
> # Overhead  Command  Shared Object       Symbol
> # ........  .......  .................. ........................................................................................... 
> 
> #
>       7.34%  t        [kernel.kallsyms]   [k] clear_page
>       6.80%  t        [kernel.kallsyms]   [k] __do_page_fault
>       5.39%  t        [kernel.kallsyms]   [k] __lock_text_start
>       3.90%  t        t                   [.] nothrow core.thread.Fiber core.thread.Fiber.__ctor(void delegate(), ulong)
>       3.73%  t        [kernel.kallsyms]   [k] unmap_page_range
>       3.32%  t        [kernel.kallsyms]   [k] flush_tlb_mm_range
>       2.70%  t        [kernel.kallsyms]   [k] _raw_spin_lock
>       2.57%  t        libpthread-2.23.so  [.] pthread_mutex_unlock
>       2.53%  t        t                   [.] nothrow void core.thread.Fiber.__dtor()
> 
> So looks like memory management, even not GC, take most of the time (if I interpret these numbers correctly)

It depends on how much time test took. If it was small then of course creating the fibers took too much percents. Ideally we need to find the duration of the test when percents of memory management stop changing - it will be more reliable numbers.
« First   ‹ Prev
1 2