Thread overview
Using tasks without GC?
Jan 04, 2020
Chris Katko
Jan 04, 2020
H. S. Teoh
Jan 04, 2020
Elronnd
Jan 04, 2020
dwdv
Jan 06, 2020
Chris Katko
January 04, 2020
When I program, it's usually videogame ideas. That implies a soft, real-time requirement. In general, that requires the mantra "allocations are evil, use object pools whenever possible." [storing data in static arrays and 'deleting' is usually just marking an entry as is_deleted=true and re-using "dead" ones.]

I'm looking through D's parallelism module and the docs state, up-front:

 >Creates a Task on the GC heap that calls an alias.

The modern, scalable way to make a parallel game engine uses tasks. (as opposed to functional decomposition where 1 thread is networking 1 thread is physics, etc.) That requires making LOTS of tasks (_per frame_!) and dispatching them. And a 60 FPS frametime is... 16 ms or less.

So my question is: Has anyone done any analysis over how "dangerous" it is to use GC'd tasks for _small_ tasks (in terms of milliseconds)? Is the GC going to fire off all the time and send jitter off the charts? Because while freeze-the-world for 20 milliseconds would probably be unnoticable for many business apps, it would completely break a videogame.

I wonder how difficult it would be to either modify the existing parallel task codebase (or, write my own?) to use static pools instead. Allocate once an array of "MAX_NUM_TASKS" tasks (eating the minor memory hit) to prevent touching any allocation. [Even if it wasn't GC, allocating every frame in say, C++, is dangerous. malloc/new is slow and subject to fragmentation and permissions checks.]

Any advice, thoughts? Thanks,
--Chris Katko

January 03, 2020
On Sat, Jan 04, 2020 at 12:51:15AM +0000, Chris Katko via Digitalmars-d-learn wrote: [...]
> I'm looking through D's parallelism module and the docs state, up-front:
> 
>  >Creates a Task on the GC heap that calls an alias.
> 
> The modern, scalable way to make a parallel game engine uses tasks. (as opposed to functional decomposition where 1 thread is networking 1 thread is physics, etc.) That requires making LOTS of tasks (_per frame_!) and dispatching them. And a 60 FPS frametime is... 16 ms or less.
[...]

I don't know the inner workings of std.parallelism well enough to answer your question directly, but couldn't you just spawn a bunch of workers that can be assigned tasks? So you only allocate up-front when first spawning the workers, but during the main loop you have a fixed number of workers that can receive and execute tasks.  So you don't have to allocate at all after the initial up-front allocation.

Generally, doing a lot of small GC allocations is bad for performance. I would avoid doing that, and use a fixed number of workers that you create, say at the beginning of every level or something, and thereafter you never deallocate them, just leave them idle if there are no tasks, otherwise have some dispatcher to assign tasks to them.


T

-- 
Latin's a dead language, as dead as can be; it killed off all the Romans, and now it's killing me! -- Schoolboy
January 04, 2020
You can control when the gc runs.  So if you know the allocations are small enough that they won't OOM, then you can say GC.disable, and it straight up won't run at all.  But you can manually run a collection cycle (during a loading screen or whatever) with GC.collect.  See http://dpldocs.info/experimental-docs/core.memory.GC.html
January 04, 2020
> Creates a Task on the GC heap that calls an alias.

If possible, there's also scopedTask, which allocates on the stack: https://dlang.org/phobos/std_parallelism.html#.scopedTask

> So my question is: Has anyone done any analysis over how "dangerous" it is to use GC'd tasks for _small_ tasks (in terms of milliseconds)?

Nothing major, but https://github.com/mratsim/weave/tree/master/benchmarks/fibonacci puts quite a bit of pressure on various implementations. You might want to profile ./pfib 40:

import std;

ulong fib(uint n) {
    if (n < 2) return n;

    auto x = scopedTask!fib(n-1); // vs. Task!fib(n-1);
    taskPool.put(x);
    auto y = fib(n-2);
    return x.yieldForce + y; // {yield,spin,work}Force
}

void main(string[] args) {
    enforce(args.length == 2, "Usage: fib <n-th fibonacci number requested>");
    auto n = args[1].to!uint;
    // defaultPoolThreads(totalCPUs);
    writefln!"fib(%d) = %d"(n, fib(n));
}

At least D isn't locking up beyond 12 tasks; looking at you, stdlib-nim. :)
January 06, 2020
Thanks everyone, looks like i'll have to benchmark myself (which is fine) but I'm always afraid because I know "proper benchmarking is hard. (TM)"

Feel free to throw any other side advice in. I'm looking to get a broad perspective on this.

Straight up shutting off the garbage collector in exchange for memory is an interesting concept. But I wonder how quickly it will get eaten up. ALSO, if I do shut it off, when i turn it on is it going to take much longer? Does the GC take linear / quadratically more time based on the N of "items need to be freed"?

The thing is, I'm looking to parallelize a dedicated server for running MASSIVE numbers of objects (10000's or 100000's) at high tick rate. Worse, for my particular "game mode", the "rounds" can last hours. So while a slow crawl of RAM is okay, it won't be okay if it hits 30 GB in an hour. So that's going to be another "it works IF it works [in our particular game/application]" as opposed to "this definitely will/won't work". That's more "benchmarking and see" scenarios, which I'm trying to avoid as much as possible. You normally don't want to willfully START a project with a design where the only way to know if it works... is to bench.

There's also an option of periodically firing off (without ending the game) say, once every hour and tell everyone to just "live with it" because it's (I HOPE!) not going to be a 15/30/60 second delay. In my particular application, that still wouldn't be that disruptive. (Unless turning GC off hits max ram every couple minutes. Then again nobody is going to be okay with that.)

On Saturday, 4 January 2020 at 11:30:53 UTC, dwdv wrote:
>> Creates a Task on the GC heap that calls an alias.
>
> If possible, there's also scopedTask, which allocates on the stack: https://dlang.org/phobos/std_parallelism.html#.scopedTask
>
>> So my question is: Has anyone done any analysis over how "dangerous" it is to use GC'd tasks for _small_ tasks (in terms of milliseconds)?
>
> Nothing major, but https://github.com/mratsim/weave/tree/master/benchmarks/fibonacci puts quite a bit of pressure on various implementations. You might want to profile ./pfib 40:
>
> import std;
>
> ulong fib(uint n) {
>     if (n < 2) return n;
>
>     auto x = scopedTask!fib(n-1); // vs. Task!fib(n-1);
>     taskPool.put(x);
>     auto y = fib(n-2);
>     return x.yieldForce + y; // {yield,spin,work}Force
> }
>
> void main(string[] args) {
>     enforce(args.length == 2, "Usage: fib <n-th fibonacci number requested>");
>     auto n = args[1].to!uint;
>     // defaultPoolThreads(totalCPUs);
>     writefln!"fib(%d) = %d"(n, fib(n));
> }
>
> At least D isn't locking up beyond 12 tasks; looking at you, stdlib-nim. :)