Thread overview
October 06
https://issues.dlang.org/show_bug.cgi?id=17881

Andrei Alexandrescu <andrei@erdani.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |andrei@erdani.com

--- Comment #1 from Andrei Alexandrescu <andrei@erdani.com> ---
This seems like a tall order. It's essentially pushing a specific management need on to the GC, complicating everyone's life in the process.

If an application needs to preallocate, there's an API for that already - it's called GC.allocate() :o). Then the application needs to do its own bookkeeping.

--
October 06
https://issues.dlang.org/show_bug.cgi?id=17881

--- Comment #2 from Steven Schveighoffer <schveiguy@yahoo.com> ---
The use case is different than for GC.allocate.

I want to fill in a structure of nodes, I know I'm going to fill it in with 10,000 elements, so I'm going to allocate them all in a loop. But I don't want the GC to keep the ENTIRE thing in memory if just one element is still pointed at. And I don't want to run the GC constantly in my tight loop allocating each node.

Think of a tree constructor, or an AA constructor.

Essentially it's the same as array.reserve, but for individual blocks instead of a contiguous single block.

--
October 06
https://issues.dlang.org/show_bug.cgi?id=17881

--- Comment #3 from Andrei Alexandrescu <andrei@erdani.com> ---
>The use case is different than for GC.allocate.

The semi-joke was this is all needed.

>I want to fill in a structure of nodes, I know I'm going to fill it in with 10,000 elements, so I'm going to allocate them all in a loop. But I don't want the GC to keep the ENTIRE thing in memory if just one element is still pointed at. And I don't want to run the GC constantly in my tight loop allocating each node.

The solution is to prefill a freelist, then let go of the extra elements remaining.

>Think of a tree constructor, or an AA constructor.

>Essentially it's the same as array.reserve, but for individual blocks instead of a contiguous single block.

That's not the job of the allocator. It's the job of the user of the allocator. The allocator gives you memory. You organize it as you find fit.

--
October 06
https://issues.dlang.org/show_bug.cgi?id=17881

--- Comment #4 from Steven Schveighoffer <schveiguy@yahoo.com> ---
Stevens-MacBook-Pro:testd steves$ cat testpreallocate.d
struct S
{
    S *next;
}

void main()
{
    version(freelist)
    {
        S* head = null;
        foreach(i; 0 .. 20_000_000)
            head = new S(head);
    }
    version(preallocate)
    {
        S* head = null;
        auto x = new S[20_000_000];
        foreach(i; 0 .. 20_000_000)
        {
            x[i] = S(head);
            head = &(x[i]);
        }
    }
}
Stevens-MacBook-Pro:testd steves$ dmd -O -release testpreallocate.d
-version=freelist
Stevens-MacBook-Pro:testd steves$ time ./testpreallocate

real    0m1.869s
user    0m1.750s
sys    0m0.114s
Stevens-MacBook-Pro:testd steves$ dmd -O -release testpreallocate.d
-version=preallocate
Stevens-MacBook-Pro:testd steves$ time ./testpreallocate

real    0m0.111s
user    0m0.045s
sys    0m0.062s


The point is that the GC is not well-equipped to handle the tight allocation loop.

The second version has the drawback that all 20 million elements will remain in memory as long as there is one element still alive.

What I'm looking for is something that has the performance (or similar, I realize it won't be as good) of the second, but can collect each block individually like the first.

--
October 07
https://issues.dlang.org/show_bug.cgi?id=17881

Rainer Schuetze <r.sagitario@gmx.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |r.sagitario@gmx.de

--- Comment #5 from Rainer Schuetze <r.sagitario@gmx.de> ---
A slightly simpler API could be to add allocating N same-sized chunks from the GC that returns them in a free-list-like chain.

I agree with Andrei that we should not complicate the GC interface for every possible allocation pattern a user might want to optimize for, though.

If you call GC.reserve(20_000_000*(Key.sizeof+Value.sizeof)) before inserting
elements, there should be no collection while filling the AA.

If we add thread local free-lists to the GC, the overhead of allocating these from the GC instead of caching them in the AA would be rather small.

--
October 07
https://issues.dlang.org/show_bug.cgi?id=17881

--- Comment #6 from Steven Schveighoffer <schveiguy@yahoo.com> ---
(In reply to Rainer Schuetze from comment #5)
> A slightly simpler API could be to add allocating N same-sized chunks from the GC that returns them in a free-list-like chain.

I think an array is best, since the data already is an array, and a free-list requires building a linked list before-hand.

> I agree with Andrei that we should not complicate the GC interface for every possible allocation pattern a user might want to optimize for, though.

Hm... I'm trying to find the most generic interface for this. It's not necessarily limited to AA, as allocating a bunch of blocks in a loop isn't an uncommon thing.

If there was a way to "re-flavor" the blocks from one giant block into individual ones, then we could do this outside the GC.

> If you call GC.reserve(20_000_000*(Key.sizeof+Value.sizeof)) before
> inserting elements, there should be no collection while filling the AA.

Limiting the collection in the example I posted shaves off about 300-400msec, but it still is 1.5 seconds vs. 0.1 for the local array version.

> If we add thread local free-lists to the GC, the overhead of allocating these from the GC instead of caching them in the AA would be rather small.

Agreed, I think keeping the lists inside the GC is the most useful, and does not expose any implementation details to the user.

--
October 09
https://issues.dlang.org/show_bug.cgi?id=17881

Alexandru Razvan Caciulescu <alexandru.razvan.c@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |alexandru.razvan.c@gmail.co
                   |                            |m
           Assignee|nobody@puremagic.com        |alexandru.razvan.c@gmail.co
                   |                            |m

--
October 11
https://issues.dlang.org/show_bug.cgi?id=17881

safety0ff.bugz <safety0ff.bugz@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |safety0ff.bugz@gmail.com

--- Comment #7 from safety0ff.bugz <safety0ff.bugz@gmail.com> ---
(In reply to Steven Schveighoffer from comment #6)
> > If we add thread local free-lists to the GC, the overhead of allocating these from the GC instead of caching them in the AA would be rather small.
> 
> Agreed, I think keeping the lists inside the GC is the most useful, and does not expose any implementation details to the user.

I think this is the only sane way of doing it. i.e. without over constraining the GC implementation or exposing the user to implementation details.

I think the API should obviously should have the size of the elements as a separate arguments and provide as much information as malloc e.g.: GC.preload(size_t num, size_t size, uint ba = 0, const TypeInfo ti = null)

Otherwise it relies on hope that the GC will pull from the same pool.

--
October 11
https://issues.dlang.org/show_bug.cgi?id=17881

--- Comment #8 from Steven Schveighoffer <schveiguy@yahoo.com> ---
(In reply to safety0ff.bugz from comment #7)
> I think the API should obviously should have the size of the elements as a separate arguments and provide as much information as malloc e.g.: GC.preload(size_t num, size_t size, uint ba = 0, const TypeInfo ti = null)

Internally, I think something like this is needed. What I am looking for, though, is the high-level API of "I have 20 million T's I want to allocate, give me them properly allocated and typed one at a time". I don't necessarily even think you need the TypeInfo for the low level API. Perhaps the bit attributes can even be done as you allocate each block.

--
October 11
https://issues.dlang.org/show_bug.cgi?id=17881

--- Comment #9 from safety0ff.bugz <safety0ff.bugz@gmail.com> ---
(In reply to Steven Schveighoffer from comment #8)
> Internally, I think something like this is needed. What I am looking for, though, is the high-level API of "I have 20 million T's I want to allocate, give me them properly allocated and typed one at a time". I don't necessarily even think you need the TypeInfo for the low level API. Perhaps the bit attributes can even be done as you allocate each block.

Here's the rough sketch of one idea:

void[] gc_nmalloc(size_t num, size_t size, void[] prev = [], uint ba = 0, const
TypeInfo ti = null)
// prev is used for temporary pinning/unpinning a memory range

struct GC // in core.memory
{
  auto NMalloc(T)(size_t num) { // convenience function
     int loop(int delegate(T*) dg) {
       void[] prev; size_t remain = num;
       scope(exit) gc_nmalloc(0,T.sizeof,prev); // unpin last chunk
       int result;
       while (remain) {
         prev = gc_nmalloc(remain, T.sizeof, prev, ...);
         auto chunk = cast(T*)prev.ptr[0 .. prev.length/T.sizeof];
         foreach (ref e; chunk)
           if ((result = dg(&e))) return result;
         remain -= chunk.length;
       }
       return result;
     }
     return &loop;
  }
}

So user code could look like:
foreach (e; GC.NMalloc(T)(20_000_000))
    // do something with allocated e

--