January 08
On Tuesday, 8 January 2019 at 10:56:43 UTC, Ola Fosheim Grøstad wrote:
> Let's say you obtain ownership of every other object in an array, then you have to prove that  you also release every other object in that array before you return from the function.

Or rather, that every other object has been acquired (locked to memory) by the context (the calling function).

So, it is quite doable, but you have to use ideas used for proving program correctness, which is a difficult topic.

January 08
On Sunday, 6 January 2019 at 07:48:03 UTC, Manu wrote:
> Ref counting in C++ is a sloppy hack. I really just want the compiler
> to emit opInc and opDec appropriately on either sides of a reference
> assignment, and let me take it from there.

I believe C++ compilers are allowed to do this for shared_ptr, but you would need global optimization for it to make much sense.

But shared_ptr is not frequently used in performance sensitive loops, because it is slow, so it doesn't make a lot of sense for compiler implementors to spend  that much time on it...


January 08
On Tuesday, 8 January 2019 at 09:31:09 UTC, Márcio Martins wrote:
> Is the current implementation of simple MS not improvable in any way?

I'd say it is. I've written up a small benchmark here:

https://gist.github.com/rbehrends/3d4ab87d34e79722272189abc68eee00

Options:

* D with its native GC, compiled with ldc 1.13.0.
* jemalloc, version 5.1.0.
* The Boehm GC, version 8.0.2 with sequential marking.
* The Boehm GC with parallel marking.
* The Boehm GC using incremental GC.
* The Boehm GC with explicit deallocation instead of GC.
* System malloc on macOS 10.12.6.

This is a version of the "binary trees" benchmark [1], as it was the one allocation-oriented benchmark readily available and easily portable between languages. Because it is a very specialized benchmark (it allocates a *lot* of small fixed-size objects), I'll caution against generalizing its results, but it can give us some insights all the same.

Here's a sample run with several GC options enabled, run on a 2.6GHz Core i7 with four cores; code that doesn't do allocation/collection (i.e. checksum()) accounts for less than two seconds of runtime. Most actual tuned non-functional programs will spend only 10%-20% doing allocation and collection, so this is not terribly representative of real applications and exaggerates differences in performance.

$ make benchmark DEPTH=21
# D garbage collector
/usr/bin/time ./btree-d 21 >/dev/null
       34.24 real        33.96 user         0.21 sys
# jemalloc explicit malloc()/free()
/usr/bin/time ./btree-jemalloc 21 >/dev/null
       21.62 real        21.35 user         0.22 sys
# Boehm GC with four parallel marker threads
GC_MARKERS=4 /usr/bin/time ./btree-gc 21 >/dev/null
        9.39 real        12.02 user         0.18 sys
# Boehm GC with single-threaded marking
GC_MARKERS=1 /usr/bin/time ./btree-gc 21 >/dev/null
       11.42 real        11.34 user         0.06 sys
# Boehm GC with explicit deallocation
GC_MARKERS=1 /usr/bin/time ./btree-gc-free 21 >/dev/null
       13.12 real        13.05 user         0.05 sys
# Boehm GC with incremental collection (single-threaded)
/usr/bin/time ./btree-gc-inc 21 >/dev/null
       20.74 real        17.21 user         6.29 sys
# System malloc()/free()
/usr/bin/time ./btree-sysmalloc 21 >/dev/null
       67.73 real        66.97 user         0.74 sys

The fastest version is the parallel-mark version of the Boehm-Demers-Weiser GC, which beats out all the competitors. The rear is brought up by system malloc() (to nobody's surprise, standard malloc()/free() on a Mac is dog slow).

The Boehm GC benchmarks are run with interior pointers disabled in order to prevent extra padding for the small objects that would blow up the heap size, but even with interior pointers enabled, the relative performance doesn't change all that much; the single-threaded run clocks in at around 16 seconds instead, still twice as fast as the D GC, and the version with parallel marking finishes in around 13 seconds. It is worth noting that the Boehm GC still does go through all the conservative marking machinery and does check interior pointers from the stack regardless, so a precise or semi-precise collector may still be able to gain additional performance.

Parallel marking gives us a noticeable speedup in wall clock time and a really big one in GC pauses (not displayed here) at the expense of using more CPU time.

Incremental collection is super expensive in comparison, due to the cost of mprotect(), for code that technically would not require write barriers. It still is on par with jemalloc() for throughput.

In terms of GC pause times, we get the following for the Boehm GC:

$ time GC_MARKERS=4 ./btree-gc 23 | tail -5
      88 collections
   0.056 seconds/collection
   0.080 max seconds/collection
     818 MB heap size
     306 MB free bytes

This is again with parallel marking (four marker threads) and using a larger heap. The collection times are wall clock times, based on gettimeofday(). Note that most heaps will contain at least some non-pointer data, ours is almost 100% pointers. With single-threaded marking, the maximum collection time more than triples to nearly .3 seconds.

As an interesting aside, the maximum pause time for explicitly deleting trees in the benchmark is much larger than the GC pause time (nearly a second for jemalloc()) and would remain larger even if the delete tree operation were parallelized. This is a good reminder that cascading deletes in RAII can also be problematic for pause times and generally require tradeoffs (deferred/concurrent deletion, sacrificing timely destructor calls) in order to work in real time systems.


[1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/binarytrees.html
January 08
On 2019-01-07 20:46, H. S. Teoh wrote:

> I think the idea is probably to use struct wrappers to implement RC,
> i.e., you'd have a struct wrapping a class reference of the object it's
> managing.  As long as the struct can have semantics compatible with RC,
> it can be used to implement RC. At least in theory.

That will be quite ugly. Will these struct wrappers be covariant if the types they wrap are covariant? I mean, is that possible to implement?

-- 
/Jacob Carlborg
January 08
On Sunday, 6 January 2019 at 03:42:05 UTC, Walter Bright wrote:
> On 1/5/2019 2:05 PM, Manu wrote:
>> How much truth is in here?
>
> There's a lot of triggering on the word "GC". At some level, it doesn't matter how good or bad the GC is, it's "GC is bad". Even if your code never calls the GC.
>
>
>> What is this business about write barriers making the GC fast? Is that
>> actually fact, or are there other hurdles?
>
> Here's an explanation:
>
> https://stackoverflow.com/questions/19154607/how-actually-card-table-and-writer-barrier-works
>
> There are many more if you google "gc write barrier".
>
> The penalty, of course, is extra code gets run for every write through a pointer.
>
> The benefits exceed the penalties for a language where all dynamic allocation is done via the GC. But for D, where quite a lot of it is not, it isn't so clear. What is clear is that if you're going to compare speed with C++, having those write barriers in there is going to be a serious penalty, because they'll be there even if you don't use the GC.
>
> > Where's the ARC stuff? What happened to opAddRef/opRelease?
>
> Andrei, Razvan, and I have decided that the most pragmatic way forward to support reference counting is to support copy constructors ala C++. C++ is clearly successful with this approach, and Andrei/Razvan concluded that D's postblit just doesn't cut it. Razvan has a proposal for copy constructors, and an implementation PR that has fallen a bit behind his latest proposal.

Topics: get a better understanding of D's application bottlenecks
It could be interesting to be able to analyze trace execution and gc use with tools designed for. As example the process minning (a scientific field) seem to be interesting in order to understand what happen and where are the bottllenecks.
With such tools people will blame the GC only when he is responsible. So a tool which would be able to generate an event trace to the file format xes (http://www.processmining.org/logs/xes) should be awesome.
Indeed they are somme tools such as Prom (http://www.promtools.org/doku.php) which are able to analyse and represent it. The developpers will get a better understanding with the use with such (animated) representation: http://fluxicon.com/blog/wp-content/uploads/2013/11/Disco-Process-Mining-Animation.gif .

Best regards
January 08
On 1/7/19 4:57 PM, Neia Neutuladh wrote:
> On Mon, 07 Jan 2019 13:06:22 -0800, H. S. Teoh wrote:
>> Hmph, good point.  That sux. :-(  I suppose that's where copying/moving
>> the object comes in -- migrate it to a different pool somehow so that we
>> can avoid stop-the-world.
> 
> class Node
> {
>    enum Type { leaf, intermediate }
>    Type type;
>    union { Node child; ulong data; }
> }
> auto node = new Node;
> node.type = Node.Type.leaf;
> node.data = cast(ulong)cast(void*)node;
> 
> How do you copy this?

You copy the memory, then update the references to that memory.

Or you make the user explicitly copy, and disallow the cast.

> Pinning avoids those problems, but it still involves scanning memory as if
> doing a collection.

Even if you pin, if you cast the above from shared to unshared, you then have made it so the thread-local pool cannot be scanned independently.

If we get thread-local pools, and all scans have to be done on all thread local pools along with the shared pool, we haven't gained anything.

-Steve
January 08
On Tue, 08 Jan 2019 09:21:23 -0500, Steven Schveighoffer wrote:
> You copy the memory, then update the references to that memory.

Again, in the case of unions and void[], how do you determine if something is a reference to that memory instead of a false pointer?

Moving collectors *have* to be fully precise, and D doesn't allow a fully precise collector.

> Even if you pin, if you cast the above from shared to unshared, you then have made it so the thread-local pool cannot be scanned independently.

Because the reference to an object in the current thread might be stored inside an object allocated by a different thread's GC, and that might have happened without a cast. Fixing that would require redesigning shared.

> If we get thread-local pools, and all scans have to be done on all thread local pools along with the shared pool, we haven't gained anything.

We can shorten the stop-the-world phase to the size of shared memory. If you're only sharing a little data and have a decent number of threads, this would be a big improvement.
January 09
On Tuesday, 8 January 2019 at 18:04:26 UTC, Neia Neutuladh wrote:
> We can shorten the stop-the-world phase to the size of shared memory. If you're only sharing a little data and have a decent number of threads, this would be a big improvement.

Pointer storage isn't limited to shared memory?


January 09
On Wed, 09 Jan 2019 00:24:14 +0000, Ola Fosheim Grøstad wrote:
> On Tuesday, 8 January 2019 at 18:04:26 UTC, Neia Neutuladh wrote:
>> We can shorten the stop-the-world phase to the size of shared memory. If you're only sharing a little data and have a decent number of threads, this would be a big improvement.
> 
> Pointer storage isn't limited to shared memory?

1. Stop all the threads.
2. Scan shared memory: everything that's been allocated as shared,
everything that's been explicitly casted to shared, and everything
reachable from those objects. Use that to mark things in the current
thread's local GC heap as reachable.
3. Resume every thread but this one.
4. Run mark/sweep on the current thread's heap.
5. Resume this thread.

If you have a small amount of shared memory, the stop-the-world phase would be pretty short.

As an alternative, this could hijack all the threads to run a parallel mark/sweep, where each individual thread can resume as soon as its own local mark/sweep finishes. If you have a worker thread that deals with very few pointers, it can resume very quickly after the shared mark/sweep phase.
January 09
On Tuesday, 8 January 2019 at 09:39:07 UTC, Francesco Mecca wrote:
> As you can see the pause times are way lower.
> This is true for many other small programs used for benchmarking.
> Back in the old TangoRT times the concurrent GC even improved the program execution time but this is no longer true mainly because Martin Nowak implemented a similar strategy for the eager allocation of pools.
>
> I was already contacted by Mike in order to report the status of the project for the SAOC so I hope we can disclose more in the upcoming weeks.

That is just plain awesome. Good to see work is being done on that front.
Next ›   Last »
1 2 3 4 5 6 7 8