December 12, 2011
On 12/12/2011 04:17 PM, Robert Jacques wrote:
> On Mon, 12 Dec 2011 01:06:14 -0500, Brad Anderson <eco@gnuk.net> wrote:
>
>> On Sun, Dec 11, 2011 at 10:55 PM, Robert Jacques <sandford@jhu.edu>
>> wrote:
>>> Second, being a systems language means that D can not implement a lot of
>> GC algorithms including copying, generational and the good concurrent
>> collectors.
>>
>> What about being a systems language prevents generational? The page on
>> garbage collection on the D website says while it doesn't use a
>> generational GC it will some day and gives tips on what to avoid so you
>> don't fall victim to the behavior of a moving GC.
>
> Regarding moving collectors.
> D supports semi-precise collection. Therefore D can support some types
> of moving collectors, i.e. compactors, which is what the website is
> talking about. Copying collectors, on the other hand, require full
> precision to work; you must be able to fully evacuate a region of
> memory. D doesn't support full precision, for a couple of performance
> (unions,

The compiler could insert small code fragments that track whether or not an union contains a pointer.

> the call stack,

What is the problem with the call stack? Can't the compiler just generate reference offset information for all the function frames and then the GC generates a backtrace to identify the references?

> C/C++ interop)

There should be multiple options for GC. If C/C++ interop is unimportant, a better GC that does not support it well is still handy.

> and technical reasons (the inline
> assembler).

inline assembler does not always move around GC heap references. I think that in the cases it does, reference stores could be annotated manually.

>
> Regarding generational collectors.
> Both generational and concurrent collectors require that every pointer
> assignment is known to the compiler, which then instruments the
> assignment to flag mark bits, etc. For generational collectors, you need
> this information to know which objects/memory pages to search for roots
> into the young generation. Without this information, you have to search
> the entire heap, i.e. do a full collection. Again, both performance and
> technical reasons come into play here. Instrumentation represents a
> performance cost, which even if it pays for itself, looks bad in
> newsgroups posting. Indeed, concurrent collectors are mostly about
> trading throughput for latency. So, like JAVA, you'd want to use version
> statements to select your GC style, but you'd also have to make sure
> your entire codebase was compiled with the same flags; with 3rd party
> DLLs and objects, this can become non-trivial. From a technical
> perspective, complete pointer assignment instrumentation is a
> non-starter because the D compiler doesn't have complete access to all
> the code; both C/C++ and assembler code can modify pointers and are not
> subject to instrumentation.  Now, if we supported C/C++ through
> marshaling, like JAVA and C# do, and made the assembler a bit more smart
> or required manual pointer instrumentation of asm code, we could use
> these types of collectors.
>
> * Note that the above doesn't take into account the types of virtual
> memory tricks C4 can do, which may open these algorithms up to D and
> other system programming languages.

I think we'll definitely need a generational/concurrent collector eventually. Could some of the problems be worked around by having more than one GC implementation in the same executable?


December 12, 2011
On 12 December 2011 17:44, Timon Gehr <timon.gehr@gmx.ch> wrote:

> On 12/12/2011 04:17 PM, Robert Jacques wrote:
>
>> On Mon, 12 Dec 2011 01:06:14 -0500, Brad Anderson <eco@gnuk.net> wrote:
>>
>>  On Sun, Dec 11, 2011 at 10:55 PM, Robert Jacques <sandford@jhu.edu>
>>> wrote:
>>>
>>>> Second, being a systems language means that D can not implement a lot of
>>>>
>>> GC algorithms including copying, generational and the good concurrent collectors.
>>>
>>> What about being a systems language prevents generational? The page on garbage collection on the D website says while it doesn't use a generational GC it will some day and gives tips on what to avoid so you don't fall victim to the behavior of a moving GC.
>>>
>>
>> Regarding moving collectors.
>> D supports semi-precise collection. Therefore D can support some types
>> of moving collectors, i.e. compactors, which is what the website is
>> talking about. Copying collectors, on the other hand, require full
>> precision to work; you must be able to fully evacuate a region of
>> memory. D doesn't support full precision, for a couple of performance
>> (unions,
>>
>
> The compiler could insert small code fragments that track whether or not an union contains a pointer.
>
>  the call stack,
>>
>
> What is the problem with the call stack? Can't the compiler just generate reference offset information for all the function frames and then the GC generates a backtrace to identify the references?
>
>  C/C++ interop)
>>
>
> There should be multiple options for GC. If C/C++ interop is unimportant, a better GC that does not support it well is still handy.
>
>
>  and technical reasons (the inline
>> assembler).
>>
>
> inline assembler does not always move around GC heap references. I think that in the cases it does, reference stores could be annotated manually.
>
>
>> Regarding generational collectors.
>> Both generational and concurrent collectors require that every pointer
>> assignment is known to the compiler, which then instruments the
>> assignment to flag mark bits, etc. For generational collectors, you need
>> this information to know which objects/memory pages to search for roots
>> into the young generation. Without this information, you have to search
>> the entire heap, i.e. do a full collection. Again, both performance and
>> technical reasons come into play here. Instrumentation represents a
>> performance cost, which even if it pays for itself, looks bad in
>> newsgroups posting. Indeed, concurrent collectors are mostly about
>> trading throughput for latency. So, like JAVA, you'd want to use version
>> statements to select your GC style, but you'd also have to make sure
>> your entire codebase was compiled with the same flags; with 3rd party
>> DLLs and objects, this can become non-trivial. From a technical
>> perspective, complete pointer assignment instrumentation is a
>> non-starter because the D compiler doesn't have complete access to all
>>
>> the code; both C/C++ and assembler code can modify pointers and are not subject to instrumentation.  Now, if we supported C/C++ through marshaling, like JAVA and C# do, and made the assembler a bit more smart or required manual pointer instrumentation of asm code, we could use these types of collectors.
>>
>> * Note that the above doesn't take into account the types of virtual memory tricks C4 can do, which may open these algorithms up to D and other system programming languages.
>>
>
> I think we'll definitely need a generational/concurrent collector eventually. Could some of the problems be worked around by having more than one GC implementation in the same executable?
>

Totally off topic... I have a question.

If I pass a pointer to a C library... how does the GC know when it's safe to collect?


December 12, 2011
On Monday, 12 December 2011 at 16:00:04 UTC, Manu wrote:
> Totally off topic... I have a question.
>
> If I pass a pointer to a C library... how does the GC know when it's safe
> to collect?

It doesn't, you should add it to the GC with GC.addRoot or GC.addRange unless you're absolutely sure the pointer is never taken off the stack (the call stack is shared between the C and D code). Once you're sure the memory is no longer referenced on the C heap, you can remove the root or range with GC.removeRoot and GC.removeRange.
December 12, 2011
On Monday, December 12, 2011 17:59:54 Manu wrote:
> Totally off topic... I have a question.
> 
> If I pass a pointer to a C library... how does the GC know when it's safe to collect?

It assumes that it's safe to collect it. If you pass a pointer to a C library and it keeps it, then you need to make sure that your D code maintains a reference to that data so that the GC doesn't collect it.

- Jonathan M Davis
December 13, 2011
On Mon, 12 Dec 2011 10:44:27 -0500, Timon Gehr <timon.gehr@gmx.ch> wrote:

> On 12/12/2011 04:17 PM, Robert Jacques wrote:
>> On Mon, 12 Dec 2011 01:06:14 -0500, Brad Anderson <eco@gnuk.net> wrote:
>>
>>> On Sun, Dec 11, 2011 at 10:55 PM, Robert Jacques <sandford@jhu.edu>
>>> wrote:
>>>> Second, being a systems language means that D can not implement a lot of
>>> GC algorithms including copying, generational and the good concurrent
>>> collectors.
>>>
>>> What about being a systems language prevents generational? The page on
>>> garbage collection on the D website says while it doesn't use a
>>> generational GC it will some day and gives tips on what to avoid so you
>>> don't fall victim to the behavior of a moving GC.
>>
>> Regarding moving collectors.
>> D supports semi-precise collection. Therefore D can support some types
>> of moving collectors, i.e. compactors, which is what the website is
>> talking about. Copying collectors, on the other hand, require full
>> precision to work; you must be able to fully evacuate a region of
>> memory. D doesn't support full precision, for a couple of performance
>> (unions,
>
> The compiler could insert small code fragments that track whether or not
> an union contains a pointer.
>
>> the call stack,
>
> What is the problem with the call stack? Can't the compiler just
> generate reference offset information for all the function frames and
> then the GC generates a backtrace to identify the references?
>
>> C/C++ interop)
>
> There should be multiple options for GC. If C/C++ interop is
> unimportant, a better GC that does not support it well is still handy.
>
>> and technical reasons (the inline
>> assembler).
>
> inline assembler does not always move around GC heap references. I think
> that in the cases it does, reference stores could be annotated manually.

Solving these problems is a little more complicated than this, but the existence of solutions wasn't the problem; performance of those solutions was.

Unions can be instrumented, but the instrumentation must be harmonized with the GC as pointer bit masks must be changed. And changeable bitmasks present their own overhead problem (i.e. a 1/32 or 1/64 memory overhead). Personally, I'm in favor of a precise heap, so I'd like this, but it does require a compiler, not runtime, change.

The call stack is more complicated because you have to view the function frame as a bunch of unions as space in the function frame is heavily reused to minimize cache effects and the change of a stack overflow. So, before every function call you'd have to flush the current reference offsets of the function frame. *opps* I was remembering why dual stacks doesn't work, and the same logic applies to function frames. So a dual stack approach would use one stack for references and one stack for values. However, pointers to unions/structs/classes on the dual stack don't work because to values and pointers in the objects are not together anymore. Similarly, pointers to unions on a precise/framed stack are troublesome because the the assignment wouldn't know where the meta-data was. Hmm... The GC could have a pointer bitmask for each stack, which the stack frame flushes prior to each function call.

C/C++ interop is never unimportant; at a minimum all OS calls work through C/C++ interop. So while it may not be use directly by most D programs, under the hood, the libraries we use all depend on it. We'd have to be able to annotate C functions with their marshaling requirements and change how we handle them based on the GC we're using. Things get even more icky when one considers registering D callbacks or C functions calling into D DLLs.

As for the inline assembler, we'd have to annotate changes to both the GC heap and the call stack, and since this would have to be done manually, it would become a source of nasty memory bugs.

Technically, the performance of certain fundamental operations is an implementation detail, but there's a certain expectation of 'system' languages to be lean and mean, either optionally or by default and as pointer tracking eliminates that option. Furthermore, pointer tracking interferes with language interop in rather nasty ways and may even necessitate changes to the ABI. But that doesn't mean a D dialect couldn't do a nice modern GC; indead the .net D compiler already did, and IIRC LDC/LLVM has a JIT backend.

>> Regarding generational collectors.
>> Both generational and concurrent collectors require that every pointer
>> assignment is known to the compiler, which then instruments the
>> assignment to flag mark bits, etc. For generational collectors, you need
>> this information to know which objects/memory pages to search for roots
>> into the young generation. Without this information, you have to search
>> the entire heap, i.e. do a full collection. Again, both performance and
>> technical reasons come into play here. Instrumentation represents a
>> performance cost, which even if it pays for itself, looks bad in
>> newsgroups posting. Indeed, concurrent collectors are mostly about
>> trading throughput for latency. So, like JAVA, you'd want to use version
>> statements to select your GC style, but you'd also have to make sure
>> your entire codebase was compiled with the same flags; with 3rd party
>> DLLs and objects, this can become non-trivial. From a technical
>> perspective, complete pointer assignment instrumentation is a
>> non-starter because the D compiler doesn't have complete access to all
>> the code; both C/C++ and assembler code can modify pointers and are not
>> subject to instrumentation.  Now, if we supported C/C++ through
>> marshaling, like JAVA and C# do, and made the assembler a bit more smart
>> or required manual pointer instrumentation of asm code, we could use
>> these types of collectors.
>>
>> * Note that the above doesn't take into account the types of virtual
>> memory tricks C4 can do, which may open these algorithms up to D and
>> other system programming languages.
>
> I think we'll definitely need a generational/concurrent collector
> eventually.

Thread-local collectors are easy todo in D, much easier in fact than the majority of other languages, and have performance similar to the current set of generational/concurrent collectors. And generational collectors are mainly to account for the fact Java doesn't have structs. So while I do think D needs a modern collector, I don't think that a generational/concurrent collector is absolutely required.

> Could some of the problems be worked around by having more
> than one GC implementation in the same executable?

Yes and no. GCs that require instrumentation to function would require entirely separate codebases. So supporting two different GCs would require fat-objects of some kind (i.e. N different executables)
December 13, 2011
On 11.12.2011 17:33, Andrei Alexandrescu wrote:
> On 12/11/11 9:23 AM, dsimcha wrote:
>> My optimizations make very little difference on this benchmark, but for
>> good reason: It's not a very good GC benchmark. I ran it with my GC
>> profiling code enabled and it only spends around 10% of its execution
>> time in GC. We need to figure out why else this benchmark may be so slow.
>
> I'll venture an opinion (without having profiled the code).
>
> The code uses BigInt, which makes eager allocations and custom work for
> each operation. But a good portion of the loop is spent using small
> numbers, so using the small integer optimization and native operations
> helps a lot.
>
> I think we need to add reference counting and small int optimization to
> BigInt to make it competitive.
>
>
> Andrei

Yeah. Also important to note that BigInt was written assuming that something like TempAlloc would become available. Its GC usage is terrible without it.
OTOH looks like this is a case where it'd be much faster to use fixed length integers rather than BigInt (I think that's true of nearly everything in RosettaCode) -- strict upper bounds are known on the length of the integers, they aren't arbitrary precision.
I think these are toy examples, though, they're completely unrepresentative of real-world code.
December 13, 2011
Don:

> OTOH looks like this is a case where it'd be much faster to use fixed length integers rather than BigInt

There's a fixed length integers version too:
http://rosettacode.org/wiki/Count_the_coins#128-bit_version
But it's slower than the Java code still, maybe because of the DMD back-end :-)


> I think these are toy examples, though, they're completely unrepresentative of real-world code.

It's a toy example, but I have seen several times the low performance of D BigInts compared to Python flexible ints, so I think this toy example shows a more general pattern.

Bye,
bearophile
December 13, 2011
Robert Jacques Wrote:
> >>>> Second, being a systems language means that D can not implement a lot of
> >>> GC algorithms including copying, generational and the good concurrent collectors.

I disagree, as there are other system languages with better GC algorithms as D,  because they offer more safe features than D, lack of inline assembler being one of them.

And regardless of what you may think about these languages suitability for doing systems programming, there are research operating systems written in them with lots of papers to read from. Something that I am yet to see from D.

Yet when reading how Singularity was implemented, there are lots of parallels between what Sing# offers and what D does. So I really see that there is quite some possibilities to improve D's GC still.
December 13, 2011
Don:

> OTOH looks like this is a case where it'd be much faster to use fixed length integers rather than BigInt (I think that's true of nearly everything in RosettaCode) -- strict upper bounds are known on the length of the integers, they aren't arbitrary precision.

Two problems with this idea:
1) The Python version of this coins program has found a bug in the C version that uses 128 bit integers. When you use fixed sized numbers you risk overflows. If the overflows are silent (like the ones in the built-in D integral numbers) you don't know if your code is giving bogus results, you need to work and think to be sure there are no overflows. This work and thinking requires time, that often I'd like to use for something else. This is why multiprecision numbers (or not silent overflows) are handy. In this program the upper bounds are known if you compute them first with Python or with multiprecision numbers, like C GMP :-)
2) What if I want coins for a big larger number of euros like 200_000? Maybe the result or some intermediate value don't fit in 128 bits, so the C program becomes useless again, while the Python code is useful still. Multiprecision numbers sometimes are more useful.

Bye,
bearophile
December 14, 2011
On Tue, 13 Dec 2011 02:22:00 -0500, Paulo Pinto <pjmlp@progtools.org> wrote:

> Robert Jacques Wrote:
>> >>>> Second, being a systems language means that D can not implement a lot of
>> >>> GC algorithms including copying, generational and the good concurrent
>> >>> collectors.
>
> I disagree, as there are other system languages with better GC algorithms as D,  because they offer more safe features than D, lack of inline assembler being one of them.
>
> And regardless of what you may think about these languages suitability for doing systems programming, there are research operating systems written in them with lots of papers to read from. Something that I am yet to see from D.
>
> Yet when reading how Singularity was implemented, there are lots of parallels between what Sing# offers and what D does. So I really see that there is quite some possibilities to improve D's GC still.
>

From the Singularity Wikipedia article:
The lowest-level x86 interrupt dispatch code is written in assembly language and C. Once this code has done its job, it invokes the kernel, whose runtime and garbage collector are written in Sing# (an extended version of Spec#, itself an extension of C#) and runs in unprotected mode. The hardware abstraction layer is written in C++ and runs in protected mode. There is also some C code to handle debugging. The computer's BIOS is invoked during the 16-bit real mode bootstrap stage; once in 32-bit mode, Singularity never invokes the BIOS again, but invokes device drivers written in Sing#. During installation, Common Intermediate Language (CIL) opcodes are compiled into x86 opcodes using the Bartok compiler.

From BitC website
A less obvious issue is the absence of first-class union value types in the managed subset of the Common Language Runtime (CLR) or the corresponding parts of the Java Virtual Machine (JVM). These are absolutely necessary for low-level systems programming, so one must either abandon Java/C#/Spec# to implement these low-level objects (thereby abandoning the foundation for checking), or one must find a more appropriate language.

In addition to the problems of expressiveness, neither Java nor C# was designed with formal property checking in mind. Spec# [3], a language developed by Microsoft Research to retrofit formal property checking to C#, has been forced to introduce some fairly severe language warts to support precondition and postcondition checking, but the language does not attempt to address the underlying performance issues of C#.

So, no, Singularity isn't written purely in Sing#; all its low-level systems access is written in ASM/C/C++, like pretty much every single other operating system. (It's still an impressive microkernel)

Now BitC and Coyotos are another interesting language/OS pair, though they currently use a C/C++ conservative garbage collector.

At the end of the day, I'm really excited about the growth in the systems programming arena and I'd love to see the combination of the ideals in C4, L4, Singularity and Coyotos into some new OS and/or language. But that doesn't really change the limitations of running on top of Windows or iOS.