Thread overview
Some GC and emulated TLS questions (GDC related)
Jul 14, 2017
Johannes Pfau
Jul 14, 2017
Kagamin
Jul 16, 2017
Johannes Pfau
Jul 15, 2017
Joakim
Jul 16, 2017
Johannes Pfau
Jul 16, 2017
Iain Buclaw
Jul 16, 2017
Johannes Pfau
Jul 23, 2017
Joakim
July 14, 2017
As you might know, GDC currently doesn't properly hook up the GC to the GCC emulated TLS support in libgcc. Because of that, TLS memory is not scanned on target systems with emulated TLS. For GCC this includes MinGW, Android (although Google switched to LLVM anyway) and some more architectures. Proper integration likely needs some modifications in the libgcc emutls code so I need some more information about the GC to really propose a good solution.


The main problem is that GCC emutls does not use contiguous memory
blocks. So instead of scanning one range containing N variables we'll
have one range for every single TLS variable per thread.
So assuming we could iterate over all these variables (this would be
an extension required in libgcc), would scanTLSRanges in rt.sections
produce acceptable performance in these cases? Depending on the
number of TLS variables and threads there may be thousands of ranges
to scan.

Another solution could be to enhance libgcc emutls to allow custom allocators, then have a special allocation function in druntime for all D emutls variables. As far as I know there is no GC heap that is scanned, but not automatically collected? I'd need a way to completely manually manage GC.malloc/GC.free memory without the GC collecting this memory, but still scanning this memory for pointers. Does something like this exist?

Another option is simply using the DMD-style emutls. But as far as I can see the DMD implementation never supported dynamic loading of shared libraries? This is something the GCC emutls support is quite good at: It doesn't have any platform dependencies (apart from mutexes and some way to store one thread specific pointer+destructor) and should work with all kinds of shared library combinations. DMD style emutls also does not allow sharing TLS variables between D and other languages.

So I was thinking, if DMD style emutls really doesn't support shared libraries, maybe we should just clone a GCC-style, compiler and OS agnostic emutls implementation into druntime? A D implementation could simply allocate all internal arrays using the GC. This should be just as efficient as the C implementation for variable access and interfacing to the GC is trivial. It gets somewhat more complicated if we want to use this in betterC though. We also lose C/C++ compatibility though by using such a custom implementation.




The rest of this post is a description of the GCC emutls code. Someone
can use this specification to implement a clean-room design D emutls
clone.
Source code can be found here, but beware of the GPL license:
https://github.com/gcc-mirror/gcc/blob/master/libgcc/emutls.c

Unlike DMD TLS, the GCC TLS code does not put all initialization memory
into one section. In fact, the code is completely runtime and
compile time linker agnostic so it can't use section start/stop
markers. Instead, every TLS variable is handled individually. For every
variable, an instance of __emutls_object is created in the (writeable)
data segment. __emutls_object is defined as:

struct __emutls_object
{
    word size;
    word align;
    union {pointer offset; void* ptr};
    void* templ;
}

The void* ptr is only used as an optimization for single threaded programs, so I'll ignore this for now in the further description.

Whenever such a variable is accessed, the compiler calls __emutls_get_address(&(__emutls_object in data segment)). This function first does an atomic load of the __emutls_object.offset variable. If it is zero, this particular TLS variable has not been accessed in any thread before.

If this is the case, first check if the global emutls
initialization function (emutls_init) has been run already, if not run
it (__gthread_once). The initialization function initializes the mutex
variable and creates a thread local variable using __gthread_key_create
with the destructor function set to emutls_destroy.

Back to __emutls_get_address: If offset was zero and we ran the emutls_init if required, we now lock the mutex. We have a global variable emutls_size to count the number of total variables. We now increase the emutls_size counter and atomically set __emutls_object.offset = emutls_size.

We now have an __emutls_object.offset index assigned. Either using the
procedure described above or maybe we're called at a later stage again
and offset was already != zero. Now we get a per-thread pointer using
__gthread_getspecific. This is a pointer to an __emutls_array which is
simply a size value, followed by size void*. If
__gthread_getspecific returns null this is the first time we access a
TLS variable in this thread. Then allocate a new __emutls_array (size =
emutls_size + 32 + 1(for the size field)) and save using
__gthread_setspecific. If we already had an array for this thread,
check if __emutls_object.offset index is larger than the array. Then
reallocate the array (double the size, if still to small add +32, then
either way add +1). Update using __gthread_setspecific.

Now we have enough space in the thread-specific array in either case, so look at array[offset-1]. If this is null, allocate a new object (emutls_alloc) and set the array value. Return the array value at index offset-1.

The emutls_alloc function is simple: Allocate __emutls_object.size bytes with __emutls_object.align alignment. In order to ensure alignment, the libgcc implementation uses malloc, then manually adjusts the pointer. As the original pointer is needed for free, the implementation allocates void*.sizeof more bytes and stores the original malloc pointer at the start of the allocated data block. The returned value is offset by void*.sizeof into the data block. Finally copy __emutls_object.templ into the newly allocated data block (initialization).

The last missing function is emutls_destroy: Called by __gthread once a thread key gets destroyed it receives a void* argument pointing to the per-thread array. The code now simply iterates over the array, gets the original pointers (offset -1 in the allocated blocks) and frees the data.

-- Johannes

July 14, 2017
Just allocate emutls array in managed heap and pin it somewhere, then everything referenced by it will be preserved.
July 15, 2017
On Friday, 14 July 2017 at 09:13:26 UTC, Johannes Pfau wrote:
> Another solution could be to enhance libgcc emutls to allow custom allocators, then have a special allocation function in druntime for all D emutls variables. As far as I know there is no GC heap that is scanned, but not automatically collected?

I believe that's what's done with the TLS ranges now, they're scanned but not collected, though they're not part of the GC heap.

> I'd need a way to completely manually manage GC.malloc/GC.free memory without the GC collecting this memory, but still scanning this memory for pointers. Does something like this exist?

It doesn't have to be GC.malloc/GC.free, right?  The current DMD-style emutls simply mallocs and frees the TLS data itself and only expects the GC to scan it.

> Another option is simply using the DMD-style emutls. But as far as I can see the DMD implementation never supported dynamic loading of shared libraries? This is something the GCC emutls support is quite good at: It doesn't have any platform dependencies (apart from mutexes and some way to store one thread specific pointer+destructor) and should work with all kinds of shared library combinations. DMD style emutls also does not allow sharing TLS variables between D and other languages.

Yes, DMD's emutls was never made to work with loading multiple shared libraries.  As for sharing with other languages without copying the TLS data over, that seems a rare scenario.

> So I was thinking, if DMD style emutls really doesn't support shared libraries, maybe we should just clone a GCC-style, compiler and OS agnostic emutls implementation into druntime? A D implementation could simply allocate all internal arrays using the GC. This should be just as efficient as the C implementation for variable access and interfacing to the GC is trivial. It gets somewhat more complicated if we want to use this in betterC though. We also lose C/C++ compatibility though by using such a custom implementation.

It would be a good alternative to have, and you're not going to care in betterC mode, since there's no druntime or GC.  You'd have to be careful how you called TLS data from C/C++, but it could still be done.

> The rest of this post is a description of the GCC emutls code. Someone
> can use this specification to implement a clean-room design D emutls
> clone.
> Source code can be found here, but beware of the GPL license:
> https://github.com/gcc-mirror/gcc/blob/master/libgcc/emutls.c
>
> [...]

There is also this llvm implementation, available under permissive licenses and actually documented somewhat:

https://github.com/llvm-mirror/compiler-rt/blob/master/lib/builtins/emutls.c
July 16, 2017
Am Fri, 14 Jul 2017 12:47:55 +0000
schrieb Kagamin <spam@here.lot>:

> Just allocate emutls array in managed heap and pin it somewhere, then everything referenced by it will be preserved.

This is basically the option of replicating GCC-style emutls in druntime. This is quite simple to implement and you don't even need special pinning, as the Thread instance object in core.thread can refer to the TLS array.

This solution can't be implemented in libgcc though, as obviously the GC is not always available to allocate the arrays in pure C programs ;-)


-- Johannes

July 16, 2017
Am Sat, 15 Jul 2017 10:49:39 +0000
schrieb Joakim <dlang@joakim.fea.st>:

> On Friday, 14 July 2017 at 09:13:26 UTC, Johannes Pfau wrote:
> > Another solution could be to enhance libgcc emutls to allow custom allocators, then have a special allocation function in druntime for all D emutls variables. As far as I know there is no GC heap that is scanned, but not automatically collected?
> 
> I believe that's what's done with the TLS ranges now, they're scanned but not collected, though they're not part of the GC heap.

Indeed. We used to use GC.addRange for this and this was said to be
slow when using many ranges. So I'm basically asking whether the scan
delegate has got the same problem or whether it can cope with thousands
of small ranges.
A scanned but not collected heap is slightly different, as the GC can
internally treat the allocator memory as one huge memory range. When
allocating using C malloc, every single allocation needs to be scanned
individually. A scan+/do not collect allocator can probably be built
using the std.experimental.allocator primitives but that code is not in
druntime.

> 
> > I'd need a way to completely manually manage GC.malloc/GC.free memory without the GC collecting this memory, but still scanning this memory for pointers. Does something like this exist?
> 
> It doesn't have to be GC.malloc/GC.free, right?  The current DMD-style emutls simply mallocs and frees the TLS data itself and only expects the GC to scan it.

The problem here again is whether this scales properly when using thousands of non contiguous memory ranges. DMD style TLS can allocate one memory block per thread for all variables. GCC style will allocate one block per thread and variable.

> 
> > Another option is simply using the DMD-style emutls. But as far as I can see the DMD implementation never supported dynamic loading of shared libraries? This is something the GCC emutls support is quite good at: It doesn't have any platform dependencies (apart from mutexes and some way to store one thread specific pointer+destructor) and should work with all kinds of shared library combinations. DMD style emutls also does not allow sharing TLS variables between D and other languages.
> 
> Yes, DMD's emutls was never made to work with loading multiple shared libraries.  As for sharing with other languages without copying the TLS data over, that seems a rare scenario.

Yes, probably the best solution for now is to reimplement GCC style emutls with shared library support in druntime for all compilers and forget about C/C++ TLS compatibility. Even if we could get patches into libgcc it'd take years till all relevant systems have been updated to new libgcc versions.

> 
> > So I was thinking, if DMD style emutls really doesn't support shared libraries, maybe we should just clone a GCC-style, compiler and OS agnostic emutls implementation into druntime? A D implementation could simply allocate all internal arrays using the GC. This should be just as efficient as the C implementation for variable access and interfacing to the GC is trivial. It gets somewhat more complicated if we want to use this in betterC though. We also lose C/C++ compatibility though by using such a custom implementation.
> 
> It would be a good alternative to have, and you're not going to care in betterC mode, since there's no druntime or GC.  You'd have to be careful how you called TLS data from C/C++, but it could still be done.
> 
> > The rest of this post is a description of the GCC emutls code.
> > Someone
> > can use this specification to implement a clean-room design D
> > emutls
> > clone.
> > Source code can be found here, but beware of the GPL license:
> > https://github.com/gcc-mirror/gcc/blob/master/libgcc/emutls.c
> >
> > [...]
> 
> There is also this llvm implementation, available under permissive licenses and actually documented somewhat:
> 
> https://github.com/llvm-mirror/compiler-rt/blob/master/lib/builtins/emutls.c

Unfortunately also not boost compatible, so we can't simply port that code either, as far as I can see?


-- Johannes

July 16, 2017
On 16 July 2017 at 14:37, Johannes Pfau via Digitalmars-d <digitalmars-d@puremagic.com> wrote:
> Am Sat, 15 Jul 2017 10:49:39 +0000
> schrieb Joakim <dlang@joakim.fea.st>:
>
>> On Friday, 14 July 2017 at 09:13:26 UTC, Johannes Pfau wrote:
>> > Another solution could be to enhance libgcc emutls to allow custom allocators, then have a special allocation function in druntime for all D emutls variables. As far as I know there is no GC heap that is scanned, but not automatically collected?
>>
>> I believe that's what's done with the TLS ranges now, they're scanned but not collected, though they're not part of the GC heap.
>
> Indeed. We used to use GC.addRange for this and this was said to be
> slow when using many ranges. So I'm basically asking whether the scan
> delegate has got the same problem or whether it can cope with thousands
> of small ranges.
> A scanned but not collected heap is slightly different, as the GC can
> internally treat the allocator memory as one huge memory range. When
> allocating using C malloc, every single allocation needs to be scanned
> individually. A scan+/do not collect allocator can probably be built
> using the std.experimental.allocator primitives but that code is not in
> druntime.
>
>>
>> > I'd need a way to completely manually manage GC.malloc/GC.free memory without the GC collecting this memory, but still scanning this memory for pointers. Does something like this exist?
>>
>> It doesn't have to be GC.malloc/GC.free, right?  The current DMD-style emutls simply mallocs and frees the TLS data itself and only expects the GC to scan it.
>
> The problem here again is whether this scales properly when using thousands of non contiguous memory ranges. DMD style TLS can allocate one memory block per thread for all variables. GCC style will allocate one block per thread and variable.
>
>>
>> > Another option is simply using the DMD-style emutls. But as far as I can see the DMD implementation never supported dynamic loading of shared libraries? This is something the GCC emutls support is quite good at: It doesn't have any platform dependencies (apart from mutexes and some way to store one thread specific pointer+destructor) and should work with all kinds of shared library combinations. DMD style emutls also does not allow sharing TLS variables between D and other languages.
>>
>> Yes, DMD's emutls was never made to work with loading multiple shared libraries.  As for sharing with other languages without copying the TLS data over, that seems a rare scenario.
>
> Yes, probably the best solution for now is to reimplement GCC style emutls with shared library support in druntime for all compilers and forget about C/C++ TLS compatibility. Even if we could get patches into libgcc it'd take years till all relevant systems have been updated to new libgcc versions.
>

I sense a revert coming on...

https://github.com/D-Programming-GDC/GDC/commit/cf5e9e323b26d21a652bc2933dd886faba90281c

Iain.
July 16, 2017
Am Sun, 16 Jul 2017 14:48:04 +0200
schrieb Iain Buclaw via Digitalmars-d <digitalmars-d@puremagic.com>:

> 
> I sense a revert coming on...
> 
> https://github.com/D-Programming-GDC/GDC/commit/cf5e9e323b26d21a652bc2933dd886faba90281c
> 
> Iain.

Correct, though more in a metaphorical sense ;-)

Ideally, I'd want a boost licensed, high level D implementation in core.thread. Instead of using __gthread get/setspecific, we simply add a GC managed (i.e. plain stupid) void[][] _tlsVars array to core.thread.Thread, use core.sync for locking and core.atomic to manage array indices. With all the high-level stuff we can reuse from druntime (resizing/reserving arrays) such an implementation is probably < 100 LOC. Most importantly, as we can't overwrite the functions in libgcc we'd also use custom function names (__d_emutls_get_address).

The one thing stopping me though is that I don't think I can implement this and boost-license it now that I almost know the libgcc implementation by heart...

-- Johannes

July 23, 2017
On Sunday, 16 July 2017 at 12:37:26 UTC, Johannes Pfau wrote:
> Yes, probably the best solution for now is to reimplement GCC style emutls with shared library support in druntime for all compilers and forget about C/C++ TLS compatibility. Even if we could get patches into libgcc it'd take years till all relevant systems have been updated to new libgcc versions.

It might be worth doing anyway, considering the rise of GC languages like D and Go.

>> There is also this llvm implementation, available under permissive licenses and actually documented somewhat:
>> 
>> https://github.com/llvm-mirror/compiler-rt/blob/master/lib/builtins/emutls.c
>
> Unfortunately also not boost compatible, so we can't simply port that code either, as far as I can see?

Yes, it can't simply be relicensed as Boost, even though the UIUC/MIT dual license it's under is very permissive, but each license has advertising and license text inclusion clauses that are not compatible with the Boost license.

On Sunday, 16 July 2017 at 14:10:45 UTC, Johannes Pfau wrote:
> Am Sun, 16 Jul 2017 14:48:04 +0200
> schrieb Iain Buclaw via Digitalmars-d <digitalmars-d@puremagic.com>:
>
>> 
>> I sense a revert coming on...
>> 
>> https://github.com/D-Programming-GDC/GDC/commit/cf5e9e323b26d21a652bc2933dd886faba90281c
>> 
>> Iain.
>
> Correct, though more in a metaphorical sense ;-)
>
> Ideally, I'd want a boost licensed, high level D implementation in core.thread. Instead of using __gthread get/setspecific, we simply add a GC managed (i.e. plain stupid) void[][] _tlsVars array to core.thread.Thread, use core.sync for locking and core.atomic to manage array indices. With all the high-level stuff we can reuse from druntime (resizing/reserving arrays) such an implementation is probably < 100 LOC. Most importantly, as we can't overwrite the functions in libgcc we'd also use custom function names (__d_emutls_get_address).
>
> The one thing stopping me though is that I don't think I can implement this and boost-license it now that I almost know the libgcc implementation by heart...

Sounds like a worthwhile effort.  If it requires someone who's never looked at the libgcc implementation, you could try asking in the LDC forum or someone who's contributed to the GC.  Maybe Dmitry could whip this up for us?