March 18, 2012
I thought about supporting emulated tls a little. The GCC emutls.c implementation currently can't work with the gc, as every TLS variable is allocated individually and therefore we don't have a contiguous memory region for the gc. I think these are the possible solutions:

* Try to fix GCCs emutls to allocate all tls memory for a module
  (application/shared object) at once. That's the best solution
  and native TLS works this way, but I'm not sure if we can extract
  enough information from the runtime linker to make this work (we
  need at least the combined size of all tls variables).

* Provide a callback in GCC's emutls which is called after every
  allocation. This could call GC.addRange for every variable, but I
  guess adding huge amounts of ranges is slow.

* Make it possible to register a custom allocator for GCC's emutls (not
  sure if possible, as this would have to be set up very early in
  application startup). Then allocate the memory directly from the GC
  (but this memory should only be scanned, not collected)

* Replace the calls to mallloc in emutls.c with a custom, region based
  memory allocator. (This is not a perfect solution though, it can
  always happen that we'll need more memory)



* Do not use GCC's emutls at all, roll a custom solution. This could be
  compatible with / based on dmd's tls emulation for OSX. Most of the
  implementation is in core.thread, all that's necessary is to group
  the tls data into a _tls_data_array and call ___tls_get_addr for
  every tls access. I'm not sure if this can be done in the
  'middle-end' though and it doesn't support shared libraries yet.

March 18, 2012
On 18 March 2012 11:32, Johannes Pfau <nospam@example.com> wrote:
> I thought about supporting emulated tls a little. The GCC emutls.c implementation currently can't work with the gc, as every TLS variable is allocated individually and therefore we don't have a contiguous memory region for the gc. I think these are the possible solutions:
>
> * Try to fix GCCs emutls to allocate all tls memory for a module  (application/shared object) at once. That's the best solution  and native TLS works this way, but I'm not sure if we can extract  enough information from the runtime linker to make this work (we  need at least the combined size of all tls variables).
>
> * Provide a callback in GCC's emutls which is called after every  allocation. This could call GC.addRange for every variable, but I  guess adding huge amounts of ranges is slow.
>

Painfully slow.


> * Make it possible to register a custom allocator for GCC's emutls (not  sure if possible, as this would have to be set up very early in  application startup). Then allocate the memory directly from the GC  (but this memory should only be scanned, not collected)
>
> * Replace the calls to mallloc in emutls.c with a custom, region based  memory allocator. (This is not a perfect solution though, it can  always happen that we'll need more memory)
>
>
>
> * Do not use GCC's emutls at all, roll a custom solution. This could be  compatible with / based on dmd's tls emulation for OSX. Most of the  implementation is in core.thread, all that's necessary is to group  the tls data into a _tls_data_array and call ___tls_get_addr for  every tls access. I'm not sure if this can be done in the  'middle-end' though and it doesn't support shared libraries yet.
>

If we are going to fix TLS, I'd rather it be in the most platform agnostic way possible, if it could be helped. That would mean also scrapping the current implementation on Linux (just tries to mimic what dmd does, and has corner cases where it doesn't always get it right).




-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
March 18, 2012
On 18-03-2012 12:32, Johannes Pfau wrote:
> I thought about supporting emulated tls a little. The GCC emutls.c
> implementation currently can't work with the gc, as every TLS variable
> is allocated individually and therefore we don't have a contiguous
> memory region for the gc. I think these are the possible solutions:
>
> * Try to fix GCCs emutls to allocate all tls memory for a module
>    (application/shared object) at once. That's the best solution
>    and native TLS works this way, but I'm not sure if we can extract
>    enough information from the runtime linker to make this work (we
>    need at least the combined size of all tls variables).
>
> * Provide a callback in GCC's emutls which is called after every
>    allocation. This could call GC.addRange for every variable, but I
>    guess adding huge amounts of ranges is slow.

We should avoid this if possible, yes. A small root set is desirable.

>
> * Make it possible to register a custom allocator for GCC's emutls (not
>    sure if possible, as this would have to be set up very early in
>    application startup). Then allocate the memory directly from the GC
>    (but this memory should only be scanned, not collected)

Such an allocator would probably just allocate a decently-sized memory block from libc and add it as a root range (rather than individual word-sized roots). The memory doesn't necessarily have to be allocated with the GC.

>
> * Replace the calls to mallloc in emutls.c with a custom, region based
>    memory allocator. (This is not a perfect solution though, it can
>    always happen that we'll need more memory)
>
>
>
> * Do not use GCC's emutls at all, roll a custom solution. This could be
>    compatible with / based on dmd's tls emulation for OSX. Most of the
>    implementation is in core.thread, all that's necessary is to group
>    the tls data into a _tls_data_array and call ___tls_get_addr for
>    every tls access. I'm not sure if this can be done in the
>    'middle-end' though and it doesn't support shared libraries yet.
>


-- 
- Alex
March 18, 2012
Am Sun, 18 Mar 2012 12:21:51 +0000
schrieb Iain Buclaw <ibuclaw@ubuntu.com>:

> On 18 March 2012 11:32, Johannes Pfau <nospam@example.com> wrote:
> > I thought about supporting emulated tls a little. The GCC emutls.c implementation currently can't work with the gc, as every TLS variable is allocated individually and therefore we don't have a contiguous memory region for the gc. I think these are the possible solutions:
> >
> > * Try to fix GCCs emutls to allocate all tls memory for a module  (application/shared object) at once. That's the best solution  and native TLS works this way, but I'm not sure if we can extract  enough information from the runtime linker to make this work (we  need at least the combined size of all tls variables).
> >
> > * Provide a callback in GCC's emutls which is called after every  allocation. This could call GC.addRange for every variable, but I  guess adding huge amounts of ranges is slow.
> >
> 
> Painfully slow.
> 
> 
> > * Make it possible to register a custom allocator for GCC's emutls (not sure if possible, as this would have to be set up very early in  application startup). Then allocate the memory directly from the GC  (but this memory should only be scanned, not collected)
> >
> > * Replace the calls to mallloc in emutls.c with a custom, region based memory allocator. (This is not a perfect solution though, it can always happen that we'll need more memory)
> >
> >
> >
> > * Do not use GCC's emutls at all, roll a custom solution. This could be compatible with / based on dmd's tls emulation for OSX. Most of the implementation is in core.thread, all that's necessary is to group the tls data into a _tls_data_array and call ___tls_get_addr for every tls access. I'm not sure if this can be done in the 'middle-end' though and it doesn't support shared libraries yet.
> >
> 
> If we are going to fix TLS, I'd rather it be in the most platform agnostic way possible, if it could be helped. That would mean also scrapping the current implementation on Linux (just tries to mimic what dmd does, and has corner cases where it doesn't always get it right).

You mean getting rid of __tls_beg and __tls_end? I'd also like to remove those, but:

TLS is mostly object-format specific (not as much OS specific). The ELF
implementation lays out the TLS data for a module (module = shared
library or the application) in a contiguous way. The details are
described in "ELF Handling For Thread-Local
Storage" (www.akkadia.org/drepper/tls.pdf).

The GC requires the TLS blocks to be contiguous, this is not the case for GCC's emulated TLS and this causes issues there.

For native TLS/ELF this requirement is met, but the GC also has to know the start and the size of the TLS sections. Although the runtime linker has this information, there's no standard way to access it. So we could:

* Add a custom extension API to the C libraries. We'd need at least: A
  'tls_range dl_get_tls_range(void *handle)' function related to the
  dl* set of funtions in the runtime linker, and a 'tls_range
  dl_get_tls_range2(struct dl_phdr_info *info)' to be used with
  dl_iterate_phdr. We also need some way to get the tls range for the
  application, 'get_app_tls_range' (although some libcs also return
  the application module in dl_iterate_phdr).

This seems to be the best way, but we'd have to patch every C library and it would take some time till those updated C libraries are widely deployed.

The other solution is to hook directly into each C libraries non-public (and maybe non-stable!) API. For example, the structure returned by BSD libc's dl_iterate_phdr and dlopen has these fields:

 int tlsindex;		/* Index in DTV for this module
 void *tlsinit;		/* Base address of TLS init block
 size_t tlsinitsize;	/* Size of TLS init block for this module
 size_t tlssize;	/* Size of TLS block for this module
 size_t tlsoffset;	/* Offset of static TLS block for this module
 size_t tlsalign;	/* Alignment of static TLS block

tlsindex gives us the start-address of the TLS for every thread, as long as we know how to compute the TLS address from the TP (thread pointer) and the dtv index (there are basically 2 methods, described in "ELF Handling For Thread-Local Storage") and tlssize gives us the size.


However, there doesn't seem to be a painless way to do this...

March 18, 2012
On 2012-03-18 12:32, Johannes Pfau wrote:
> I thought about supporting emulated tls a little. The GCC emutls.c
> implementation currently can't work with the gc, as every TLS variable
> is allocated individually and therefore we don't have a contiguous
> memory region for the gc. I think these are the possible solutions:

Why not use the native TLS implementation when available and roll our own, like DMD on Mac OS X, when none exists?

BTW, I think it would be possible to emulate TLS in a very similar way to how it's implemented natively for ELF.

-- 
/Jacob Carlborg
March 18, 2012
On 2012-03-18 19:39, Johannes Pfau wrote:

> You mean getting rid of __tls_beg and __tls_end? I'd also like to
> remove those, but:

__tls_beg and __tls_end is not used by Mac OS X any more:

https://github.com/D-Programming-Language/druntime/commit/73cf2c150665cb17d9365a6e3d6cf144d76312d6

https://github.com/D-Programming-Language/dmd/commit/054c525edba048ad7829dd5ec2d8d9261a6517c3

> TLS is mostly object-format specific (not as much OS specific). The ELF
> implementation lays out the TLS data for a module (module = shared
> library or the application) in a contiguous way. The details are
> described in "ELF Handling For Thread-Local
> Storage" (www.akkadia.org/drepper/tls.pdf).
>

Mac OS X 10.7 + supports TLS natively. But I don't know where to find documentation about it. It always possible to look at the source code.

-- 
/Jacob Carlborg
March 19, 2012
Am Sun, 18 Mar 2012 21:57:57 +0100
schrieb Jacob Carlborg <doob@me.com>:

> On 2012-03-18 12:32, Johannes Pfau wrote:
> > I thought about supporting emulated tls a little. The GCC emutls.c implementation currently can't work with the gc, as every TLS variable is allocated individually and therefore we don't have a contiguous memory region for the gc. I think these are the possible solutions:
> 
> Why not use the native TLS implementation when available and roll our own, like DMD on Mac OS X, when none exists?

That's what we (mostly) do right now. We have 2 issues:

* Our own, emulated TLS support is implemented in GCC. This means it's
  also used in C, which is great. Also GCC's emulated tls needs
  absolutely no special features in the runtime linker, compile time
  linker or language frontends. It's very portable and works with all
  weird combinations of dynamic libraries, dlopen, etc.
  But it has one quirk: It doesn't allocate TLS memory in a contiguous
  way, every tls variable is allocated using malloc. This means we
  can't pass a range to the GC for the tls variables. So we can't
  support this emutls in the GC.

* The other issue with native TLS is that using bracketing with
  __tls_beg and __tls_end has corner cases where it doesn't work. We'd
  need an alternative to locate the TLS memory addresses and TLS sizes.
  But there's no standard or public API to do that.

> BTW, I think it would be possible to emulate TLS in a very similar way to how it's implemented natively for ELF.
> 

I don't think it's that easy. For example, how would you assign module ids? For native TLS this is partially done by the compile time linker (for the main application and libraries that are always loaded), but if no native TLS is available, we can't rely on the linker to do that. We also need some way to get the current module id in running code.

And how do we get the TLS initialization data? If we placed it into an array, like DMD does on OSX, we could use dlsym for dlopened libraries, but what about initially loaded libraries?

Say you have application 'app', which depends on 'liba' and 'libb'. All of these have TLS data. Maybe we could implement something using dl_iterate_phdr, but that's a nonstandard extension.

Compare that to GCC's emulation, which is probably slow, but 'just
works' everywhere (except for the GC :-( ).
March 19, 2012
Am Sun, 18 Mar 2012 22:06:41 +0100
schrieb Jacob Carlborg <doob@me.com>:

> On 2012-03-18 19:39, Johannes Pfau wrote:
> 
> > You mean getting rid of __tls_beg and __tls_end? I'd also like to remove those, but:
> 
> __tls_beg and __tls_end is not used by Mac OS X any more:
> 
> https://github.com/D-Programming-Language/druntime/commit/73cf2c150665cb17d9365a6e3d6cf144d76312d6
> 
> https://github.com/D-Programming-Language/dmd/commit/054c525edba048ad7829dd5ec2d8d9261a6517c3

Yes, but OSX still uses emulated tls. With the way dmd emulates TLS it's possible to remove __tls_beg and __tls_end, but for native TLS those symbols are still needed. However, as the runtime linker (ld.so) has got the necessary information, it's possible that OSX even offers a API to access it. It's just that most C libraries don't provide a way to get the TLS segment sizes and the (per thread) addresses of the TLS blocks.

> > TLS is mostly object-format specific (not as much OS specific). The
> > ELF implementation lays out the TLS data for a module (module =
> > shared library or the application) in a contiguous way. The details
> > are described in "ELF Handling For Thread-Local
> > Storage" (www.akkadia.org/drepper/tls.pdf).
> >
> 
> Mac OS X 10.7 + supports TLS natively. But I don't know where to find documentation about it. It always possible to look at the source code.
> 

Then it's probably already supported by GCC/GDC. But having working emulated TLS would be nice for many other architectures. Native TLS is not that widespread.

March 19, 2012
On 19 March 2012 08:15, Johannes Pfau <nospam@example.com> wrote:
> Am Sun, 18 Mar 2012 21:57:57 +0100
> schrieb Jacob Carlborg <doob@me.com>:
>
>> On 2012-03-18 12:32, Johannes Pfau wrote:
>> > I thought about supporting emulated tls a little. The GCC emutls.c implementation currently can't work with the gc, as every TLS variable is allocated individually and therefore we don't have a contiguous memory region for the gc. I think these are the possible solutions:
>>
>> Why not use the native TLS implementation when available and roll our own, like DMD on Mac OS X, when none exists?
>
> That's what we (mostly) do right now. We have 2 issues:
>
> * Our own, emulated TLS support is implemented in GCC. This means it's
>  also used in C, which is great. Also GCC's emulated tls needs
>  absolutely no special features in the runtime linker, compile time
>  linker or language frontends. It's very portable and works with all
>  weird combinations of dynamic libraries, dlopen, etc.
>  But it has one quirk: It doesn't allocate TLS memory in a contiguous
>  way, every tls variable is allocated using malloc. This means we
>  can't pass a range to the GC for the tls variables. So we can't
>  support this emutls in the GC.
>

As far as my thought process goes, the only (implementable in the GDC frontend) way to force contiguous layout of all TLS symbols is to pack them up ourselves into a struct that is accessible via a single global module-level variable.  And in the .ctor section, the module adds this range to the GC.  This should be enough so it also works for shared libraries too, however I'm sure there is quite a few details I am missing out on here that would block this from working. :)


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
March 19, 2012
On 2012-03-19 09:15, Johannes Pfau wrote:
> Am Sun, 18 Mar 2012 21:57:57 +0100
> schrieb Jacob Carlborg<doob@me.com>:
>
>> On 2012-03-18 12:32, Johannes Pfau wrote:
>>> I thought about supporting emulated tls a little. The GCC emutls.c
>>> implementation currently can't work with the gc, as every TLS
>>> variable is allocated individually and therefore we don't have a
>>> contiguous memory region for the gc. I think these are the possible
>>> solutions:
>>
>> Why not use the native TLS implementation when available and roll our
>> own, like DMD on Mac OS X, when none exists?
>
> That's what we (mostly) do right now. We have 2 issues:
>
> * Our own, emulated TLS support is implemented in GCC. This means it's
>    also used in C, which is great. Also GCC's emulated tls needs
>    absolutely no special features in the runtime linker, compile time
>    linker or language frontends. It's very portable and works with all
>    weird combinations of dynamic libraries, dlopen, etc.
>    But it has one quirk: It doesn't allocate TLS memory in a contiguous
>    way, every tls variable is allocated using malloc. This means we
>    can't pass a range to the GC for the tls variables. So we can't
>    support this emutls in the GC.

Ok, I see.

> * The other issue with native TLS is that using bracketing with
>    __tls_beg and __tls_end has corner cases where it doesn't work. We'd
>    need an alternative to locate the TLS memory addresses and TLS sizes.
>    But there's no standard or public API to do that.

On Mac OS X they are actually not needed. Don't know about other platforms.

>> BTW, I think it would be possible to emulate TLS in a very similar
>> way to how it's implemented natively for ELF.
>>
>
> I don't think it's that easy. For example, how would you assign module
> ids? For native TLS this is partially done by the compile time linker
> (for the main application and libraries that are always loaded), but if
> no native TLS is available, we can't rely on the linker to do that. We
> also need some way to get the current module id in running code.

As I understand it, in the native ELF implementation, assembly is used to access the current module id, this is for FreeBSD:

http://people.freebsd.org/~marcel/tls.html

This is how ___tls_get_addr is implemented on FreeBSD ELF i386:

https://bitbucket.org/freebsd/freebsd-head/src/4e8f50fe2f05/libexec/rtld-elf/i386/reloc.c#cl-355

> And how do we get the TLS initialization data? If we placed it into an
> array, like DMD does on OSX, we could use dlsym for dlopened libraries,
> but what about initially loaded libraries?

In the same way it's done in the native implementation. Isn't it possible to access all loaded libraries?

> Say you have application 'app', which depends on 'liba' and 'libb'. All
> of these have TLS data. Maybe we could implement something using
> dl_iterate_phdr, but that's a nonstandard extension.

Ok. Mac OS X has this a function called "_dyld_register_func_for_add_image", I guess other OS'es don't have a corresponding function? In general all this stuff very low level and nonstandard.

https://developer.apple.com/library/mac/#documentation/developertools/Reference/MachOReference/Reference/reference.html#jumpTo_53

> Compare that to GCC's emulation, which is probably slow, but 'just
> works' everywhere (except for the GC :-( ).

Yeah, that's a big advantage.

In general I was hoping that the work done by the dynamic loader to setup TLS could be moved to druntime.

-- 
/Jacob Carlborg
« First   ‹ Prev
1 2 3 4
Top | Discussion index | About this forum | D home