March 19, 2012
On 2012-03-19 09:17, Johannes Pfau wrote:
> Am Sun, 18 Mar 2012 22:06:41 +0100
> schrieb Jacob Carlborg<doob@me.com>:

> Yes, but OSX still uses emulated tls. With the way dmd emulates TLS
> it's possible to remove __tls_beg and __tls_end, but for native TLS
> those symbols are still needed. However, as the runtime linker (ld.so)
> has got the necessary information, it's possible that OSX even offers a
> API to access it. It's just that most C libraries don't provide a way to
> get the TLS segment sizes and the (per thread) addresses of the TLS
> blocks.

The dyld library on Mac OS X provides access to segments and sections. But since the dynamic loader needs can get this information it should be possible for other applications to get this information as well?

Just walk through the object file and find the necessary segments?

>> Mac OS X 10.7 + supports TLS natively. But I don't know where to find
>> documentation about it. It always possible to look at the source code.
>>
>
> Then it's probably already supported by GCC/GDC. But having working
> emulated TLS would be nice for many other architectures. Native TLS is
> not that widespread.

Yeah, don't know about GCC though, Apple cares less and less about GCC and putting all their effort in to LLVM and Clang. Ok, I didn't know how widespread TLS was.

-- 
/Jacob Carlborg
March 19, 2012
Am Mon, 19 Mar 2012 09:22:01 +0000
schrieb Iain Buclaw <ibuclaw@ubuntu.com>:

> On 19 March 2012 08:15, Johannes Pfau <nospam@example.com> wrote:
> >
> > * Our own, emulated TLS support is implemented in GCC. This means
> > it's also used in C, which is great. Also GCC's emulated tls needs
> >  absolutely no special features in the runtime linker, compile time
> >  linker or language frontends. It's very portable and works with all
> >  weird combinations of dynamic libraries, dlopen, etc.
> >  But it has one quirk: It doesn't allocate TLS memory in a
> > contiguous way, every tls variable is allocated using malloc. This
> > means we can't pass a range to the GC for the tls variables. So we
> > can't support this emutls in the GC.
> >
> 
> As far as my thought process goes, the only (implementable in the GDC frontend) way to force contiguous layout of all TLS symbols is to pack them up ourselves into a struct that is accessible via a single global module-level variable.  And in the .ctor section, the module adds this range to the GC.  This should be enough so it also works for shared libraries too, however I'm sure there is quite a few details I am missing out on here that would block this from working. :)
> 

Good idea, I should have thought about that. I can't think of a reason why it wouldn't work and it should be quite fast as well.

Just to clarify: 'module-level' as in D module(/object file) or as in one variable per shared library/application? If we can support one variable per shared library/application that'd be great, as we will then only have a few tls ranges for the gc.

March 19, 2012
Am Mon, 19 Mar 2012 10:40:25 +0100
schrieb Jacob Carlborg <doob@me.com>:

> On 2012-03-19 09:15, Johannes Pfau wrote:
> > Am Sun, 18 Mar 2012 21:57:57 +0100
> > schrieb Jacob Carlborg<doob@me.com>:
> >
> >> On 2012-03-18 12:32, Johannes Pfau wrote:
> >>> I thought about supporting emulated tls a little. The GCC emutls.c implementation currently can't work with the gc, as every TLS variable is allocated individually and therefore we don't have a contiguous memory region for the gc. I think these are the possible solutions:
> >>
> >> Why not use the native TLS implementation when available and roll our own, like DMD on Mac OS X, when none exists?
> >
> > That's what we (mostly) do right now. We have 2 issues:
> >
> > * Our own, emulated TLS support is implemented in GCC. This means
> > it's also used in C, which is great. Also GCC's emulated tls needs
> >    absolutely no special features in the runtime linker, compile
> > time linker or language frontends. It's very portable and works
> > with all weird combinations of dynamic libraries, dlopen, etc.
> >    But it has one quirk: It doesn't allocate TLS memory in a
> > contiguous way, every tls variable is allocated using malloc. This
> > means we can't pass a range to the GC for the tls variables. So we
> > can't support this emutls in the GC.
> 
> Ok, I see.
> 
> > * The other issue with native TLS is that using bracketing with
> >    __tls_beg and __tls_end has corner cases where it doesn't work.
> > We'd need an alternative to locate the TLS memory addresses and TLS
> > sizes. But there's no standard or public API to do that.
> 
> On Mac OS X they are actually not needed. Don't know about other platforms.
> 
> >> BTW, I think it would be possible to emulate TLS in a very similar way to how it's implemented natively for ELF.
> >>
> >
> > I don't think it's that easy. For example, how would you assign module ids? For native TLS this is partially done by the compile time linker (for the main application and libraries that are always loaded), but if no native TLS is available, we can't rely on the linker to do that. We also need some way to get the current module id in running code.
> 
> As I understand it, in the native ELF implementation, assembly is used to access the current module id, this is for FreeBSD:
> 
> http://people.freebsd.org/~marcel/tls.html
> 
> This is how ___tls_get_addr is implemented on FreeBSD ELF i386:
> 
> https://bitbucket.org/freebsd/freebsd-head/src/4e8f50fe2f05/libexec/rtld-elf/i386/reloc.c#cl-355

Yep and the module id is part of the tls_index parameter. That pointer is a pointer into the GOT. The initial values of that GOT entry are undefined, they are filled in by the runtime linker. We could probably emulate all that, but it seems a little complicated to me. If we can get the current emutls to work, that be awesome even if it's slow. Proper native TLS support is easier to implement in the runtime linker anyway.

> > And how do we get the TLS initialization data? If we placed it into an array, like DMD does on OSX, we could use dlsym for dlopened libraries, but what about initially loaded libraries?
> 
> In the same way it's done in the native implementation. Isn't it possible to access all loaded libraries?

The only way to access all loaded library is dl_iterate_phdr. But I'm not sure if it provides all necessary information.

> 
> > Say you have application 'app', which depends on 'liba' and 'libb'. All of these have TLS data. Maybe we could implement something using dl_iterate_phdr, but that's a nonstandard extension.
> 
> Ok. Mac OS X has this a function called "_dyld_register_func_for_add_image", I guess other OS'es don't have a corresponding function? In general all this stuff very low level and nonstandard.

Some C libraries might provide a similar API, but there's no guarantee such an API is available.

> 
> https://developer.apple.com/library/mac/#documentation/developertools/Reference/MachOReference/Reference/reference.html#jumpTo_53
> 
> > Compare that to GCC's emulation, which is probably slow, but 'just
> > works' everywhere (except for the GC :-( ).
> 
> Yeah, that's a big advantage.
> 
> In general I was hoping that the work done by the dynamic loader to setup TLS could be moved to druntime.
> 

That'd be nice, but I think the runtime linker doesn't export all necessary information.

March 19, 2012
On 19 March 2012 15:25, Johannes Pfau <nospam@example.com> wrote:
> Am Mon, 19 Mar 2012 09:22:01 +0000
> schrieb Iain Buclaw <ibuclaw@ubuntu.com>:
>
>> On 19 March 2012 08:15, Johannes Pfau <nospam@example.com> wrote:
>> >
>> > * Our own, emulated TLS support is implemented in GCC. This means
>> > it's also used in C, which is great. Also GCC's emulated tls needs
>> >  absolutely no special features in the runtime linker, compile time
>> >  linker or language frontends. It's very portable and works with all
>> >  weird combinations of dynamic libraries, dlopen, etc.
>> >  But it has one quirk: It doesn't allocate TLS memory in a
>> > contiguous way, every tls variable is allocated using malloc. This
>> > means we can't pass a range to the GC for the tls variables. So we
>> > can't support this emutls in the GC.
>> >
>>
>> As far as my thought process goes, the only (implementable in the GDC frontend) way to force contiguous layout of all TLS symbols is to pack them up ourselves into a struct that is accessible via a single global module-level variable.  And in the .ctor section, the module adds this range to the GC.  This should be enough so it also works for shared libraries too, however I'm sure there is quite a few details I am missing out on here that would block this from working. :)
>>
>
> Good idea, I should have thought about that. I can't think of a reason why it wouldn't work and it should be quite fast as well.
>

Initial things to think about on the top of my head:

* Speed to access symbols.
* Accessing thread local symbols across modules.


> Just to clarify: 'module-level' as in D module(/object file) or as in one variable per shared library/application? If we can support one variable per shared library/application that'd be great, as we will then only have a few tls ranges for the gc.
>

Per module - see the code that initialises _Dmodule_ref.  We're really just adding two extra fields to that which includes starting address and size.


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
March 19, 2012
On 2012-03-19 16:57, Johannes Pfau wrote:
> Am Mon, 19 Mar 2012 10:40:25 +0100
> schrieb Jacob Carlborg<doob@me.com>:

>> In general I was hoping that the work done by the dynamic loader to
>> setup TLS could be moved to druntime.
>>
>
> That'd be nice, but I think the runtime linker doesn't export all
> necessary information.

I think this would require to investigate each individual platform and see what's possible.

-- 
/Jacob Carlborg
March 19, 2012
On 3/18/2012 12:32 PM, Johannes Pfau wrote:
> I thought about supporting emulated tls a little. The GCC emutls.c
> implementation currently can't work with the gc, as every TLS variable
> is allocated individually and therefore we don't have a contiguous
> memory region for the gc. I think these are the possible solutions:
>
> * Try to fix GCCs emutls to allocate all tls memory for a module
>    (application/shared object) at once. That's the best solution
>    and native TLS works this way, but I'm not sure if we can extract
>    enough information from the runtime linker to make this work (we
>    need at least the combined size of all tls variables).
>
> * Provide a callback in GCC's emutls which is called after every
>    allocation. This could call GC.addRange for every variable, but I
>    guess adding huge amounts of ranges is slow.
>
> * Make it possible to register a custom allocator for GCC's emutls (not
>    sure if possible, as this would have to be set up very early in
>    application startup). Then allocate the memory directly from the GC
>    (but this memory should only be scanned, not collected)
>
> * Replace the calls to mallloc in emutls.c with a custom, region based
>    memory allocator. (This is not a perfect solution though, it can
>    always happen that we'll need more memory)

Check the implementation of ranges in gcx.d: it's rather fast to add a range (vector like appending to exponentially growing data), and a simple loop over the ranges is done in the collection that would not change performance a lot when being executed in one memory chunk: it's the marking of references in the scanned data that is expensive.

I would be more concerned about removal of ranges, though. It scans existing ranges linearly to find the one to remove and moves the remaining entries in memory. Some optimizations might be helpful here.

>
>
>
> * Do not use GCC's emutls at all, roll a custom solution. This could be
>    compatible with / based on dmd's tls emulation for OSX. Most of the
>    implementation is in core.thread, all that's necessary is to group
>    the tls data into a _tls_data_array and call ___tls_get_addr for
>    every tls access. I'm not sure if this can be done in the
>    'middle-end' though and it doesn't support shared libraries yet.
>
March 21, 2012
Am Mon, 19 Mar 2012 16:14:36 +0000
schrieb Iain Buclaw <ibuclaw@ubuntu.com>:


> 
> Initial things to think about on the top of my head:
> 
> * Speed to access symbols.

It needs the normal code to access the TLS struct / get the address of the TLS struct + one add instruction which adds the offset for the specific variable. So it should be fast enough.

> * Accessing thread local symbols across modules.

Do we have to use module-local symbols? If we could use symbols with unique, mangled names, we could just access that symbol+offset from every module. This assumes the d/di files provide enough information to calculate the offset.

> 
> > Just to clarify: 'module-level' as in D module(/object file) or as in one variable per shared library/application? If we can support one variable per shared library/application that'd be great, as we will then only have a few tls ranges for the gc.
> >
> 
> Per module - see the code that initialises _Dmodule_ref.  We're really just adding two extra fields to that which includes starting address and size.
> 
March 21, 2012
On 21 March 2012 13:17, Johannes Pfau <nospam@example.com> wrote:
> Am Mon, 19 Mar 2012 16:14:36 +0000
> schrieb Iain Buclaw <ibuclaw@ubuntu.com>:
>
>
>>
>> Initial things to think about on the top of my head:
>>
>> * Speed to access symbols.
>
> It needs the normal code to access the TLS struct / get the address of the TLS struct + one add instruction which adds the offset for the specific variable. So it should be fast enough.
>
>> * Accessing thread local symbols across modules.
>
> Do we have to use module-local symbols? If we could use symbols with unique, mangled names, we could just access that symbol+offset from every module. This assumes the d/di files provide enough information to calculate the offset.
>

Oh yeah, that's it.  Perhaps the externally visible mangled names just be references to the actual location?

I don't think there would be enough information to access via main entry point symbol+offset.


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
March 23, 2012
On Mon, 19 Mar 2012 16:57:29 +0100, Johannes Pfau <nospam@example.com> wrote:

> The only way to access all loaded library is dl_iterate_phdr. But I'm
> not sure if it provides all necessary information.

Yes it does.

The drawback is that it eagerly allocates the TLS block.
https://github.com/dawgfoto/druntime/blob/SharedRuntime/src/rt/dso.d#L408
https://github.com/dawgfoto/druntime/blob/SharedRuntime/src/rt/dso.d#L459
March 23, 2012
On Mon, 19 Mar 2012 10:40:25 +0100, Jacob Carlborg <doob@me.com> wrote:

> As I understand it, in the native ELF implementation, assembly is used to access the current module id, this is for FreeBSD:
>  http://people.freebsd.org/~marcel/tls.html
>  This is how ___tls_get_addr is implemented on FreeBSD ELF i386:
>  https://bitbucket.org/freebsd/freebsd-head/src/4e8f50fe2f05/libexec/rtld-elf/i386/reloc.c#cl-355

Not quite.
Access to the static image is done through %fs relative addressing
which is super-fast and requires no runtime linking.
The general dynamic addressing needs one tls_index
struct in the GOT for every variable and a call to _tls_get_addr(tls_index*).
The module index and the offset are filled by the runtime linker.