Thread overview
Emulated TLS for Android
Jul 06, 2017
Joakim
Jul 06, 2017
Johannes Pfau
Jul 07, 2017
Joakim
July 06, 2017
So I've finally spent some time looking at this, ie what work google did to get a gcc-alike emulated TLS into llvm, since they've ditched the gcc compiler from their Native Development Kit (NDK):

https://reviews.llvm.org/D10522
https://bugs.llvm.org/show_bug.cgi?id=23566
https://android.googlesource.com/platform/ndk/+/ndk-r15-release/CHANGELOG.md

It simply modifies llvm to call __emutls_get_address from libgcc (a library the NDK still supplies) anytime your code accesses a thread-local variable, but there's no hook in __emutls_get_address that I can use to register that data with the D GC:

https://github.com/gcc-mirror/gcc/blob/master/libgcc/emutls.c#L127

I can compile and run much D code fine with llvm's emulated TLS- all of druntime's tests pass- but start having problems because it's not registered with the GC, a dozen or so Phobos modules' tests fail or segfault in somewhat random ways.

Three ways to fix this come to mind:

1. Intercept all calls to thread-local variables at runtime and make sure they're registered with the GC, ie by inserting some registering function after __emutls_get_address.  This would require deeper knowledge of ldc and llvm than I have, one of you would probably have to do it.  Also, we'd now be depending on libgcc for its emulated TLS, ie another dependency.

2. Intercept __emutls_get_address when linking and replace it with our own implementation, rather than depending on libgcc's version.  This can be done, I tried it with an empty function.  A drawback is that it appears that you don't know how large the emulated thread-local data section will be, which I think is why they keep extending it in the libgcc implementation linked above.

3. Modify my llvm patch that keeps TLS data where it is in other platforms, ie .tdata/.tbss, but removes the SHF_TLS/STT_TLS ELF flags, adds section delimiting symbols _tlsstart and _tlsend, and replaces the TLS relocation with a normal one (pretty much Walter's emulated TLS for OS X: https://gist.github.com/joakim-noah/1fb23fba1ba5b7e87e1a), so that it is enabled for Android behind a runtime flag and apply it to our llvm source before building it.

Initially it'd only have to be for releases, but if we get an Android CI going, it'd have to be applied for that too.  Because _tlsstart is applied for the module with main, that object has to be linked first.

I'm leaning towards 3., it's the easiest and I'm not too keen on how the libgcc version works.

Since 3. will require patching our llvm, let me know what you think we should do.

With this last change, ldc will have Android cross-compilation support from every platform that's part of the official release.  Until we get some way to generate a cross-compiled stdlib from the compiler and accompanying source, I can put up a tarfile with the cross-compiled stdlib for Android, though they'll need the NDK for their platform for its native Android libraries and linker.
July 06, 2017
Am Thu, 06 Jul 2017 11:13:20 +0000
schrieb Joakim <dlang@joakim.fea.st>:

> So I've finally spent some time looking at this, ie what work google did to get a gcc-alike emulated TLS into llvm, since they've ditched the gcc compiler from their Native Development Kit (NDK):
> 
> [...]
>
> With this last change, ldc will have Android cross-compilation support from every platform that's part of the official release. Until we get some way to generate a cross-compiled stdlib from the compiler and accompanying source, I can put up a tarfile with the cross-compiled stdlib for Android, though they'll need the NDK for their platform for its native Android libraries and linker.

Interesting, I've had the exact same problem with GDC which is no surprise as it's using the same mechanism ;-)

1) Does not work if you want to support mixing C and D code. You can intercept calls from D code, but if the variable is only accessed through C code your custom function is not run. (Unless you intercept the function at runtime, but I'm not sure if this leads to a stable solution...)

2) The GCC implementation has the advantage of working with dynamically
loaded shared libraries, static libraries, any number of threads and
it's runtime-linker agnostic. You have to sacrifice one of these
features to know the per-thread memory size. So the GCC solution
is quite elegant, but it does not work with the GC too well...

3) Then you loose C/D compatibility for thread local variables and I'm not sure if the DMD approach fully supports dynamic shared library loading? Do you have some more information about this implementation? I'm wondering whether C compatibility is that important. But TLS for shared library loading etc should work.


The main problem with the GCC implementation is that the memory for TLS is not contiguous. So even if you end up with a solution, you'll have to add a GC range for every single variable and thread. This is not exactly going to be fast...

The solution we came up for GDC was to generate a __scan_emutls(cb) function per module. The function then calls cb(&var, var.sizeof) for every TLS variable in the module. Add a pointer to __scan_emutls to ModuleInfo and all modules can be scanned. But the __scan_emutls functions have to be called for every thread and as the GC runs only in one thread you'll have to do this at thread startup (or whenever a thread loads a new shared library) and store a list of all variables location and size... I never updated this code for the new rt.sections mechanism though so this is currently broken.

We could probably do better by patching the libgcc functions but it'll take very long till these updated libgcc versions have been upgraded on all interesting targets. Optimally libgcc would just provide a callback __emutls_iterate_variables(cb) to iterate all variables in all threads. We can't really do that externally as we can't access the emutls_mutex and emutls_key and as __emutls_get_address updates the pthread_setspecific value anyway, so __emutls_get_address needs to be patched.

The emutls source code is here (GPL3 with GCC Runtime Library Exception!!!) https://github.com/gcc-mirror/gcc/blob/master/libgcc/emutls.c

-- Johannes

July 07, 2017
On Thursday, 6 July 2017 at 12:10:40 UTC, Johannes Pfau wrote:
> Am Thu, 06 Jul 2017 11:13:20 +0000
> schrieb Joakim <dlang@joakim.fea.st>:
>
>> So I've finally spent some time looking at this, ie what work google did to get a gcc-alike emulated TLS into llvm, since they've ditched the gcc compiler from their Native Development Kit (NDK):
>> 
>> [...]
>>
>> With this last change, ldc will have Android cross-compilation support from every platform that's part of the official release. Until we get some way to generate a cross-compiled stdlib from the compiler and accompanying source, I can put up a tarfile with the cross-compiled stdlib for Android, though they'll need the NDK for their platform for its native Android libraries and linker.
>
> Interesting, I've had the exact same problem with GDC which is no surprise as it's using the same mechanism ;-)
>
> 1) Does not work if you want to support mixing C and D code. You can intercept calls from D code, but if the variable is only accessed through C code your custom function is not run. (Unless you intercept the function at runtime, but I'm not sure if this leads to a stable solution...)

I suppose it's possible that you have some extern(C) TLS variable in a D module that's accessed first or only from the C code, but that seems unlikely.

> 2) The GCC implementation has the advantage of working with dynamically
> loaded shared libraries, static libraries, any number of threads and
> it's runtime-linker agnostic. You have to sacrifice one of these
> features to know the per-thread memory size. So the GCC solution
> is quite elegant, but it does not work with the GC too well...

Only shared libraries have not been made to work with the other emulated TLS approaches on Android, largely because I have not looked into switching Android to the massive rt.sections_elf_shared, which ldc already uses for non-Android-linux/Darwin/BSD.

> 3) Then you loose C/D compatibility for thread local variables and I'm not sure if the DMD approach fully supports dynamic shared library loading? Do you have some more information about this implementation? I'm wondering whether C compatibility is that important. But TLS for shared library loading etc should work.

C compatibility only goes under the extreme scenario you alluded to, and I doubt there is much interminingling of TLS variables with C code, even when properly registered with the D GC first so that it works fine.  Yeah, no additional D shared libraries on Android working yet, as mentioned above, only a single D shared library that statically links against the D runtime.

As for more info, Walter wrote an article about it, dmd on OS X used it with Mach-O for years afterwards (still in the defunct x86 version), and I simply copied it over onto Android with ELF:

http://www.drdobbs.com/architecture-and-design/implementing-thread-local-storage-on-os/228701185
https://github.com/dlang/druntime/commit/73cf2c150
https://github.com/dlang/druntime/pull/784

> The main problem with the GCC implementation is that the memory for TLS is not contiguous. So even if you end up with a solution, you'll have to add a GC range for every single variable and thread. This is not exactly going to be fast...
>
> The solution we came up for GDC was to generate a __scan_emutls(cb) function per module. The function then calls cb(&var, var.sizeof) for every TLS variable in the module. Add a pointer to __scan_emutls to ModuleInfo and all modules can be scanned. But the __scan_emutls functions have to be called for every thread and as the GC runs only in one thread you'll have to do this at thread startup (or whenever a thread loads a new shared library) and store a list of all variables location and size... I never updated this code for the new rt.sections mechanism though so this is currently broken.

Interesting, you initialize and GC-register every thread-local variable at every thread startup and add to the list when a shared library is loaded, rather than lazily allocating like other implementations.  I guess this is the cost of making sure the GC always knows what's going on.

> We could probably do better by patching the libgcc functions but it'll take very long till these updated libgcc versions have been upgraded on all interesting targets. Optimally libgcc would just provide a callback __emutls_iterate_variables(cb) to iterate all variables in all threads. We can't really do that externally as we can't access the emutls_mutex and emutls_key and as __emutls_get_address updates the pthread_setspecific value anyway, so __emutls_get_address needs to be patched.

Yeah, I was initially thinking of a hook like __emutls_iterate_variables too, but after seeing that this implementation may extend the thread-local data at any time, I guess that would still be problematic.

> The emutls source code is here (GPL3 with GCC Runtime Library Exception!!!) https://github.com/gcc-mirror/gcc/blob/master/libgcc/emutls.c

Yeah, I linked to it above.  I since also found this llvm compiler-rt implementation under permissive licenses, written and merged by the same google engineer who got the emulated TLS hooks into llvm, and which helpfully also has some doc comments (not to mention a Windows version):

https://github.com/llvm-mirror/compiler-rt/blob/master/lib/builtins/emutls.c