Jump to page: 1 2
Thread overview
LLVM and TLS
Feb 17, 2015
Jonathan Marler
Feb 17, 2015
Dan Olson
Feb 17, 2015
Martin Nowak
Feb 17, 2015
Jacob Carlborg
Feb 18, 2015
Jonathan Marler
Feb 18, 2015
Jacob Carlborg
Feb 18, 2015
Dan Olson
Feb 18, 2015
Jonathan Marler
Feb 22, 2015
Dan Olson
Feb 23, 2015
Jonathan Marler
Feb 23, 2015
Dan Olson
Feb 23, 2015
Jacob Carlborg
Feb 17, 2015
Joakim
Feb 17, 2015
Kai Nacke
Feb 19, 2015
deadalnix
Feb 20, 2015
Jacob Carlborg
February 17, 2015
I've noticed that on my windows 7 development machine, switching between TLS and non-TLS storage has a minimal impact on performance (when using DMD).  I haven't tried LDC yet, however, on a macbook pro, which uses clang (LLVM) for the linker, using TLS has a huge performance impact (much much slower).  Does anyone know if this is because of the way LLVM handles TLS storage?  I'll have to try using LDC on my windows machine but maybe one of you know off hand whether or not LLVM has some performance problems with TLS storage. Thanks!
February 17, 2015
"Jonathan Marler" <johnnymarler@gmail.com> writes:

> I've noticed that on my windows 7 development machine, switching between TLS and non-TLS storage has a minimal impact on performance (when using DMD).  I haven't tried LDC yet, however, on a macbook pro, which uses clang (LLVM) for the linker, using TLS has a huge performance impact (much much slower).  Does anyone know if this is because of the way LLVM handles TLS storage?  I'll have to try using LDC on my windows machine but maybe one of you know off hand whether or not LLVM has some performance problems with TLS storage. Thanks!

Last time I checked, DMD still did not use OS X native TLS support, but has its own solution.  Try LDC and see if the performance improves because LDC uses OS X native TLS.
--
Dan
February 17, 2015
On Tuesday, 17 February 2015 at 06:16:04 UTC, Dan Olson wrote:
> Try LDC and see if the performance improves because LDC uses OS X native TLS.

Is there more information available abput OSX' TLS support and how this is implemented in LDX? What version of OSX is required? I'd very much like to use that for DMD/druntime too, so that we can go on with the shared library support.

February 17, 2015
On 2015-02-17 07:47, Martin Nowak wrote:

> Is there more information available abput OSX' TLS support and how this
> is implemented in LDX? What version of OSX is required? I'd very much
> like to use that for DMD/druntime too, so that we can go on with the
> shared library support.

I've created an issue for this, there is some information about the implementation in the issue [1].

OS X 10.7 or later is required. But I'm pretty sure we can back port it to 10.6 if we really want/need to.

[1] https://issues.dlang.org/show_bug.cgi?id=9476#c2

-- 
/Jacob Carlborg
February 17, 2015
On Tuesday, 17 February 2015 at 02:41:12 UTC, Jonathan Marler wrote:
> I've noticed that on my windows 7 development machine, switching between TLS and non-TLS storage has a minimal impact on performance (when using DMD).  I haven't tried LDC yet, however, on a macbook pro, which uses clang (LLVM) for the linker, using TLS has a huge performance impact (much much slower).  Does anyone know if this is because of the way LLVM handles TLS storage?  I'll have to try using LDC on my windows machine but maybe one of you know off hand whether or not LLVM has some performance problems with TLS storage. Thanks!

It has little to do with the linker or llvm.  dmd doesn't use the native TLS APIs on OS X, as Dan says, because OS X didn't have native TLS back then:

http://www.drdobbs.com/architecture-and-design/implementing-thread-local-storage-on-os/228701185

Druntime has since been updated to call pthread_setspecific and pthread_getspecific, but maybe that's still slower than non-TLS on OS X:

https://github.com/D-Programming-Language/druntime/blob/master/src/rt/sections_osx.d#L151

As Dan noted, David got ldc working with the since-added undocumented TLS, ie TLV, functions on OS X:

https://github.com/ldc-developers/druntime/blob/ldc/src/ldc/osx_tls.c

On Tuesday, 17 February 2015 at 06:47:12 UTC, Martin Nowak wrote:
> On Tuesday, 17 February 2015 at 06:16:04 UTC, Dan Olson wrote:
>> Try LDC and see if the performance improves because LDC uses OS X native TLS.
>
> Is there more information available abput OSX' TLS support and how this is implemented in LDX? What version of OSX is required? I'd very much like to use that for DMD/druntime too, so that we can go on with the shared library support.

The functions David used were added in 10.7.
February 17, 2015
Hi Jonathan!

On Tuesday, 17 February 2015 at 02:41:12 UTC, Jonathan Marler wrote:
> I've noticed that on my windows 7 development machine, switching between TLS and non-TLS storage has a minimal impact on performance (when using DMD).  I haven't tried LDC yet, however, on a macbook pro, which uses clang (LLVM) for the linker, using TLS has a huge performance impact (much much slower).  Does anyone know if this is because of the way LLVM handles TLS storage?  I'll have to try using LDC on my windows machine but maybe one of you know off hand whether or not LLVM has some performance problems with TLS storage. Thanks!

On Windows, LLVM uses the segment registers for TLS storage (gs: for 32bit and fs: for 64bit). There is no other impact.

Regards,
Kai
February 18, 2015
I've created a simple program to demonstrate the issue.  The performance cost of TLS vs __gshared is over one and a half orders of magnitude!

import std.stdio;
import std.datetime;

size_t tlsGlobal;
__gshared size_t sharedGlobal;

void main(string[] args)
{
  runTest(3, 10_000_000);
}

void runTest(size_t runCount, size_t loopCount)
{
  writeln("--------------------------------------------------");
  StopWatch sw;
  for(auto runIndex = 0; runIndex < runCount; runIndex++) {

    writefln("run %s (loopcount %s)", runIndex + 1, loopCount);

    sw.reset();
    sw.start();
    for(size_t i = 0; i < loopCount; i++) {
      tlsGlobal = i;
    }
    sw.stop();
    writefln("  TLS   : %s milliseconds", sw.peek.msecs);

    sw.reset();
    sw.start();
    for(size_t i = 0; i < loopCount; i++) {
      sharedGlobal = i;
    }
    sw.stop();
    writefln("  Shared: %s milliseconds", sw.peek.msecs);
  }
}

--------------------------------------------------
Output:
--------------------------------------------------
run 1 (loopcount 10000000)
  TLS   : 104 milliseconds
  Shared: 3 milliseconds
run 2 (loopcount 10000000)
  TLS   : 97 milliseconds
  Shared: 4 milliseconds
run 3 (loopcount 10000000)
  TLS   : 99 milliseconds
  Shared: 3 milliseconds
February 18, 2015
On 2015-02-18 02:41, Jonathan Marler wrote:
> I've created a simple program to demonstrate the issue.  The performance
> cost of TLS vs __gshared is over one and a half orders of magnitude!

It would be nice to have a comparison in C as well, which do use the native TLS implementation.

-- 
/Jacob Carlborg
February 18, 2015
"Jonathan Marler" <johnnymarler@gmail.com> writes:

> I've created a simple program to demonstrate the issue.  The performance cost of TLS vs __gshared is over one and a half orders of magnitude!
>
--snip--

I ran on my MacBook to compare DMD and LDC 2.066.1 versions.  With LDC, I had to put in an emty asm instruction in the for loops otherwise the optimizer removed all but the last write and timing looked really good (0 milliseconds)!

LDC  __gshared versus TLS time is a bit better than DMD.

$ dmd -O timetls.d $ ./timetls
--------------------------------------------------
run 1 (loopcount 10000000)
  TLS   : 93 milliseconds
  Shared: 6 milliseconds
run 2 (loopcount 10000000)
  TLS   : 91 milliseconds
  Shared: 6 milliseconds
run 3 (loopcount 10000000)
  TLS   : 92 milliseconds
  Shared: 4 milliseconds

$ ldmd2 -O3 timetls.d $ ./timetls
--------------------------------------------------
run 1 (loopcount 10000000)
  TLS   : 21 milliseconds
  Shared: 3 milliseconds
run 2 (loopcount 10000000)
  TLS   : 22 milliseconds
  Shared: 5 milliseconds
run 3 (loopcount 10000000)
  TLS   : 20 milliseconds
  Shared: 3 milliseconds
February 18, 2015
On Wednesday, 18 February 2015 at 17:03:38 UTC, Dan Olson wrote:
> LDC  __gshared versus TLS time is a bit better than DMD.
>
> $ dmd -O timetls.d
> $ ./timetls
> --------------------------------------------------
> run 1 (loopcount 10000000)
>   TLS   : 93 milliseconds
>   Shared: 6 milliseconds
> run 2 (loopcount 10000000)
>   TLS   : 91 milliseconds
>   Shared: 6 milliseconds
> run 3 (loopcount 10000000)
>   TLS   : 92 milliseconds
>   Shared: 4 milliseconds
>
> $ ldmd2 -O3 timetls.d
> $ ./timetls
> --------------------------------------------------
> run 1 (loopcount 10000000)
>   TLS   : 21 milliseconds
>   Shared: 3 milliseconds
> run 2 (loopcount 10000000)
>   TLS   : 22 milliseconds
>   Shared: 5 milliseconds
> run 3 (loopcount 10000000)
>   TLS   : 20 milliseconds
>   Shared: 3 milliseconds

That's quite a bit better.  If I run this using DMD on windows I get almost the same performance:

dmd test.d
--------------------------------------------------
run 1 (loopcount 10000000)
  TLS   : 28 milliseconds
  Shared: 25 milliseconds
run 2 (loopcount 10000000)
  TLS   : 28 milliseconds
  Shared: 25 milliseconds
run 3 (loopcount 10000000)
  TLS   : 27 milliseconds
  Shared: 25 milliseconds

If I turn on optimization they both take 7 milliseconds.

« First   ‹ Prev
1 2