April 23, 2017
On Saturday, 22 April 2017 at 10:38:45 UTC, Stefan Koch wrote:
> On Saturday, 22 April 2017 at 03:03:32 UTC, evilrat wrote:
>>
>> Is this apply to templates too? I recently tried some code, and templated version with about 10 instantiations for 4-5 types increased compile time from about 1 sec up to 4! The template itself was staightforward, just had a bunch of static if-else-else for types special cases.
>
> If you could share the code it would be appreciated.
> If you cannot share it publicly come in irc sometime.
> I am Uplink|DMD there.

Sorry, I failed, that was actually caused by build system and added dependencies(which is compiled every time no matter what, hence the slowdown). Testing overloaded functions vs template shows no significant difference in build times.
April 23, 2017
On Sunday, 23 April 2017 at 02:45:09 UTC, evilrat wrote:
> On Saturday, 22 April 2017 at 10:38:45 UTC, Stefan Koch wrote:
>> On Saturday, 22 April 2017 at 03:03:32 UTC, evilrat wrote:
>>> [...]
>>
>> If you could share the code it would be appreciated.
>> If you cannot share it publicly come in irc sometime.
>> I am Uplink|DMD there.
>
> Sorry, I failed, that was actually caused by build system and added dependencies(which is compiled every time no matter what, hence the slowdown). Testing overloaded functions vs template shows no significant difference in build times.

Ah I see.
4x slowdown for 10 instances seemed rather unusual.
Though doubtlessly possible.
April 24, 2017
On Saturday, 22 April 2017 at 14:29:22 UTC, Stefan Koch wrote:
> And for that reason I am looking to extend the interface to support for example scaled loads and the like.
> Otherwise you and up with 1000 temporaries that add offsets to pointers.

What are scaled loads?

> Also and perhaps more importantly I am sick and tired of hearing "why don't you use ldc/llvm?" all the time...

Yes, that's not fair.
April 24, 2017
On Thursday, 20 April 2017 at 12:56:11 UTC, Stefan Koch wrote:
> Hi Guys,
>
> I just begun work on the x86 jit backend.
>
> Because right now I am at a stage where further design decisions need to be made and those decisions need to be informed by how a _fast_ jit-compatible x86-codegen is structured.
>
> Since I do believe that this is an interesting topic;
> I will give you the over-the-shoulder perspective on this.
>
> At the time of posting the video is still uploading, but you should be able to see it soon.
>
> https://www.youtube.com/watch?v=pKorjPAvhQY
>
> Cheers,
> Stefan

Have you considered using the LLVM jit compiler for CTFE? We already have an LLVM front end. This would mean that CTFE would depend on LLVM, which is a large dependency, but it would create very fast, optimized code for CTFE on any platform.

Keep in mind that I'm not as familiar with the technical details of CTFE so you may see alot of negative ramifications that I'm not aware of. I just want to make sure it's being considered and what yours and others thoughts were.
April 24, 2017
On Monday, 24 April 2017 at 12:59:55 UTC, Jonathan Marler wrote:
>
> Have you considered using the LLVM jit compiler for CTFE? We already have an LLVM front end. This would mean that CTFE would depend on LLVM, which is a large dependency, but it would create very fast, optimized code for CTFE on any platform.
>

I can't help but laugh at this after the above posts...
April 24, 2017
On Monday, 24 April 2017 at 14:41:44 UTC, jmh530 wrote:
> On Monday, 24 April 2017 at 12:59:55 UTC, Jonathan Marler wrote:
>>
>> Have you considered using the LLVM jit compiler for CTFE? We already have an LLVM front end. This would mean that CTFE would depend on LLVM, which is a large dependency, but it would create very fast, optimized code for CTFE on any platform.
>>
>
> I can't help but laugh at this after the above posts...

I totally missed when Stefan said:

> Also and perhaps more importantly I am sick and tired of hearing "why don't you use ldc/llvm?" all the time...

That is pretty hilarious :)  I suppose I just demonstrated the reason he is attempting to create an x86 jitter so he will have an interface that could be extended to something like LLVM.  Wow.

April 24, 2017
On Monday, 24 April 2017 at 11:29:01 UTC, Ola Fosheim Grøstad wrote:
>
> What are scaled loads?

x86 has addressing modes which allow you to multiply an index by a certain set of scalars and add it as on offset to the pointer you want to load.
Thereby making memory access patterns more transparent to the caching and prefetch systems.
As well as reducing the overall code-size.
April 25, 2017
On Monday, 24 April 2017 at 17:48:50 UTC, Stefan Koch wrote:
> On Monday, 24 April 2017 at 11:29:01 UTC, Ola Fosheim Grøstad wrote:
>>
>> What are scaled loads?
>
> x86 has addressing modes which allow you to multiply an index by a certain set of scalars and add it as on offset to the pointer you want to load.
> Thereby making memory access patterns more transparent to the caching and prefetch systems.
> As well as reducing the overall code-size.

Oh, ok. AFAIK The decoding of indexing modes into micro-ops (the real instructions used inside the CPU, not the actual op-codes) has no effect on the caching system. It may however compress the generated code so you don't flush the instruction cache and speed up the decoding of op-codes into micro-ops.

If you want to improve cache loads you have to consider when to use the "prefetch" instructions, but the effect (positive or negative) varies greatly between CPU generations so you will basically need to target each CPU-generation individually.

Probably too much work to be worthwhile as it usually doesn't pay off until you work on large datasets and then you usually have to be careful with partitioning the data into cache-friendly working-sets. Probably not so easy to do for a JIT.

You'll probably get a decent performance boost without worrying about caching too much in the first implementation anyway. Any gains in that area could be obliterated in the next CPU generation... :-/





April 25, 2017
On Tuesday, 25 April 2017 at 09:09:00 UTC, Ola Fosheim Grøstad wrote:
> On Monday, 24 April 2017 at 17:48:50 UTC, Stefan Koch wrote:
>> [...]
>
> Oh, ok. AFAIK The decoding of indexing modes into micro-ops (the real instructions used inside the CPU, not the actual op-codes) has no effect on the caching system. It may however compress the generated code so you don't flush the instruction cache and speed up the decoding of op-codes into micro-ops.
>
> If you want to improve cache loads you have to consider when to use the "prefetch" instructions, but the effect (positive or negative) varies greatly between CPU generations so you will basically need to target each CPU-generation individually.
>
> Probably too much work to be worthwhile as it usually doesn't pay off until you work on large datasets and then you usually have to be careful with partitioning the data into cache-friendly working-sets. Probably not so easy to do for a JIT.
>
> You'll probably get a decent performance boost without worrying about caching too much in the first implementation anyway. Any gains in that area could be obliterated in the next CPU generation... :-/

It's already the case. Intel and AMD (especially in Ryzen) strongly discourage the use of prefetch instructions since at least Core2 and Athlon64. The icache cost rarely pays off and very often breaks the auto-prefetcher algorithms by spoiling memory bandwidth.
April 25, 2017
On Tuesday, 25 April 2017 at 16:16:43 UTC, Patrick Schluter wrote:
> It's already the case. Intel and AMD (especially in Ryzen) strongly discourage the use of prefetch instructions since at least Core2 and Athlon64. The icache cost rarely pays off and very often breaks the auto-prefetcher algorithms by spoiling memory bandwidth.

I think it just has to be done on a case-by-case basis. But if one doesn't target a specific set of CPUs and a specific predictable access pattern (like visiting every 4th cacheline) then one probably shouldn't do it.

There are also so many different types to choose from: prefetch-for-write, prefetch-for-one-time-use, prefetch-to-cache-level2, etc... Hard to get that right for a small-scale JIT without knowledge of the the algorithm or the dataset.


1 2
Next ›   Last »