Thread overview
Vectorization examples
Apr 20, 2015
bearophile
Apr 20, 2015
Panke
Apr 20, 2015
finalpatch
Apr 20, 2015
Panke
Apr 24, 2015
Luc Bourhis
Apr 20, 2015
Walter Bright
Apr 20, 2015
bearophile
Apr 20, 2015
Walter Bright
April 20, 2015
"Utilizing the other 80% of your system's performance: Starting with Vectorization" by Ulrich Drepper:

https://www.youtube.com/watch?v=DXPfE2jGqg0

It shows two still missing parts of the D type system: a way to define strongly typed byte alignments for arrays (something better than the aligned() shown here, because I prefer the alignment to be part of the type), and a way to tell the type system that some array slices are fully distinct (the __restrict seen here, I think this information doesn't need to be part of a type).

Bye,
bearophile
April 20, 2015
On Monday, 20 April 2015 at 09:41:09 UTC, bearophile wrote:
> "Utilizing the other 80% of your system's performance: Starting with Vectorization" by Ulrich Drepper:
>
> https://www.youtube.com/watch?v=DXPfE2jGqg0
>
> It shows two still missing parts of the D type system: a way to define strongly typed byte alignments for arrays (something better than the aligned() shown here, because I prefer the alignment to be part of the type), and a way to tell the type system that some array slices are fully distinct (the __restrict seen here, I think this information doesn't need to be part of a type).
>
> Bye,
> bearophile

Aren't unaligned loads as fast as aligned loads on modern x86?
April 20, 2015
On Monday, 20 April 2015 at 11:01:28 UTC, Panke wrote:
> On Monday, 20 April 2015 at 09:41:09 UTC, bearophile wrote:
>> "Utilizing the other 80% of your system's performance: Starting with Vectorization" by Ulrich Drepper:
>>
>> https://www.youtube.com/watch?v=DXPfE2jGqg0
>>
>> It shows two still missing parts of the D type system: a way to define strongly typed byte alignments for arrays (something better than the aligned() shown here, because I prefer the alignment to be part of the type), and a way to tell the type system that some array slices are fully distinct (the __restrict seen here, I think this information doesn't need to be part of a type).
>>
>> Bye,
>> bearophile
>
> Aren't unaligned loads as fast as aligned loads on modern x86?

No that's not true. On modern x86 processors using unaligned loading instructions on aligned data does not incur additional overhead, therefore you can always use unaligned load for everything, but loading unaligned data is still slower than aligned data.
April 20, 2015
On 4/20/2015 2:41 AM, bearophile wrote:
> "Utilizing the other 80% of your system's performance: Starting with
> Vectorization" by Ulrich Drepper:
>
> https://www.youtube.com/watch?v=DXPfE2jGqg0
>
> It shows two still missing parts of the D type system: a way to define strongly
> typed byte alignments for arrays (something better than the aligned() shown
> here, because I prefer the alignment to be part of the type),

Use arrays of double2, float4, int4, etc., declared in core.simd. Those will be aligned appropriately.


> and a way to tell
> the type system that some array slices are fully distinct (the __restrict seen
> here, I think this information doesn't need to be part of a type).

A runtime test is sufficient.
April 20, 2015
Walter Bright:

> Use arrays of double2, float4, int4, etc., declared in core.simd. Those will be aligned appropriately.

Is the GC able to give memory aligned to 32 bytes for new architectures with 512 bits wide SIMD?


>> and a way to tell
>> the type system that some array slices are fully distinct (the __restrict seen
>> here, I think this information doesn't need to be part of a type).
>
> A runtime test is sufficient.

One of the points of having a type system is to rule out certain classes of bugs caused by programmers. The compiler could use the type system to add those runtime tests where needed. And even better sometimes is to avoid the time used by run time tests, as shown in that video, using the static information inserted in the code (he shows assembly code that contains run time tests).

Another example of missing static information in D is shown near the end of the video, where he shows an annotation to compile functions for different CPUs, where the compiler updates function pointers inside the binary according to the CPU you are using, making the code safe and efficient.

Bye,
bearophile
April 20, 2015
On 4/20/2015 1:09 PM, bearophile wrote:
> Walter Bright:
>
>> Use arrays of double2, float4, int4, etc., declared in core.simd. Those will
>> be aligned appropriately.
>
> Is the GC able to give memory aligned to 32 bytes for new architectures with 512
> bits wide SIMD?

When the CPU requires 32 byte alignment, the compiler/GC will support it.

And even if it doesn't, it is trivial to manually align things.

>>> and a way to tell
>>> the type system that some array slices are fully distinct (the __restrict seen
>>> here, I think this information doesn't need to be part of a type).
>>
>> A runtime test is sufficient.
>
> One of the points of having a type system is to rule out certain classes of bugs
> caused by programmers. The compiler could use the type system to add those
> runtime tests where needed. And even better sometimes is to avoid the time used
> by run time tests, as shown in that video, using the static information inserted
> in the code (he shows assembly code that contains run time tests).

"this information doesn't need to be part of a type"

Besides, you can create a 'restrict' template that checks for overlap at runtime, checking that can be turned on and off at compile time (i.e. assert). The runtime check overhead should be insignificant if using large arrays.


> Another example of missing static information in D is shown near the end of the
> video, where he shows an annotation to compile functions for different CPUs,
> where the compiler updates function pointers inside the binary according to the
> CPU you are using, making the code safe and efficient.

Come on, bearophile. I've done that stuff in C based on the runtime CPU. No compiler support is needed.

April 20, 2015
> No that's not true. On modern x86 processors using unaligned loading instructions on aligned data does not incur additional overhead, therefore you can always use unaligned load for everything, but loading unaligned data is still slower than aligned data.

Thanks for clarifying.
April 24, 2015
On Monday, 20 April 2015 at 11:15:48 UTC, finalpatch wrote:
> On Monday, 20 April 2015 at 11:01:28 UTC, Panke wrote:
>> Aren't unaligned loads as fast as aligned loads on modern x86?
>
> No that's not true. On modern x86 processors using unaligned loading instructions on aligned data does not incur additional overhead, therefore you can always use unaligned load for everything, but loading unaligned data is still slower than aligned data.

According to [1, section 7.13 and 8.13], the overhead was particularly bad for Core2 but this not a major issue either for Nehalem or SandyBridge anymore. Do you have data contradicting him?

[1] Agner Fog, 3. The microarchitecture of Intel, AMD and VIA CPUs, Tech. report, Copenhagen University College of Engineering, February 2012. http://www.agner.org/optimize/