January 07, 2012
On 1/6/2012 4:12 PM, Manu wrote:
> Come on IRC? This requires involved conversation.

I'm on skype.

> I'm sure you realise how much more work this is...

Actually, not that much. Surprising, no? <g> I already think I did the hard stuff already by supporting SIMD for float/double.


> Why would you commit to this right off the bat? Why not produce the simple
> primitive type, and allow me the opportunity to try it with the libraries before
> polluting the language its self with a massive volume of stuff...
> I'm genuinely concerned that once you add this to the language, it's done, and
> it'll be stuck there like lots of other debatable features... we can tweak the
> library implementation as we gain experience with usage of the feature.

If it doesn't work, we can back it out. I'm willing to add it as an experimental feature because I don't see it breaking any existing code.


> MS also agree that the primitive __m128 is the right approach. I'm not basing my
> opinion on their judgement at all, I independently conclude it is the right
> approach, but it's encouraging that they agree... and perhaps they're a more
> respectable authority than me and my opinion :)

Can you show me a typical example of how it looks in action in source code?

> What I proposed in the OP is the simplest, most non-destructive initial
> implementation in the language. I think there is the lest opportunity for making
> a mistake/wrong decision in my initial proposal, and it can be extended with
> what you're suggesting in time after we have the opportunity to prove that it's
> correct. We can test and prove the rest with libraries before committing to
> implement it in the language...

I don't think the typeless approach will wind up being any easier, and it'll certainly suck when it comes to optimization, error messages, symbolic debugger support, etc.
January 07, 2012
On 7 January 2012 02:52, Walter Bright <newshound2@digitalmars.com> wrote:

> MS also agree that the primitive __m128 is the right approach. I'm not
>> basing my
>
> opinion on their judgement at all, I independently conclude it is the right
>> approach, but it's encouraging that they agree... and perhaps they're a
>> more
>> respectable authority than me and my opinion :)
>>
>
> Can you show me a typical example of how it looks in action in source code?


Not without breaking NDA's... but maybe I will anyway, I'll dig some stuff up...


> What I proposed in the OP is the simplest, most non-destructive initial
>> implementation in the language. I think there is the lest opportunity for
>> making
>> a mistake/wrong decision in my initial proposal, and it can be extended
>> with
>> what you're suggesting in time after we have the opportunity to prove
>> that it's
>> correct. We can test and prove the rest with libraries before committing
>> to
>> implement it in the language...
>>
>
> I don't think the typeless approach will wind up being any easier, and it'll certainly suck when it comes to optimization, error messages, symbolic debugger support, etc.
>

Symbolic debugger support eh... now that is a compelling argument! :)

Okay, I'm prepared to reconsider... but I'm still apprehensive. I'm manuevans on skype, on there now if you want to add me...


January 07, 2012
On Friday, 6 January 2012 at 20:26:37 UTC, Walter Bright wrote:
> On 1/6/2012 11:16 AM, Brad Roberts wrote:
>> However, a counter example, it'd be a lot easier to write a memcpy routine that uses them
>> without having to resort to asm code under this theoretical model.
>
> I would seriously argue that individuals not attempt to write their own memcpy.

Agner Fog states in his optimization manuals that the glibc routines are fairly unoptimized. He provides his own versions, however they are GPL.

> Why? Because the C one has had probably thousands of programmers looking at it for the last 30 years. You're not going to spend 5 minutes, or even 5 days, and make it faster.

This assumes that hardware never changes. New memcpy implementations can take advantage of large registers in newer CPUs for higher speeds.
January 07, 2012
On 7 January 2012 03:46, Vladimir Panteleev <vladimir@thecybershadow.net>wrote:

> On Friday, 6 January 2012 at 20:26:37 UTC, Walter Bright wrote:
>
>> On 1/6/2012 11:16 AM, Brad Roberts wrote:
>>
>>> However, a counter example, it'd be a lot easier to write a memcpy
>>> routine that uses them
>>> without having to resort to asm code under this theoretical model.
>>>
>>
>> I would seriously argue that individuals not attempt to write their own memcpy.
>>
>
> Agner Fog states in his optimization manuals that the glibc routines are fairly unoptimized. He provides his own versions, however they are GPL.
>
>
>  Why? Because the C one has had probably thousands of programmers looking
>> at it for the last 30 years. You're not going to spend 5 minutes, or even 5 days, and make it faster.
>>
>
> This assumes that hardware never changes. New memcpy implementations can take advantage of large registers in newer CPUs for higher speeds.
>

I've never seen a memcpy on any console system I've ever worked on that takes advantage if its large registers... writing a fast memcpy is usually one of the first things we do when we get a new platform ;)


January 07, 2012
On 1/6/12 5:52 PM, Walter Bright wrote:
> Support the 10 vector types as basic types, support them with the
> arithmetic infix operators, and use intrinsics for the rest of the
> operations. I believe this scheme:
>
> 1. will look better in code, and will be easier to use
> 2. will allow for better error detection and more comprehensible error
> messages when things are misused
> 3. will generate better code
> 4. shouldn't be hard to implement, as I already did most of the work
> when I did the SIMD support for float and double.

I think it would be great to try avoiding the barbarism of adding 10 built-in types and a bunch of built-ins.

Historically, D has erred heavily on the side of building in the compiler. Consider the the complex numbers affair, in tow with crackpot science arguments on why they're a must. It's great that embarrassment is behind us. Also consider how the hard-coding of associative arrays in an awkward interface inside the runtime has stifled efficient implementations, progress, and innovation in that area. Still a lot of work needed there, too, to essentially undo a bad decision.

Let's not repeat history. Months later we'll look at the hecatomb of types and builtins we dumped into the language and we'll be like, what were we /thinking/?

Adding built-in types and functions is giving up good design, judgment, and using what we have creatively. It may mean failure to understand and use the expressive power of the language, and worse, compensate by adding even more poorly designed artifacts to it.

I would very strongly suggest we reconsider the tactics of it all. Yes, it's great to have SIMD support in the language. No, I don't think adding a wheelbarrow to the language is the right way.


Thanks,

Andrei
January 07, 2012
On Sat, 07 Jan 2012 01:06:21 +0100, Walter Bright <newshound2@digitalmars.com> wrote:

> On 1/6/2012 1:43 PM, Manu wrote:
>> There is actually. To the compiler, the intrinsic is a normal function, with
>> some hook in the code generator to produce the appropriate opcode when it's
>> performing actual code generation.
>> On most compilers, the inline asm on the other hand, is unknown to the compiler,
>> the optimiser can't do much anymore, because it doesn't know what the inline asm
>> has done, and the code generator just goes and pastes your asm code inline where
>> you told it to. It doesn't know if you've written to aliased variables, called
>> functions, etc.. it can no longer safely rearrange code around the inline asm
>> block.. which means it's not free to pipeline the code efficiently.
>
> And, in fact, the compiler should not try to optimize inline assembler. The IA is there so that the programmer can hand tweak things without the compiler defeating his attempts.
>
> For example, suppose the compiler schedules instructions for processor X. The programmer writes inline asm to schedule for Y, because the compiler doesn't specifically support Y. The compiler goes ahead and reschedules it for X.
>
> Arggh!

Yes, but that's not what I meant.

Consider

__v128 a = load(1), b = loadB(2);
__v128 c = add(a, b);
__v128 d = add(a, b);

A valid optimization could be.

__v128 b = load(2);
__v128 a = load(1);
__v128 tmp = add(a, b);
__v128 d = tmp;
__v128 c = tmp;

__v128 load(int v) pure
{
    __v128 res;
    asm (res, v)
    {
        MOVD res, v;
        SHUF res, 0x0000;
    }
    return res;
}

__v128 add(__v128 a, __v128 b) pure
{
    __v128 res = a;
    asm (res, b)
    {
        ADD res, b;
    }
    return res;
}

The compiler might drop evaluation of
d and just use the comsub of c.
He might also evaluate d before c.
The important point is to mark those functions as having no-sideeffect,
which can be checked if instructions are classified.
Thus the compiler can do all kind of optimizations on expression level.

After inlining it would look like this.

__v128 b;
asm (b) { MOV b, 2; }
__v128 a;
asm (a) { MOV a, 1; }
__v128 tmp;
asm (a, b, tmp) { MOV tmp, a; ADD tmp, b; }
__v128 c;
asm (c, tmp) { MOV c, tmp; }
__v128 d;
asm (d, tmp) { MOV d, tmp; }

Then he will do the usual register assignment except that
variables must be assigned a register for asm blocks they
are used in.

This is effectively achieves the same as writing this with intrinsics.
It also greatly improves the composition of inline asm.

>
> What dmd does do with the inline assembler is it keeps track of which registers are read/written, so that effective register allocation can be done for the non-asm code.

Which is why the compiler should be the one to allocate pseudo-registers.
January 07, 2012
On 01/07/12 04:27, Martin Nowak wrote:
> __v128 add(__v128 a, __v128 b) pure
> {
>     __v128 res = a;
>     asm (res, b)
>     {
>         ADD res, b;
>     }
>     return res;
> }


> This is effectively achieves the same as writing this with intrinsics. It also greatly improves the composition of inline asm.

What it also does is allows mixing "ordinary" asm with the SIMD instructions. People will do that, because it's easier this way (less typing), and then the result is practically unportable. Cause every compiler would now have to fully understand and support that one asm variant.

If you do "__v128 __simd_add(__v128 a, __v128)" instead, you don't loose anything; in fact it could be internally implemented with your asm(). But now the "real" asm code is separate from the more generic (and sometimes even portable) simd ops -- the compiler does not need to understand asm() to be able to use it. It can still do every optimization as with the raw asm, and possibly more as it knows exactly what's going on. The explicit pure annotations are not needed. It has more freedom to choose better scheduling, ordering, sometimes instruction selection (if there's more than one alternative) and even various code transformations. Even CTFE works.
Consider the case when a lot of your above add()-like functions are inlined into another one, which will be a common pattern -- you don't want any false dependencies. (If you do care about exact instruction scheduling you're writing asm, not D, so for that case asm() is a better choice)

I wrote "__v128 __simd_add(__v128 a, __v128)" above, but that was just to keep things simple. What you actually want is "vfloat4 __simd_add(vfloat4 a, vfloat4 b)" etc. Ie strongly typed.

Whether this needs to go into the compiler itself depends on only one thing - if it can be done efficiently in a library. Efficiently in this case means "zero-cost" or "free".

Having different static types (in addition to the untyped __v(64|128|256) ones) gives you not only security (you don't accidentally end up operating on the wrong data/format because you forgot about some version() combination etc), but also allows things like overloading. Then you can write more generic code, which works with all available formats. And eg changing the precision used by some app module involves only changing a few declarations plus data entry/exit points, not modifying every single SIMD instruction.
Untyped __v128 only really works for memcpy() type functions; other than that is mainly useful for conversions and passing data etc - the cases where you don't care about the content in transit.

>> What dmd does do with the inline assembler is it keeps track of which registers are read/written, so that effective register allocation can be done for the non-asm code.
> 
> Which is why the compiler should be the one to allocate pseudo-registers.

Yep.

artur
January 07, 2012
On 01/06/12 21:16, Walter Bright wrote:
> Aligning to non-powers of 2 will never work. As for other alignments, they only will work if the underlying storage is aligned to that or greater. Otherwise, you'll have to resort to the method outlined above.
> 
> 
>> What about GCC? Will/does it support arbitrary alignment?
> 
> Don't know about gcc.

GCC keeps the stack 16-byte aligned by default.
January 07, 2012
On 07.01.2012 04:18, Andrei Alexandrescu wrote:
> On 1/6/12 5:52 PM, Walter Bright wrote:
>> Support the 10 vector types as basic types, support them with the
>> arithmetic infix operators, and use intrinsics for the rest of the
>> operations. I believe this scheme:
>>
>> 1. will look better in code, and will be easier to use
>> 2. will allow for better error detection and more comprehensible error
>> messages when things are misused
>> 3. will generate better code
>> 4. shouldn't be hard to implement, as I already did most of the work
>> when I did the SIMD support for float and double.
>
> I think it would be great to try avoiding the barbarism of adding 10
> built-in types and a bunch of built-ins.
>
> Historically, D has erred heavily on the side of building in the
> compiler. Consider the the complex numbers affair, in tow with crackpot
> science arguments on why they're a must. It's great that embarrassment
> is behind us.


> Also consider how the hard-coding of associative arrays in
> an awkward interface inside the runtime has stifled efficient
> implementations, progress, and innovation in that area. Still a lot of
> work needed there, too, to essentially undo a bad decision.

Sorry Andrei, I have to disagree with that in the strongest possible terms. I would have mentioned AAs as a very strong argument in the opposite direction!

Moving AAs from a built-in to a library type has been an unmitigated disaster from the implementation side. And it has so far brought us *nothing* in return. Not "hardly anything", but *NOTHING*. I don't even have any idea of what good could possibly come from it. Note that you CANNOT have multiple implementations on a given platform, or you'll get linker errors! So I think there is more pain to come from it.
It seems to have been motivated by religious reasons and nothing more.
Why should anyone believe the same argument again?
January 07, 2012
On Saturday, 7 January 2012 at 16:10:32 UTC, Don wrote:
> Sorry Andrei, I have to disagree with that in the strongest possible terms. I would have mentioned AAs as a very strong argument in the opposite direction!

Amen. AAs are *still* broken from this change. If you
take a look at my cgi.d, you'll find this:

// Referencing this gigantic typeid seems to remind the compiler
// to actually put the symbol in the object file. I guess the immutable
// assoc array array isn't actually included in druntime
void hackAroundLinkerError() {
     writeln(typeid(const(immutable(char)[][])[immutable(char)[]]));
     writeln(typeid(immutable(char)[][][immutable(char)[]]));
     writeln(typeid(Cgi.UploadedFile[immutable(char)[]]));
     writeln(typeid(immutable(Cgi.UploadedFile)[immutable(char)[]]));
     writeln(typeid(immutable(char[])[immutable(char)[]]));
     // this is getting kinda ridiculous btw. Moving assoc arrays
     // to the library is the pain that keeps on coming.

     // eh this broke the build on the work server
     // writeln(typeid(immutable(char)[][immutable(string[])]));
     writeln(typeid(immutable(string[])[immutable(char)[]]));
}


It was never a problem before... but if I take that otherwise
useless function out, it still randomly breaks my builds to
this day.