June 20, 2012
On 20 June 2012 14:19, Tobias Pankrath <tobias@pankrath.net> wrote:

> It's because the compiler doesn't understand assembly code. It has no
>> knowledge of what it actually does, and as a result, just treats it as a black box.
>>
>
> But this is not set in stone. If I teach a compiler how to optimize intrinsics, can't I teach him to understand and optimize a (maybe small) subset of assembler, too? This must happen in the backend anyway, since intrinsics are platform-dependent, no?
>

It's MUCH easier with intrinsics. Teaching it to understand assembly involves learning a foreign language, and also for the _compiler_ to understand and predict the product of the codegen step. The compiler is usually completely separated from the codegen, it doesn't understand the architecture it targets. But using the knowledge supplied by the intrinsic API, it can do the optimisations it needs to in the usual way.

Declare the intrinsic as pure/nothrow, declare its arguments as in/scope/const/etc, it now knows a lot more about what to expect from the magic code beneath the intrinsic, and it can safely perform regular optimisation around it. Also, the codegen can use standard register assignment, which is important, and integrate nicely with regular program control structure.


June 20, 2012
On 20/06/12 13:04, Manu wrote:
> On 20 June 2012 13:51, Don Clugston <dac@nospam.com
> <mailto:dac@nospam.com>> wrote:
>
>     On 19/06/12 20:19, Iain Buclaw wrote:
>
>         Hi,
>
>         Had round one of the code review process, so I'm going to post
>         the main
>         issues here that most affect D users / the platforms they want
>         to run on
>         / the compiler version they want to use.
>
>
>
>         1) D Inline Asm and naked function support is raising far too
>         many alarm
>         bells. So would just be easier to remove it and avoid all the other
>         comments on why we need middle-end and backend headers in gdc.
>
>
>     You seem to be conflating a couple of unrelated issues here.
>     One is the calling convention. The other is inline asm.
>
>     Comments in the thread about "asm is mostly used for short things
>     which get inlined" leave me completely baffled, as it is completely
>     wrong.
>
>     There are two uses for asm, and they are very different:
>     (1) Functionality. This happens when there are gaps in the language,
>     and you get an abstraction inversion. You can address these with
>     intrinsics.
>     (2) Speed. High-speed, all-asm functions. These _always_ include a loop.
>
>
>     You seem to be focusing on (1), but case (2) is completely different.
>
>     Case (2) cannot be replaced with intrinsics. For example, you can't
>     write asm code using MSVC intrinsics (because the compiler rewrites
>     your code).
>     Currently, D is the best way to write (2). It is much, much better
>     than an external assembler.
>
>
> Case 1 has no alternative to inline asm. I've thrown out some crazy
> ideas to think about (but nobody seems to like them). I still think it
> could be addressed though.
>
> Case 2; I'm not convinced. These such long functions are the type I'm
> generally interested in aswell, and have the most experience with. But
> in my experience, they're almost always best written with intrinsics.
> If they're small enough to be inlined, then you can't afford not to use
> intrinsics. If they are truly big functions, then you begin to sacrifice
> readability and maintain-ability, and certainly limit the number of
> programmers that can maintain the code.

I don't agree with that. In the situations I'm used to, using intrinsics would not make it easier to read, and would definitely not make it easier to maintain. I find it inconceivable that somebody could understand the processor well enough to maintain the code, and yet not understand asm.

> I rarely fail to produce identical code with intrinsics to that which I
> would write with hand written asm. The flags are always the biggest
> challenge, as discussed prior in this thread. I think that could be
> addressed with better intrinsics.

Again, look at std.internal.math.BiguintX86. There are many cases there where you can swap two instructions, and the code will still produce the correct result, but it will be 30% slower.

I think that the SIMD case gives you a misleading impression, because on x86 they are very easy to schedule (they nearly all take the same number of cycles, etc). So it's not hard for the compiler to do a good job of it.
June 20, 2012
Le 20/06/2012 13:04, Manu a écrit :
> Case 1 has no alternative to inline asm. I've thrown out some crazy
> ideas to think about (but nobody seems to like them). I still think it
> could be addressed though.
>
> Case 2; I'm not convinced. These such long functions are the type I'm
> generally interested in aswell, and have the most experience with. But
> in my experience, they're almost always best written with intrinsics.
> If they're small enough to be inlined, then you can't afford not to use
> intrinsics. If they are truly big functions, then you begin to sacrifice
> readability and maintain-ability, and certainly limit the number of
> programmers that can maintain the code.
> I rarely fail to produce identical code with intrinsics to that which I
> would write with hand written asm. The flags are always the biggest
> challenge, as discussed prior in this thread. I think that could be
> addressed with better intrinsics.

I'm sorry, but what you say is rather ignorant.

Not that it is wrong, but it only cover YOUR usage of inline asm. You are talking about performances, but many other usages of assembly code are very useful, valid, and cannot be replaced by intrinsics. druntime is full of that, Walter and I presented you piece of code specifically. None of that could have been done without 100% asm functions.

It is clear, however, that the compiler should get a better understanding of asm.
June 20, 2012
On 20 June 2012 14:44, Don Clugston <dac@nospam.com> wrote:

> On 20/06/12 13:04, Manu wrote:
>
>> On 20 June 2012 13:51, Don Clugston <dac@nospam.com
>>
>> <mailto:dac@nospam.com>> wrote:
>>
>>    On 19/06/12 20:19, Iain Buclaw wrote:
>>
>>        Hi,
>>
>>        Had round one of the code review process, so I'm going to post
>>        the main
>>        issues here that most affect D users / the platforms they want
>>        to run on
>>        / the compiler version they want to use.
>>
>>
>>
>>        1) D Inline Asm and naked function support is raising far too
>>        many alarm
>>        bells. So would just be easier to remove it and avoid all the other
>>        comments on why we need middle-end and backend headers in gdc.
>>
>>
>>    You seem to be conflating a couple of unrelated issues here.
>>    One is the calling convention. The other is inline asm.
>>
>>    Comments in the thread about "asm is mostly used for short things
>>    which get inlined" leave me completely baffled, as it is completely
>>    wrong.
>>
>>    There are two uses for asm, and they are very different:
>>    (1) Functionality. This happens when there are gaps in the language,
>>    and you get an abstraction inversion. You can address these with
>>    intrinsics.
>>    (2) Speed. High-speed, all-asm functions. These _always_ include a
>> loop.
>>
>>
>>    You seem to be focusing on (1), but case (2) is completely different.
>>
>>    Case (2) cannot be replaced with intrinsics. For example, you can't
>>    write asm code using MSVC intrinsics (because the compiler rewrites
>>    your code).
>>    Currently, D is the best way to write (2). It is much, much better
>>    than an external assembler.
>>
>>
>> Case 1 has no alternative to inline asm. I've thrown out some crazy ideas to think about (but nobody seems to like them). I still think it could be addressed though.
>>
>> Case 2; I'm not convinced. These such long functions are the type I'm generally interested in aswell, and have the most experience with. But in my experience, they're almost always best written with intrinsics. If they're small enough to be inlined, then you can't afford not to use intrinsics. If they are truly big functions, then you begin to sacrifice readability and maintain-ability, and certainly limit the number of programmers that can maintain the code.
>>
>
> I don't agree with that. In the situations I'm used to, using intrinsics would not make it easier to read, and would definitely not make it easier to maintain. I find it inconceivable that somebody could understand the processor well enough to maintain the code, and yet not understand asm.


These functions of yours are 100% asm, that's not really what I would
usually call 'inline asm'. That's really just 'asm' :)
I think you've just illustrated one of my key points actually; that is that
you can't just insert small inline asm blocks within regular code, the
optimiser can't deal with it in most cases, so inevitably, the entire
function becomes asm from start to end.

I find I can typically produce equivalent code using carefully crafted intrinsics within regular C language structures. Also, often enough, the code outside the hot loop can be written in normal C for readability, since it barely affects performance, and trivial setup code will usually optimise perfectly anyway.

You're correct that a person 'maintaining' such code, who doesn't have such
a thorough understanding of the codegen may ruin it's perfectly tuned
efficiency. This may be the case, but in a commercial coding environment,
where a build MUST be delivered yesterday, the guy that understands it is
on holiday, and you need to tweak the behaviour immediately, this is a much
safer position to be in.
This is a very real scenario. I can't afford to ignore this practical
reality.

I might have a go at compiling the regular D code tonight, and seeing if I can produce identical assembly. I haven't tried this so much with x86 as I have with RISC architectures, which have much more predictable codegen.


I rarely fail to produce identical code with intrinsics to that which I
>> would write with hand written asm. The flags are always the biggest challenge, as discussed prior in this thread. I think that could be addressed with better intrinsics.
>>
>
> Again, look at std.internal.math.BiguintX86. There are many cases there where you can swap two instructions, and the code will still produce the correct result, but it will be 30% slower.
>

But that's precisely the sort of thing optimisers/schedulers are best at.
Can you point at a particular example where that is the case, that the
scheduler would get it wrong if left to its own ordering algorithm?
The opcode tables should have thorough information about the opcode timings
and latencies. The only thing that I find usually trips it up is not having
knowledge of the probability of the data being in nearby cache. If it has 2
loads, and one is less likely to be in cache, it should be scheduled
earlier.

As a side question, x86 architectures perform wildly differently from each
other. How do you reliably say some block of hand written x86 code is the
best possible code on all available processors?
Do you just benchmark on a suite of common processors available at the
time? I can imagine the opcode timing tables, which are presumably rather
different for every cpu, could easily feed wrong data to the codegen...


I think that the SIMD case gives you a misleading impression, because on
> x86 they are very easy to schedule (they nearly all take the same number of cycles, etc). So it's not hard for the compiler to do a good job of it.
>

True, but it's one of the most common usage scenarios, so it can't be ignored. Some other case studies I feel close to are hardware emulation, software rasterisation, particles, fluid dynamics, rigid body dynamics, FFT's, and audio signal processing. In each, the only time I rarely need inline asm, usually only when there is a hole in the high level language, as you said earlier. I find this typically surfaces when needing to interact with the flags regs directly.


June 20, 2012
On 20 June 2012 15:30, deadalnix <deadalnix@gmail.com> wrote:

> Le 20/06/2012 13:04, Manu a écrit :
>
>> Case 1 has no alternative to inline asm. I've thrown out some crazy ideas to think about (but nobody seems to like them). I still think it could be addressed though.
>>
>> Case 2; I'm not convinced. These such long functions are the type I'm
>> generally interested in aswell, and have the most experience with. But
>> in my experience, they're almost always best written with intrinsics.
>> If they're small enough to be inlined, then you can't afford not to use
>> intrinsics. If they are truly big functions, then you begin to sacrifice
>> readability and maintain-ability, and certainly limit the number of
>> programmers that can maintain the code.
>> I rarely fail to produce identical code with intrinsics to that which I
>> would write with hand written asm. The flags are always the biggest
>> challenge, as discussed prior in this thread. I think that could be
>> addressed with better intrinsics.
>>
>
> I'm sorry, but what you say is rather ignorant.
>
> Not that it is wrong, but it only cover YOUR usage of inline asm. You are talking about performances, but many other usages of assembly code are very useful, valid, and cannot be replaced by intrinsics. druntime is full of that, Walter and I presented you piece of code specifically. None of that could have been done without 100% asm functions.
>

I wasn't talking about performance strictly, I'm talking about pure functionality but with an intent not to inhibit optimisation. The high level language can't interact with registers directly, there's no mechanism to do so.

I offered trivial solutions. You never suggested any reason why they
couldn't work. In your code, you only need push/pop intrinsics, and a
register alias to produce identical code.
In Walters example, I offered a number of options (neat handling of JC
being the key issue). I'm not saying what it SHOULD be, just some
possibilities to think about/explore.


It is clear, however, that the compiler should get a better understanding
> of asm.
>

Such a better understanding of asm is easier implemented via intrinsics, that's the basis of my suggestion; extend the high level language such that it is capable of that understanding within conventional expressions. Intrinsics are already mechanically present in the language, adding more as they are needed is no problem. The only missing component I can identify, is the ability to directly address specific registers in high level code.


June 20, 2012
On 19/06/12 19:19, Iain Buclaw wrote:
> Had round one of the code review process, so I'm going to post the main issues
> here that most affect D users / the platforms they want to run on / the compiler
> version they want to use.

A somewhat different take on these issues -- we've several times now had discussions over the backend of DMD, and the whole "reference implementation not entirely free/open source" issue.  One of the points made in those discussions is that the issue is somewhat moot given that the real "reference implementation" is the frontend, and this already has at least 2 free backends (GCC and LLVM).

However, that point stops being moot the moment there are compiler-specific constraints that mean that code that will compile with DMD won't compile with GDC, or vice-versa.  If I can't use inline asm with GDC, or I have to go about it in a different way to DMD, then we can hardly say that GDC reflects the reference implementation.

It seems to me that guaranteeing equal capabilities between DMD and GDC should be a "red line" in determining what changes or deletions are acceptable or not.
June 20, 2012
Le 20/06/2012 14:51, Manu a écrit :
> On 20 June 2012 14:44, Don Clugston <dac@nospam.com
> <mailto:dac@nospam.com>> wrote:
>
>     On 20/06/12 13:04, Manu wrote:
>
>         On 20 June 2012 13:51, Don Clugston <dac@nospam.com
>         <mailto:dac@nospam.com>
>
>         <mailto:dac@nospam.com <mailto:dac@nospam.com>>> wrote:
>
>             On 19/06/12 20:19, Iain Buclaw wrote:
>
>                 Hi,
>
>                 Had round one of the code review process, so I'm going
>         to post
>                 the main
>                 issues here that most affect D users / the platforms
>         they want
>                 to run on
>                 / the compiler version they want to use.
>
>
>
>                 1) D Inline Asm and naked function support is raising
>         far too
>                 many alarm
>                 bells. So would just be easier to remove it and avoid
>         all the other
>                 comments on why we need middle-end and backend headers
>         in gdc.
>
>
>             You seem to be conflating a couple of unrelated issues here.
>             One is the calling convention. The other is inline asm.
>
>             Comments in the thread about "asm is mostly used for short
>         things
>             which get inlined" leave me completely baffled, as it is
>         completely
>             wrong.
>
>             There are two uses for asm, and they are very different:
>             (1) Functionality. This happens when there are gaps in the
>         language,
>             and you get an abstraction inversion. You can address these with
>             intrinsics.
>             (2) Speed. High-speed, all-asm functions. These _always_
>         include a loop.
>
>
>             You seem to be focusing on (1), but case (2) is completely
>         different.
>
>             Case (2) cannot be replaced with intrinsics. For example,
>         you can't
>             write asm code using MSVC intrinsics (because the compiler
>         rewrites
>             your code).
>             Currently, D is the best way to write (2). It is much, much
>         better
>             than an external assembler.
>
>
>         Case 1 has no alternative to inline asm. I've thrown out some crazy
>         ideas to think about (but nobody seems to like them). I still
>         think it
>         could be addressed though.
>
>         Case 2; I'm not convinced. These such long functions are the
>         type I'm
>         generally interested in aswell, and have the most experience
>         with. But
>         in my experience, they're almost always best written with
>         intrinsics.
>         If they're small enough to be inlined, then you can't afford not
>         to use
>         intrinsics. If they are truly big functions, then you begin to
>         sacrifice
>         readability and maintain-ability, and certainly limit the number of
>         programmers that can maintain the code.
>
>
>     I don't agree with that. In the situations I'm used to, using
>     intrinsics would not make it easier to read, and would definitely
>     not make it easier to maintain. I find it inconceivable that
>     somebody could understand the processor well enough to maintain the
>     code, and yet not understand asm.
>
>
> These functions of yours are 100% asm, that's not really what I would
> usually call 'inline asm'. That's really just 'asm' :)

You are being picky here.

Yes, this is 100% asm. But still, 100% asm is inline asm. It is asm within D code.
June 20, 2012
On 20 June 2012 14:01, Joseph Rushton Wakeling <joseph.wakeling@webdrake.net> wrote:
> On 19/06/12 19:19, Iain Buclaw wrote:
>>
>> Had round one of the code review process, so I'm going to post the main
>> issues
>> here that most affect D users / the platforms they want to run on / the
>> compiler
>> version they want to use.
>
>
> A somewhat different take on these issues -- we've several times now had discussions over the backend of DMD, and the whole "reference implementation not entirely free/open source" issue.  One of the points made in those discussions is that the issue is somewhat moot given that the real "reference implementation" is the frontend, and this already has at least 2 free backends (GCC and LLVM).
>
> However, that point stops being moot the moment there are compiler-specific constraints that mean that code that will compile with DMD won't compile with GDC, or vice-versa.  If I can't use inline asm with GDC, or I have to go about it in a different way to DMD, then we can hardly say that GDC reflects the reference implementation.
>
> It seems to me that guaranteeing equal capabilities between DMD and GDC should be a "red line" in determining what changes or deletions are acceptable or not.

Unfortunately this is a red line I am going to cross.  Haven't yet pushed anything yet, but feel free to visualise:

I have altered the following to the gdc build for all gdc-specific sources (which includes d inline assembler implementation)

- GCC system headers are included first and foremost before all other headers. - Now compiles with macro -DIN_GCC_FRONTEND turned on


Result:  GDC now fails to compile as we pull in many middle-end and backend headers that have been POISONED for GCC frontends to use. Apparently I somehow bypassed this.  :o)

Fix: Remove all included headers that are poisoned - but wait! - now D inline assembler is missing crucial key elements of what made it just about work in GDC.


Hands are tied, sorry.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
June 20, 2012
On 20/06/12 14:31, Iain Buclaw wrote:
> Hands are tied, sorry.

Is this planned as a short-term change for which a long-term solution will be developed, or is it likely to be a permanent split with DMD?

June 20, 2012
On 20/06/12 13:22, Manu wrote:
> On 20 June 2012 13:59, Don Clugston <dac@nospam.com
> <mailto:dac@nospam.com>> wrote:
>
>     You and I seem to be from different planets. I have almost never
>     written as asm function which was suitable for inlining.
>
>     Take a look at std.internal.math.biguintX86.d
>
>     I do not know how to write that code without inline asm.
>
>
> Interesting.
> I wish I could paste some counter-examples, but they're all proprietary >_<
>
> I think they key detail here is where you stated, they _always_ include
> a loop. Is this because it's hard to manipulate the compiler into the
> correct interaction with the flags register?

No. It's just because speed doesn't matter outside loops. A consequence of having the loop be inside the asm code, is that the parameter passing is much less significant for speed, and calling convention is the big

> I'd be interested to compare the compiled D code, and your hand written
> asm code, to see where exactly the optimiser goes wrong. It doesn't look
> like you're exploiting too many tricks (at a brief glance), it's just
> nice tight hand written code, which the optimiser should theoretically
> be able to get right...

Theoretically, yes. In practice, DMD doesn't get anywhere near, and gcc isn't much better. I don't think there's any reason why they couldn't, but I don't have much hope that they will.

As you say, the code looks fairly straightforward, but actually there are very many similar ways of writing the code, most of which are much slower. There are many bottlenecks you need to avoid. I was only able to get it to that speed by using the processor profiling registers.

So, my original two uses for asm are actually:
(1) when the language prevents you from accessing low-level functionality; and
(2) when the optimizer isn't good enough.

> I find optimisers are very good at code simplification, assuming that
> you massage the code/expressions to neatly match any architectural quirks.
> I also appreciate that good x86 code is possibly the hardest
> architecture for an optimiser to get right...

Optimizers improved enormously during the 80's and 90's, but the rate of improvement seems to have slowed.

With x86, out-of-order execution has made it very easy to get reasonably good code, and much harder to achieve perfection. Still, Core i7 is much easier than Core2, since Intel removed one of the most complicated bottlenecks (on core2 and earlier there is a max 3 reads per cycle, of registers you haven't written to in the previous 3 cycles).