Jump to page: 1 2
Thread overview
Multi-architecture binaries
May 01, 2007
Jascha Wetzel
May 01, 2007
Chad J
May 01, 2007
Lutger
May 02, 2007
Jascha Wetzel
May 02, 2007
janderson
May 02, 2007
Jascha Wetzel
May 02, 2007
janderson
May 02, 2007
Jascha Wetzel
May 02, 2007
Don Clugston
May 02, 2007
janderson
May 02, 2007
Pragma
May 02, 2007
Jascha Wetzel
May 01, 2007
A thought that came up in the VM discussion...

Suppose someday we have language support for vector operations. We want to ship binaries that support but do not require extensions like SSE. We do not want to ship multiple binaries and wrappers that switch between them or installers that decide which one to use, because it's more work and we'd be shipping a lot of redundant code.

Ideally we wouldn't have to write additional code either. The compiler
could emit code for multiple targets on a per-function basis (e.g. with
the target architecure mangled into the function name). The runtime
would check at startup, which version will be used and "link" the
appropriate function.
Here is a small proof-of-concept implementation of this detection and
linking mechanism.

Comments?


May 01, 2007
I've thought about this myself, and really like the idea.  In the VM discussion Don mentioned benchmarking different codepaths to find which one works best on the current CPU, then linking the best one in.  This makes a lot of sense to me, since CPUs seem to have different performance characteristics, even regardless of instruction set differences.

I was once benchmarking an algorithm on my notebook computer with a more modern processor, and my desktop computer with an older processor.  The algo ran faster on the notebook of course, but branching had an especially reduced cost.  That is, branching on the more modern processor was less expensive relative to other instructions than it was on the previous processor.  This was with the same D binary on both of them.

That is the sort of stuff that I think the JITC's want to leverage, but I have to wonder if using a strategy like this and covering enough permutations of costly algorithms would give exactly the same benifit, with a massively reduced startup time for applications.  Of course, it would also be nice to be able to turn it off, because it will cost SOME startup time as well as executable size, which are not worthwhile costs for some apps like simple command line apps that need to be snappy and small.  It would rock for games though ;)

I really can't wait to see D's performance some day when/if it gets cool tricks like this, low-d vector primitives, array operations, etc.

Jascha Wetzel wrote:
> A thought that came up in the VM discussion...
> 
> Suppose someday we have language support for vector operations. We want
> to ship binaries that support but do not require extensions like SSE. We
> do not want to ship multiple binaries and wrappers that switch between
> them or installers that decide which one to use, because it's more work
> and we'd be shipping a lot of redundant code.
> 
> Ideally we wouldn't have to write additional code either. The compiler
> could emit code for multiple targets on a per-function basis (e.g. with
> the target architecure mangled into the function name). The runtime
> would check at startup, which version will be used and "link" the
> appropriate function.
> Here is a small proof-of-concept implementation of this detection and
> linking mechanism.
> 
> Comments?
> 
> 
> ------------------------------------------------------------------------
> 
> import std.cpuid;
> import std.stdio;
> 
> //-----------------------------------------------------------------------------
> //  This code goes into the runtime library
> 
> const uint  CPU_NO_EXTENSION    = 0,
>             CPU_MMX             = 1,
>             CPU_SSE             = 2,
>             CPU_SSE2            = 4,
>             CPU_SSE3            = 8;
> 
> /******************************************************************************
>     A function pointer with a bitmask for it's required extensions
> ******************************************************************************/
> struct MultiTargetVariant
> {
>     static MultiTargetVariant opCall(uint ext, void* func)
>     {
>         MultiTargetVariant mtv;
>         mtv.ext = ext;
>         mtv.func = func;
>         return mtv;
>     }
> 
>     uint    ext;
>     void*   func;
> }
> 
> /******************************************************************************
>     Chooses the first matching MTV
>     and saves it's FP to the dummy entry in the VTBL
> ******************************************************************************/
> void LinkMultiTarget(ClassInfo ci, void* dummy_ptr, MultiTargetVariant[] multi_target_variants)
> {
>     uint extensions;
>     if ( mmx )  extensions |= CPU_MMX;
>     if ( sse )  extensions |= CPU_SSE;
>     if ( sse2 ) extensions |= CPU_SSE2;
>     if ( sse3 ) extensions |= CPU_SSE3;
> 
>     foreach ( i, inout vp; ci.vtbl )
>     {
>         if ( vp is dummy_ptr )
>         {
>             foreach ( variant; multi_target_variants )
>             {
>                 if ( (variant.ext & extensions) == variant.ext )
>                 {
>                     vp = variant.func;
>                     break;
>                 }
>             }
>             assert(vp !is dummy_ptr);
>             break;
>         }
>     }
> }
> 
> 
> //-----------------------------------------------------------------------------
> //  This is application code
> 
> /******************************************************************************
>     A class with a multi-target function
> ******************************************************************************/
> class MyMultiTargetClass
> {
>     // The following 3 functions could be generated automatically by the compiler
>     // with different targets enabled. For example, when we have language support for
>     // vector operations, the compiler could generate multiple versions for different
>     // SIMD extensions. Then there would be only one extension independent implementation.
> 
>     char[] multi_target_sse2()
>     {
>         return "using SSE2";
>     }
> 
>     char[] multi_target_sse_mmx()
>     {
>         return "using SSE and MMX";
>     }
> 
>     char[] multi_target_noext()
>     {
>         return "using no extension";
>     }
> 
>     // The following code could be generated by the compiler if there are multi-target
>     // functions 
> 
>     char[] multi_target() { return null; }
>     static this()
>     {
>         MultiTargetVariant[] variants = [
>             MultiTargetVariant(CPU_SSE2, &multi_target_sse2),
>             MultiTargetVariant(CPU_SSE|CPU_MMX, &multi_target_sse_mmx),
>             MultiTargetVariant(CPU_NO_EXTENSION, &multi_target_noext)
>         ];
>         LinkMultiTarget(this.classinfo, &multi_target, variants);
>     }
> }
> 
> /******************************************************************************
>     Finally, the usage is completely opaque and there is no runtime overhead
>     besides the detection at startup.
> ******************************************************************************/
> void main()
> {
>     MyMultiTargetClass t = new MyMultiTargetClass;
>     writefln("%s", t.multi_target);
> }
May 01, 2007
Jascha Wetzel wrote:
> A thought that came up in the VM discussion...
> 
> Suppose someday we have language support for vector operations. We want
> to ship binaries that support but do not require extensions like SSE. We
> do not want to ship multiple binaries and wrappers that switch between
> them or installers that decide which one to use, because it's more work
> and we'd be shipping a lot of redundant code.

On a totally unrelated note we are using GDC to build Universal Binaries
for Mac OS X, that is: objects with both i386 (=i686) and ppc (=powerpc)
code. They are however twice as big as when building for just one arch.

$ file hello
hello: Mach-O universal binary with 2 architectures
hello (for architecture ppc):   Mach-O executable ppc
hello (for architecture i386):  Mach-O executable i386

The GCC driver automatically runs two compilation steps and lipos them,
so it's pretty straight-forward to use (unrelated to vector ops, though)
gdc -isysroot /Developer/SDKs/MacOSX10.4u.sdk -arch ppc -arch i386 ...

It only does one variant for each architecture, so no help for a "JIT".

--anders
May 01, 2007
I've seen some games with multiple executables compiled for different architectures. Since the executable size is dwarfed by resources, this is no problem for these kind of applications.

How much of a negative impact would your suggested approach have on compiler optimizations? (inlining and that sort of thing)

Another thing, what are the benefits of the compiler doing this over libraries?

On a related note it may be worth mentioning liboil which implements exactly this in a library: http://liboil.freedesktop.org/wiki/
May 02, 2007
Jascha Wetzel wrote:
> A thought that came up in the VM discussion...
> 
> Suppose someday we have language support for vector operations. We want
> to ship binaries that support but do not require extensions like SSE. We
> do not want to ship multiple binaries and wrappers that switch between
> them or installers that decide which one to use, because it's more work
> and we'd be shipping a lot of redundant code.
> 
> Ideally we wouldn't have to write additional code either. The compiler
> could emit code for multiple targets on a per-function basis (e.g. with
> the target architecure mangled into the function name). The runtime
> would check at startup, which version will be used and "link" the
> appropriate function.
> Here is a small proof-of-concept implementation of this detection and
> linking mechanism.
> 
> Comments?
> 

This is fine when you have a small sub-set of target architectures, however if you want to be really optimal it needs to be optimized for the target architecture.  Michael Abrash tried this for Pixomatic however the size of the executable grow to large (its an exponential thing because you want to avoid branching so things must be inlined).

http://www.ddj.com/184405765
http://www.ddj.com/184405807
http://www.ddj.com/184405848

I'm not saying its not a good start however I think the compiler would need to perform some sort of compression and optimize the function for the architecture (even the order of instructions can make a huge different to efficiency) at startup. I guess that's a kinda JITC however

I guess it could be a load of tiny code segments that are pre-built and rearranged and added together just before build. (Kinda like pixomatic)

-Joel
May 02, 2007
Lutger wrote:
> I've seen some games with multiple executables compiled for different architectures. Since the executable size is dwarfed by resources, this is no problem for these kind of applications.

yeah, the size issue isn't that important. it's actually more about not doing anything but adding a compiler switch to get multiple versions.

> How much of a negative impact would your suggested approach have on compiler optimizations? (inlining and that sort of thing)

the smallest unit for this approach would be a non-inlined function. any
function that gets inlined within the multi-arch function would be
compiled with the appropriate target as well.
all intraprocedural optimizations work as usual. only optimizations that
change the calling convention are affected. those have to be equal for
all versions of the function because the caller never knows which
version it calls.
in the example where only virtual functions are supported this is not an
issue, since virtual functions have that requirement anyway. for static
functions this has to be ensured explicitly.

> Another thing, what are the benefits of the compiler doing this over libraries?

using libraries means that you have to at least compile multiple
versions of each library and have code that loads the appropriate version.
with compiler support it's a lot more convenient and less error prone,
since you do not have to write any additional code or have more complex
build scripts.

> On a related note it may be worth mentioning liboil which implements exactly this in a library: http://liboil.freedesktop.org/wiki/

yep, it has the same goal and looking at the source shows how much work it is. of course that's also because everything is manually optimized.
May 02, 2007
the granularity isn't as fine as it could be, of course. but the effort
to make it happen is pretty small and it's better than compiling the
whole program multiple times and switching manually.
it's not a replacement for JITC or methods like Abrash's welding.

janderson wrote:
> Jascha Wetzel wrote:
>> A thought that came up in the VM discussion...
>>
>> Suppose someday we have language support for vector operations. We want to ship binaries that support but do not require extensions like SSE. We do not want to ship multiple binaries and wrappers that switch between them or installers that decide which one to use, because it's more work and we'd be shipping a lot of redundant code.
>>
>> Ideally we wouldn't have to write additional code either. The compiler
>> could emit code for multiple targets on a per-function basis (e.g. with
>> the target architecure mangled into the function name). The runtime
>> would check at startup, which version will be used and "link" the
>> appropriate function.
>> Here is a small proof-of-concept implementation of this detection and
>> linking mechanism.
>>
>> Comments?
>>
> 
> This is fine when you have a small sub-set of target architectures, however if you want to be really optimal it needs to be optimized for the target architecture.  Michael Abrash tried this for Pixomatic however the size of the executable grow to large (its an exponential thing because you want to avoid branching so things must be inlined).
> 
> http://www.ddj.com/184405765
> http://www.ddj.com/184405807
> http://www.ddj.com/184405848
> 
> I'm not saying its not a good start however I think the compiler would need to perform some sort of compression and optimize the function for the architecture (even the order of instructions can make a huge different to efficiency) at startup. I guess that's a kinda JITC however
> 
> I guess it could be a load of tiny code segments that are pre-built and rearranged and added together just before build. (Kinda like pixomatic)
> 
> -Joel
May 02, 2007
here is a much simpler version that works with templates. what is boils down to is choosing one template instance at startup that will replace a function pointer.

now the only compiler support required would be a pragma or similar to
select the target architecture.
this could also be used to manage multiple versions of BLADE code.


May 02, 2007
Jascha Wetzel wrote:
> here is a much simpler version that works with templates. what is boils
> down to is choosing one template instance at startup that will replace a
> function pointer.
> 
> now the only compiler support required would be a pragma or similar to
> select the target architecture.

A pragma would only be required as a size optimisation. Probably not worth worrying about (We have enough version information already).

> this could also be used to manage multiple versions of BLADE code.

It's a nice idea, but I don't know how it could generate the class to put the 'this()' function into (we don't want a memory alloc every time we enter that function!)

Interestingly DDL could be fantastic for this. At startup, walk through the symbol fixup table, and look for any import symbols marked __cpu_fixup_XXX.
When you find them, look for an export symbol called __cpu_SSE2_XXX, and patch them into everything in the the fixup list. That way, you even get a direct function call, instead of an indirect one.

I wonder if it's possible to pop ESP off the stack, and write back into the code that called you, without the operating system triggering a security alert -- in that case, the function you call could be a little thunk, something like:

asm {
  naked;
  mov eax, CPU_TYPE;
  mov eax, FUNCPOINTERS[eax];
  mov ecx, [esp-4]; // get the return address
  mov [ecx-4], eax; // patch the call address, so this thunk never gets called again.
  jmp [eax];
}

But I think a modern OS would go nuts if you try this?
(It's been a long time since I wrote self modifying code).
May 02, 2007
Jascha Wetzel wrote:
> the granularity isn't as fine as it could be, of course. but the effort
> to make it happen is pretty small and it's better than compiling the
> whole program multiple times and switching manually.
> it's not a replacement for JITC or methods like Abrash's welding.
> 

I agree, it's a good start.


> janderson wrote:
>> Jascha Wetzel wrote:
>>> A thought that came up in the VM discussion...
>>>
>>> Suppose someday we have language support for vector operations. We want
>>> to ship binaries that support but do not require extensions like SSE. We
>>> do not want to ship multiple binaries and wrappers that switch between
>>> them or installers that decide which one to use, because it's more work
>>> and we'd be shipping a lot of redundant code.
>>>
>>> Ideally we wouldn't have to write additional code either. The compiler
>>> could emit code for multiple targets on a per-function basis (e.g. with
>>> the target architecure mangled into the function name). The runtime
>>> would check at startup, which version will be used and "link" the
>>> appropriate function.
>>> Here is a small proof-of-concept implementation of this detection and
>>> linking mechanism.
>>>
>>> Comments?
>>>
>> This is fine when you have a small sub-set of target architectures,
>> however if you want to be really optimal it needs to be optimized for
>> the target architecture.  Michael Abrash tried this for Pixomatic
>> however the size of the executable grow to large (its an exponential
>> thing because you want to avoid branching so things must be inlined).
>>
>> http://www.ddj.com/184405765
>> http://www.ddj.com/184405807
>> http://www.ddj.com/184405848
>>
>> I'm not saying its not a good start however I think the compiler would
>> need to perform some sort of compression and optimize the function for
>> the architecture (even the order of instructions can make a huge
>> different to efficiency) at startup. I guess that's a kinda JITC however
>>
>> I guess it could be a load of tiny code segments that are pre-built and
>> rearranged and added together just before build. (Kinda like pixomatic)
>>
>> -Joel
« First   ‹ Prev
1 2