December 29, 2011
On 12/29/11 2:29 PM, Walter Bright wrote:
> On 12/29/2011 11:47 AM, Walter Bright wrote:
>> On 12/29/2011 3:19 AM, Vladimir Panteleev wrote:
>>> I'd like to invite you to translate Daniel Vik's C memcpy
>>> implementation to D:
>>> http://www.danielvik.com/2010/02/fast-memcpy-in-c.html
>>
>> Challenge accepted.
>
> Here's another version that uses string mixins to ensure inlining of the
> COPY functions. There are no call instructions in the generated code.
> This should be as good as the C version using the same code generator.
[snip]

In other news, TAB has died with Kim-Jong Il. Please stop using it.

Andrei

December 29, 2011
David Nadlinger Wrote:

> On 12/29/11 2:13 PM, a wrote:
> > void test(ref V a, ref V b)
> > {
> >      asm
> >      {
> >          movaps XMM0, a;
> >          addps  XMM0, b;
> >          movaps a, XMM0;
> >      }
> >      asm
> >      {
> >          movaps XMM0, a;
> >          addps  XMM0, b;
> >          movaps a, XMM0;
> >      }
> > }
> >
> > […]
> >
> > The needles loads and stores would make it impossible to write an efficient simd add function even if the functions containing asm blocks could be inlined.
> 
> Yes, this is indeed a problem, and as far as I'm aware, usually solved in the gamedev world by using the (SSE) intrinsics your favorite C++ compiler provides, instead of resorting to inline asm.
> 
> David

IIRC Walter doesn't want to add vector intrinsics, so it would be nice if the functions to do vector operations could be efficiently  written using inline assembly.  It would also be a more general solution than having intrinsics. Something like that is possible with gcc extended inline assembly. For example this:

typedef float v4sf __attribute__((vector_size(16)));

void vadd(v4sf *a, v4sf *b)
{
    asm(
        "addps %1, %0"
        : "=x" (*a)
        : "x" (*b), "0" (*a)
        : );
}

void test(float * __restrict__ a, float * __restrict__ b)
{
    v4sf * va = (v4sf*) a;
    v4sf * vb = (v4sf*) b;
    vadd(va,vb);
    vadd(va,vb);
    vadd(va,vb);
    vadd(va,vb);
}

compiles to:

00000000004004c0 <test>:
  4004c0:       0f 28 0e                movaps (%rsi),%xmm1
  4004c3:       0f 28 07                movaps (%rdi),%xmm0
  4004c6:       0f 58 c1                addps  %xmm1,%xmm0
  4004c9:       0f 58 c1                addps  %xmm1,%xmm0
  4004cc:       0f 58 c1                addps  %xmm1,%xmm0
  4004cf:       0f 58 c1                addps  %xmm1,%xmm0
  4004d2:       0f 29 07                movaps %xmm0,(%rdi)

This should also be possible with GDC, but I couldn't figure out how to get something like __restrict__ (if you want to use vector types and gcc extended inline assembly with GDC, see http://www.digitalmars.com/d/archives/D/gnu/Support_for_gcc_vector_attributes_SIMD_builtins_3778.html and https://bitbucket.org/goshawk/gdc/wiki/UserDocumentation).
December 29, 2011
Walter Bright Wrote:

> On 12/29/2011 5:13 AM, a wrote:
> > The needles loads and stores would make it impossible to write an efficient simd add function even if the functions containing asm blocks could be inlined.
> 
> This does what you're asking for:
> 
> void test(ref float a, ref float b)
> {
>      asm
>      {
>          naked;
>          movaps  XMM0,[RSI];
>          addps   XMM0,[RDI];
>          movaps  [RSI],XMM0;
>          movaps  XMM0,[RSI];
>          addps   XMM0,[RDI];
>          movaps  [RSI],XMM0;
>          ret;
>      }
> }

What I want is to be able to write short functions using inline assembly and have them inlined and compiled even to a single instruction where possible. This can be done with gcc. See my post here: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=153879
December 29, 2011
On 12/29/2011 2:52 PM, a wrote:
> What I want is to be able to write short functions using inline assembly and
> have them inlined and compiled even to a single instruction where possible.
> This can be done with gcc. See my post here:
> http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=153879

I understand. I just wished to make sure you knew about 'naked' and what good it was for.
December 29, 2011
On 12/29/2011 12:19 PM, Vladimir Panteleev wrote:
> Before you disagree with any of the above, first
> (for starters) I'd like to invite you to translate Daniel Vik's C memcpy
> implementation to D:
> http://www.danielvik.com/2010/02/fast-memcpy-in-c.html . It doesn't even
> use inline assembler or compiler intrinsics.

Ok, I have performed a direct translation (with all the preprocessor stuff replaced by string mixins). However, I think I could do a lot better starting from scratch in D. I have performed some basic testing with all the configuration options, and it seems to work correctly.

// File: memcpy.d direct translation of memcpy.c

/********************************************************************
 ** File:     memcpy.c
 **
 ** Copyright (C) 1999-2010 Daniel Vik
 **
 ** This software is provided 'as-is', without any express or implied
 ** warranty. In no event will the authors be held liable for any
 ** damages arising from the use of this software.
 ** Permission is granted to anyone to use this software for any
 ** purpose, including commercial applications, and to alter it and
 ** redistribute it freely, subject to the following restrictions:
 **
 ** 1. The origin of this software must not be misrepresented; you
 **    must not claim that you wrote the original software. If you
 **    use this software in a product, an acknowledgment in the
 **    use this software in a product, an acknowledgment in the
 **    product documentation would be appreciated but is not
 **    required.
 **
 ** 2. Altered source versions must be plainly marked as such, and
 **    must not be misrepresented as being the original software.
 **
 ** 3. This notice may not be removed or altered from any source
 **    distribution.
 **
 **
 ** Description: Implementation of the standard library function memcpy.
 **             This implementation of memcpy() is ANSI-C89 compatible.
 **
 **             The following configuration options can be set:
 **
 **           LITTLE_ENDIAN   - Uses processor with little endian
 **                             addressing. Default is big endian.
 **
 **           PRE_INC_PTRS    - Use pre increment of pointers.
 **                             Default is post increment of
 **                             pointers.
 **
 **           INDEXED_COPY    - Copying data using array indexing.
 **                             Using this option, disables the
 **                             PRE_INC_PTRS option.
 **
 **           MEMCPY_64BIT    - Compiles memcpy for 64 bit
 **                             architectures
 **
 **
 ** Best Settings:
 **
 ** Intel x86:  LITTLE_ENDIAN and INDEXED_COPY
 **
 *******************************************************************/


/********************************************************************
 ** Configuration definitions.
 *******************************************************************/

version = LITTLE_ENDIAN;
version = INDEXED_COPY;


/********************************************************************
 ** Includes for size_t definition
 *******************************************************************/

/********************************************************************
 ** Typedefs
 *******************************************************************/

version(MEMCPY_64BIT) version(D_LP32) static assert(0, "not a 64 bit compile");
version(D_LP64){
    alias ulong              UIntN;
    enum TYPE_WIDTH =        8;
}else{
    alias uint               UIntN;
    enum TYPE_WIDTH =        4;
}


/********************************************************************
 ** Remove definitions when INDEXED_COPY is defined.
 *******************************************************************/

version(INDEXED_COPY){
    version(PRE_INC_PTRS)
        static assert(0, "cannot use INDEXED_COPY together with PRE_INC_PTRS!");
}

/********************************************************************
 ** The X template
 *******************************************************************/

string Ximpl(string x){
    import utf = std.utf;
    string r=`"`;
    for(typeof(x.length) i=0;i<x.length;r~=x[i..i+utf.stride(x,i)],i+=utf.stride(x,i)){
        if(x[i]=='@'&&x[i+1]=='('){
            auto start = ++i; int nest=1;
            while(nest){
                i+=utf.stride(x,i);
                if(x[i]=='(') nest++;
                else if(x[i]==')') nest--;
            }
            i++;
            r~=`"~`~x[start..i]~`~"`;
            if(i==x.length) break;
        }
        if(x[i]=='"'||x[i]=='\\'){r~="\\"; continue;}
    }
    return r~`"`;
}

template X(string x){
    enum X = Ximpl(x);
}


/********************************************************************
 ** Definitions for pre and post increment of pointers.
 *******************************************************************/

// uses *(*&x)++ and similar to work around a bug in the parser

version(PRE_INC_PTRS){
    string START_VAL(string x)           {return mixin(X!q{(*&@(x))--;});}
    string INC_VAL(string x)             {return mixin(X!q{*++(*&@(x))});}
    string CAST_TO_U8(string p, string o){
        return mixin(X!q{(cast(ubyte*)@(p) + @(o) + TYPE_WIDTH)});
    }
    enum WHILE_DEST_BREAK  =                     (TYPE_WIDTH - 1);
    enum PRE_LOOP_ADJUST   =                     q{- (TYPE_WIDTH - 1)};
    enum PRE_SWITCH_ADJUST =                     q{+ 1};
}else{
    string START_VAL(string x)           {return q{};}
    string INC_VAL(string x)             {return mixin(X!q{*(*&@(x))++});}
    string CAST_TO_U8(string p, string o){
        return mixin(X!q{(cast(ubyte*)@(p) + @(o))});
    }
    enum WHILE_DEST_BREAK  =                     0;
    enum PRE_LOOP_ADJUST   =                     q{};
    enum PRE_SWITCH_ADJUST =                     q{};
}




/********************************************************************
 ** Definitions for endians
 *******************************************************************/

version(LITTLE_ENDIAN){
    enum SHL = q{>>};
    enum SHR = q{<<};
}else{
    enum SHL = q{<<};
    enum SHR = q{>>};
}

/********************************************************************
 ** Macros for copying words of  different alignment.
 ** Uses incremening pointers.
 *******************************************************************/

string CP_INCR() {
    return mixin(X!q{
        @(INC_VAL(q{dstN})) = @(INC_VAL(q{srcN}));
    });
}

string CP_INCR_SH(string shl, string shr) {
    return mixin(X!q{
        dstWord   = srcWord @(SHL) @(shl);
        srcWord   = @(INC_VAL(q{srcN}));
        dstWord  |= srcWord @(SHR) @(shr);
        @(INC_VAL(q{dstN})) = dstWord;
    });
}



/********************************************************************
 ** Macros for copying words of  different alignment.
 ** Uses array indexes.
 *******************************************************************/

string CP_INDEX(string idx) {
    return mixin(X!q{
        dstN[@(idx)] = srcN[@(idx)];
    });
}

string CP_INDEX_SH(string x, string shl, string shr) {
    return mixin(X!q{
        dstWord   = srcWord @(SHL) @(shl);
        srcWord   = srcN[@(x)];
        dstWord  |= srcWord @(SHR) @(shr);
        dstN[@(x)]= dstWord;
    });
}



/********************************************************************
 ** Macros for copying words of different alignment.
 ** Uses incremening pointers or array indexes depending on
 ** configuration.
 *******************************************************************/

version(INDEXED_COPY){
    alias CP_INDEX CP;
    alias CP_INDEX_SH CP_SH;
    string INC_INDEX(string p, string o){
        return mixin(X!q{
            ((@(p)) += (@(o)));
        });
    }
}else{
    string CP(string idx) {return mixin(X!q{@(CP_INCR())});}
    string CP_SH(string idx, string shl, string shr){
        return mixin(X!q{
            @(CP_INCR_SH(mixin(X!q{@(shl)}), mixin(X!q{@(shr)})));
        });
    }
    string INC_INDEX(string p, string o){return q{};}
}


string COPY_REMAINING(string count) {
    return mixin(X!q{
        @(START_VAL(q{dst8}));
        @(START_VAL(q{src8}));

        switch (@(count)) {
        case 7: @(INC_VAL(q{dst8})) = @(INC_VAL(q{src8}));
        case 6: @(INC_VAL(q{dst8})) = @(INC_VAL(q{src8}));
        case 5: @(INC_VAL(q{dst8})) = @(INC_VAL(q{src8}));
        case 4: @(INC_VAL(q{dst8})) = @(INC_VAL(q{src8}));
        case 3: @(INC_VAL(q{dst8})) = @(INC_VAL(q{src8}));
        case 2: @(INC_VAL(q{dst8})) = @(INC_VAL(q{src8}));
        case 1: @(INC_VAL(q{dst8})) = @(INC_VAL(q{src8}));
        case 0:
        default: break;
        }
    });
}

string COPY_NO_SHIFT() {
    return mixin(X!q{
        UIntN* dstN = cast(UIntN*)(dst8 @(PRE_LOOP_ADJUST));
        UIntN* srcN = cast(UIntN*)(src8 @(PRE_LOOP_ADJUST));
        size_t length = count / TYPE_WIDTH;

        while (length & 7) {
            @(CP_INCR());
            length--;
        }

        length /= 8;

        while (length--) {
            @(CP(q{0}));
            @(CP(q{1}));
            @(CP(q{2}));
            @(CP(q{3}));
            @(CP(q{4}));
            @(CP(q{5}));
            @(CP(q{6}));
            @(CP(q{7}));

            @(INC_INDEX(q{dstN}, q{8}));
            @(INC_INDEX(q{srcN}, q{8}));
        }

        src8 = @(CAST_TO_U8(q{srcN}, q{0}));
        dst8 = @(CAST_TO_U8(q{dstN}, q{0}));

        @(COPY_REMAINING(q{count & (TYPE_WIDTH - 1)}));

        return dest;
    });
}



string COPY_SHIFT(string shift) {
    return mixin(X!q{
        UIntN* dstN  = cast(UIntN*)(((cast(UIntN)dst8) @(PRE_LOOP_ADJUST)) &
                                    ~(TYPE_WIDTH - 1));
        UIntN* srcN  = cast(UIntN*)(((cast(UIntN)src8) @(PRE_LOOP_ADJUST)) &
                                    ~(TYPE_WIDTH - 1));
        size_t length  = count / TYPE_WIDTH;
        UIntN srcWord = @(INC_VAL(q{srcN}));
        UIntN dstWord;

        while (length & 7) {
            @(CP_INCR_SH(mixin(X!q{8 * @(shift)}), mixin(X!q{8 * (TYPE_WIDTH - @(shift))})));
            length--;
        }

        length /= 8;

        while (length--) {
            @(CP_SH(q{0}, mixin(X!q{8 * @(shift)}), mixin(X!q{8 * (TYPE_WIDTH - @(shift))})));
            @(CP_SH(q{1}, mixin(X!q{8 * @(shift)}), mixin(X!q{8 * (TYPE_WIDTH - @(shift))})));
            @(CP_SH(q{2}, mixin(X!q{8 * @(shift)}), mixin(X!q{8 * (TYPE_WIDTH - @(shift))})));
            @(CP_SH(q{3}, mixin(X!q{8 * @(shift)}), mixin(X!q{8 * (TYPE_WIDTH - @(shift))})));
            @(CP_SH(q{4}, mixin(X!q{8 * @(shift)}), mixin(X!q{8 * (TYPE_WIDTH - @(shift))})));
            @(CP_SH(q{5}, mixin(X!q{8 * @(shift)}), mixin(X!q{8 * (TYPE_WIDTH - @(shift))})));
            @(CP_SH(q{6}, mixin(X!q{8 * @(shift)}), mixin(X!q{8 * (TYPE_WIDTH - @(shift))})));
            @(CP_SH(q{7}, mixin(X!q{8 * @(shift)}), mixin(X!q{8 * (TYPE_WIDTH - @(shift))})));

            @(INC_INDEX(q{dstN}, q{8}));
            @(INC_INDEX(q{srcN}, q{8}));
        }

        src8 = @(CAST_TO_U8(q{srcN}, mixin(X!q{(@(shift) - TYPE_WIDTH)})));
        dst8 = @(CAST_TO_U8(q{dstN}, q{0}));

        @(COPY_REMAINING(q{count & (TYPE_WIDTH - 1)}));

        return dest;
    });
}


/********************************************************************
 **
 ** void *memcpy(void *dest, const void *src, size_t count)
 **
 ** Args:     dest        - pointer to destination buffer
 **           src         - pointer to source buffer
 **           count       - number of bytes to copy
 **
 ** Return:   A pointer to destination buffer
 **
 ** Purpose:  Copies count bytes from src to dest.
 **           No overlap check is performed.
 **
 *******************************************************************/

void *memcpy(void *dest, const void *src, size_t count)
{
    ubyte* dst8 = cast(ubyte*)dest;
    ubyte* src8 = cast(ubyte*)src;
    if (count < 8) {
        mixin(COPY_REMAINING(q{count}));
        return dest;
    }

    mixin(START_VAL(q{dst8}));
    mixin(START_VAL(q{src8}));

    while ((cast(UIntN)dst8 & (TYPE_WIDTH - 1)) != WHILE_DEST_BREAK) {
        mixin(INC_VAL(q{dst8})) = mixin(INC_VAL(q{src8}));
        count--;
    }
    switch ((mixin(`(cast(UIntN)src8)`~ PRE_SWITCH_ADJUST)) & (TYPE_WIDTH - 1)) {
    // { } required to work around DMD bug
    case 0: {mixin(COPY_NO_SHIFT());} break;
    case 1: {mixin(COPY_SHIFT(q{1}));}   break;
    case 2: {mixin(COPY_SHIFT(q{2}));}   break;
    case 3: {mixin(COPY_SHIFT(q{3}));}   break;
static if(TYPE_WIDTH > 4){ // was TYPE_WIDTH >= 4. bug in original code.
    case 4: {mixin(COPY_SHIFT(q{4}));}   break;
    case 5: {mixin(COPY_SHIFT(q{5}));}   break;
    case 6: {mixin(COPY_SHIFT(q{6}));}   break;
    case 7: {mixin(COPY_SHIFT(q{7}));}   break;
}
    default: assert(0);
    }
}


void main(){
    int[13] x = [1,2,3,4,5,6,7,8,9,0,1,2,3];
    int[13] y;
    memcpy(y.ptr, x.ptr, x.sizeof);
    import std.stdio;   writeln(y);
}
December 30, 2011
On Thursday, 29 December 2011 at 19:47:39 UTC, Walter Bright wrote:
> On 12/29/2011 3:19 AM, Vladimir Panteleev wrote:
>> I'd like to invite you to translate Daniel Vik's C memcpy implementation to D:
>> http://www.danielvik.com/2010/02/fast-memcpy-in-c.html
>
> Challenge accepted.

Ah, a direct translation using functions! This is probably the most elegant approach, however - as I'm sure you've noticed - the programmer has no control over what gets inlined.

> Examining the assembler output, it inlines everything except COPY_SHIFT, COPY_NO_SHIFT, and COPY_REMAINING. The inliner in dmd could definitely be improved, but that is not a problem with the language, but the implementation.

This is the problem with heuristic inlining: while great by itself, in a position such as this the programmer is left with no choice but to examine the assembler output to make sure the compiler does what the programmer wants it to do. Such behavior can change from one implementation to another, and even from one compiler version to another. (After all, I don't think that we can guarantee that what's inlined today, will be inlined tomorrow.)

> Continuing in that vein, please note that neither C nor C++ require inlining of any sort. The "inline" keyword is merely a hint to the compiler. What inlining takes place is completely implementation defined, not language defined.

I think we can agree that the C inline hint is of limited use. However, major C compiler vendors implement an extension to force inlining. Generally, I would say that common vendor extensions seen in other languages are an opportunity for D to avoid a similar mess: such extensions would not have to be required to be implemented, but when they are, they would use the same syntax across implementations.

> I wish to note that the D version semantically accomplishes the same thing as the C version without using mixins or CTFE - it's all straightforward code, without the abusive preprocessor tricks.

I don't think there's much value in that statement. After all, except for a few occasional templates (which weren't strictly necessary), your translation uses few D-specific features. If you were to leave yourself at the mercy of a C compiler's optimizer, your rewrite would merely be a testament against C macros, not the power of D.

However, the most important part is: this translation is incorrect. C macros in the original code provide a guarantee that the code is inlined. D cannot make such guarantees - even your amended version is tuned to one specific implementation (and possibly, only a specific range of versions of it).
December 30, 2011
On Thursday, 29 December 2011 at 20:58:59 UTC, Timon Gehr wrote:
> I don't think you should use DMD to benchmark the D language.

You're missing my point. We can't count that optimizers in all implementations will be perfect. I am suggesting language features which could provide guarantees to the programmer regarding how the code will be compiled. If an implementation cannot satisfy them, the programmer should be told so, so he could try something else - rather than having to sift through disassembler listings or use a profiler.

December 30, 2011
On Thursday, 29 December 2011 at 23:47:08 UTC, Timon Gehr wrote:
> ** The X template

Good work, but I'm not sure if inventing a DSL to make up for the problems in D string mixins that C macros don't have qualifies as "doing it right".
December 30, 2011
On 12/29/2011 9:51 PM, Vladimir Panteleev wrote:
> Ah, a direct translation using functions! This is probably the most elegant
> approach, however - as I'm sure you've noticed - the programmer has no control
> over what gets inlined.

The programmer also has no control over which variables go into which registers. (Early C compilers did provide this.)


> I think we can agree that the C inline hint is of limited use. However, major C
> compiler vendors implement an extension to force inlining.

I know.


> I don't think there's much value in that statement. After all, except for a few
> occasional templates (which weren't strictly necessary), your translation uses
> few D-specific features. If you were to leave yourself at the mercy of a C
> compiler's optimizer, your rewrite would merely be a testament against C macros,
> not the power of D.

I think this criticism is off target, because the C example was almost entirely macros - and macros that were used in the service of evading C language limitations. The point wasn't to use clever D features, the challenge was to demonstrate you can get the same results in D as in C.


> However, the most important part is: this translation is incorrect. C macros in
> the original code provide a guarantee that the code is inlined. D cannot make
> such guarantees - even your amended version is tuned to one specific
> implementation (and possibly, only a specific range of versions of it).

I also think this is off target, because a C compiler really doesn't guarantee **** about efficiency, it only guarantees that it will work "as if" it was executed on some idealized abstract machine. Even dividing code up into functions is completely arbitrary, and open to wildly different strategies that are perfectly legal to any C compiler. A C compiler doesn't have to enregister anything in variables, either, and that has far more of a performance impact than inlining.

There are a very wide range of code generation techniques that compilers employ. All of them, to verify that they are being applied, require inspection of the assembler output. Many argue that the compiler should tell you about inlining - but what about all those others? I think the focus on inlining (as opposed to other possible optimizations) is out of proportion, likely exacerbated by dmd needing to do a better job of it.

I completely agree that DMD's inliner is underpowered and needs improvement. I am less sure that this demonstrates that the language needs changes.

Functions below a certain size should be inlined if possible. Those above that size do not benefit perceptibly from inlining. Where that certain size exactly is, who knows, but I doubt that functions near that size will benefit much from user intervention.

December 30, 2011
On Friday, 30 December 2011 at 06:53:06 UTC, Walter Bright wrote:
> I think this criticism is off target, because the C example was almost entirely macros - and macros that were used in the service of evading C language limitations. The point wasn't to use clever D features, the challenge was to demonstrate you can get the same results in D as in C.

...

> I also think this is off target, because a C compiler really doesn't guarantee **** about efficiency, it only guarantees that it will work "as if" it was executed on some idealized abstract machine. Even dividing code up into functions is completely arbitrary, and open to wildly different strategies that are perfectly legal to any C compiler. A C compiler doesn't have to enregister anything in variables, either, and that has far more of a performance impact than inlining.

Even though the core language (of C and D) are not specific to any one platform, writing fast code has never been about targeting abstract idealized virtual machines. Some assumptions need to be made. Most assumptions that the C memcpy code makes can be expected to generally be true across major C compilers (e.g. macros are at least as fast as regular functions). However, your D port makes some rather fragile assumptions regarding the compiler implementation.

Let's eliminate the language distinction, and consider two memcpy versions - one using macros, the other using functions (not even with "inline"). Would you say that the second is generally as fast as the first? I'm being intentionally vague: saying that their performance is "about the same" is holding on MUCH more fragile assumptions.

The fact that major compiler vendors implement language extensions to facilitate writing optimized code shows that there is a demand for it. Even compilers that are great at optimization (GCC, LLVM) have such intrinsics.

I'm not necessarily advocating changing the core language (e.g. new @attributes, things that would need to go into TDPLv2). However, what I think would greatly improve the situation is to have DigitalMars provide recommendations for implementation-specific extensions that provide more control with regards to how the code is compiled (pragma names, keywords starting with __, etc.). Once they're defined, pull requests to add them to DMD will follow.

> Functions below a certain size should be inlined if possible. Those above that size do not benefit perceptibly from inlining. Where that certain size exactly is, who knows, but I doubt that functions near that size will benefit much from user intervention.

I agree, but this wasn't as much about heuristics, but compiler capabilities (e.g. inlining assembler functions).