December 29, 2011
On Thu, 29 Dec 2011 20:27:43 +0200, Walter Bright <newshound2@digitalmars.com> wrote:

> On 12/29/2011 2:15 AM, Caligo wrote:
>> If there is a God (I'm not saying there
>> isn't, and I'm not saying there is), what language would he choose to create the
>> universe?
>
> Mathematics.

In the essence and the spirit math is THE answer,
but if you mean the implementation we have now, it is verbose nonsense.

Yet we are talking about GODs language, you have a point!
Only an immortal could comprehend math fully.
December 29, 2011
On 12/29/2011 11:47 AM, Walter Bright wrote:
> On 12/29/2011 3:19 AM, Vladimir Panteleev wrote:
>> I'd like to invite you to translate Daniel Vik's C memcpy implementation to D:
>> http://www.danielvik.com/2010/02/fast-memcpy-in-c.html
>
> Challenge accepted.


This does compile, though I did not test or benchmark it.

Examining the assembler output, it inlines everything except COPY_SHIFT, COPY_NO_SHIFT, and COPY_REMAINING. The inliner in dmd could definitely be improved, but that is not a problem with the language, but the implementation.

Continuing in that vein, please note that neither C nor C++ require inlining of any sort. The "inline" keyword is merely a hint to the compiler. What inlining takes place is completely implementation defined, not language defined.

The same goes for all those language extensions you mentioned. Those are not part of Standard C. They are vendor extensions. Does that mean that C is not actually a systems language? No.

I wish to note that the D version semantically accomplishes the same thing as the C version without using mixins or CTFE - it's all straightforward code, without the abusive preprocessor tricks.
December 29, 2011
On 12/29/2011 9:15 AM, so wrote:
> The legitimate "D performs so bad in my example" posts appeared in this forum
> almost always ended up with the conclusion that D's lack a controlled inline
> mechanism.

Standard C doesn't have one either. C vendors often implement vendor-specific extensions for this.
December 29, 2011
On 12/29/11 19:27, Walter Bright wrote:
> On 12/29/2011 2:15 AM, Caligo wrote:
>> If there is a God (I'm not saying there
>> isn't, and I'm not saying there is), what language would he choose to create the
>> universe?
> 
> Mathematics.

Fan of Tegmark¹, eh? :)

--

¹http://en.wikipedia.org/wiki/Mathematical_universe_hypothesis
December 29, 2011
On 12/29/2011 5:13 AM, a wrote:
> The needles loads and stores would make it impossible to write an efficient
> simd add function even if the functions containing asm blocks could be
> inlined.

This does what you're asking for:

void test(ref float a, ref float b)
{
    asm
    {
        naked;
        movaps  XMM0,[RSI];
        addps   XMM0,[RDI];
        movaps  [RSI],XMM0;
        movaps  XMM0,[RSI];
        addps   XMM0,[RDI];
        movaps  [RSI],XMM0;
        ret;
    }
}
December 29, 2011
On Thu, 29 Dec 2011 22:00:12 +0200, Walter Bright <newshound2@digitalmars.com> wrote:

> Examining the assembler output, it inlines everything except COPY_SHIFT, COPY_NO_SHIFT, and COPY_REMAINING. The inliner in dmd could definitely be improved, but that is not a problem with the language, but the implementation.
>
> Continuing in that vein, please note that neither C nor C++ require inlining of any sort. The "inline" keyword is merely a hint to the compiler. What inlining takes place is completely implementation defined, not language defined.
>
> The same goes for all those language extensions you mentioned. Those are not part of Standard C. They are vendor extensions. Does that mean that C is not actually a systems language? No.
>
> I wish to note that the D version semantically accomplishes the same thing as the C version without using mixins or CTFE - it's all straightforward code, without the abusive preprocessor tricks.

Yet every big C/C++ compiler has to support it, no?
Lets forget D for a second.
Will you, as a compiler vendor support controlled inline in DMD with an extension?
Or let me try another way, will you "let" community to do it?
December 29, 2011
On 12/29/2011 11:47 AM, Walter Bright wrote:
> On 12/29/2011 3:19 AM, Vladimir Panteleev wrote:
>> I'd like to invite you to translate Daniel Vik's C memcpy implementation to D:
>> http://www.danielvik.com/2010/02/fast-memcpy-in-c.html
>
> Challenge accepted.

Here's another version that uses string mixins to ensure inlining of the COPY functions. There are no call instructions in the generated code. This should be as good as the C version using the same code generator.
----------------
/********************************************************************
 ** File:     memcpy.c
 **
 ** Copyright (C) 1999-2010 Daniel Vik
 **
 ** This software is provided 'as-is', without any express or implied
 ** warranty. In no event will the authors be held liable for any
 ** damages arising from the use of this software.
 ** Permission is granted to anyone to use this software for any
 ** purpose, including commercial applications, and to alter it and
 ** redistribute it freely, subject to the following restrictions:
 **
 ** 1. The origin of this software must not be misrepresented; you
 **    must not claim that you wrote the original software. If you
 **    use this software in a product, an acknowledgment in the
 **    use this software in a product, an acknowledgment in the
 **    product documentation would be appreciated but is not
 **    required.
 **
 ** 2. Altered source versions must be plainly marked as such, and
 **    must not be misrepresented as being the original software.
 **
 ** 3. This notice may not be removed or altered from any source
 **    distribution.
 **
 **
 ** Description: Implementation of the standard library function memcpy.
 **             This implementation of memcpy() is ANSI-C89 compatible.
 **
 **             The following configuration options can be set:
 **
 **           LITTLE_ENDIAN   - Uses processor with little endian
 **                             addressing. Default is big endian.
 **
 **           PRE_INC_PTRS    - Use pre increment of pointers.
 **                             Default is post increment of
 **                             pointers.
 **
 **           INDEXED_COPY    - Copying data using array indexing.
 **                             Using this option, disables the
 **                             PRE_INC_PTRS option.
 **
 **           MEMCPY_64BIT    - Compiles memcpy for 64 bit
 **                             architectures
 **
 **
 ** Best Settings:
 **
 ** Intel x86:  LITTLE_ENDIAN and INDEXED_COPY
 **
 *******************************************************************/

module memcpy;


/********************************************************************
 ** Configuration definitions.
 *******************************************************************/

version = LITTLE_ENDIAN;
version = INDEXED_COPY;


/********************************************************************
 ** Includes for size_t definition
 *******************************************************************/



/********************************************************************
 ** Typedefs
 *******************************************************************/

alias ubyte       UInt8;
alias ushort      UInt16;
alias uint        UInt32;
alias ulong       UInt64;

version (D_LP64)
{
    alias UInt64   UIntN;
    enum TYPE_WIDTH = 8;
}
else
{
    alias UInt32 UIntN;
    enum TYPE_WIDTH = 4;
}


/********************************************************************
 ** Remove definitions when INDEXED_COPY is defined.
 *******************************************************************/

//#if defined (INDEXED_COPY)
//#if defined (PRE_INC_PTRS)
//#undef PRE_INC_PTRS
//#endif /*PRE_INC_PTRS*/
//#endif /*INDEXED_COPY*/



/********************************************************************
 ** Definitions for pre and post increment of pointers.
 *******************************************************************/

version (PRE_INC_PTRS)
{
    void START_VAL(ref UInt8* x)      { x--; }
    ref T INC_VAL(T)(ref T* x)        { return *++x; }
    UInt8* CAST_TO_U8(void* p, int o) { return cast(UInt8*)p + o + TYPE_WIDTH; }
    enum WHILE_DEST_BREAK  = (TYPE_WIDTH - 1);
    enum PRE_LOOP_ADJUST   = -(TYPE_WIDTH - 1);
    enum PRE_SWITCH_ADJUST = 1;
}
else
{
    void START_VAL(UInt8* x)	      { }
    ref T INC_VAL(T)(ref T* x)        { return *x++; }
    UInt8* CAST_TO_U8(void* p, int o) { return cast(UInt8*)p + o; }
    enum WHILE_DEST_BREAK  = 0;
    enum PRE_LOOP_ADJUST   = 0;
    enum PRE_SWITCH_ADJUST = 0;
}







/********************************************************************
 **
 ** void *memcpy(void *dest, const void *src, size_t count)
 **
 ** Args:     dest        - pointer to destination buffer
 **           src         - pointer to source buffer
 **           count       - number of bytes to copy
 **
 ** Return:   A pointer to destination buffer
 **
 ** Purpose:  Copies count bytes from src to dest.
 **           No overlap check is performed.
 **
 *******************************************************************/

void *memcpy(void *dest, const void *src, size_t count)
{
    auto dst8 = cast(UInt8*)dest;
    auto src8 = cast(UInt8*)src;

    UIntN* dstN;
    UIntN* srcN;
    UIntN dstWord;
    UIntN srcWord;

    /********************************************************************
     ** Macros for copying words of  different alignment.
     ** Uses incremening pointers.
     *******************************************************************/

    void CP_INCR() {
	INC_VAL(dstN) = INC_VAL(srcN);
    }

    void CP_INCR_SH(int shl, int shr) {
	version (LITTLE_ENDIAN)
	{
	    dstWord   = srcWord >> shl;
	    srcWord   = INC_VAL(srcN);
	    dstWord  |= srcWord << shr;
	    INC_VAL(dstN) = dstWord;
	}
	else
	{
	    dstWord   = srcWord << shl;
	    srcWord   = INC_VAL(srcN);
	    dstWord  |= srcWord >> shr;
	    INC_VAL(dstN) = dstWord;
	}
    }



    /********************************************************************
     ** Macros for copying words of  different alignment.
     ** Uses array indexes.
     *******************************************************************/

    void CP_INDEX(size_t idx) {
	dstN[idx] = srcN[idx];
    }

    void CP_INDEX_SH(size_t x, int shl, int shr) {
	version (LITTLE_ENDIAN)
	{
	    dstWord   = srcWord >> shl;
	    srcWord   = srcN[x];
	    dstWord  |= srcWord << shr;
	    dstN[x]  = dstWord;
	}
	else
	{
	    dstWord   = srcWord << shl;
	    srcWord   = srcN[x];
	    dstWord  |= srcWord >> shr;
	    dstN[x]  = dstWord;
	}
    }


    /********************************************************************
     ** Macros for copying words of different alignment.
     ** Uses incremening pointers or array indexes depending on
     ** configuration.
     *******************************************************************/

    version (INDEXED_COPY)
    {
	void CP(size_t idx) { CP_INDEX(idx); }
	void CP_SH(size_t idx, int shl, int shr) { CP_INDEX_SH(idx, shl, shr); }

	void INC_INDEX(T)(ref T* p, size_t o) { p += o; }
    }
    else
    {
	void CP(size_t idx) { CP_INCR(); }
	void CP_SH(size_t idx, int shl, int shr) { CP_INCR_SH(shl, shr); }

	void INC_INDEX(T)(T* p, size_t o) { }
    }

    static immutable string COPY_REMAINING = q{
	START_VAL(dst8);
	START_VAL(src8);

	switch (cnt) {
	case 7: INC_VAL(dst8) = INC_VAL(src8);
	case 6: INC_VAL(dst8) = INC_VAL(src8);
	case 5: INC_VAL(dst8) = INC_VAL(src8);
	case 4: INC_VAL(dst8) = INC_VAL(src8);
	case 3: INC_VAL(dst8) = INC_VAL(src8);
	case 2: INC_VAL(dst8) = INC_VAL(src8);
	case 1: INC_VAL(dst8) = INC_VAL(src8);
	case 0:
	default: break;
	}
    };

    static immutable string COPY_NO_SHIFT = q{
	dstN = cast(UIntN*)(dst8 + PRE_LOOP_ADJUST);
	srcN = cast(UIntN*)(src8 + PRE_LOOP_ADJUST);
	size_t length = count / TYPE_WIDTH;

	while (length & 7) {
	    CP_INCR();
	    length--;
	}

	length /= 8;

	while (length--) {
	    CP(0);
	    CP(1);
	    CP(2);
	    CP(3);
	    CP(4);
	    CP(5);
	    CP(6);
	    CP(7);

	    INC_INDEX(dstN, 8);
	    INC_INDEX(srcN, 8);
	}

	src8 = CAST_TO_U8(srcN, 0);
	dst8 = CAST_TO_U8(dstN, 0);

	{ const cnt = (count & (TYPE_WIDTH - 1)); mixin(COPY_REMAINING); }
    };


    static immutable string COPY_SHIFT = q{
	dstN  = cast(UIntN*)(((cast(UIntN)dst8) + PRE_LOOP_ADJUST) &
				 ~(TYPE_WIDTH - 1));
	srcN  = cast(UIntN*)(((cast(UIntN)src8) + PRE_LOOP_ADJUST) &
				 ~(TYPE_WIDTH - 1));
	size_t length  = count / TYPE_WIDTH;
	srcWord = INC_VAL(srcN);

	while (length & 7) {
	    CP_INCR_SH(8 * shift, 8 * (TYPE_WIDTH - shift));
	    length--;
	}

	length /= 8;

	while (length--) {
	    CP_SH(0, 8 * shift, 8 * (TYPE_WIDTH - shift));
	    CP_SH(1, 8 * shift, 8 * (TYPE_WIDTH - shift));
	    CP_SH(2, 8 * shift, 8 * (TYPE_WIDTH - shift));
	    CP_SH(3, 8 * shift, 8 * (TYPE_WIDTH - shift));
	    CP_SH(4, 8 * shift, 8 * (TYPE_WIDTH - shift));
	    CP_SH(5, 8 * shift, 8 * (TYPE_WIDTH - shift));
	    CP_SH(6, 8 * shift, 8 * (TYPE_WIDTH - shift));
	    CP_SH(7, 8 * shift, 8 * (TYPE_WIDTH - shift));

	    INC_INDEX(dstN, 8);
	    INC_INDEX(srcN, 8);
	}

	src8 = CAST_TO_U8(srcN, (shift - TYPE_WIDTH));
	dst8 = CAST_TO_U8(dstN, 0);

	{ const cnt = (count & (TYPE_WIDTH - 1)); mixin(COPY_REMAINING); }
    };


    if (count < 8) {
	const cnt = count;
        mixin(COPY_REMAINING);
        return dest;
    }

    START_VAL(dst8);
    START_VAL(src8);

    while ((cast(UIntN)dst8 & (TYPE_WIDTH - 1)) != WHILE_DEST_BREAK) {
        INC_VAL(dst8) = INC_VAL(src8);
        count--;
    }

    final switch (((cast(UIntN)src8) + PRE_SWITCH_ADJUST) & (TYPE_WIDTH - 1)) {
    case 0: mixin(COPY_NO_SHIFT); break;
    case 1: { const shift = 1; mixin(COPY_SHIFT); }   break;
    case 2: { const shift = 2; mixin(COPY_SHIFT); }   break;
    case 3: { const shift = 3; mixin(COPY_SHIFT); }   break;
    static if (TYPE_WIDTH >= 4)
    {
	case 4: { const shift = 4; mixin(COPY_SHIFT); }   break;
	case 5: { const shift = 5; mixin(COPY_SHIFT); }   break;
	case 6: { const shift = 6; mixin(COPY_SHIFT); }   break;
	case 7: { const shift = 7; mixin(COPY_SHIFT); }   break;
    }
    }

    return dest;
}
December 29, 2011
On 12/29/2011 12:23 PM, so wrote:
> Yet every big C/C++ compiler has to support it, no?
> Lets forget D for a second.
> Will you, as a compiler vendor support controlled inline in DMD with an extension?
> Or let me try another way, will you "let" community to do it?

You can do a pull request for it, and we can evaluate it.
December 29, 2011
On 12/29/2011 12:19 PM, Vladimir Panteleev wrote:
> On Thursday, 29 December 2011 at 09:16:23 UTC, Walter Bright wrote:
>> Are you a ridiculous hacker? Inline x86 assembly that the compiler
>> actually understands in 32 AND 64 bit code, hex string literals like
>> x"DE ADB EEF" where spacing doesn't matter, the ability to set data
>> alignment cross-platform with type.alignof = 16, load your shellcode
>> verbatim into a string like so: auto str = import("shellcode.txt");
>
> I would like to talk about this for a bit. Personally, I think D's
> system programming abilities are only half-way there. Note that I am not
> talking about use cases in high-level application code, but rather
> low-level, widely-used framework code, where every bit of performance
> matters (for example: memory copy routines, string builders, garbage
> collectors).
>
> In-line assembler as part of the language is certainly neat, and in fact
> coming from Delphi to C++ I was surprised to learn that C++
> implementations adopted different syntax for asm blocks. However,
> compared to some C++ compilers, it has severe limitations and is D's
> only trick in this alley.
>
> For one thing, there is no way to force the compiler to inline a
> function (like __forceinline / __attribute((always_inline)) ). This is
> fine for high-level code (where users are best left with PGO and "the
> compiler knows best"), but sucks if you need a guarantee that the
> function must be inlined. The guarantee isn't just about inlining
> heuristics, but also implementation capabilities. For example, some
> implementations might not be able to inline functions that use certain
> language features, and your code's performance could demand that such a
> short function must be inlined. One example of this is inlining
> functions containing asm blocks - IIRC DMD does not support this.

That does not mean the language does not support it, probably ldc and gdc can do it.

> The
> compiler should fail the build if it can't inline a function tagged with
> @forceinline, instead of shrugging it off and failing silently, forcing
> users to check the disassembly every time.

+1. I think we should extend the 'enum' storage class to functions, and introduce cast(enum) to force instant evaluation.

void foo() enum{...} // always inlined or compile error. declaration alone does not contribute code to the object file
void goo(){...}      // inlined at compiler's discretion

void main(){
    (cast(enum)goo)(); // inlined or compile error
}

>
> You may have noticed that GCC has some ridiculously complicated
> assembler facilities. However, they also open the way to the
> possibilities of writing optimal code - for example, creating custom
> calling conventions, or inlining assembler functions without restricting
> the caller's register allocation with a predetermined calling
> convention. In contrast, DMD is very conservative when it comes to
> mixing D and assembler. One time I found that putting an asm block in a
> function turned what were single instructions into blocks of 6
> instructions each.
>
> D's lacking  in this area makes it impossible to create language features
> that are on the level of D's compiler built-ins. For example, I have
> tested three memcpy implementations recently, but none of them could
> beat DMD's standard array slice copy (despite that in release mode it
> compiles to a simple memcpy call). Why? Because the overhead of using a
> custom memcpy routine negated its performance gains.

I don't think you should use DMD to benchmark the D language.

>
> This might have been alleviated with the presence of sane macros, but no
> such luck. String mixins are not the answer: trying to translate
> macro-heavy C code to D using string mixins is string escape hell, and
> we're back to the level of shell scripts.

No string escape hell if you do it right.

>
> We've discussed this topic on IRC recently. From what I understood,
> Andrei thinks improvements in this area are not "impactful" enough,
> which I find worrisome.

Me too.

>
> Personally, I don't think D qualifies as a true "system programming
> language" in light of the above.

Neither do C or C++ without compiler specific extensions. We should definitely standardise such features in D.

> It's more of a compiled language with
> pointers and assembler. Before you disagree with any of the above, first
> (for starters) I'd like to invite you to translate Daniel Vik's C memcpy
> implementation to D:
> http://www.danielvik.com/2010/02/fast-memcpy-in-c.html . It doesn't even
> use inline assembler or compiler intrinsics.

OK, will do.



December 29, 2011
On 12/29/11 9:58 PM, Timon Gehr wrote:
> On 12/29/2011 12:19 PM, Vladimir Panteleev wrote:
>> […]One example of this is inlining
>> functions containing asm blocks - IIRC DMD does not support this.
> That does not mean the language does not support it, probably ldc and
> gdc can do it.

LDC has pragma(allow_inline), which allows you to mark a function containing inline asm as safe to inline.

David