November 26, 2008
On Wed, 26 Nov 2008 12:18:27 -0800, Robert Fraser wrote:

> bearophile wrote:
>> To test the new scoping features of D 2.021 I have used a small stressing program, coming from this page, originally by Knuth:
>> 
>> http://www.rosettacode.org/wiki/Man_or_boy_test
>> 
>> More info:
>> http://en.wikipedia.org/wiki/Man_or_boy_test
>> 
>> My purpose is to see the D2 compiler being able to compute results up to N=25 on my PC (that has 2 GB RAM and a 32 bit operating system).
>> 
>> ---------------------------
>> 
>> This is the code I have used for DMD 1.037:
>> 
>> import std.c.stdio: printf;
>> 
>> int a(int k, lazy int x1, lazy int x2, lazy int x3, lazy int x4, lazy
>> int x5) {
>>     int delegate() b;
>>     b = {
>>         k -= 1;
>>         return a(k, b(), x1, x2, x3, x4);
>>     };
>>     return k <= 0 ? x4 + x5 : b();
>> }
>> 
>> void main() {
>>     printf("%d\n", a(15, 1, -1, -1, 1, 0)); // N is 15
>> }
>> 
>> ---------------------------
>> 
>> This is the code I have used for DMD 2.021. I was not sure how to use the scope, so if you have improvements please tell me:
>> 
>> import std.c.stdio: printf;
>> 
>> int a(scope int k, lazy int x1, lazy int x2, lazy int x3, lazy int x4,
>> lazy int x5) {
>>     scope int delegate() b;
>>     b = {
>>         k -= 1;
>>         return a(k, b(), x1, x2, x3, x4);
>>     };
>>     return k <= 0 ? x4 + x5 : b();
>> }
>> 
>> void main() {
>>     printf("%d\n", a(15, 1, -1, -1, 1, 0)); // N is 15
>> }
>> 
>> ---------------------------
>> 
>> The results for higher values of N are: k   21           22 23         24          25 A   -389695     -865609     -1922362 -4268854    -9479595
>> 
>> In both cases I have compiled the code with: dmd -O -release -inline -L/STACK:1700000000 man_or_boy.d
>> 
>> The results:
>> 
>> DMD V.1.037 (exe: 168_476: bytes):
>>   N=24:  788 MB RAM, 3 seconds
>>   N=25: 1.57 GB RAM, 6.42 seconds
>> 
>> DMD V.2.021 (exe: 99_356 bytes):
>> 
>> The code was too much slow in D2, so I have stopped it. Do you know if the D2 code can be improved to have performance similar to D1 ones?
>> 
>> Bye,
>> bearophile
> 
> Try marking all the "lazy" parameters as "scope" ("lazy" creates
> delegates).

From change log: "The lazy storage class now implies scope so that lazy arguments won't trigger a heap allocated closure."
November 26, 2008
Jarrett Billingsley wrote:
> The reason I wonder is because I would expect that the compiler is
> still allocating the delegate on the heap if you use the first syntax.
>  (the second is also shorter and clearer.)

There's no reason to suspect. Just obj2asm the output and see. Furthermore, better results will come from:

    int b() {
        k -= 1;
        return a(k, b(), x1, x2, x3, x4);
    };

instead of using the delegate.
November 26, 2008
On Wed, Nov 26, 2008 at 4:13 PM, Walter Bright <newshound1@digitalmars.com> wrote:
> Jarrett Billingsley wrote:
>>
>> The reason I wonder is because I would expect that the compiler is
>> still allocating the delegate on the heap if you use the first syntax.
>>  (the second is also shorter and clearer.)
>
> There's no reason to suspect. Just obj2asm the output and see.

That'd be great if I had it.  I don't want to get into that, though.

> Furthermore,
> better results will come from:
>
>    int b() {
>        k -= 1;
>        return a(k, b(), x1, x2, x3, x4);
>    };
>
> instead of using the delegate.

So my suspicion is correct, then?  That is:

scope int delegate() b;
b = { ... };

Will allocate on the heap?
November 26, 2008
Walter Bright:
> There's no reason to suspect. Just obj2asm the output and see.
> Furthermore, better results will come from:
>      int b() {
>          k -= 1;
>          return a(k, b(), x1, x2, x3, x4);
>      };
> instead of using the delegate.

I don't fully understand what's happening, maybe I am doing something wrong, so I just report what I have seen.

--------------------

With DMD 1.036 this code works up to N=25, usign almost 1.6 GB of RAM (and I think a running time of just 6.4 seconds is very little, try to do this with any dynamic language, so I was impressed by D1):

// code #1
import std.c.stdio: printf;

int a(int k, lazy int x1, lazy int x2, lazy int x3, lazy int x4, lazy int x5) {
    int delegate() b;
    b = {
        k -= 1;
        return a(k, b(), x1, x2, x3, x4);
    };
    return k <= 0 ? x4 + x5 : b();
}

void main() {
    printf("%d\n", a(24, 1, -1, -1, 1, 0));
}

--------------------

While on DMD 1.036 the following code:

// code #2
import std.c.stdio: printf;

int a(int k, lazy int x1, lazy int x2, lazy int x3, lazy int x4, lazy int x5) {
    int b() {
        k -= 1;
        return a(k, b(), x1, x2, x3, x4);
    }
    return k <= 0 ? x4 + x5 : b();
}

void main() {
    printf("%d\n", a(24, 1, -1, -1, 1, 0));
}

Outputs at compile time:

man_or_boy2.d(6): delegate man_or_boy2.a.b.__dgliteral1 is a nested function and cannot be accessed from a man_or_boy2.d(6): delegate man_or_boy2.a.b.__dgliteral2 is a nested function and cannot be accessed from a man_or_boy2.d(6): delegate man_or_boy2.a.b.__dgliteral3 is a nested function and cannot be accessed from a man_or_boy2.d(6): delegate man_or_boy2.a.b.__dgliteral4 is a nested function and cannot be accessed from a man_or_boy2.d(6): delegate man_or_boy2.a.b.__dgliteral5 is a nested function and cannot be accessed from a


But this code works on the codepad site, that uses DMD v.1.026 (I think with Tangobos), up to N=21 (then the site goes into timeout for safety):
http://codepad.org/8GtKswm8

------------------------------

DMD v2.021 outputs the same errors with code #2.

------------------------------

DMD is able to run code #1 and the following #3:

// code #3
import std.c.stdio: printf;

int a(scope int k, lazy int x1, lazy int x2, lazy int x3, lazy int x4, lazy int x5) {
    scope int delegate() b;
    b = {
        k -= 1;
        return a(k, b(), x1, x2, x3, x4);
    };
    return k <= 0 ? x4 + x5 : b();
}

void main() {
    printf("%d\n", a(22, 1, -1, -1, 1, 0));
}


All the following compilations are done with and N=20 and: dmd -O -release -inline -L/STACK:1700000000 man_or_boy.d dmd -O -release -inline -L/STACK:1700000000 man_or_boy2.d

Note: in D2 the code #1 and code #3 seem to use the same amount of RAM and time to run, 2.46 s and about 94MB RAM (N=20).

The same code #1 in DMD 1.036 runs in about 0.16 s and I am not able to know how much RAM it uses (N=20).

------------------------------
ASM of code #1 compiled with DMD 1.036:

_D10man_or_boy1aFiLiLiLiLiLiZi	comdat
	assume	CS:_D10man_or_boy1aFiLiLiLiLiLiZi
		sub	ESP,0Ch
		push	EBX
		mov	dword ptr 8[ESP],0
		mov	dword ptr 0Ch[ESP],0
		lea	EAX,0Ch[ESP]
		mov	ECX,offset FLAT:_D10man_or_boy1aFiLiLiLiLiLiZi12__dgliteral1MFZi
		mov	8[ESP],EAX
		mov	0Ch[ESP],ECX
		cmp	dword ptr 03Ch[ESP],0
		jg	L56
		mov	EAX,01Ch[ESP]
		mov	EDX,020h[ESP]
		mov	EBX,01Ch[ESP]
		call	EDX
		push	EAX
		sub	ESP,4
		mov	EAX,01Ch[ESP]
		mov	EDX,020h[ESP]
		mov	EBX,01Ch[ESP]
		call	EDX
		mov	ECX,EAX
		add	ESP,4
		pop	EAX
		add	EAX,ECX
		jmp short	L62
L56:		mov	EAX,8[ESP]
		mov	EDX,ECX
		mov	EBX,8[ESP]
		call	EDX
L62:		pop	EBX
		add	ESP,0Ch
		ret	02Ch

------------------------------
ASM of code #1 compiled with DMD 2.021:

	assume	CS:_D10man_or_boy1aFiLiLiLiLiLiZi
L0:		sub	ESP,0Ch
		push	EBX
		push	ESI
		push	EDI
		push	030h
		call	near ptr __d_allocmemory
		mov	ESI,EAX
		mov	dword ptr [ESI],0
		mov	EAX,048h[ESP]
		mov	4[ESI],EAX
		mov	EDX,044h[ESP]
		mov	EAX,040h[ESP]
		mov	010h[ESI],EAX
		mov	014h[ESI],EDX
		mov	EDX,03Ch[ESP]
		mov	EAX,038h[ESP]
		mov	018h[ESI],EAX
		mov	01Ch[ESI],EDX
		mov	EDX,034h[ESP]
		mov	EAX,030h[ESP]
		mov	020h[ESI],EAX
		mov	024h[ESI],EDX
		mov	EDX,02Ch[ESP]
		mov	EAX,028h[ESP]
		mov	028h[ESI],EAX
		mov	02Ch[ESI],EDX
		mov	dword ptr 8[ESI],0
		mov	dword ptr 0Ch[ESI],0
		mov	EBX,ESI
		mov	ECX,offset FLAT:_D10man_or_boy1aFiLiLiLiLiLiZi12__dgliteral1MFZi
		mov	8[ESI],EBX
		mov	0Ch[ESI],ECX
		add	ESP,4
		cmp	dword ptr 4[ESI],0
		jg	L9F
		mov	EAX,028h[ESI]
		mov	EDX,02Ch[ESI]
		mov	EDI,028h[ESI]
		call	EDX
		push	EAX
		sub	ESP,4
		mov	EAX,024h[ESP]
		mov	EDX,028h[ESP]
		mov	EDI,024h[ESP]
		call	EDX
		mov	ECX,EAX
		add	ESP,4
		pop	EAX
		add	EAX,ECX
		jmp short	LA4
L9F:		mov	EAX,8[ESI]
		call	ECX
LA4:		pop	EDI
		pop	ESI
		pop	EBX
		add	ESP,0Ch
		ret	02Ch
_D10man_or_boy1aFiLiLiLiLiLiZi	ends

------------------------------
ASM of code #3 compiled with DMD 2.021:

_D11man_or_boy31aFMiLiLiLiLiLiZi	comdat
	assume	CS:_D11man_or_boy31aFMiLiLiLiLiLiZi
L0:		sub	ESP,0Ch
		push	EBX
		push	ESI
		push	EDI
		push	030h
		call	near ptr __d_allocmemory
		mov	ESI,EAX
		mov	dword ptr [ESI],0
		mov	EAX,048h[ESP]
		mov	4[ESI],EAX
		mov	EDX,044h[ESP]
		mov	EAX,040h[ESP]
		mov	010h[ESI],EAX
		mov	014h[ESI],EDX
		mov	EDX,03Ch[ESP]
		mov	EAX,038h[ESP]
		mov	018h[ESI],EAX
		mov	01Ch[ESI],EDX
		mov	EDX,034h[ESP]
		mov	EAX,030h[ESP]
		mov	020h[ESI],EAX
		mov	024h[ESI],EDX
		mov	EDX,02Ch[ESP]
		mov	EAX,028h[ESP]
		mov	028h[ESI],EAX
		mov	02Ch[ESI],EDX
		mov	dword ptr 8[ESI],0
		mov	dword ptr 0Ch[ESI],0
		mov	EBX,ESI
		mov	ECX,offset FLAT:_D11man_or_boy31aFMiLiLiLiLiLiZi12__dgliteral1MFZi
		mov	8[ESI],EBX
		mov	0Ch[ESI],ECX
		add	ESP,4
		cmp	dword ptr 4[ESI],0
		jg	L9F
		mov	EAX,028h[ESI]
		mov	EDX,02Ch[ESI]
		mov	EDI,028h[ESI]
		call	EDX
		push	EAX
		sub	ESP,4
		mov	EAX,024h[ESP]
		mov	EDX,028h[ESP]
		mov	EDI,024h[ESP]
		call	EDX
		mov	ECX,EAX
		add	ESP,4
		pop	EAX
		add	EAX,ECX
		jmp short	LA4
L9F:		mov	EAX,8[ESI]
		call	ECX
LA4:		pop	EDI
		pop	ESI
		pop	EBX
		add	ESP,0Ch
		ret	02Ch
_D11man_or_boy31aFMiLiLiLiLiLiZi	ends

------------------------------

Bye,
bearophile
November 26, 2008
On Wed, 26 Nov 2008 13:24:57 -0500, Jarrett Billingsley <jarrett.billingsley@gmail.com> wrote:
> scope int b() { .. }
>
> The reason I wonder is because I would expect that the compiler is
> still allocating the delegate on the heap if you use the first syntax.
>  (the second is also shorter and clearer.)

Just as a point of reference (in D1)
scope Object a = new Object(); // Stack Allocated
scope Object b;
b = new Object(); // Heap Allocated

So there may be some merit to scope int b() { .. } vs scope int b(); b = {...}
November 26, 2008
Wed, 26 Nov 2008 17:02:54 -0500, bearophile wrote:

> While on DMD 1.036 the following code:
> 
> // code #2
> import std.c.stdio: printf;
> 
> int a(int k, lazy int x1, lazy int x2, lazy int x3, lazy int x4, lazy int x5) {
>     int b() {
>         k -= 1;
>         return a(k, b(), x1, x2, x3, x4);
>     }
>     return k <= 0 ? x4 + x5 : b();
> }
> 
> void main() {
>     printf("%d\n", a(24, 1, -1, -1, 1, 0));
> }
> 
> Outputs at compile time:
> 
> man_or_boy2.d(6): delegate man_or_boy2.a.b.__dgliteral1 is a nested function and cannot be accessed from a man_or_boy2.d(6): delegate man_or_boy2.a.b.__dgliteral2 is a nested function and cannot be accessed from a man_or_boy2.d(6): delegate man_or_boy2.a.b.__dgliteral3 is a nested function and cannot be accessed from a man_or_boy2.d(6): delegate man_or_boy2.a.b.__dgliteral4 is a nested function and cannot be accessed from a man_or_boy2.d(6): delegate man_or_boy2.a.b.__dgliteral5 is a nested function and cannot be accessed from a
> 
> But this code works on the codepad site, that uses DMD v.1.026 (I think with Tangobos), up to N=21 (then the site goes into timeout for safety):
> http://codepad.org/8GtKswm8

[snip]

> All the following compilations are done with and N=20 and: dmd -O -release -inline -L/STACK:1700000000 man_or_boy.d dmd -O -release -inline -L/STACK:1700000000 man_or_boy2.d

Remove -inline from your compiler options, and #2 compiles and runs faster in both D1 and D2 than #1.

lazy seems to do something funny when -inline is in effect.
November 26, 2008
Wed, 26 Nov 2008 20:33:35 +0000 (UTC), Jesse Phillips wrote:

> On Wed, 26 Nov 2008 12:18:27 -0800, Robert Fraser wrote:
>> Try marking all the "lazy" parameters as "scope" ("lazy" creates
>> delegates).
> 
> From change log: "The lazy storage class now implies scope so that lazy arguments won't trigger a heap allocated closure."

Also, the whole point of this test is delegate propagation.  That's what makes it interesting.
November 26, 2008
Sergey Gromov:
> Remove -inline from your compiler options, and #2 compiles and runs
> faster in both D1 and D2 than #1.
> lazy seems to do something funny when -inline is in effect.

You are right, I have tested it on D1.
I think the codepad doesn't use -inline, that's why #2 works there.
#2 also uses less RAM on D1, for example N=24 requires about 707 MB instead of 788 MB, and about 2.8 s instead of about 3 s.

Sergey Gromov:
> Remove -inline from your compiler options, and #2 compiles and runs
> faster in both D1 and D2 than #1.
> lazy seems to do something funny when -inline is in effect.

You are right, I have tested it on D1.
I think the codepad doesn't use -inline, that's why #2 works there.

#2 also uses less RAM on D1, for example N=24 requires about 710 MB instead of 788 MB, and about 2.8 s instead of about 3 s.

This is the Asm of the #2 compiled with D1 with -O -release, it's shorter still (but note there are some other parts that I don't show here):

_D11man_or_boy21aFiLiLiLiLiLiZi	comdat
	assume	CS:_D11man_or_boy21aFiLiLiLiLiLiZi
L0:		push	EAX
		push	EBX
		cmp	dword ptr 034h[ESP],0
		jg	L33
		mov	EAX,014h[ESP]
		mov	EDX,018h[ESP]
		mov	EBX,014h[ESP]
		call	EDX
		push	EAX
		sub	ESP,4
		mov	EAX,014h[ESP]
		mov	EDX,018h[ESP]
		mov	EBX,014h[ESP]
		call	EDX
		mov	ECX,EAX
		add	ESP,4
		pop	EAX
		add	EAX,ECX
		jmp short	L3C
L33:		lea	EAX,4[ESP]
		call	near ptr _D11man_or_boy21aFiLiLiLiLiLiZi1bMFZi
L3C:		pop	EBX
		pop	ECX
		ret	02Ch
_D11man_or_boy21aFiLiLiLiLiLiZi	ends

--------------------------

I have then tested #1 and #2 without -inline on D2, and the results are very different from each other: #1 is very slow and uses lot of memory, while #2 (that contains no scope) acts as D1, using "only" 707 MB with N=24 and working with n=25 too. The asm code is similar the one I have just shown here.
So compiling #2 witout -inline in D2 fulfulls my original desire of computing up to N=25 with D2 :-)

I presume the -inline uncovers a small bug of DMD, that will be fixed. But what interests me more now is to understand how to write such fast code in general in D2.

Bye,
bearophile
November 26, 2008
Wed, 26 Nov 2008 18:19:38 -0500, bearophile wrote:

> So compiling #2 witout -inline in D2 fulfulls my original desire of computing up to N=25 with D2 :-)
> 
> I presume the -inline uncovers a small bug of DMD, that will be fixed.

I think so, too.

> But what interests me more now is to understand how to write such fast code in general in D2.

Seems like right now delegates are stack allocated only if they're passed directly as scope parameters.  Assigning a delegate to a variable is always considered escaping, even if it's a local scope variable. This is probably because scope for declarations is something completely different than scope for arguments.
November 27, 2008
27.11.08 в 01:12 Robert Jacques в своём письме писал(а):

> On Wed, 26 Nov 2008 13:24:57 -0500, Jarrett Billingsley <jarrett.billingsley@gmail.com> wrote:
>> scope int b() { .. }
>>
>> The reason I wonder is because I would expect that the compiler is
>> still allocating the delegate on the heap if you use the first syntax.
>>  (the second is also shorter and clearer.)
>
> Just as a point of reference (in D1)
> scope Object a = new Object(); // Stack Allocated
> scope Object b;
> b = new Object(); // Heap Allocated
>
> So there may be some merit to scope int b() { .. } vs scope int b(); b = {...}

If so, then why all the three usages

1) scope Object a;
2) scope Object a = b;
3) scope Object a = new Object();

are allowed when only 3rd one stack-allocates? I believe only third one should be allowed unless scope analisys is implemented (in its very basic form, at least):

scope Object a1 = new Object(); // ok
Object a2 = a1; // not ok
scope Object a3 = a1; // ok;
return a1; // not ok
return a2; // ok