March 26, 2014
On Wednesday, 26 March 2014 at 04:47:48 UTC, Jay Norwood wrote:
> This is a first attempt at using parallel, but no improvement

oops.  scratch that one. I tested a pointer to the wrong function.
March 26, 2014
This corrects the parallel example range in the second foreach.  Still slow.

void printDiamonde2cpa(in uint N)
{
    size_t N2 = N/2;
    char p[] = uninitializedArray!(char[])(N2+N);
    p[0..N2] = ' ';
    p[N2..$] = '*';
    char nl[] = uninitializedArray!(char[])(1);
    nl[] = '\n';

    char[][] wc = minimallyInitializedArray!(char[][])(N);

    auto w = appender!(char[])();

    foreach(n, ref elem; taskPool.parallel(wc[0..N2+1],100)){
        elem = p[n .. N2+2*n+1];
    }

    foreach (rn, ref elem ; taskPool.parallel(wc[N2+1..N],100)){
        int n = N2 - rn - 1;
        elem = p[n .. N2+2*n+1];
    }
    auto wj = join(wc,nl);
    w.put(wj);

    writeln(w.data);
}
March 28, 2014
> void main(){
> import std.stdio,std.range,std.algorithm,std.conv;
> auto m=10.iota.map!(_=>readln.split.to!(int[]));
> m.map!sum.chain(m.transposed.map!sum).reduce!max.write;
> }

It used to work, but with the latest changes I think I have broken it.

Bye,
bearophile
April 21, 2014
On Tuesday, 25 March 2014 at 08:42:30 UTC, monarch_dodra wrote:
>
> Interesting. I'd have thought the "extra copy" would be an overall slowdown, but I guess that's not the case.
>

I installed ubuntu 14.04 64 bit, and measured some of these examples using gdc, ldc and dmd on a corei3 box.  The examples that wouldn't build had something to do with use of array.replicate and range.replicate conflicting in the libraries for gdc and ldc builds, which were based on 2.064.2.


This is the ldc2 (0.13.0 alpha)(2.064.2) result:
jay@jay-ubuntu:~/ec_ddt/workspace/diamond/source$ ./main 1>/dev/null
brad: time: 2107[ms]
sergei: time: 2441[ms]
jay2: time: 26[ms]
diamondShape: time: 679[ms]
printDiamond: time: 19[ms]
printDiamonde2a: time: 9[ms]
printDiamonde2b: time: 8[ms]
printDiamond3: time: 14[ms]

This is the gdc(2.064.2) result:
jay@jay-ubuntu:~/ec_ddt/workspace/diamond/source$ ./a.out 1>/dev/null
brad: time: 3216[ms]
sergei: time: 2828[ms]
jay2: time: 26[ms]
diamondShape: time: 776[ms]
printDiamond: time: 19[ms]
printDiamonde2a: time: 13[ms]
printDiamonde2b: time: 13[ms]
printDiamond3: time: 51[ms]

This is the dmd(2.065) result:
jay@jay-ubuntu:~/ec_ddt/workspace/diamond/source$ ./main 1>/dev/null
brad: time: 10830[ms]
sergei: time: 3480[ms]
jay2: time: 29[ms]
diamondShape: time: 2462[ms]
printDiamond: time: 23[ms]
printDiamonde2a: time: 13[ms]
printDiamonde2b: time: 10[ms]
printDiamond3: time: 23[ms]


So this printDiamonde2b example had the fastest time of the solutions, and had similar times on all three builds. The ldc2 compiler build is performing best in most examples on ubuntu.

void printDiamonde2b(in uint N)
{
    uint N2 = N/2;
    char pSpace[] = uninitializedArray!(char[])(N2);
    pSpace[] = ' ';

    char pStars[] = uninitializedArray!(char[])(N+1);
    pStars[] = '*';

    pStars[$-1] = '\n';

    auto w = appender!(char[])();
    w.reserve(N*3);

    foreach (n ; 0 .. N2 + 1){
        w.put(pSpace[0 .. N2 - n]);
	w.put(pStars[$-2*n-2 .. $]);
	}

    foreach_reverse (n ; 0 .. N2){
        w.put(pSpace[0 .. N2 - n]);
	w.put(pStars[$-2*n-2 .. $]);
	}

    write(w.data);
}




April 21, 2014
On Monday, 21 April 2014 at 00:11:14 UTC, Jay Norwood wrote:
> So this printDiamonde2b example had the fastest time of the solutions, and had similar times on all three builds. The ldc2 compiler build is performing best in most examples on ubuntu.
>
> void printDiamonde2b(in uint N)
> {
>     uint N2 = N/2;
>     char pSpace[] = uninitializedArray!(char[])(N2);
>     pSpace[] = ' ';
>
>     char pStars[] = uninitializedArray!(char[])(N+1);
>     pStars[] = '*';
>
>     pStars[$-1] = '\n';
>
>     auto w = appender!(char[])();
>     w.reserve(N*3);
>
>     foreach (n ; 0 .. N2 + 1){
>         w.put(pSpace[0 .. N2 - n]);
> 	w.put(pStars[$-2*n-2 .. $]);
> 	}
>
>     foreach_reverse (n ; 0 .. N2){
>         w.put(pSpace[0 .. N2 - n]);
> 	w.put(pStars[$-2*n-2 .. $]);
> 	}
>
>     write(w.data);
> }

With this slightly tweaked solution, I can get times of roughly 50% to 100% faster, on my dmd-linux box:

//----
void printDiamonde2monarch(in uint N)
{
    uint N2 = N/2;

    char[] pBuf = uninitializedArray!(char[])(N + N2);
    pBuf[ 0 .. N2] = ' ';
    pBuf[N2 ..  $] = '*';

    auto slice = uninitializedArray!(char[])(3*N2*N2 + 4*N);

    size_t i;
    foreach (n ; 0 .. N2 + 1){
        auto w = 1 + N2 + n;
        slice[i .. i + w] = pBuf[n .. w + n];
        slice[(i+=w)++]='\n';
    }

    foreach_reverse (n ; 0 .. N2){
        auto w = 1 + N2 + n;
        slice[i .. i + w] = pBuf[n .. w + n];
        slice[(i+=w)++]='\n';
    }

    write(slice[0 .. i]);
}
//----

The two "key" points here, first, is to avoid using appender. Second, instead of having two buffer: "    " and "******\n", and two do two "slice copies", to only have 1 buffer "    *****", and to do 1 slice copy, and a single '\n' write. At this point, I'm not sure how we could be going any faster, short of using alloca...

How does this hold up on your environment?
April 22, 2014
On Monday, 21 April 2014 at 08:26:49 UTC, monarch_dodra wrote:
> The two "key" points here, first, is to avoid using appender. Second, instead of having two buffer: "    " and "******\n", and two do two "slice copies", to only have 1 buffer "    *****", and to do 1 slice copy, and a single '\n' write. At this point, I'm not sure how we could be going any faster, short of using alloca...
>
> How does this hold up on your environment?

Yes your solution is the fastest yet.  Also, its times are similar for all three compilers.   The range of execution times varied for different solutions from over 108 seconds down to 64 msec.

I see that RefAppender's data() returns the managed array.  Can write() handle that?  It seems that would be more efficient than duplicating the  character buffer ... or perhaps writing directly to an OutBuffer, and then sending that to write() would avoid the duplication?

jay@jay-ubuntu:~/ec_ddt/workspace/diamond/source$ gdc -O2 main.d
jay@jay-ubuntu:~/ec_ddt/workspace/diamond/source$ ./a.out 1>/dev/null
brad: time: 31865[ms]
sergei: time: 28596[ms]
jay2: time: 258[ms]
diamondShape: time: 7512[ms]
printDiamond: time: 200[ms]
printDiamonde2a: time: 140[ms]
printDiamonde2b: time: 137[ms]
printDiamond3: time: 503[ms]
printDiamonde2monarch: time: 86[ms]
jay@jay-ubuntu:~/ec_ddt/workspace/diamond/source$ dmd -release main.d
jay@jay-ubuntu:~/ec_ddt/workspace/diamond/source$ ./main 1>/dev/null
brad: time: 108111[ms]
sergei: time: 33949[ms]
jay2: time: 282[ms]
diamondShape: time: 24567[ms]
printDiamond: time: 230[ms]
printDiamonde2a: time: 132[ms]
printDiamonde2b: time: 106[ms]
printDiamond3: time: 222[ms]
printDiamonde2monarch: time: 66[ms]
jay@jay-ubuntu:~/ec_ddt/workspace/diamond/source$ ~/ldc/bin/ldc2 -O2 main.d
jay@jay-ubuntu:~/ec_ddt/workspace/diamond/source$ ./main 1>/dev/null
brad: time: 20996[ms]
sergei: time: 24841[ms]
jay2: time: 259[ms]
diamondShape: time: 6797[ms]
printDiamond: time: 194[ms]
printDiamonde2a: time: 91[ms]
printDiamonde2b: time: 87[ms]
printDiamond3: time: 145[ms]
printDiamonde2monarch: time: 64[ms]
jay@jay-ubuntu:~/ec_ddt/workspace/diamond/source$

April 22, 2014
Wow,  joiner is much slower than join.  Such a small choice can make this big of a difference.  Not at all expected, since the lazy calls, I thought, were considered to be more efficient.  This is with ldc2 -O2.

jay@jay-ubuntu:~/ec_ddt/workspace/diamond/source$ ./main 1>/dev/null
brad: time: 21958[ms]
sergei: time: 24629[ms]
jay2: time: 259[ms]
diamondShape: time: 6701[ms]
printDiamond: time: 194[ms]
printDiamonde2a: time: 95[ms]
printDiamonde2b: time: 92[ms]
printDiamond3: time: 144[ms]
printDiamonde2monarch: time: 67[ms]
printDiamonde2cJoin: time: 96[ms]
printDiamonde2cJoiner: time: 16115[ms]


void printDiamonde2cJoin(in uint N)
{
	int n,l;
    size_t N2 = N/2;
    size_t NM1 = N-1;
    char p[] = uninitializedArray!(char[])(N2+N);
    p[0..N2] = ' ';
    p[N2..$] = '*';
    char nl[] = uninitializedArray!(char[])(1);
    nl[] = '\n';

    char wc[][] = minimallyInitializedArray!(char[][])(N);

    for(n=0,l=0; n<N2; n++,l+=2){
    	wc[n] = wc[NM1-n] = p[n .. N2+l+1];
    }

    wc[N2] = p[N2..$];
    auto wj = join(wc,nl);
    write(wj);
    write('\n');
}

void printDiamonde2cJoiner(in uint N)
{
	int n,l;
    size_t N2 = N/2;
    size_t NM1 = N-1;
    char p[] = uninitializedArray!(char[])(N2+N);
    p[0..N2] = ' ';
    p[N2..$] = '*';

    char wc[][] = minimallyInitializedArray!(char[][])(N);

    for(n=0,l=0; n<N2; n++,l+=2){
    	wc[n] = wc[NM1-n] = p[n .. N2+l+1];
    }

    wc[N2] = p[N2..$];
    write(joiner(wc,"\n"));
    write('\n');
}


April 22, 2014
On Tuesday, 22 April 2014 at 05:05:30 UTC, Jay Norwood wrote:
> On Monday, 21 April 2014 at 08:26:49 UTC, monarch_dodra wrote:
>> The two "key" points here, first, is to avoid using appender. Second, instead of having two buffer: "    " and "******\n", and two do two "slice copies", to only have 1 buffer "    *****", and to do 1 slice copy, and a single '\n' write. At this point, I'm not sure how we could be going any faster, short of using alloca...
>>
>> How does this hold up on your environment?
>
> Yes your solution is the fastest yet.  Also, its times are similar for all three compilers.   The range of execution times varied for different solutions from over 108 seconds down to 64 msec.
>
> I see that RefAppender's data() returns the managed array.  Can write() handle that?  It seems that would be more efficient than duplicating the  character buffer ...

I'm not sure what you mean? "data" returns the managed array, but no duplication ever actually happens. It's allocated on the GC. the only thing that is copied is the slice itself.

> or perhaps writing directly to an OutBuffer, and then sending that to write() would avoid the duplication?

appender *is* the outbuffer :)
April 22, 2014
On Tuesday, 22 April 2014 at 11:41:41 UTC, Jay Norwood wrote:
> Wow,  joiner is much slower than join.  Such a small choice can make this big of a difference.  Not at all expected, since the lazy calls, I thought, were considered to be more efficient.  This is with ldc2 -O2.

Yeah, that's because join actually works on "RoR, R", rather than "R, E". This means if you feed it a "string[], string", then it will actually iterate over individual *characters*. Not only that, but since you are using char[], it will decode them too.

"join" is faster for 2 reasons:
1) It detects you want to joins arrays, so it doesn't have to iterate over them: It just glues them "slice at once"
2) No UTF decoding.

I kind of wish we had a faster joiner, but I think it would have made the call ambiguous.
April 23, 2014
On Tuesday, 22 April 2014 at 15:25:04 UTC, monarch_dodra wrote:
> Yeah, that's because join actually works on "RoR, R", rather than "R, E". This means if you feed it a "string[], string", then it will actually iterate over individual *characters*. Not only that, but since you are using char[], it will decode them too.
>
> "join" is faster for 2 reasons:
> 1) It detects you want to joins arrays, so it doesn't have to iterate over them: It just glues them "slice at once"
> 2) No UTF decoding.
>
> I kind of wish we had a faster joiner, but I think it would have made the call ambiguous.

Ok, thanks.  I re-tried joiner with both parameters being ranges, but there was no improvement in execution speed.  I thought perhaps from your comments that it might work.

    char nl[] = uninitializedArray!(char[])(1);
    nl[] = '\n';

    write(joiner(wc,nl));
1 2 3 4 5
Next ›   Last »