Jump to page: 1 2
Thread overview
[Issue 2313] New: Poor array ~= append performance
Aug 25, 2008
d-bugmail
Aug 26, 2008
d-bugmail
Aug 26, 2008
d-bugmail
Aug 26, 2008
d-bugmail
Aug 27, 2008
d-bugmail
Aug 27, 2008
Don
Aug 27, 2008
d-bugmail
Aug 27, 2008
d-bugmail
Aug 27, 2008
d-bugmail
Aug 27, 2008
d-bugmail
Aug 27, 2008
Don
Aug 27, 2008
Lionello Lunesu
August 25, 2008
http://d.puremagic.com/issues/show_bug.cgi?id=2313

           Summary: Poor array ~= append performance
           Product: D
           Version: unspecified
          Platform: PC
               URL: http://www.digitalmars.com/webnews/newsgroups.php?art_gr
                    oup=digitalmars.D&article_id=75410
        OS/Version: Windows
            Status: NEW
          Severity: normal
          Priority: P2
         Component: DMD
        AssignedTo: bugzilla@digitalmars.com
        ReportedBy: lio+bugzilla@lunesu.com


See original thread.

Appending an element to an array is very slow. There are three reasons:

1) __d_arrayappendcT is used for all types of arrays, resulting in less than
optimal performance when sizeof(item) > 1;

2) std.gc.capacity calls sizeOfNoSync which in turn calls findPool. Complexity
of this call is O(m+n), n = number of pools, m = size of block;

3) sizeOfNoSync records the last result in a (global) cache to improve the
performance of the case "for() ar~=item;" When appending to two arrays, this
cache is useless, resulting in the O(m+n) code path described above.

A possible solution to 1) might be to create custom append routines for each array type (similar to the custom routines for the array operations, comparison, hashing, etc.) This way, an array of int[] can simply add an int. Or, the __d_arrayappendcT code should check the size of the item and invoke different code (possibly using mmx/sse when applicable.)

2) might be solved by using the fact that pooltable is always sorted; this
would bring the complexity down to O(m + log n). Ideally the size for each
allocation is recorded, either in a separate array (per pool) or right before
the allocation itself. This would result in a complexity of O(log n) resp.
O(1), minimizing the impact of the cache miss as mentioned in 3).


-- 

August 26, 2008
http://d.puremagic.com/issues/show_bug.cgi?id=2313





------- Comment #1 from lio+bugzilla@lunesu.com  2008-08-26 09:41 -------
Some stats. Using bearophile's test program from the original post in newsgroup, with n = 100_000_000:

dmd v2.018 -O -inline -release

Default Phobos: 10,72 seconds
Commented gc.d line 915: 4,26 seconds
Replaced line 915 with memcpy: 5,63 seconds

Line 915 is:
#    x.ptr[length * sizeelem .. newsize] = argp[0 .. sizeelem];
where both x and argp are byte[]

Why is byte[] = byte[] slower than memcpy? Perhaps that array assignment should also be part of the run-time library, perhaps just using memcpy?


-- 

August 26, 2008
http://d.puremagic.com/issues/show_bug.cgi?id=2313





------- Comment #2 from lio+bugzilla@lunesu.com  2008-08-26 10:58 -------
For the record, when changing the loop to..

#int count = 0;
#for(int i; i < n; i++) a[count++] = i;

..it takes 0,43 seconds. (Same flags, n as before.)
Adding std.gc.capacity(a.ptr) to the loop: 2,73 seconds.


-- 

August 26, 2008
http://d.puremagic.com/issues/show_bug.cgi?id=2313





------- Comment #3 from 2korden@gmail.com  2008-08-26 11:08 -------
Original thread is here: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=75410


-- 

August 27, 2008
http://d.puremagic.com/issues/show_bug.cgi?id=2313





------- Comment #4 from bugzilla@digitalmars.com  2008-08-27 01:05 -------
Why is byte[] = byte[] slower than memcpy? The answer isn't very simple. Consider this program:

import std.c.string;

long timer()
{
    asm
    {   naked                   ;
        rdtsc                   ;
        ret                     ;
    }
}

void test1(byte[] a, byte[] b)
{
    a[] = b[];
}

void test2(byte[] a, byte[] b)
{
    memcpy(a.ptr, b.ptr, a.length);
}

void main()
{
    for (int i = 4; i < 100_000_000; i *= 2)
    {
        auto a = new byte[i];
        auto b = new byte[i];

        auto start = timer();
        test1(a, b);
        auto end = timer();
        auto r1 = end - start;

        start = timer();
        test2(a, b);
        end = timer();
        auto r2 = end - start;

        printf("i: %8d,\t[]=[]: %8lld,\tmemcpy: %8lld\n", i, r1, r2);
    }
}

Running this program produces:

i:        4,    []=[]:      144,        memcpy:      568
i:        8,    []=[]:      144,        memcpy:      300
i:       16,    []=[]:      172,        memcpy:      324
i:       32,    []=[]:      204,        memcpy:      344
i:       64,    []=[]:      288,        memcpy:      276
i:      128,    []=[]:      288,        memcpy:      272
i:      256,    []=[]:      352,        memcpy:      364
i:      512,    []=[]:      372,        memcpy:      424
i:     1024,    []=[]:      552,        memcpy:      564
i:     2048,    []=[]:      684,        memcpy:     1384
i:     4096,    []=[]:     1344,        memcpy:     1772
i:     8192,    []=[]:     2900,        memcpy:     3216
i:    16384,    []=[]:     5292,        memcpy:     5472
i:    32768,    []=[]:    11496,        memcpy:    10388
i:    65536,    []=[]:    29484,        memcpy:    27480
i:   131072,    []=[]:   110464,        memcpy:    67784
i:   262144,    []=[]:   655580,        memcpy:   562400
i:   524288,    []=[]:  1204124,        memcpy:  1107256
i:  1048576,    []=[]:  2364588,        memcpy:  2272552
i:  2097152,    []=[]:  4516440,        memcpy:  4417764
i:  4194304,    []=[]:  8996992,        memcpy:  8817176
i:  8388608,    []=[]: 20223908,        memcpy: 17717748
i: 16777216,    []=[]: 35774952,        memcpy: 36094652
i: 33554432,    []=[]: 71008068,        memcpy: 71246896
i: 67108864,    []=[]: 142982284,       memcpy: 145473300

There's not much of a consistent conclusion to be drawn from that.


-- 

August 27, 2008
http://d.puremagic.com/issues/show_bug.cgi?id=2313





------- Comment #5 from lio+bugzilla@lunesu.com  2008-08-27 01:15 -------
You're right. I'll double check my own results tonight.


-- 

August 27, 2008
d-bugmail@puremagic.com wrote:
> http://d.puremagic.com/issues/show_bug.cgi?id=2313
> 
> 
> 
> 
> 
> ------- Comment #4 from bugzilla@digitalmars.com  2008-08-27 01:05 -------
> Why is byte[] = byte[] slower than memcpy? The answer isn't very simple.
> Consider this program:
> 
> import std.c.string;
> 
> long timer()
> {
>     asm
>     {   naked                   ;
>         rdtsc                   ;
>         ret                     ;
>     }
> }
> 
> void test1(byte[] a, byte[] b)
> {
>     a[] = b[];
> }
> 
> void test2(byte[] a, byte[] b)
> {
>     memcpy(a.ptr, b.ptr, a.length);
> }
> 
> void main()
> {
>     for (int i = 4; i < 100_000_000; i *= 2)
>     {
>         auto a = new byte[i];
>         auto b = new byte[i];
> 
>         auto start = timer();
>         test1(a, b);
>         auto end = timer();
>         auto r1 = end - start;
> 
>         start = timer();
>         test2(a, b);
>         end = timer();
>         auto r2 = end - start;
> 
>         printf("i: %8d,\t[]=[]: %8lld,\tmemcpy: %8lld\n", i, r1, r2);
>     }
> }
> 
> Running this program produces:
> 
> i:        4,    []=[]:      144,        memcpy:      568
> i:        8,    []=[]:      144,        memcpy:      300
> i:       16,    []=[]:      172,        memcpy:      324
> i:       32,    []=[]:      204,        memcpy:      344
> i:       64,    []=[]:      288,        memcpy:      276
> i:      128,    []=[]:      288,        memcpy:      272
> i:      256,    []=[]:      352,        memcpy:      364
> i:      512,    []=[]:      372,        memcpy:      424
> i:     1024,    []=[]:      552,        memcpy:      564
> i:     2048,    []=[]:      684,        memcpy:     1384
> i:     4096,    []=[]:     1344,        memcpy:     1772
> i:     8192,    []=[]:     2900,        memcpy:     3216
> i:    16384,    []=[]:     5292,        memcpy:     5472
> i:    32768,    []=[]:    11496,        memcpy:    10388
> i:    65536,    []=[]:    29484,        memcpy:    27480
> i:   131072,    []=[]:   110464,        memcpy:    67784
> i:   262144,    []=[]:   655580,        memcpy:   562400
> i:   524288,    []=[]:  1204124,        memcpy:  1107256
> i:  1048576,    []=[]:  2364588,        memcpy:  2272552
> i:  2097152,    []=[]:  4516440,        memcpy:  4417764
> i:  4194304,    []=[]:  8996992,        memcpy:  8817176
> i:  8388608,    []=[]: 20223908,        memcpy: 17717748
> i: 16777216,    []=[]: 35774952,        memcpy: 36094652
> i: 33554432,    []=[]: 71008068,        memcpy: 71246896
> i: 67108864,    []=[]: 142982284,       memcpy: 145473300
> 
> There's not much of a consistent conclusion to be drawn from that.
Except, we can conclude that
(1) Walter's machine has a 64Kb L1 data cache. The penalty for a cache miss is 1.5 clocks. It's probably an AMD CPU. Judging by the timing, it looks like a K8 (Hammer) <g>
(2) neither a[] = b[], nor memcpy(), attempt to optimise for cache misses. Both look like rep movsd; to me.

BTW,
(3) rtdsc doesn't serialise, so the counts for low numbers are pretty much garbage. You need to stick a mov EAX, 0; cpuid; in there.
(4) cache effects are giving memcpy a big advantage. If you swap the order of test1 and test2, you'll probably find the order reverses.

There's potential to do something about (2). Not easy though.
August 27, 2008
http://d.puremagic.com/issues/show_bug.cgi?id=2313





------- Comment #6 from lio+bugzilla@lunesu.com  2008-08-27 06:37 -------
I've checked my results, and memcpy still beats []=[] by a landslide. Here are the results:

Gold (using 'prior knowledge'):
#    (cast(int*)x.ptr)[length] = *cast(int*)argp;
4193ms.

Silver:
#    memcpy(x.ptr + length * sizeelem, argp, sizeelem);
5450ms.

DNF:
#    x.ptr[length * sizeelem .. newsize] = argp[0 .. sizeelem];
10270ms.

I'll attach the .asm files.


-- 

August 27, 2008
http://d.puremagic.com/issues/show_bug.cgi?id=2313





------- Comment #7 from lio+bugzilla@lunesu.com  2008-08-27 06:39 -------
Created an attachment (id=272)
 --> (http://d.puremagic.com/issues/attachment.cgi?id=272&action=view)
assembly for _d_arrayappendcT using memcpy


-- 

August 27, 2008
http://d.puremagic.com/issues/show_bug.cgi?id=2313





------- Comment #8 from lio+bugzilla@lunesu.com  2008-08-27 06:39 -------
Created an attachment (id=273)
 --> (http://d.puremagic.com/issues/attachment.cgi?id=273&action=view)
assembly for _d_arrayappendcT using the original byte[] copy


-- 

« First   ‹ Prev
1 2