Jump to page: 1 2
Thread overview
[Issue 4412] New: Array capacity growth spikey and the ratio approaches 1.0
Jul 01, 2010
Ali Cehreli
Jul 01, 2010
Ali Cehreli
Jul 02, 2010
Ali Cehreli
Jul 14, 2010
Fawzi Mohamed
Jul 14, 2010
Ali Cehreli
July 01, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4412

           Summary: Array capacity growth spikey and the ratio approaches
                    1.0
           Product: D
           Version: 2.032
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P3
         Component: druntime
        AssignedTo: sean@invisibleduck.org
        ReportedBy: acehreli@yahoo.com


--- Comment #0 from Ali Cehreli <acehreli@yahoo.com> 2010-07-01 10:33:06 PDT ---
There are spikes in the array capacity growth. Additionally, the capacity/length ratio approaches 1.0; as a result, the complexity of the append operation would not be 'amortized constant' for large length.

import std.stdio;
import std.array;

void main()
{
    int[] a;
    size_t oldCapacity;

    foreach (i; 0 .. 100_000_000) {
        a ~= i;

        if (a.capacity != oldCapacity) {
            writeln("length: ", a.length,
                    " capacity: ", capacity(a),
                    " ratio: ", cast(double)capacity(a) / oldCapacity);
            oldCapacity = capacity(a);
        }
    }
}

Produces:

length: 1 capacity: 3 ratio: inf        please ignore this one ;)
length: 4 capacity: 7 ratio: 2.33333
length: 8 capacity: 15 ratio: 2.14286
length: 16 capacity: 31 ratio: 2.06667
length: 32 capacity: 63 ratio: 2.03226
length: 64 capacity: 127 ratio: 2.01587
length: 128 capacity: 255 ratio: 2.00787
length: 256 capacity: 509 ratio: 1.99608
length: 512 capacity: 1021 ratio: 2.00589
length: 1022 capacity: 2045 ratio: 2.00294
length: 2046 capacity: 3069 ratio: 1.50073
length: 3070 capacity: 4093 ratio: 1.33366
length: 4094 capacity: 5117 ratio: 1.25018
...
length: 1153022 capacity: 1154045 ratio: 1.00089
length: 1154046 capacity: 1155069 ratio: 1.00089
length: 1155070 capacity: 3153917 ratio: 2.7305    <-- spike
length: 3153918 capacity: 3154941 ratio: 1.00032
length: 3154942 capacity: 3155965 ratio: 1.00032
...
length: 4741118 capacity: 4742141 ratio: 1.00022
length: 4742142 capacity: 4743165 ratio: 1.00022
length: 4743166 capacity: 12333053 ratio: 2.60017  <-- spike
length: 12333054 capacity: 12334077 ratio: 1.00008
length: 12334078 capacity: 12335101 ratio: 1.00008
... (there are more spikes later on)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 01, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4412


Ali Cehreli <acehreli@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |acehreli@yahoo.com


--- Comment #1 from Ali Cehreli <acehreli@yahoo.com> 2010-07-01 10:34:39 PDT ---
Forgot to mention that the compiler is dmd 2.047

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 01, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4412


Steven Schveighoffer <schveiguy@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |sean@invisibleduck.org
            Version|2.032                       |D2
         AssignedTo|sean@invisibleduck.org      |schveiguy@yahoo.com


--- Comment #2 from Steven Schveighoffer <schveiguy@yahoo.com> 2010-07-01 11:25:43 PDT ---
I'll look into it.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 02, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4412


Ali Cehreli <acehreli@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                URL|                            |http://www.mail-archive.com
                   |                            |/digitalmars-d@puremagic.co
                   |                            |m/msg33113.html


--- Comment #3 from Ali Cehreli <acehreli@yahoo.com> 2010-07-02 09:16:11 PDT ---
The following is an excerpt from the newsgroup thread that's in the URL field.

> Steven Schveighoffer wrote:
>  > newCapacity isn't called when setting the length explicitly, because
>  > it's not considered to be an append.
>
> You are right.
>
> Looks like I was successful in locating the culprit. :) I took liberty to write ++a.length and also outputted the number of elements that are in reserve at each allocation:
>
> import std.stdio;
> import std.array;
>
> void main()
> {
>      int[] a;
>      size_t oldCapacity;
>
>      foreach (i; 0 .. 100_000_000) {
>          ++a.length;
>          a[$-1] = i;
>
>          if (capacity(a) != oldCapacity) {
>              writeln("length: ", a.length,
>                      " capacity: ", capacity(a),
>                      " ratio: ", cast(double)capacity(a) / oldCapacity,
>                      " reserve: ", capacity(a) - a.length);
>              oldCapacity = capacity(a);
>          }
>      }
> }
>
> Now the spikes are gone, and the allocations asymptote at once-per-1024 elements for an int array:
>
> length: 1 capacity: 3 ratio: inf reserve: 2
> length: 4 capacity: 7 ratio: 2.33333 reserve: 3
> length: 8 capacity: 15 ratio: 2.14286 reserve: 7
> length: 16 capacity: 31 ratio: 2.06667 reserve: 15
> length: 32 capacity: 63 ratio: 2.03226 reserve: 31
> length: 64 capacity: 127 ratio: 2.01587 reserve: 63
> length: 128 capacity: 255 ratio: 2.00787 reserve: 127
> length: 256 capacity: 509 ratio: 1.99608 reserve: 253
> length: 512 capacity: 1021 ratio: 2.00589 reserve: 509
> length: 1022 capacity: 2045 ratio: 2.00294 reserve: 1023
> length: 2046 capacity: 3069 ratio: 1.50073 reserve: 1023
> length: 3070 capacity: 4093 ratio: 1.33366 reserve: 1023
> ...
>
> In the long run, there is one allocation per 1024 elements. That still feels like amortized constant, but considering that relocation times would be O(N), I think reserve should still grow with N, but maybe not too much. (I am not sure about this last point at all.)
>

4x1024 = 4096 == page size.

The newCapacity function is supposed to reduce the effect of this phenomenon. i.e., when appending, it's assumed you are going to continue appending, so to reduce the overhead, the amount added to the memory block is supposed to be related to the total number of bytes already allocated.  It doesn't seem like the function is working very well...  It could have been something I did, or it could have always been broken :)  I honestly did not examine the code at all, I just only ever read the comments, assuming it works.

Please add these comments to the bug report, it will help when I look at it.  I don't have a lot of time to look at it right now, so I'm afraid I'll forget what we were doing later :)

-Steve

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 02, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4412


bearophile_hugs@eml.cc changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bearophile_hugs@eml.cc


--- Comment #4 from bearophile_hugs@eml.cc 2010-07-02 09:31:02 PDT ---
This is a simplified version of that code:


import std.stdio: writeln;
void main() {
    int[] array;
    size_t oldCapacity;

    foreach (i; 0 .. 20_000) {
        array ~= i;
        size_t newCapacity = array.capacity();
        if (newCapacity != oldCapacity) {
            writeln(cast(double)newCapacity / oldCapacity);
            oldCapacity = newCapacity;
        }
    }
}


It prints:

inf
2.33333
2.14286
2.06667
2.03226
2.01587
2.00787
1.99608
2.00589
2.00294
1.50073
1.33366
1.25018
1.20012
1.16675
1.14292
1.12505
1.11115
1.10003
1.09093
1.08335
1.07694
1.07144
1.06668
1.06251
1.05883
1.05556
1.05264

This output shows that there is a bug: to allow an O(1) amortized append the size of the array has to grow geometrically.

So I suggest to use a geometric growth factor of 2 until the memory block is "large enough", something like 200MB or 300MB. And after such absolute memory limit then use a smaller geometric growth factor, like 1.3, to avoid wasting too much memory.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 02, 2010
d-bugmail@puremagic.com wrote:
> http://d.puremagic.com/issues/show_bug.cgi?id=4412
> 
> 
> bearophile_hugs@eml.cc changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |bearophile_hugs@eml.cc
> 
> 
> --- Comment #4 from bearophile_hugs@eml.cc 2010-07-02 09:31:02 PDT ---
> This is a simplified version of that code:
> 
> 
> import std.stdio: writeln;
> void main() {
>     int[] array;
>     size_t oldCapacity;
> 
>     foreach (i; 0 .. 20_000) {
>         array ~= i;
>         size_t newCapacity = array.capacity();
>         if (newCapacity != oldCapacity) {
>             writeln(cast(double)newCapacity / oldCapacity);
>             oldCapacity = newCapacity;
>         }
>     }
> }
> 
> 
> It prints:
> 
> inf
> 2.33333
> 2.14286
> 2.06667
> 2.03226
> 2.01587
> 2.00787
> 1.99608
> 2.00589
> 2.00294
> 1.50073
> 1.33366
> 1.25018
> 1.20012
> 1.16675
> 1.14292
> 1.12505
> 1.11115
> 1.10003
> 1.09093
> 1.08335
> 1.07694
> 1.07144
> 1.06668
> 1.06251
> 1.05883
> 1.05556
> 1.05264
> 
> This output shows that there is a bug: to allow an O(1) amortized append the
> size of the array has to grow geometrically.

...when it is actually being move. Per my previous message - you may want to evaluate and print the ratio only when the array is effectively relocated.

Andrei
July 02, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4412



--- Comment #5 from bearophile_hugs@eml.cc 2010-07-02 09:48:24 PDT ---
This is a modified program, after a suggestion by Andrei:


import std.stdio: writeln;
void main() {
    int[] array;
    int* old_ptr;

    foreach (i; 0 .. 100_000_000) {
        array ~= i;
        int* new_ptr = array.ptr;
        if (array.ptr != old_ptr) {
            old_ptr = array.ptr;
            writeln(cast(double)array.capacity() / array.length);
        }
    }
}


It outputs:
3
1.75
1.875
1.9375
1.96875
1.98438
1.99219
1.98828
1.99414
3.00001
2.7305
2.60017
2.48002
2.37

It shows that when relocation happens the capacity() grows significantly. But that grow factor looks quite large.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 02, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4412



--- Comment #6 from Steven Schveighoffer <schveiguy@yahoo.com> 2010-07-02 10:30:19 PDT ---
OK, here's the deal:

When the current block can be extended into the next page, it is.  This does not take into account any scaling factors.  That is, if you ask for one more byte, and a page can be lumped onto the end, then only 1 page is added.  This explains the behavior most of the time.

If a page cannot be added, then it uses the newCapacity function.  However, I think this function may return too high a value.

Essentially, the function looks like this:

100 + (1000L * size) / log2plus1(newcap);

Where size is the size of a single element in bytes, newcap is the size of the *total* memory being requested (old size + appended size) in bytes, and log2plus1 essentially gets 1 + highest bit set.  The result is a percentage of the total size requested to use.

Where I think the problem is using size in this equation.  I don't really understand why size has to be taken into account since newcap already takes size into account.

What happens is, let's say you are doing an array of bytes, and you are requesting 100 bytes.  size is 1, newcap is 100.  log2plus1(100) is 7.  So the resulting formula looks like:

100 + (1000L * 1) / 7 => 242, meaning 242% of the original size requested.

Now, if we use the integer, which is 4 bytes, as our array type, the formula now has 400 as newcap and size is 4.  Resulting in:

100 + (1000L * 4) / 9 => 544, meaning 544% of the original size requested.

I'm not sure why larger array element types should need more relative space, it doesn't make a lot of sense, but not much is explained in the comment on why the formula is the way it is.

I'm unsure how to change this, any ideas?

For one, I think we can use the newCapacity function always, even when appending pages (I think it will just append as many pages as it can up to the requested size).

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 14, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4412


Fawzi Mohamed <fawzi@gmx.ch> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fawzi@gmx.ch


--- Comment #7 from Fawzi Mohamed <fawzi@gmx.ch> 2010-07-14 05:29:31 PDT ---
When this issue was discovered in tango I invested some time thinking about it.
What I came out with is
  http://github.com/fawzi/blip/blob/master/blip/util/Grow.d
a version of which I had also used in tango.

Looking again at it I think that I decided to grow based on memory usage, and give "extra" space to arrays with large elements just using rounding up. Particularly with 32-bit one should be aware that some calculation in extreme cases could overflow.

I release the growing code with whatever license needed.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
July 14, 2010
http://d.puremagic.com/issues/show_bug.cgi?id=4412


Steven Schveighoffer <schveiguy@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


--- Comment #8 from Steven Schveighoffer <schveiguy@yahoo.com> 2010-07-14 13:36:01 PDT ---
Fixed in changeset http://www.dsource.org/projects/druntime/changeset/332

This should grow at a more normal rate (logarigthmic in relation to the size of the array)  There will be short growths, that's when just appending free pages can satisfy the append vs. having to commit more pages from the OS.

More tweaking may get this to be a better curve, but I'm pretty happy with it now.

Your test program now looks like this:

length: 1 capacity: 3 ratio: inf
length: 4 capacity: 7 ratio: 2.33333
length: 8 capacity: 15 ratio: 2.14286
length: 16 capacity: 31 ratio: 2.06667
length: 32 capacity: 63 ratio: 2.03226
length: 64 capacity: 127 ratio: 2.01587
length: 128 capacity: 255 ratio: 2.00787
length: 256 capacity: 511 ratio: 2.00392
length: 512 capacity: 1019 ratio: 1.99413
length: 1020 capacity: 2043 ratio: 2.00491
length: 2044 capacity: 4091 ratio: 2.00245
length: 4092 capacity: 7163 ratio: 1.75092
length: 7164 capacity: 8187 ratio: 1.14296
length: 8188 capacity: 9211 ratio: 1.12508
length: 9212 capacity: 15355 ratio: 1.66703
length: 15356 capacity: 24571 ratio: 1.6002
length: 24572 capacity: 25595 ratio: 1.04168
length: 25596 capacity: 40955 ratio: 1.60012
length: 40956 capacity: 41979 ratio: 1.025
length: 41980 capacity: 65531 ratio: 1.56104
length: 65532 capacity: 73723 ratio: 1.12501
length: 73724 capacity: 74747 ratio: 1.01389
length: 74748 capacity: 113659 ratio: 1.52058
length: 113660 capacity: 122875 ratio: 1.08108
length: 122876 capacity: 123899 ratio: 1.00833
length: 123900 capacity: 188411 ratio: 1.52068
length: 188412 capacity: 282619 ratio: 1.50001
length: 282620 capacity: 294907 ratio: 1.04348
length: 294908 capacity: 295931 ratio: 1.00347
length: 295932 capacity: 435195 ratio: 1.4706
length: 435196 capacity: 442363 ratio: 1.01647
length: 442364 capacity: 651259 ratio: 1.47223
length: 651260 capacity: 655355 ratio: 1.00629
length: 655356 capacity: 950267 ratio: 1.45
length: 950268 capacity: 951291 ratio: 1.00108
length: 951292 capacity: 1380347 ratio: 1.45102
length: 1380348 capacity: 1392635 ratio: 1.0089
...

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
« First   ‹ Prev
1 2