September 10, 2008
Walter Bright wrote:
> Walter Bright wrote:
>> superdan wrote:
>>> u could've stopped readin'. general comments without experience are
>>> oxpoop. my mailman could give'em.
>>
>> The thing about iterators and collections is that they look so simple, but getting the right design is fiendishly difficult.
> 
> And when a correct design is devised, the mark of genius is everyone will think it in retrospect to be simple and obvious!

Also: everyone will measure the quality of that design using a different yardstick.

For my money, the best collection design I've ever worked with is the C5 library for .Net, developed at the IT University of Copenhagen:

http://www.itu.dk/research/c5/
http://www.ddj.com/windows/199902700

--benji
September 10, 2008
Benji Smith wrote:
> bearophile wrote:
>> 5) The source code of the current algorithm module of D2 is already very complex to follow, it smells of over-generalization here and there. Sometimes it's better to reduce the generality of things, even if that reduces their power a little, to reduce complexity, etc. Tango code too isn't perfect, but it often looks more human. While you have created the algorithm module I too have created something similar, but based on different grounds.
> 
> Along these same lines, while D is still young, the documentation is often thin, and code examples are scarce.
> 
> I've been doing a lot of programming lately with Tango, and despite the growing documentation, I've had to refer directly to the Tango source code on more occasions than I can count. Luckily, the Tango sources are well-written and pretty easy to read and understand, and I've had very few problems figuring out how to use the library.
> 
> I hope the range implementation makes readability a high priority.

Speaking of examples and readability, and to tie this with the discussion on array reallocation, I was curious on an array appender built on the output range interface. I've seen a quite complicated array builder in digitalmars.d. I wanted a simpler appender that should not do worse than the built-in ~= and that works with algorithm2 whenever data is written out.

It turned out quite simple and it improved performance of a large-scale data preprocessing task of mine (which involved reading and parsing about 1M lines of integers) by 15%. I'd be curious how it fares with other tests that you guys may have.

The idea is very simple: just use D's native append operation, but cache the capacity to avoid too many lookups (I understand that that's the bottleneck).

I paste the code below, I'd be indebted if you guys grabbed it and tested it.

Andrei

struct ArrayAppender(T)
{
    private T* _buffer;
    private size_t _length;
    private size_t _capacity;

    this(T[] array)
    {
        _buffer = array.ptr;
        _length = array.length;
        if (_buffer) _capacity = .capacity(array.ptr) / T.sizeof;
    }

    size_t capacity() const { return _capacity; }

    void putNext(T item)
    {
        invariant desiredLength = _length + 1;
        if (desiredLength <= _capacity)
        {
            // Should do in-place construction here
            _buffer[_length] = item;
        }
        else
        {
            // Time to reallocate, do it and cache capacity
            auto tmp = _buffer[0 .. _length];
            tmp ~= item;
            _buffer = tmp.ptr;
            _capacity = .capacity(_buffer) / T.sizeof;
        }
        _length = desiredLength;
    }

    T[] release()
    {
        // Shrink-to-fit
        auto result = cast(T[]) realloc(_buffer, _length * T.sizeof);
        // Relinquish state
        _buffer = null;
        _length = _capacity = 0;
        return result;
    }
}

unittest
{
    auto app = arrayAppender(new char[0]);
    string b = "abcdefg";
    foreach (char c; b) app.putNext(c);
    assert(app.release == "abcdefg");
}
September 10, 2008
Andrei Alexandrescu:

> Speaking of examples and readability, and to tie this with the discussion on array reallocation, I was curious on an array appender built on the output range interface. I've seen a quite complicated array builder in digitalmars.d. I wanted a simpler appender that should not do worse than the built-in ~= and that works with algorithm2 whenever data is written out.

That builder was probably my one, designed for D1. For the last version take a look at the 'builder.d' module in the libs I have shown you. It's not too much complex: its API is simple and its internals are as complex as they have to be to be efficient. (I'm slowly improving it still, I'm now trying to make it more flexible, making its extending functionality work with other kinds of iterables too.)


> I'd be curious how it fares with other tests that you guys may have.

That module has about 380 lines of code of benchmarks (after few hundred of lines of unit tests), I think you can add few lines to them to benchmark your implementation, but I presume mine is more efficient :-) I may do few benchmarks later...

Bye,
bearophile
September 10, 2008
Few benchmarks, appending ints, note this is a worst-case situation (other benchmarks are less dramatic). Just many appends, followed by the "release":

benchmark 10, N=10_000_000:
  Array append:  0.813 s
  ArrayBuilder:  0.159 s
  ArrayAppender: 1.056 s

benchmark 10, N=100_000_000:
  Array append:  10.887 s
  ArrayBuilder:   1.477 s
  ArrayAppender: 13.305 s

-----------------

Chunk of the code I have used:

struct ArrayAppender(T)
{
     private T* _buffer;
     private size_t _length;
     private size_t _capacity;

     void build(T[] array)
     {
         _buffer = array.ptr;
         _length = array.length;
         //if (_buffer) _capacity = .capacity(array.ptr) / T.sizeof;
     }

     size_t capacity() /* const */ { return _capacity; }

     void putNext(T item)
     {
         /* invariant */ int desiredLength = _length + 1;
         if (desiredLength <= _capacity)
         {
             // Should do in-place construction here
             _buffer[_length] = item;
         }
         else
         {
             // Time to reallocate, do it and cache capacity
             auto tmp = _buffer[0 .. _length];
             tmp ~= item;
             _buffer = tmp.ptr;
             _capacity = this.capacity() / T.sizeof;
         }
         _length = desiredLength;
     }

     T[] release()
     {
         // Shrink-to-fit
         //auto result = cast(T[]) realloc(_buffer, _length * T.sizeof);
         auto result = cast(T[]) gcrealloc(_buffer, _length * T.sizeof);
         // Relinquish state
         _buffer = null;
         _length = _capacity = 0;
         return result;
     }
}

unittest
{
     auto app = arrayAppender(new char[0]);
     string b = "abcdefg";
     foreach (char c; b) app.putNext(c);
     assert(app.release == "abcdefg");
}


    void benchmark10(int ntest, int N) {
        putr("\nbenchmark 10, N=", thousands(N), ":");

        if (ntest == 0) {
            auto t0 = clock();
            int[] a1;
            for (int i; i < N; i++)
                a1 ~= i;
            auto last1 = a1[$ - 1];
            auto t2 = clock() - t0;
            putr("  Array append: %.3f", t2, " s   ", last1);
        } else if (ntest == 1) {
            auto t0 = clock();
            ArrayBuilder!(int) a2;
            for (int i; i < N; i++)
                a2 ~= i;
            auto last2 = a2.toarray[$ - 1];
            auto t2 = clock() - t0;
            putr("  ArrayBuilder: %.3f", t2, " s   ", last2);
        } else {
            auto t0 = clock();
            ArrayAppender!(int) a3;
            a3.build(null);
            for (int i; i < N; i++)
                a3.putNext(i);
            auto last3 = a3.release()[$ - 1];
            auto t2 = clock() - t0;
            putr("  ArrayAppender: %.3f", t2, " s   ", last3);
        }
    }

Bye,
bearophile
September 10, 2008
bearophile wrote:
> Few benchmarks, appending ints, note this is a worst-case situation (other benchmarks are less dramatic). Just many appends, followed by the "release":
> 
> benchmark 10, N=10_000_000:
>   Array append:  0.813 s
>   ArrayBuilder:  0.159 s
>   ArrayAppender: 1.056 s
> 
> benchmark 10, N=100_000_000:
>   Array append:  10.887 s
>   ArrayBuilder:   1.477 s
>   ArrayAppender: 13.305 s

That's odd. The array appender should never, by definition, do significantly worse than the straight array append. I think some other factor intervened (e.g. swapping). Also don't forget to compile with -O -release -inline and to test several times after a warmup run. I adapted your code obtaining the numbers below:

benchmark 10, N=10000000:
  Array append: 0.69 s
  ArrayAppender: 0.19 s

benchmark 10, N=25000000:
  Array append: 2.06 s
  ArrayAppender: 0.82 s

benchmark 10, N=50000000:
  Array append: 4.28 s
  ArrayAppender: 1.75 s

benchmark 10, N=75000000:
  Array append: 9.62 s
  ArrayAppender: 5.8 s

benchmark 10, N=100000000:
  Array append: 11.35 s
  ArrayAppender: 6.20 s


Andrei
September 10, 2008
Andrei Alexandrescu:
> That's odd. The array appender should never, by definition, do significantly worse than the straight array append.

But the reality of your code and benchmarks may differ from the abstract definition :-)


> I think some other factor intervened (e.g. swapping). Also don't forget to compile with -O -release -inline and to test several times after a warmup run.

I have kept an eye on such things too. Note that benchmarks are generally tricky.

I am using DMD v1.035, on a Core Duo 2 GHz, 2 GB RAM, on Win, the code doesn't make HD swap, and timings are warm. My timings are repeatable within 0.02-0.03 seconds on my PC.
If you want I can give you the whole testing code, of course. But it's just the builders.d module of my libs plus the added testing code I have shown you.

Anyway, in the end it doesn't matter much, I think.

Bye,
bearophile
September 10, 2008
bearophile wrote:
> Andrei Alexandrescu:
>> That's odd. The array appender should never, by definition, do significantly worse than the straight array append.
> 
> But the reality of your code and benchmarks may differ from the
> abstract definition :-)

But it's not abstract, and besides my benchmarks do support my
hypothesis. On the common path my code does the minimum amount that any
append would do: test, assign at index, bump. On the less common path
(invoked an amortized constant number of times) my code does an actual
built-in append plus a call to capacity to cache it. Assuming the
built-in array does a screaming fast append, my code should be just
about as fast, probably insignificantly slower because of the extra
straggler operations.

>> I think some other factor intervened (e.g. swapping). Also don't
>> forget to compile with -O -release -inline and to test several
>> times after a warmup run.
> 
> I have kept an eye on such things too. Note that benchmarks are
> generally tricky.
> 
> I am using DMD v1.035, on a Core Duo 2 GHz, 2 GB RAM, on Win, the
> code doesn't make HD swap, and timings are warm. My timings are
> repeatable within 0.02-0.03 seconds on my PC. If you want I can give
> you the whole testing code, of course. But it's just the builders.d
> module of my libs plus the added testing code I have shown you.

But I copied your test code from your post. Then I adapted it to Phobos
(replaced putr with writefln, clock with ctime and the such, which
shouldn't matter). Then I ran it under Phobos. For the built-in array
append we get comparable numbers. So there must be something that makes
your numbers for ArrayAppender skewed. I'm saying yours and not mine
because mine are in line with expectations and yours aren't.


Andrei

September 10, 2008
Andrei Alexandrescu Wrote:

> bearophile wrote:
> > Few benchmarks, appending ints, note this is a worst-case situation (other benchmarks are less dramatic). Just many appends, followed by the "release":
> > 
> > benchmark 10, N=10_000_000:
> >   Array append:  0.813 s
> >   ArrayBuilder:  0.159 s
> >   ArrayAppender: 1.056 s
> > 
> > benchmark 10, N=100_000_000:
> >   Array append:  10.887 s
> >   ArrayBuilder:   1.477 s
> >   ArrayAppender: 13.305 s
> 
> That's odd. The array appender should never, by definition, do significantly worse than the straight array append. I think some other factor intervened (e.g. swapping). Also don't forget to compile with -O -release -inline and to test several times after a warmup run. I adapted your code obtaining the numbers below:
> 
> benchmark 10, N=10000000:
>    Array append: 0.69 s
>    ArrayAppender: 0.19 s
> 
> benchmark 10, N=25000000:
>    Array append: 2.06 s
>    ArrayAppender: 0.82 s
> 
> benchmark 10, N=50000000:
>    Array append: 4.28 s
>    ArrayAppender: 1.75 s
> 
> benchmark 10, N=75000000:
>    Array append: 9.62 s
>    ArrayAppender: 5.8 s
> 
> benchmark 10, N=100000000:
>    Array append: 11.35 s
>    ArrayAppender: 6.20 s
> 
> 
> Andrei

arrayappender is simple as dumb. compare and contrast with the ginormous arraybuilder. yet works great. why? because it is on the right thing. the common case. on the uncommon case it does whatever to get the job done. don't matter if it's rare.

not sure anybody saw the irony. it was bearophile advocating simplicity eh. allegedly andre's the complexity guy. code seems to tell nother story.

btw doods i figured indexed access is slower than access via pointer. so i recoded arrayappender to only use pointers. there's some half a second savings for the large case. small arrays don't feel it.

struct ArrayAppender(T)
{
    private T* _begin;
    private T* _end;
    private T* _eos;

    this(T[] array)
    {
        _begin = array.ptr;
        _end = _begin + array.length;
        if (_begin) _eos = _begin + .capacity(_begin) / T.sizeof;
    }

    size_t capacity() const { return _eos - _begin; }

    void putNext(T item)
    {
        if (_end < _eos)
        {
            *_end++ = item;
        }
        else
        {
            auto tmp = _begin[0 .. _end - _begin];
            tmp ~= item;
            _begin = tmp.ptr;
            _end = _begin + tmp.length;
            _eos = _begin + .capacity(_begin) / T.sizeof;
        }
    }

    T[] releaseArray()
    {
        auto result = _begin[0 .. _end - _begin];
        _begin = _end = _eos = null;
        return result;
    }
}

September 10, 2008
superdan wrote:
> Andrei Alexandrescu Wrote:
> 
>> bearophile wrote:
>>> Few benchmarks, appending ints, note this is a worst-case situation (other benchmarks are less dramatic). Just many appends, followed by the "release":
>>>
>>> benchmark 10, N=10_000_000:
>>>   Array append:  0.813 s
>>>   ArrayBuilder:  0.159 s
>>>   ArrayAppender: 1.056 s
>>>
>>> benchmark 10, N=100_000_000:
>>>   Array append:  10.887 s
>>>   ArrayBuilder:   1.477 s
>>>   ArrayAppender: 13.305 s
>> That's odd. The array appender should never, by definition, do significantly worse than the straight array append. I think some other factor intervened (e.g. swapping). Also don't forget to compile with -O -release -inline and to test several times after a warmup run. I adapted your code obtaining the numbers below:
>>
>> benchmark 10, N=10000000:
>>    Array append: 0.69 s
>>    ArrayAppender: 0.19 s
>>
>> benchmark 10, N=25000000:
>>    Array append: 2.06 s
>>    ArrayAppender: 0.82 s
>>
>> benchmark 10, N=50000000:
>>    Array append: 4.28 s
>>    ArrayAppender: 1.75 s
>>
>> benchmark 10, N=75000000:
>>    Array append: 9.62 s
>>    ArrayAppender: 5.8 s
>>
>> benchmark 10, N=100000000:
>>    Array append: 11.35 s
>>    ArrayAppender: 6.20 s
>>
>>
>> Andrei
> 
> arrayappender is simple as dumb. compare and contrast with the ginormous arraybuilder. yet works great. why? because it is on the right thing. the common case. on the uncommon case it does whatever to get the job done. don't matter if it's rare.
> 
> not sure anybody saw the irony. it was bearophile advocating simplicity eh. allegedly andre's the complexity guy. code seems to tell nother story.
> 
> btw doods i figured indexed access is slower than access via pointer. so i recoded arrayappender to only use pointers. there's some half a second savings for the large case. small arrays don't feel it.
> 
> struct ArrayAppender(T)
> {
>     private T* _begin;
>     private T* _end;
>     private T* _eos;
> 
>     this(T[] array)
>     {
>         _begin = array.ptr;
>         _end = _begin + array.length;
>         if (_begin) _eos = _begin + .capacity(_begin) / T.sizeof;
>     }
> 
>     size_t capacity() const { return _eos - _begin; }
> 
>     void putNext(T item)
>     {
>         if (_end < _eos)
>         {
>             *_end++ = item;
>         }
>         else
>         {
>             auto tmp = _begin[0 .. _end - _begin];
>             tmp ~= item;
>             _begin = tmp.ptr;
>             _end = _begin + tmp.length;
>             _eos = _begin + .capacity(_begin) / T.sizeof;
>         }
>     }
> 
>     T[] releaseArray()
>     {
>         auto result = _begin[0 .. _end - _begin];
>         _begin = _end = _eos = null;
>         return result;
>     }
> }

Thanks! Can't hurt. Guess I'll integrate your code if you don't mind.

I gave it some more thought and I have a theory for the root of the issue. My implementation assumes there's exponential (multiplicative) increase of capacity in the built-in ~=. I hope Walter wouldn't do anything else. If there are differences in growth strategies between D1 and D2, that could explain the difference between bearophile's benchmarks and mine.


Andrei
September 10, 2008
Andrei Alexandrescu:
> I hope Walter wouldn't do anything else. If there are differences in growth strategies between D1 and D2, that could explain the difference between bearophile's benchmarks and mine.

You can't compare benchmarks of two different compilers.

(Your code doesn't work (on D1) if T is a static array, and using the ~= looks like a better interface for the append. Your code doesn't append arrays, so you have to call it many times if you want to append strings (in D1) that's a very common case.)

Bye,
bearophile