September 19, 2011
On 09/19/2011 05:52 PM, Steven Schveighoffer wrote:
> On Mon, 19 Sep 2011 11:03:15 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:
>
>> On 09/19/2011 04:43 PM, Steven Schveighoffer wrote:
>>> On Mon, 19 Sep 2011 10:24:33 -0400, Timon Gehr <timon.gehr@gmx.ch>
>>> wrote:
>>>
>>>> On 09/19/2011 04:02 PM, Steven Schveighoffer wrote:
>>>>>
>>>>> So I think it's not only limiting to require x.length to be $, it's
>>>>> very
>>>>> wrong in some cases.
>>>>>
>>>>> Also, think of a string. It has no length (well technically, it does,
>>>>> but it's not the number of elements), but it has a distinct end
>>>>> point. A
>>>>> properly written string type would fail to compile if $ was s.length.
>>>>>
>>>>
>>>> But you'd have to compute the length anyways in the general case:
>>>>
>>>> str[0..$/2];
>>>>
>>>> Or am I misunderstanding something?
>>>>
>>>
>>> That's half the string in code units, not code points.
>>>
>>> If string was properly implemented, this would fail to compile. $ is not
>>> the length of the string range (meaning the number of code points). The
>>> given slice operation might actually create an invalid string.
>>
>> Programmers have to be aware of that if they want efficient code that
>> deals with unicode. I think having random access to the code units and
>> being able to iterate per code point is fine, because it gives you the
>> best of both worlds. Manually decoding a string and slicing it at
>> positions that were remembered to be safe has been good enough for me,
>> at least it is efficient.
>
> I find the same. I don't think I've ever dealt with arbitrary math
> operations to do slices of strings like the above. I only slice a string
> when I know the bounds are sane.
>
> Like I said, it's a compromise. The "right" thing to do is probably not
> even allow code-unit access via index (some have even argued that
> code-point slicing is too dangerous, because you can split a grapheme,
> leaving a valid, but incorrect slice of the original).
>
>>> It's tricky, because you want fast slicing, but only certain slices are
>>> valid. I once created a string type that used a char[] as its backing,
>>> but actually implemented the limitations that std.range tries to enforce
>>> (but cannot). It's somewhat of a compromise. If $ was mapped to
>>> s.length, it would fail to compile, but I'm not sure what I *would* use
>>> for $. It actually might be the code units, which would not make the
>>> above line invalid.
>>>
>>> -Steve
>>
>> Well it would have to be consistent for a string type that "does it
>> right" . Either the string is indexed with units or it is indexed with
>> code points, and the other option should be provided. Dollar should
>> just be the length of what is used for indexing/slicing here, and
>> having that be different from length makes for a somewhat awkward
>> interface imho.
>
> Except we are defining a string as a *range* and a range's length is
> defined as the number of elements.
>
> Note that hasLength!string evaluates to false in std.range.'

Ok. I feel the way narrow strings are handled in Phobos are a reasonable trade-off.

>
> $ should denote the end point of the aggregate, but it does not have to
> be equivalent to length, or even an integer/uint. It should just mean
> "end".

Point taken. What is the solution for infinite ranges? Should any arithmetics on $ just be disallowed?

>
> I also proposed a while back to have ^ denote the beginning (similar to
> regex) of an aggregate for aggregates that don't use 0 as the beginning,
> but people didn't like it :)
>
> -Steve

=D, well, it is grammatically unambiguous!
September 19, 2011
On 9/19/11 10:46 AM, Robert Jacques wrote:
> So, on balance, I'd say the two pointers representation is categorically
> worse than the fat pointer representation.

Benchmark. A few of your assumptions don't hold.

Andrei
September 19, 2011
On 9/19/11 11:12 AM, Andrei Alexandrescu wrote:
> On 9/19/11 10:46 AM, Robert Jacques wrote:
>> So, on balance, I'd say the two pointers representation is categorically
>> worse than the fat pointer representation.
>
> Benchmark. A few of your assumptions don't hold.
>
> Andrei

s/don't/may not/

:o)

Andrei
September 19, 2011
On Mon, 19 Sep 2011 12:00:46 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:

> On 09/19/2011 05:52 PM, Steven Schveighoffer wrote:

>>
>> $ should denote the end point of the aggregate, but it does not have to
>> be equivalent to length, or even an integer/uint. It should just mean
>> "end".
>
> Point taken. What is the solution for infinite ranges? Should any arithmetics on $ just be disallowed?

I think $ should not pose any type restrictions, and whatever type you return can have whatever semantics you want.  The only semantic restriction is it should mean "the end of the container".

In other words, generic code that expects to be able to do:

x[0..$/2]

should only expect this to work when $ is a numeric value, or defines division by numbers.

Technically, half of infinity is still infinity, so the type of $ could be defined so that any arithmetic operations on it result in itself ;)

-Steve
September 20, 2011
On Mon, 19 Sep 2011 12:00:17 -0400, Steven Schveighoffer <schveiguy@yahoo.com> wrote:

> On Mon, 19 Sep 2011 11:46:44 -0400, Robert Jacques <sandford@jhu.edu>
> wrote:
>
>> On Mon, 19 Sep 2011 10:08:32 -0400, Andrei Alexandrescu
>> <SeeWebsiteForEmail@erdani.org> wrote:
>>> On 9/19/11 6:25 AM, Steven Schveighoffer wrote:
>>>> On Sun, 18 Sep 2011 15:34:16 -0400, Timon Gehr <timon.gehr@gmx.ch>
>>>> wrote:
>>>>
>>>>> On 09/18/2011 08:28 PM, Andrei Alexandrescu wrote:
>>>>>> That would allow us to e.g. switch from the
>>>>>> pointer+length representation to the arguably better pointer+pointer
>>>>>> representation with ease.
>>>>>
>>>>> In what way is that representation better?
>>>>
>>>> I agree, I don't see why the representation is inherently better. Some
>>>> operations become higher performance (i.e. popFront), and some become
>>>> worse (i.e. length). Most of the others are a wash.
>>>
>>> That's where frequency of use comes into play. I'm thinking popFront
>>> would be used most often, and it touches two words.
>>>
>>> Andrei
>>>
>>
>> The elephant in the room, of course, is that length now requires a
>> division and that popFront is actually implemented using slicing:
>>
>> a = a[i .. $];
>>
>> which translates into:
>>
>> auto begin = i;
>> auto end   = length;
>> if(end - begin >= 0  && length - end >= 0) {
>>      ptr     = ptr + T.sizeof * begin;
>>      length  = end - begin;
>> }
>>
>> vs
>>
>> auto length = (ptrBack - ptrFront) / T.sizeof;
>> auto begin  = ptrFront + T.sizeof * i;
>> auto end    = ptrFront + T.sizeof * length;
>> if(end - begin >= 0  && ptrBack - end >= 0) {
>>      ptrFront = begin;
>>      ptrBack  = end;
>> }
>
> I would hope something like this would be optimized by the compiler:
>
> auto begin = ptrFront + T.sizeof * i;
> if(ptrBack - begin >= 0)
>     ptrFront = begin;
>
> If not, popFront could optimize it.
>
> Certainly, to say popFront is going to perform *worse* using a
> dual-pointer representation is false.  Only one calculation is needed.
>
> -Steve
>

Unfortunately, the compiler isn't going to auto-magically optimize the length computation away. (Remember that end is not necessarily equal to ptrBack, for an arbitrary set of ptrFront and ptrBack; it is the high level invariants associated with arrays which make it true) The compiler could optimize the slice operator, but it has to do so at a very high level; i.e. recognizing and special casing statements like a[1..$]. (And such optimizations are notoriously brittle) And yes, popFront could optimize the slicing operator. But having to do so is a pretty good indication that something's wrong with the array design.

Also, I didn't say that popFront, by itself, is going to perform *worse*. I said that popFront + empty require 1 extra subtraction and 1 less memory write. On today's out of order desktop processors that probably amounts to a wash. (And the benchmarks support this, see my other post). And given that the change makes computations like empty, length and slicing much worse, I stated that dual-pointers would, on the whole, be worse than ptr+length.
September 20, 2011
On Mon, 19 Sep 2011 12:12:04 -0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 9/19/11 10:46 AM, Robert Jacques wrote:
>> So, on balance, I'd say the two pointers representation is categorically
>> worse than the fat pointer representation.
>
> Benchmark. A few of your assumptions don't hold.
>
> Andrei
>

So I ran Timon Gehr's benchmark on my system 10 times, and although the Old method performed the best, the times showed a scheduling jitter of about 0.2 s which is an order of magnitude larger than the differences between the runs. I've had some experience with removing jitter from benchmarks, so I redid it. For those who don't want to read through all the results and code snippets below, here's the executive summary: there's no measurable difference between the optimum popFront + empty implementations as laid out by Timon Gehr and myself. However, when you consider the current slice based implementation of popFront (i.e. a[1..$]), things change drastically. For an int[], ptr+ptr was 3x slower and for an int[3][] it was 8x slower.


Timon Gehr's benchmark:

Old     New     Diff
4.64242 4.58722 0.0551916
4.60659 4.5482  0.058386
4.53478 4.51753 0.0172519
4.54561 4.51867 0.026935
4.46662 4.4733  -0.00668139
4.47204 4.46893 0.00310762
4.52798 4.54021 -0.0122326
4.5685  4.57667 -0.00816801
4.50138 4.49186 0.00951325
4.49764 4.49422 0.00342912

--------------------------

import std.datetime, std.stdio, std.algorithm;

int[1_000_000] a;

void f1(){
    for(auto ptr=a.ptr, len=a.length+1; len; ++ptr, --len) {}
}
void f2(){
    for(auto b=a.ptr, e=a.ptr+a.length; b<e; b++){}
}

void main(){
    writeln("Always remember to run benchmarks on a single core. Pause");
    readln();
    double min_f1 = int.max;
    double min_f2 = int.max;
    foreach(i;0..10_000) {
        auto b=benchmark!(f1,f2)(3);
        min_f1 = min(min_f1, b[0].to!("seconds",double));
        min_f2 = min(min_f2, b[1].to!("seconds",double));
    }
    writeln("Old\tNew\tDiff");
    writeln(min_f1,"\t",min_f2,"\t",(min_f2-min_f1)/(min_f1)*100.0,"%");
}

which gives

Old     New     Diff
0.00127244      0.00127244      0%

On my system. Which is pretty much as expected. Now onto the more critical test.

import std.datetime, std.stdio, std.algorithm, std.range, std.array;

alias int[3] T;
T[1_000_000] a;

void f1(){
    auto a2 = a[];
    while(!a2.empty) { a2.popFront(); }
}
void f2(){
    auto ptrFront = a.ptr;
    auto ptrBack  = a.ptr + a.length;
    auto i = 1;
    while(!(ptrFront >= ptrBack) ) {
        auto length = (ptrBack - ptrFront) / T.sizeof;
        auto begin  = ptrFront + T.sizeof * i;
        auto end    = ptrFront + T.sizeof * length;
        if(end - begin >= 0  && ptrBack - end >= 0) {
            ptrFront = begin;
            ptrBack  = end;
        }
    }
}

void main(){
    writeln("Always remember to run benchmarks on a single core. Pause");
    readln;
    double min_f1 = int.max;
    double min_f2 = int.max;
    foreach(i;0..10_000) {
        auto b=benchmark!(f1,f2)(3);
        min_f1 = min(min_f1, b[0].to!("seconds",double));
        min_f2 = min(min_f2, b[1].to!("seconds",double));
    }
    writeln("Old\tNew\tDiff");
    writeln(min_f1,"\t",min_f2,"\t",(min_f2-min_f1)/(min_f1)*100.0,"%");
}

Old     New     Diff
0.00869062      0.0118318       36.145%

Which makes sense given the division. Switching T to an int gives

Old     New     Diff
0.00868968      0.00502586      -42.1629%

Ah.. Perhaps I should also inline f1:

void f1(){
//    auto a2 = a[];
//    while(!a2.empty) { a2.popFront(); }

    auto i = 1;
    auto length = a.length;
    auto ptr    = a.ptr;
    while(!(length == 0)) {
        auto end   = length;
        auto begin = i;
        if(end - begin >= 0  && length - end >= 0) {
            ptr     = ptr + T.sizeof * begin;
            length  = end - begin;
        }
    }
}

which results in:

Old     New     Diff
0.00127244      0.00502912      295.233%

And Switching T back to int[3] gives:

Old     New     Diff
0.00127244      0.01192 836.78%



September 20, 2011
On Tue, 20 Sep 2011 02:18:12 -0400, Robert Jacques <sandford@jhu.edu> wrote:

> On Mon, 19 Sep 2011 12:12:04 -0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
>
>> On 9/19/11 10:46 AM, Robert Jacques wrote:
>>> So, on balance, I'd say the two pointers representation is categorically
>>> worse than the fat pointer representation.
>>
>> Benchmark. A few of your assumptions don't hold.
>>
>> Andrei
>>
>
> So I ran Timon Gehr's benchmark on my system 10 times...

P.S. All results are compiled using -O -release -inline and run on a single core of a Core 2 Duo T7500 2.2 GHz
September 20, 2011
Am 19.09.2011, 19:08 Uhr, schrieb Steven Schveighoffer <schveiguy@yahoo.com>:

> On Mon, 19 Sep 2011 12:00:46 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:
>
>> On 09/19/2011 05:52 PM, Steven Schveighoffer wrote:
>
>>>
>>> $ should denote the end point of the aggregate, but it does not have to
>>> be equivalent to length, or even an integer/uint. It should just mean
>>> "end".
>>
>> Point taken. What is the solution for infinite ranges? Should any arithmetics on $ just be disallowed?
>
> I think $ should not pose any type restrictions, and whatever type you return can have whatever semantics you want.  The only semantic restriction is it should mean "the end of the container".
>
> In other words, generic code that expects to be able to do:
>
> x[0..$/2]
>
> should only expect this to work when $ is a numeric value, or defines division by numbers.
>
> Technically, half of infinity is still infinity, so the type of $ could be defined so that any arithmetic operations on it result in itself ;)
>
> -Steve

The documentation of such generic code would make it obvious, that it cannot work with infinite ranges.
sort() on a list of all primes for example is obviously not working.
Btw, I don't mind [^..$]. I think it is clear for everyone who ever wrote a regex and easy for anyone else.
For natrually 0 based ranges, you'd still use 0. ^ would just be obfuscating there.
September 20, 2011
On 09/20/2011 08:18 AM, Robert Jacques wrote:
> On Mon, 19 Sep 2011 12:12:04 -0400, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>
>> On 9/19/11 10:46 AM, Robert Jacques wrote:
>>> So, on balance, I'd say the two pointers representation is categorically
>>> worse than the fat pointer representation.
>>
>> Benchmark. A few of your assumptions don't hold.
>>
>> Andrei
>>
>
> So I ran Timon Gehr's benchmark on my system 10 times, and although the
> Old method performed the best, the times showed a scheduling jitter of
> about 0.2 s which is an order of magnitude larger than the differences
> between the runs. I've had some experience with removing jitter from
> benchmarks, so I redid it. For those who don't want to read through all
> the results and code snippets below, here's the executive summary:
> there's no measurable difference between the optimum popFront + empty
> implementations as laid out by Timon Gehr and myself. However, when you
> consider the current slice based implementation of popFront (i.e.
> a[1..$]), things change drastically. For an int[], ptr+ptr was 3x slower
> and for an int[3][] it was 8x slower.
>
>
> Timon Gehr's benchmark:
>
> Old New Diff
> 4.64242 4.58722 0.0551916
> 4.60659 4.5482 0.058386
> 4.53478 4.51753 0.0172519
> 4.54561 4.51867 0.026935
> 4.46662 4.4733 -0.00668139
> 4.47204 4.46893 0.00310762
> 4.52798 4.54021 -0.0122326
> 4.5685 4.57667 -0.00816801
> 4.50138 4.49186 0.00951325
> 4.49764 4.49422 0.00342912
>
> --------------------------
>
> import std.datetime, std.stdio, std.algorithm;
>
> int[1_000_000] a;
>
> void f1(){
> for(auto ptr=a.ptr, len=a.length+1; len; ++ptr, --len) {}
> }
> void f2(){
> for(auto b=a.ptr, e=a.ptr+a.length; b<e; b++){}
> }
>
> void main(){
> writeln("Always remember to run benchmarks on a single core. Pause");
> readln();
> double min_f1 = int.max;
> double min_f2 = int.max;
> foreach(i;0..10_000) {
> auto b=benchmark!(f1,f2)(3);
> min_f1 = min(min_f1, b[0].to!("seconds",double));
> min_f2 = min(min_f2, b[1].to!("seconds",double));
> }
> writeln("Old\tNew\tDiff");
> writeln(min_f1,"\t",min_f2,"\t",(min_f2-min_f1)/(min_f1)*100.0,"%");
> }
>
> which gives
>
> Old New Diff
> 0.00127244 0.00127244 0%
>
> On my system. Which is pretty much as expected. Now onto the more
> critical test.
>
> import std.datetime, std.stdio, std.algorithm, std.range, std.array;
>
> alias int[3] T;
> T[1_000_000] a;
>
> void f1(){
> auto a2 = a[];
> while(!a2.empty) { a2.popFront(); }
> }
> void f2(){
> auto ptrFront = a.ptr;
> auto ptrBack = a.ptr + a.length;
> auto i = 1;
> while(!(ptrFront >= ptrBack) ) {
> auto length = (ptrBack - ptrFront) / T.sizeof;
> auto begin = ptrFront + T.sizeof * i;
> auto end = ptrFront + T.sizeof * length;
> if(end - begin >= 0 && ptrBack - end >= 0) {
> ptrFront = begin;
> ptrBack = end;
> }
> }
> }
>
> void main(){
> writeln("Always remember to run benchmarks on a single core. Pause");
> readln;
> double min_f1 = int.max;
> double min_f2 = int.max;
> foreach(i;0..10_000) {
> auto b=benchmark!(f1,f2)(3);
> min_f1 = min(min_f1, b[0].to!("seconds",double));
> min_f2 = min(min_f2, b[1].to!("seconds",double));
> }
> writeln("Old\tNew\tDiff");
> writeln(min_f1,"\t",min_f2,"\t",(min_f2-min_f1)/(min_f1)*100.0,"%");
> }
>
> Old New Diff
> 0.00869062 0.0118318 36.145%
>
> Which makes sense given the division. Switching T to an int gives
>
> Old New Diff
> 0.00868968 0.00502586 -42.1629%
>
> Ah.. Perhaps I should also inline f1:
>
> void f1(){
> // auto a2 = a[];
> // while(!a2.empty) { a2.popFront(); }
>
> auto i = 1;
> auto length = a.length;
> auto ptr = a.ptr;
> while(!(length == 0)) {
> auto end = length;
> auto begin = i;
> if(end - begin >= 0 && length - end >= 0) {
> ptr = ptr + T.sizeof * begin;
> length = end - begin;
> }
> }
> }
>
> which results in:
>
> Old New Diff
> 0.00127244 0.00502912 295.233%
>
> And Switching T back to int[3] gives:
>
> Old New Diff
> 0.00127244 0.01192 836.78%
>
>
>

Thank you for making this more meaningful! I assumed the standard library benchmark function would take care of those things. Should it?




September 21, 2011
On Tue, 20 Sep 2011 14:01:05 -0400, Timon Gehr <timon.gehr@gmx.ch> wrote:
[snip]

> Thank you for making this more meaningful! I assumed the standard
> library benchmark function would take care of those things. Should it?

Yes and no. Benchmark provides a good way to make a single measurement of a function, as for really short functions you do have to loop many times to be able to get a reliable reading. However, actual benchmarking requires a) tuning the benchmark() call time to about 10-20 ms and b) running benchmark() many times, taking the minimum. The idea is that on any given run you could hit a context switch, etc. so if you make multiple run, one will get lucky and not be interrupted. Worse, if a core switch happens between StopWatch start and end, the number of ticks returned is random. Hence, the comment to manually limit execution to a single core. So, it might be nice if benchmark() also took a second parameter, denoting the number of times to benchmark the function and had some better documentation on how to do good benchmarking.