Thread overview
sort API design
Nov 20, 2019
Johannes Riecken
Nov 20, 2019
mipri
Nov 20, 2019
Dennis
Nov 20, 2019
Dukc
Nov 20, 2019
Jonathan M Davis
Nov 21, 2019
Johannes Riecken
Nov 21, 2019
Jonathan M Davis
Nov 22, 2019
mipri
November 20, 2019
I had looked at how to do case-insensitive string sorting with Phobos and without looking at the documentation much, I went with the following code and then wrongly concluded that `<` does case-insensitive comparisons on strings by default:

```
import std.algorithm;
import std.stdio;
import std.string;

void main() {
    auto arr0 = ["foo", "bar", "Baz"];
    auto case_sensitive   = sort!((a, b) => a           < b          )(arr0);
    auto case_insensitive = sort!((a, b) => a.toLower() < b.toLower())(arr0);
    writeln(case_sensitive);
    writeln(case_insensitive);
}
```
Output:
```
["bar", "Baz", "foo"]
["bar", "Baz", "foo"]
```

Later I learned that I was trapped by `sort` being in-place but still returning a value. If this API were to be designed from scratch, I wonder if it would be possible to rule out this incorrect usage at compile-time while still retaining the other nice features of the `sort` function like being in-place and giving strong types which can be used to choose faster algorithms to further transform the result, etc.? Or are there any ideas in programming language research that would help with such a case?
November 20, 2019
On Wednesday, 20 November 2019 at 09:33:25 UTC, Johannes Riecken wrote:
> Or are there
> any ideas in programming language research that would help with such a case?

I remember reading that Ada/SPARK did some checking that could've
helped here, but I can't find a reference now, and I'm not sure of
the terminology. It might be one of the implications of the data
flow analysis that it does.

Suppose you had this in your code:

  int x;
  x = 2;
  x = 3;
  do_something(x);

What is the point of the first assignment? Waste heat from the CPU?
Of course there's no point, and SPARK makes it an error to do that:
your program is doing some *work*, it is generating some outputs,
so it must go on to use those outputs, and not throw them away.

And in your program that led to a confusing result, you're also
performing work that is lost:

>     auto case_sensitive   = sort!((a, b) => a           < b
>    )(arr0);
>     auto case_insensitive = sort!((a, b) => a.toLower() <
> b.toLower())(arr0);

That first sort() has a return value, but a conceptual output is
the order that it provides to arr0, and the second sort() can be
statically known to destroy order in an arr0 provided to it. So:

  1. sort() discards arr0's order!0 and then provides a new order!1
  2. sort() discards arr0's order!1 and then provides a new order!2
  3. writeln reads order!2
  4. writeln reads order!2

Or, written like the first example above:

  Order o;
  o = order!1;
  o = order!2;
  do_something(o);

What is the point of the first assignment? Waste heat from the CPU?

November 20, 2019
On Wednesday, 20 November 2019 at 09:33:25 UTC, Johannes Riecken wrote:
> Or are there any ideas in programming language research that would help with such a case?

Try to make it a habit to use `const` instead of `auto` if you don't expect the variable to be mutated.

```
const arr0 = ["foo", "bar", "Baz"];
```

This is obviously not a fancy beginner-friendly solution like the thing mipri describes, but it is something that you can start doing today that might help.
November 20, 2019
On Wednesday, 20 November 2019 at 09:33:25 UTC, Johannes Riecken wrote:
> Later I learned that I was trapped by `sort` being in-place but still returning a value. If this API were to be designed from scratch, I wonder if it would be possible to rule out this incorrect usage at compile-time while still retaining the other nice features of the `sort` function like being in-place and giving strong types which can be used to choose faster algorithms to further transform the result, etc.?

Theoretically it could do the same thing as `std.algortihm.move` - assign the original range to it's init value, so if it's accidently reused it'd be more obvious. But of course it won't be worth doing anymore because of the breakage. Perhaps in the std v2 Wilzbach has talked about, if the authors so wish.
November 20, 2019
On Wednesday, November 20, 2019 6:15:12 AM MST Dukc via Digitalmars-d wrote:
> On Wednesday, 20 November 2019 at 09:33:25 UTC, Johannes Riecken
>
> wrote:
> > Later I learned that I was trapped by `sort` being in-place but still returning a value. If this API were to be designed from scratch, I wonder if it would be possible to rule out this incorrect usage at compile-time while still retaining the other nice features of the `sort` function like being in-place and giving strong types which can be used to choose faster algorithms to further transform the result, etc.?
>
> Theoretically it could do the same thing as `std.algortihm.move` - assign the original range to it's init value, so if it's accidently reused it'd be more obvious. But of course it won't be worth doing anymore because of the breakage. Perhaps in the std v2 Wilzbach has talked about, if the authors so wish.

The fact that sort returns a SortedRange is useful for some algorithms (since they can then statically detect that the range is sorted), but it's also useful to be able to use sort the range in-place and continue to use it with the exact same type. The current API is only a problem if you don't read the documentation.

- Jonathan M Davis



November 21, 2019
On Wednesday, 20 November 2019 at 09:33:25 UTC, Johannes Riecken wrote:
> I had looked at how to do case-insensitive string sorting with Phobos and without looking at the documentation much, I went with the following code and then wrongly concluded that `<` does case-insensitive comparisons on strings by default:
>
> ```
> import std.algorithm;
> import std.stdio;
> import std.string;
>
> void main() {
>     auto arr0 = ["foo", "bar", "Baz"];
>     auto case_sensitive   = sort!((a, b) => a           < b
>    )(arr0);
>     auto case_insensitive = sort!((a, b) => a.toLower() < b.toLower())(arr0);
>     writeln(case_sensitive);
>     writeln(case_insensitive);
> }
> ```
> Output:
> ```
> ["bar", "Baz", "foo"]
> ["bar", "Baz", "foo"]
> ```
>
> Later I learned that I was trapped by `sort` being in-place but still returning a value. If this API were to be designed from scratch, I wonder if it would be possible to rule out this incorrect usage at compile-time while still retaining the other nice features of the `sort` function like being in-place and giving strong types which can be used to choose faster algorithms to further transform the result, etc.? Or are there any ideas in programming language research that would help with such a case?

Thanks, that's some nice ideas in this thread! I should definitely get into the habit of using const wherever possible.
Jonathan, I think a major selling point of Phobos and D is that in lots of cases it can catch erroneous API usage at compile time.
November 21, 2019
On Thursday, November 21, 2019 3:19:45 PM MST Johannes Riecken via Digitalmars-d wrote:
> On Wednesday, 20 November 2019 at 09:33:25 UTC, Johannes Riecken
>
> wrote:
> > I had looked at how to do case-insensitive string sorting with Phobos and without looking at the documentation much, I went with the following code and then wrongly concluded that `<` does case-insensitive comparisons on strings by default:
> >
> > ```
> > import std.algorithm;
> > import std.stdio;
> > import std.string;
> >
> > void main() {
> >
> >     auto arr0 = ["foo", "bar", "Baz"];
> >     auto case_sensitive   = sort!((a, b) => a           < b
> >
> >    )(arr0);
> >
> >     auto case_insensitive = sort!((a, b) => a.toLower() <
> >
> > b.toLower())(arr0);
> >
> >     writeln(case_sensitive);
> >     writeln(case_insensitive);
> >
> > }
> > ```
> > Output:
> > ```
> > ["bar", "Baz", "foo"]
> > ["bar", "Baz", "foo"]
> > ```
> >
> > Later I learned that I was trapped by `sort` being in-place but still returning a value. If this API were to be designed from scratch, I wonder if it would be possible to rule out this incorrect usage at compile-time while still retaining the other nice features of the `sort` function like being in-place and giving strong types which can be used to choose faster algorithms to further transform the result, etc.? Or are there any ideas in programming language research that would help with such a case?
>
> Thanks, that's some nice ideas in this thread! I should definitely get into the habit of using const wherever possible. Jonathan, I think a major selling point of Phobos and D is that in lots of cases it can catch erroneous API usage at compile time.

There's nothing erroneous about sorting a range in two different ways (though it obviously wouldn't make sense to do so twice in row without doing anything with the range in between). And having sort return a new range that was sorted while not affecting the one that was passed in would require either sorting lazily (which really doesn't work) or allocating an entirely new range in order to sort into - neither of which makes much sense. So, I'm honestly very surprised that anyone would think that calling sort would result in a new range without affecting the one that was passed in.

Regardless, the documentation makes the behavior of sort clear, and anyone using a function needs to read its documentation if they don't want to run into problems. If you don't read documentation, you're going to make bad assumptions about how a function behaves eventually - if not frequently - and end up with bugs in your code.

It is of course nice when the language is able to catch incorrect behavior for you at compile time, but I don't see how that could be done here. Sure, you used the function incorrectly, and if sort allocated a new range to sort into and returned that, you wouldn't have been bitten. However, pretty much every sort function I have ever seen sorts in-place, so I would expect most folks to assume that sort sorted the range in-place. And if sort returned a newly allocated range, anyone making that assumption and used sort without reading the documentation would end up with buggy code and could have complaints similiar to yours. Whether sort returns a newly allocated range or sorts in-place, _someone_ is going to make the wrong assumption about how it works and get bitten without the compiler being able to help them out. And the current implementation is by far more efficient than the version that returned a newly allocated range would be. It's also more flexible, since it allows you to sort the same range multiple times, multiple ways if that makes sense for the code. Anyone who wants to get a new range can simply allocate a new array to hold the data - either via .dup if the range to be sorted is an array or by calling save on the range and passing it to std.array.array, making your code something like

    auto case_sensitive   = sort!((a, b) => a < b)(arr0.dup);
    auto case_insensitive =
        sort!((a, b) => a.toLower() < b.toLower())(arr0.dup);

So, I'm sorry that you gotten bitten by sort not working the way you expected, but I don't see how that could have been prevented without implementing the range in the way that you expected, which would then screw over anyone who assumed that the range would be sorted in-place. The compiler can't fix this for you. Ultimately, you have to read the documentation. And given the inefficiency of returning a sorted range without affecting the one that was passed in, the current implementation makes far more sense than how you assumed that sort worked.

Even if D is able to catch more problems at compile time than another language, there's still a limit to how many bugs it can catch. There's a reason that unit tests are a feature that's built into D.

- Jonathan M Davis



November 22, 2019
On Thursday, 21 November 2019 at 23:28:48 UTC, Jonathan M Davis wrote:
> So, I'm sorry that you gotten bitten by sort not working the way you expected, but I don't see how that could have been prevented without implementing the range in the way that you expected, which would then screw over anyone who assumed that the range would be sorted in-place.

This has come up several times before, here are two of them:

https://forum.dlang.org/search?q=Sorted+past+tense&search=Search

It can be fixed with careful naming conventions.

(lazy sorting make sense actually... Heap... Bucket...)

November 22, 2019
On Thursday, 21 November 2019 at 23:28:48 UTC, Jonathan M Davis wrote:
> So, I'm sorry that you gotten bitten by sort not working the
> way you expected, but I don't see how that could have been
> prevented without implementing the range in the way that you
> expected, which would then screw over anyone who assumed that
> the range would be sorted in-place.

I don't disagree, but the problem also wasn't presented as a defect
of Phobos, but as "I did wrote this code. I learned it was wrong.
How might my error have been averted?"

From some other languages,

| Language | Sort        | Mutates? | Returns? | Both? |
|----------+-------------+----------+----------+-------|
| Perl     | sort        | No       | Yes      | No    |
| Python   | list.sort   | Yes      | No       | No    |
| Python   | list.sorted | No       | Yes      | No    |
| C        | qsort       | Yes      | No       | No    |
| C++      | std::sort   | Yes      | No       | No    |
| Rust     | vec::sort   | Yes      | No       | No    |

The norm is to do one or the other, and not both.

Why does D do both? Because it has an efficiency focus (avoid
unnecessary copying) and also an expressiveness focus (go ahead and
do things with long pipelines of transformations), and this
combination is uncommon.

Actually I'm surprised Rust doesn't do both... and come to think of
it, I *was* surprised by that, when I rewrote a benchmark in it: I
tried putting the .sort_by() in the middle of a pipeline of
function calls and was confused by the error. So I had the exact
opposite "I did a thing. I learned it was wrong." problem with
sorting in Rust.

There's another reason D does both: thing.change().change().change()
is how you build pipelines in D, and each of these change()
functions needs to return the input to the next change(). But note,
this isn't a deliberate language feature, it's not taking advantage
of "pipeline syntax", it's just a consequence of how methods work
in OOP languages.

So it's more like a pattern :) One of those awful things that
highlights a failure of language design, where programmers, instead
of directly expressing their intentions with code, must laboriously
work around a void in the language. Or so I hear.

https://www.martinfowler.com/articles/collection-pipeline/
https://clojure.org/guides/threading_macros

Suppose, instead of expecting D programmers to make use of the
"collection pipelien" pattern, D just had a collection pipeline
feature? Like:

  thing |> change1() |> change2() |> change3()

If it's a language feature, D can, since it knows the types of all
these functions, apply a rule like

  if a function is void, reuse the last input for the next function
  otherwise, behave like . chaining

With this, Phobos could have a void sort that could still appear in
the middle of a long pipeline of transformations, and confusions
like the one in this thread about Phobos, and *also* the one I had
about Rust's sort, would both not be possible.

Instead you'd have the new potential confusion of thinking that
sort doesn't mutate because you saw it in a pipeline :)


To be clear, I don't recommend changing anything. I just think it's
not true that D has nothing to do with this.