August 13, 2019
Am 13.08.19 um 11:48 schrieb Mirjam Akkersdijk:
> Hello there,
> If I had a DLL, how would I sort it judging by the node contents, the D
> way?
> 
> In C if I were to sort a piece of malloc'd memory pointing to node pointers, I would write my compare function and let qsort sort it out. In D, I tried to use std.algorithm's sort functionality to no avail, because my guess would be it only operates on arrays. This opens up another can of worms as I am not fond of VLAs and D desperately wants to know what the size is at compile time.
> 
> Let's say we this trivial implementation:
> 
> Node* get_node(DLL* list, uint index);
> 
> struct Node {
>   int x;
>   Node* prev, next;
> };
> 
> struct DLL {
>   Node* head;
>   uint size;
> };
> 
> Node** nodes = cast(Node**) malloc(DLL.size * (Node*).sizeof);
> 
> and I would like to sort based on Node.t, how should I tackle it, preferably without resorting to C libraries?

I'd do something along the lines of this:

```
import std;

struct Node(T)
{
private:
    typeof(this)* prev, next;
public:
    T data;
}

struct DoublyLinkedList(T)
{
private:
    Node!T* head = null;
    Node!T* tail = null;

public:
    void pushFront(T t)
    {
        auto n = new Node!T;
        if (head != null)
        {
            head.prev = n;
        }
        else
        {
            tail = n;
        }
        n.data = t;
        n.next = head;
        head = n;
    }

    T front() const @property
    {
        return head.data;
    }

    void popFront()
    {
        head = head.next;
        if (head != null)
        {
            head.prev = null;
        }
        else
        {
            tail = null;
        }
    }

    void pushBack(T t)
    {
        auto n = new Node!T;
        if (tail != null)
        {
            tail.next = n;
        }
        else
        {
            head = n;
        }
        n.data = t;
        n.prev = tail;
        tail = n;
    }

    T back() const @property
    {
        return tail.data;
    }

    void popBack()
    {
        tail = tail.prev;
        if (tail != null)
        {
            tail.next = null;
        }
        else
        {
            head = null;
        }
    }

    size_t empty() const @property
    {
        return head == null;
    }

    auto nodes() @property
    {
        static struct NodePointerRange
        {
        private:
            Node!T* head;

        public:
            Node!T* front() @property
            {
                return head;
            }

            void popFront()
            {
                head = head.next;
            }

            bool empty() const @property
            {
                return head == null;
            }
        }

        return NodePointerRange(head);
    }
}

auto doublyLinkedList(R)(R r) if (isInputRange!R)
{
    DoublyLinkedList!(ElementType!R) list;

    foreach (e; r)
    {
        list.pushBack(e);
    }
    return list;
}

auto doublyLinkedListFromNodePointerRange(R)(R r)
        if (isInputRange!R && isPointer!(ElementType!R)
            && isInstanceOf!(Node, PointerTarget!(ElementType!R)))
{
    DoublyLinkedList!(TemplateArgsOf!(PointerTarget!(ElementType!R))) list;
    foreach (n; r)
    {
        n.prev = list.tail;
        if (list.tail != null)
        {
            list.tail.next = n;
        }
        else
        {
            list.head = n;
        }
        list.tail = n;
    }
    list.tail.next = null;
    return list;
}

void main()
{
    auto list = doublyLinkedList([5, 3, 7, 24, 1]);
    // looks natural but allocates new nodes
    auto sortedList = list.array.sort.doublyLinkedList;
    sortedList.each!writeln;

    writeln;

    auto anotherList = doublyLinkedList([54, 23, 8]);
    // looks a bit uglier but reuses the already allocated nodes
    auto anotherSortedList = anotherList.nodes
        .array
        .sort!((n1, n2) => n1.data < n2.data)
        .doublyLinkedListFromNodePointerRange;
    anotherSortedList.each!writeln;
}
```

Modulo bugs of course ;)

This uses the GC. If you want to avoid it in favor of malloc (or
std.experimental.allocator), you need to add `free`s (and possibly
referece counting) accordingly.


August 13, 2019
On Tue, Aug 13, 2019 at 12:12:28PM -0600, Jonathan M Davis via Digitalmars-d-learn wrote:
> On Tuesday, August 13, 2019 11:33:10 AM MDT Mirjam Akkersdijk via Digitalmars-d-learn wrote:
[...]
> > Thank you, this is what I eventually chose to do. It's also fairly fast, though doesn't the nature of the dynamic array slow it down a bit? Since I already know the size of the array (but just not at compile time), can't I define an array of fixed size, which is dependent on the input of the function?
> 
> In D, static arrays must know their length at compile time. Dynamic arrays can have their length determined at runtime. I don't know why a dynamic array would slow anything down here though - especially if you just allocate it with the correct length.
[...]

When it comes down to fine points concerning performance like here, my default response is: are you sure this is an *actual* performance hit? Did you run a profiler to measure the actual runtimes?

Static arrays have the advantage that the size is known at compile-time, so the compiler can generate better code (by hard-coding the length, optimizing the memory layout, etc.).  But when you pass static arrays around, you have to either slice them, in which case they become identical to dynamic arrays anyway, or pass them on the stack, which in some cases can actually slow things down (you have to copy a potentially large amount of data to/from the stack when passing arguments to a function). Passing a dynamic array (or a slice of a static array) is very fast: it's just a pointer + size pair.

Dynamic arrays could potentially incur performance hits if you're appending to them (may need a GC allocation).  Static arrays don't have this problem -- but then you can't append to a static array at all. :-P

Anyway, splitting hairs over whether to use static arrays or dynamic arrays sounds like premature optimization to me.  Before worrying about potential performance problems with using dynamic arrays, always use a profiler to see if the bottleneck is actually in that part of the code. Otherwise it's just a waste of time and energy to "optimize" the code when it isn't even anywhere near the real bottleneck(s) in the program.

Profile, profile, profile. Until then, don't sweat it over optimization.


T

-- 
Mediocrity has been pushed to extremes.
August 13, 2019
On 08/13/2019 10:33 AM, Mirjam Akkersdijk wrote:
> On Tuesday, 13 August 2019 at 14:04:45 UTC, Sebastiaan Koppe wrote:

>> Convert the nodes into an D array, sort the array with nodes.sort!"a.x
>> < b.x" and then iterate the array and repair the next/prev pointers.

If possible, I would go further and ditch the linked list altogether: Just append the nodes to an array and then sort the array. It has been shown in research, conference presentations, and in personal code to be the fasted option is most (or all) cases.

> doesn't the nature of the dynamic array slow it down a bit?

Default bounds checking is going to cost a tiny bit, which you can turn off after development with a compiler flag. (I still wouldn't.)

The only other option that would be faster is an array that's sitting on the stack, created with alloca. But it's only for cases where the thread will not run out of stack space and the result of the array is not going to be used.

> can't I define an array of fixed size, which is dependent on the input
> of the function?

    arr.length = number_of_elements;

All elements will be initialized to the element's default value, which happens to be null for pointers. (If we are back to linked list Node pointers.)

However, I wouldn't bother with setting length either as the cost of automatic array resizing is amortized, meaning that it won't hurt the O(1) algorithmic complexity in the general case. In the GC case that D uses, it will be even better: because if the GC knowns that the neighboring memory block is free, it will just add that to the dynamic array's capacity without moving elements to the new location.

Summary: Ditch the linked list and put the elements into an array. :)

Ali

August 13, 2019
On Tue, Aug 13, 2019 at 11:28:35AM -0700, Ali Çehreli via Digitalmars-d-learn wrote: [...]
> Summary: Ditch the linked list and put the elements into an array. :)
[...]

+1.  The linked list may have been faster 20 years ago, before the advent of modern CPUs with caching hierarchies and memory access predictors.

These days, with CPU multi-level caches and memory access predictors, in-place arrays are often the best option for performance, up to a certain size.  In general, avoid indirection where possible, in order to avoid defeating the memory access predictor and reduce the number of cache misses.


T

-- 
"If you're arguing, you're losing." -- Mike Thomas
August 13, 2019
On Tuesday, 13 August 2019 at 18:28:35 UTC, Ali Çehreli wrote:
> On Tuesday, 13 August 2019 at 14:04:45 UTC, Sebastiaan Koppe
> > wrote:
> > Convert the nodes into an D array, sort the array with nodes.sort!"a.x < b.x" and then iterate the array and repair the next/prev pointers.
>
> If possible, I would go further and ditch the linked list altogether: Just append the nodes to an array and then sort the array. It has been shown in research, conference presentations, and in personal code to be the fasted option is most (or all) cases.
>

Yeah, very likely. Although in this case there already is an array; it's going to be close :)
August 13, 2019
On Tuesday, 13 August 2019 at 18:54:58 UTC, H. S. Teoh wrote:
> On Tue, Aug 13, 2019 at 11:28:35AM -0700, Ali Çehreli via Digitalmars-d-learn wrote: [...]
>> Summary: Ditch the linked list and put the elements into an array. :)
> [...]
>
> +1.  The linked list may have been faster 20 years ago, before the advent of modern CPUs with caching hierarchies and memory access predictors.
>
> These days, with CPU multi-level caches and memory access predictors, in-place arrays are often the best option for performance, up to a certain size.  In general, avoid indirection where possible, in order to avoid defeating the memory access predictor and reduce the number of cache misses.
>
>
> T

I saw a Bjarne Stroustrup talk where he benchmarked that the for n > 1, std::vector was a lot faster than a linked list for all supported operations. I don't know how clever the caching strategies are on a modern processor (Pointer chasing), but to my knowledge the only way of getting a cache efficient linked list would be to effectively have a very contiguous allocator (Which obviously defeats the purpose of using a list in the first place)

Found it: https://www.youtube.com/watch?v=YQs6IC-vgmo
August 13, 2019
On Tue, Aug 13, 2019 at 08:42:37PM +0000, Max Haughton via Digitalmars-d-learn wrote:
> On Tuesday, 13 August 2019 at 18:54:58 UTC, H. S. Teoh wrote:
[...]
> > These days, with CPU multi-level caches and memory access predictors, in-place arrays are often the best option for performance, up to a certain size.  In general, avoid indirection where possible, in order to avoid defeating the memory access predictor and reduce the number of cache misses.
[...]
> I saw a Bjarne Stroustrup talk where he benchmarked that the for n > 1, std::vector was a lot faster than a linked list for all supported operations. I don't know how clever the caching strategies are on a modern processor (Pointer chasing), but to my knowledge the only way of getting a cache efficient linked list would be to effectively have a very contiguous allocator (Which obviously defeats the purpose of using a list in the first place)
> 
> Found it: https://www.youtube.com/watch?v=YQs6IC-vgmo

A contiguous allocator doesn't help after the list has undergone a large number of insertions/deletions, because of fragmentation.

You can still get a cache-efficient linked list by chunking. I.e., instead of a linked-list of individual elements, group the elements into blocks (arrays) that are then linked into a linked list.  Then you can still insert/delete elements with sub-O(1) complexity by updating only the affected block, and only occasionally you need to delete a block or allocate a new block.

As long as the block size is suitably-sized, it will be cache-coherent most of the time and enjoy the associated performance boosts from the cache hierarchy.  For small lists, it will be equivalent to using a single array. For very large lists, you'll also reap the cheap insert/delete flexibility of a linked list.

To minimize space wastage from unused slots in the blocks, you could have a list rebalancing based on some desired utilization factor (e.g., if <50% of a block is used, redistribute elements among adjacent blocks). Redistribution should be pretty cache-coherent (except perhaps for some corner cases), because it mostly involves linear copying of elements from one block to another.


T

-- 
I am not young enough to know everything. -- Oscar Wilde
August 13, 2019
On Tuesday, 13 August 2019 at 18:28:35 UTC, Ali Çehreli wrote:
> > can't I define an array of fixed size, which is dependent on
> the input
> > of the function?
>
>     arr.length = number_of_elements;
>
> All elements will be initialized to the element's default value, which happens to be null for pointers. (If we are back to linked list Node pointers.)
>
> However, I wouldn't bother with setting length either as the cost of automatic array resizing is amortized, meaning that it won't hurt the O(1) algorithmic complexity in the general case. In the GC case that D uses, it will be even better: because if the GC knowns that the neighboring memory block is free, it will just add that to the dynamic array's capacity without moving elements to the new location.

For the application I am writing, it makes very much sense to use a DLL. The approach I am using right now is to: (as stated in the first post)

void bsort(DLL* list) {
  Node*[] nodes;

  foreach (i; 0 .. list.size)
    nodes ~= get_node(list, i);

  nodes.sort!("a.x < b.x");
  // (unrelated data structure cleaning up goes here)
}

My concern was that using a dynamic array, continuously appending to it despite known size, would greatly impact overall performance. My goal is to make this operation as fast as possible, the D way.

I tried setting the length first and iterating through it with `nodes[i] = ...` and the implementation did not run noticeably faster (10 trials), per Teoh's advice to test out the difference.
For now, I consider this a success and would like to thank everyone in the thread for assistance and valuable insights.

Though, it left me with some semi-offtopic questions unanswered:

(1) Ali, do you mean that from an optimization viewpoint, it's better to keep appending like `nodes ~= ...` instead of setting the length first? I would like to have some more background on that claim.

(2) I watched the talk and read the paper [1] to be sure. Many contributors to this thread claim (C++) vectors are nearly always the preferred choice compared to DLLs, but how come they are much faster compared to DLLs, given they have nearly the same functionality?

(3) And I sense this is very closely related to what Ali is on?

[1] http://www.stroustrup.com/Software-for-infrastructure.pdf
August 13, 2019
On Tuesday, 13 August 2019 at 21:11:41 UTC, H. S. Teoh wrote:
> A contiguous allocator doesn't help after the list has undergone a large number of insertions/deletions, because of fragmentation.
Fragmentation should not be an issue, I insist on keep using a DLL for the base structure in my application and create an array just for sorting operations. No new nodes are added and no nodes are permanently removed.

> You can still get a cache-efficient linked list by chunking. [...]
I very much appreciate this thought, as I was thinking of adding such kind of cache to my implementation. It's still in the early stages and I have yet to see when this would make a considerable impact. I will remember this.
August 13, 2019
On Tuesday, 13 August 2019 at 22:12:23 UTC, Mirjam Akkersdijk wrote:

> Though, it left me with some semi-offtopic questions unanswered:
>
> (1) Ali, do you mean that from an optimization viewpoint, it's better to keep appending like `nodes ~= ...` instead of setting the length first? I would like to have some more background on that claim.

He's not saying it will be better, but that the cost will be negligible. On every append, the GC allocates new memory only if the current capacity == 0. When it does allocate, it allocates more space than it actually needs based on the current length of the array, so you don't actually have an allocation on every append. Also, when the adjacent memory block is free and can hold the additional elements, the allocation consists of extending the array's memory block and no copying is performed, making the cost of the allocation even cheaper than it could be.

Anyway, you can use `reserve` when appending rather than setting the length. This will allocate enough memory without default-initializing anything and you don't have to worry about bounds checking.

int[] ints;
ints.reserve(100);
foreach(i; 0..100) {
    ints ~= i;
}


See Steve Schveighoffer's array article for details:

https://dlang.org/articles/d-array-article.html