Jump to page: 1 2
Thread overview
Why ranges don't return vectors?
Mar 01, 2013
Piotr Szturmaj
Mar 01, 2013
bearophile
Mar 01, 2013
Piotr Szturmaj
Mar 01, 2013
Chris Cain
Mar 01, 2013
H. S. Teoh
Mar 01, 2013
Brad Anderson
Mar 01, 2013
Piotr Szturmaj
Mar 01, 2013
monarch_dodra
Mar 01, 2013
Chris Cain
Mar 04, 2013
deadalnix
Mar 01, 2013
Marco Leise
Mar 04, 2013
renoX
March 01, 2013
Seriously, nobody will ever get performance from single-element iterator/range pattern - this makes CPU stall!

Anytime you read one byte from typical hard disk, system reads full sector - typically 512 bytes, no more, no less. Anytime you read one byte from memory, CPU loads entire cache line from RAM into the cache (64 bytes on all modern Intel CPU's). Why not exploit that with ranges?

Ranges could potentially return arrays in front() (or in frontVector/whatever), so basically they will become ranges of ranges where the deepest range is always RandomAccessRange. This has obvious performance benefits. Everyone knows that traversing memory is faster than iteraring by popFront(). On the other hand memory lacks the flexibility of ranges - so lets make a hybrid range!

Advantages:
* performance (!)
* limited lookahead/backtracking

How?

1. instruction level parallelism: CPU's can execute few instructions in parallel given that they operate on different data (different vector elements)
2. SIMD: load entire vector to SSE/AVX register then run the operation on all elements at once
3. use of L1 cache: traversing vector in memory is way faster that calling popFront() for each vector element - likely if it's inlined

Disadvantages:
* potential inconvenience

I know it can't be easy drop-in addition to the current range implementation, but who knows...
March 01, 2013
Piotr Szturmaj:

> I know it can't be easy drop-in addition to the current range implementation, but who knows...

Time ago I have suggested in this newsgroup that idea, naming it "vectorized lazyness", that was mostly ignored:

http://web.archiveorange.com/archive/v/j3CAOE6ZVy4122g4OfhS

It's also related to the idea of Segmented Iterators that I have discussed later in this same newsgroup (I think I received no answers):

http://pdf.aminer.org/000/136/863/segmented_iterators_and_hierarchical_algorithms.pdf

I think that so far there is only a little amount of vectorized lazyness implemented in Phobos, search for "semi-lazy parallel map" here:
http://dlang.org/phobos/std_parallelism.html

Bye,
bearophile
March 01, 2013
On Friday, 1 March 2013 at 01:02:17 UTC, Piotr Szturmaj wrote:
> Seriously, nobody will ever get performance from single-element iterator/range pattern - this makes CPU stall!

There's no reason you can't do that. In fact, some ranges already do this. Take a look at std.stdio.File.byChunk and byLine. This isn't appropriate for all ranges, though. Like arrays, for instance. You want to be able to access single elements and do things with them, otherwise the abstraction is pointless.
March 01, 2013
On Fri, Mar 01, 2013 at 03:35:23AM +0100, Chris Cain wrote:
> On Friday, 1 March 2013 at 01:02:17 UTC, Piotr Szturmaj wrote:
> >Seriously, nobody will ever get performance from single-element iterator/range pattern - this makes CPU stall!
> 
> There's no reason you can't do that. In fact, some ranges already do this. Take a look at std.stdio.File.byChunk and byLine. This isn't appropriate for all ranges, though. Like arrays, for instance. You want to be able to access single elements and do things with them, otherwise the abstraction is pointless.

You can make a buffered range that reads in a large chunk of data at a time, and .front and .popFront merely move an internal pointer until it reaches the end of the buffer, then the next buffer page is loaded in. In this case, .front is very simple (just a pointer lookup) and will probably be inlined by an optimizing compiler.

Just because the abstraction is reading per-element, doesn't mean the implementation has to literally do that!


T

-- 
"Maybe" is a strange word.  When mom or dad says it it means "yes", but when my big brothers say it it means "no"! -- PJ jr.
March 01, 2013
On Friday, 1 March 2013 at 03:44:30 UTC, H. S. Teoh wrote:
>
> You can make a buffered range that reads in a large chunk of data at a
> time, and .front and .popFront merely move an internal pointer until it
> reaches the end of the buffer, then the next buffer page is loaded in.
> In this case, .front is very simple (just a pointer lookup) and will
> probably be inlined by an optimizing compiler.
>
> Just because the abstraction is reading per-element, doesn't mean the
> implementation has to literally do that!
>
>
> T

Isn't this essentially what is going on in stdio when a range uses fread?
March 01, 2013
Am Fri, 01 Mar 2013 02:02:16 +0100
schrieb Piotr Szturmaj <bncrbme@jadamspam.pl>:

> Anytime you read one byte from typical hard disk, system reads full sector - typically 512 bytes, no more, no less.

Side note: HDD manufacturers recently switched to 4 KiB
sectors (8x size). Everyone who bypasses operating system
buffers to do raw disk access now needs to query the
system for optimal buffer alignment.

-- 
Marco

March 01, 2013
H. S. Teoh wrote:
> On Fri, Mar 01, 2013 at 03:35:23AM +0100, Chris Cain wrote:
>> On Friday, 1 March 2013 at 01:02:17 UTC, Piotr Szturmaj wrote:
>>> Seriously, nobody will ever get performance from single-element
>>> iterator/range pattern - this makes CPU stall!
>>
>> There's no reason you can't do that. In fact, some ranges already do
>> this. Take a look at std.stdio.File.byChunk and byLine. This isn't
>> appropriate for all ranges, though. Like arrays, for instance. You
>> want to be able to access single elements and do things with them,
>> otherwise the abstraction is pointless.

Yes, I've written some buffered ranges myself, but my original post was not about buffering.

> You can make a buffered range that reads in a large chunk of data at a
> time, and .front and .popFront merely move an internal pointer until it
> reaches the end of the buffer, then the next buffer page is loaded in.
> In this case, .front is very simple (just a pointer lookup) and will
> probably be inlined by an optimizing compiler.

Inlining is not always the case.

> Just because the abstraction is reading per-element, doesn't mean the
> implementation has to literally do that!

Range abstraction is limiting some useful optimizations, mainly vectorization. Even if range internally uses a buffer and does some SIMD operations on it, it exposes only one element from this buffer, which can't be used for next SIMD operation. In other words it limits pipelining.

I'm arguing that (vector) ranges of primitive types should always expose at least 16 byte vector, even if it's one element range - but then abstraction guarantees that it can be used for SSE ops. They could return slices pointing to fixed size buffers (at least 16 byte ones), then slices may be adjusted for the last iteration to specify actual number of elements in a vector. Anyway SIMD op will be done on the full vector (underlying buffer) - CPU don't care if element in a vector is garbage or not. So, in short: ranges pass these vectors between themselfs and each subsequent range mutates the same vector - possibly using SIMD.
March 01, 2013
bearophile wrote:
> Piotr Szturmaj:
>
>> I know it can't be easy drop-in addition to the current range
>> implementation, but who knows...
>
> Time ago I have suggested in this newsgroup that idea, naming it
> "vectorized lazyness", that was mostly ignored:
>
> http://web.archiveorange.com/archive/v/j3CAOE6ZVy4122g4OfhS
>
> It's also related to the idea of Segmented Iterators that I have
> discussed later in this same newsgroup (I think I received no answers):
>
> http://pdf.aminer.org/000/136/863/segmented_iterators_and_hierarchical_algorithms.pdf
>
>
> I think that so far there is only a little amount of vectorized lazyness
> implemented in Phobos, search for "semi-lazy parallel map" here:
> http://dlang.org/phobos/std_parallelism.html

Thanks for the references. They look interesting indeed.

March 01, 2013
On Friday, 1 March 2013 at 15:38:28 UTC, Piotr Szturmaj wrote:
> Range abstraction is limiting some useful optimizations, mainly vectorization. Even if range internally uses a buffer and does some SIMD operations on it, it exposes only one element from this buffer, which can't be used for next SIMD operation. In other words it limits pipelining.

Well, why don't you write a range whose "elements" are vectors? Or that operate on vectors.

The range abstraction only defines how you iterate and view a collection. YOU are the one that defines what the elements in that collection are.

For example, "chunk" will do exactly that: Given a range, it transforms it into a range of sub-slices:

//----
    int[] arr = iota(0, 16).array(); //[0, 1, 2, ..., 16]
    auto chunks = std.range.chunks(arr, 4);
    foreach (int[] chunk ; chunks)
    {
        chunk[] *= 2;
        writeln(chunk);
    }
//----

What else are you asking for? Nothing is stopping you from pipping the chunk elements with another range, btw.
March 01, 2013
On Friday, 1 March 2013 at 15:38:28 UTC, Piotr Szturmaj wrote:
> Yes, I've written some buffered ranges myself, but my original post was not about buffering.

My point wasn't about things being buffered. I was just suggesting that many ranges do return arrays, when it's appropriate. If you want to conceptualize a "single item" as a grouping of items from a range (and such a grouping may be a vector of 16 bytes if you so choose), then you are free to do so. The concept is flexible enough to allow this.

However, not all ranges should return arrays. As an answer to the topic's name, the reason why ranges don't (always) return arrays is that the concept of a range is to _abstract_ operations on _any_ type of container (and even things that are not containers, such as RNGs), as long as it has certain capabilities. Returning arrays isn't helpful when you're trying to process arrays, for instance, because if you didn't want an abstract look at an array, you would have worked on the original array without the abstraction in the first place.

> Range abstraction is limiting some useful optimizations, mainly vectorization. Even if range internally uses a buffer and does some SIMD operations on it, it exposes only one element from this buffer, which can't be used for next SIMD operation. In other words it limits pipelining.
>
> I'm arguing that (vector) ranges of primitive types should always expose at least 16 byte vector, even if it's one element range - but then abstraction guarantees that it can be used for SSE ops. They could return slices pointing to fixed size buffers (at least 16 byte ones), then slices may be adjusted for the last iteration to specify actual number of elements in a vector. Anyway SIMD op will be done on the full vector (underlying buffer) - CPU don't care if element in a vector is garbage or not. So, in short: ranges pass these vectors between themselfs and each subsequent range mutates the same vector - possibly using SIMD.

Okay, sure, performance. But if you want to code for performance, there's nothing stopping you now. The idea about ranges is to unify lots of concepts. If you need lower level work and require every bit of performance, but still want to use ranges, then you can, but it should be a specialized range type that you do it on which meets your precise specification.

I hope I'm explaining it clearly: ranges are more of a universal specification, and it sounds like you have a very particular set of requirements which is a subset of what ranges are supposed to cover. Creating a specialized range type that meets your exact specifications would give you better performance than a generalized range concept (even optimized) and an optimized generalized range concept would not be appropriate for all uses.

Perhaps you could code up a vectorRange that does things as you want and submit it in an enhancement request. And you can show off how awesome it is so that everyone will want to use it when it's appropriate. But certainly not all ranges should be vectorRanges because most of the time we're just trying to work on singular units at a time.
« First   ‹ Prev
1 2