April 08, 2015
On Tuesday, 7 April 2015 at 17:25:00 UTC, Walter Bright wrote:
> The current D associative array algorithm
>
>     https://github.com/D-Programming-Language/druntime/blob/master/src/rt/aaA.d
>
> uses an array of buckets with a linked list attached to the buckets to resolve collisions.
>
> Linked lists are perhaps the worst (i.e. slowest) data structure possible on modern CPUs with memory caches.
>
> A possible cache-friendly replacement would be an array of buckets with local probing to resolve collisions. D programs are often heavily dependent on associative arrays, so this could be a big win for us.
>
> And it's a fun project, too. Any takers?
>
>
>
> Interestingly,
>
>
> https://github.com/D-Programming-Language/dmd/blob/master/src/root/stringtable.c
>
> uses quadratic probing instead of linked lists, but this is not cache friendly, either.

Every now and then I think how nice it would be for Phobos to be just a bunch of "module interfaces" along with default implementations of them. Unfortunately, or fortunately, D does not have module interfaces like, say, Modula-3. It would make things much easier for someone who wants to work on a different implementation of a certain module (or even whole package) implementation.

I wonder what other people think about this.
April 08, 2015
Am Tue, 07 Apr 2015 19:09:15 +0000
schrieb "Jens Bauer" <doctor@who.no>:

> On Tuesday, 7 April 2015 at 17:25:00 UTC, Walter Bright wrote:
> > The current D associative array algorithm
> 
> I did a quick-scan of the source and didn't see any Bloom filter
> there.
> I wonder if it would be best to have the Bloom filters completely
> external or if they should be included in the associative array ?
> -Eg. the Bloom filter does require extra memory, which is not
> always desirable.
> On the other hand, a Bloom filter could be opted in, where high
> speed is required.

More precisely it adds a memory and time overhead, but reduces the cost for most lookups of non-existent keys.

When you just want an associative array for known-to-exist keys it would be a bad idea to use a Bloom filter. Opt in is not possible with the current AA syntax as far as I can see. Except you mean some wrapper struct as part of Phobos/druntime.

-- 
Marco

April 08, 2015
Am Tue, 07 Apr 2015 15:26:45 -0700
schrieb Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org>:

> On 4/7/15 3:18 PM, bearophile wrote:
> > Andrei Alexandrescu:
> >
> >>> A possible cache-friendly replacement would be an array of buckets with local probing to resolve collisions.
> >>
> >> Arrays would need to move data. Current hashtables rely on values staying put. -- Andrei
> >
> > The efficiency behavour of modern CPUs+memory pyramid are rather not linear and not intuitive. As Walter has said at the the start of this thread, arrays come out as more efficient in a large number of cases...
> 
> You must have replied to the wrong post. -- Andrei

No, that's really what Walter said:
 "A possible cache-friendly replacement would be an array
[...]."

Especially when iterating arrays, multiple elements might be in the same cache-line and the CPU's prefetching will kick in and load the next array elements while the current ones are processed.

Unexpected memory fetches from RAM can easily be >100 times
slower.
https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf (pg. 22)

https://software.intel.com/en-us/forums/topic/287236
"
Core i7 Xeon 5500 Series

Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles
remote L3 CACHE ~100-300 cycles
Local Dram ~60 ns
Remote Dram ~100 ns

[...]

a) Structure data such that computationally related information resides within same cache line.
b) Write your algorithms such that the higher frequency of access occures on (near) adjacent memory. This reduces the TLB pressure.
c) Reduce number of writes to RAM through use of temporary varibles that can be registerized (optimized into register)
d) Structure data such that you can manipulate using SSE (when possible)
e) For parallel programming, coordinate activities amongst threads sharing cache levels. (HT siblings for L1 and L2, same die or half die for L3, same NUMA node for multiple nodes).

-- Jim Dempsey"

-- 
Marco

April 08, 2015
What's a faster algorithm? There are some that deal with collisions differently but might be slower with good hash distributions. You can improve iteration speed by storing only an index into a dense array of values, but it eats more memory.

As far as I can see from my own experiments with HAMT
(http://infoscience.epfl.ch/record/64398/files/idealhashtrees.pdf)
the time it takes to compute hashes and do a binary comparison
for e.g. strings is so huge that the actual AA algorithm
doesn't really matter. If anyone managed to make string AAs
faster by swapping out the AA implementation I'd be surprised.

Instead I focuses on using malloc over GC and borrowed w0rp's idea of not hashing integral data types such as int, char or bool. They are now their own hash, which is dangerous in terms of hash distribution, but acts as a real turbo boost over the built-in AA (+700% faster for lookups in my tests with 100% hit rate.)

May be important:

I found that Tuple generates a hash by addition of the hashes
of its fields. This means that a coordinate tuple like
Tuple!(int, int) will collide with mirror coordinates.
E.g.: tuple(2,5) has the same hash as tuple(5,2) and so on.
Furthermore the hash functions for each field are virtual
function calls from a TypeInfo object that contains further
branching to determine whether the type has an overridden
toHash or the default implementation (SuperFastHash over
binary representation) is used.
I found that:

  enum hasher = &typeid(T).getHash;
  hasher(&t);

already improved hashing speed for gdc a bit. Without looking at the generated assembly my wild guess is that de-virtualization failed. TypeInfo objects should be known at compile time and getHash and other calls to it inlined. Even the dynamic branch to check for overridden toHash is known at compile-time.

-- 
Marco

April 09, 2015
On 4/8/15 5:11 AM, Martin Nowak wrote:
> On Tuesday, 7 April 2015 at 22:33:52 UTC, Steven Schveighoffer wrote:
>> Until that point, all these discussions on AA updates are useless.

>
> I really don't that all or nothing attitude, it condemns an important
> step, just because something better might be possible. It's also very
> comfortable, because this way nobody will ever have to do anything.

I'm not suggesting nobody ever do anything out of comfort. What I'm suggesting is that it's not worth improving something that will be discarded anyway. Let's fix the problem hampering people from understanding how the AA works so they can work on improving the AA. Namely, the discarding of all compile-time info during implementation.

> Improving the AA implementation has a big immediate effect, and will
> also help anyone writing a library implementation, because any candidate
> up to this date was still based on that crappy bucket list implementation.

The "crappy bucket list" implementation is also the simplest. When the focus should be on fixing the AA interface with the compiler and library, I think having a simple understandable implementation is fine.

Let me remind you of your attitude when I brought up the fact that AA's load factor was 4 (just fixing this one number could improve performance, even for the bucket list implementation):

> We'll throw the current implementation away and implement an open addressing AA once we get to it. So don't worry about the load factor right now.

In other words, don't bother fixing the constant "4" in the code, it's going to be thrown away anyways.

> The biggest problems in writing an AA library implementation sorted by
> difficulty are:
> - deprecation of all magic AA behaviors (attributes, as[i][j]++)
> - get the lowering right

I really am surprised that these would be difficult at all. The lowering is pretty easy:
a) T[U] => AssociativeArray!(U,T)
b) [a:b, c:d] => AssociativeArray!(typeof(true ? a : c), typeof(true ? b : d)).literal(a, b, c, d);

Basically, bikeshed the name "literal" is the difficult part.

And deprecating the magic AA behaviors, isn't this just removing a whole lot of code from the compiler? Once the lowering is working, I don't see how removing special cases would be difficult. The part where we may have issues I see is CTFE support. But I would defer to those who have already tried this exercise, perhaps they can share what broke their solutions.

-Steve
April 09, 2015
"Steven Schveighoffer"  wrote in message news:mg5pf0$1g51$1@digitalmars.com...

> I really am surprised that these would be difficult at all. The lowering is pretty easy:
> a) T[U] => AssociativeArray!(U,T)
> b) [a:b, c:d] => AssociativeArray!(typeof(true ? a : c), typeof(true ? b : d)).literal(a, b, c, d);

Working out what to lower to is easy, working out when to lower it is hard. 

April 09, 2015
On 4/9/15 8:07 AM, Daniel Murphy wrote:
> "Steven Schveighoffer"  wrote in message
> news:mg5pf0$1g51$1@digitalmars.com...
>
>> I really am surprised that these would be difficult at all. The
>> lowering is pretty easy:
>> a) T[U] => AssociativeArray!(U,T)
>> b) [a:b, c:d] => AssociativeArray!(typeof(true ? a : c), typeof(true ?
>> b : d)).literal(a, b, c, d);
>
> Working out what to lower to is easy, working out when to lower it is hard.

Wouldn't it be whenever AA was previously invoked? I'm surprised there are any unknowns here.

-Steve
April 09, 2015
"Steven Schveighoffer"  wrote in message news:mg61gn$1o09$1@digitalmars.com...

> Wouldn't it be whenever AA was previously invoked? I'm surprised there are any unknowns here.

It has to be done before some parts of semantic, but after others.  Eg error messages and template matching still needs to be done on the AA type, but other parts need the actual template type. 

April 09, 2015
On 4/9/15 11:25 AM, Daniel Murphy wrote:
> "Steven Schveighoffer"  wrote in message
> news:mg61gn$1o09$1@digitalmars.com...
>
>> Wouldn't it be whenever AA was previously invoked? I'm surprised there
>> are any unknowns here.
>
> It has to be done before some parts of semantic, but after others.  Eg
> error messages and template matching still needs to be done on the AA
> type, but other parts need the actual template type.

I think a reasonable path is to lower as early as possible, have everything show up as the template type in error messages, and then work on updating the messages over time to reflect the builtin syntax.

In any case, I didn't think about template matching, those are handled specially for AA. Perhaps we can make that more generic.

-Steve
1 2 3 4
Next ›   Last »