September 15, 2015
On Tuesday, 15 September 2015 at 09:35:36 UTC, Ola Fosheim Grøstad wrote:
> http://sites.ieee.org/scv-cs/files/2013/03/Right-SizingPrecision1.pdf

That's a pretty convincing case. Who does it :)?


September 15, 2015
On Tuesday, 15 September 2015 at 10:38:23 UTC, ponce wrote:
> On Tuesday, 15 September 2015 at 09:35:36 UTC, Ola Fosheim Grøstad wrote:
>> http://sites.ieee.org/scv-cs/files/2013/03/Right-SizingPrecision1.pdf
>
> That's a pretty convincing case. Who does it :)?

You:9

https://github.com/jrmuizel/pyunum/blob/master/unum.py

September 16, 2015
On Tuesday, 15 September 2015 at 11:13:59 UTC, Ola Fosheim Grøstad wrote:
> On Tuesday, 15 September 2015 at 10:38:23 UTC, ponce wrote:
>> On Tuesday, 15 September 2015 at 09:35:36 UTC, Ola Fosheim Grøstad wrote:
>>> http://sites.ieee.org/scv-cs/files/2013/03/Right-SizingPrecision1.pdf
>>
>> That's a pretty convincing case. Who does it :)?

I'm not convinced. I think they are downplaying the hardware difficulties. Slide 34:

Disadvantages of the Unum Format
* Non-power-of-two alignment. Needs packing and unpacking, garbage collection.

I think that disadvantage is so enormous that it negates most of the advantages. Note that in the x86 world, unaligned memory loads of SSE values still take longer than aligned loads. And that's a trivial case!

The energy savings are achieved by using a primitive form of compression. Sure, you can reduce the memory bandwidth required by compressing the data. You could do that for *any* form of data, not just floating point. But I don't think anyone thinks that's worthwhile.

The energy comparisons are plain dishonest. The power required for accessing from DRAM is the energy consumption of a *cache miss* !! What's the energy consumption of a load from cache? That would show you what the real gains are, and my guess is they are tiny.

So:
* I don't believe the energy savings are real.
* There is no guarantee that it would be possible to implement it in hardware without a speed penalty, regardless of how many transistors you throw at it (hardware analogue of Amdahl's Law)
* but the error bound stuff is cool.


September 16, 2015
On Wednesday, 16 September 2015 at 08:17:59 UTC, Don wrote:
> On Tuesday, 15 September 2015 at 11:13:59 UTC, Ola Fosheim Grøstad wrote:
>> On Tuesday, 15 September 2015 at 10:38:23 UTC, ponce wrote:
>>> On Tuesday, 15 September 2015 at 09:35:36 UTC, Ola Fosheim Grøstad wrote:
>>>> http://sites.ieee.org/scv-cs/files/2013/03/Right-SizingPrecision1.pdf
>>>
>>> That's a pretty convincing case. Who does it :)?
>
> I'm not convinced. I think they are downplaying the hardware difficulties. Slide 34:
>
> Disadvantages of the Unum Format
> * Non-power-of-two alignment. Needs packing and unpacking, garbage collection.
>
> I think that disadvantage is so enormous that it negates most of the advantages. Note that in the x86 world, unaligned memory loads of SSE values still take longer than aligned loads. And that's a trivial case!
>
> The energy savings are achieved by using a primitive form of compression. Sure, you can reduce the memory bandwidth required by compressing the data. You could do that for *any* form of data, not just floating point. But I don't think anyone thinks that's worthwhile.
>

GPU do it a lot. Especially, but not exclusively on mobile. Not to reduce the misses (a miss is pretty much guaranteed, you load 32 thread at once in a shader core, each of them will require at least 8 pixel for a bilinear texture with mipmap, that's the bare minimum. That means 256 memory access at once. One of these pixel WILL miss, and it is going to stall the 32 threads). It is not a latency issue, but a bandwidth and energy one.

But yeah, in the general case, random access is preferable, memory alignment, and the fact you don't need to do as much bookeeping are very significants.

Also, predictable size mean you can split your dataset and process it in parallel, which is impossible if sizes are random.

> The energy comparisons are plain dishonest. The power required for accessing from DRAM is the energy consumption of a *cache miss* !! What's the energy consumption of a load from cache? That would show you what the real gains are, and my guess is they are tiny.
>

The energy comparison is bullshit. As long as you haven't loaded the data, you don't know how wide they are. Meaning you need either to go pessimistic and load for the worst case scenario or do 2 round trip to memory.

The author also use a lot the wire vs transistor cost, and how it evolved? He is right. Except that you won't cram more wire at runtime into the CPU. The CPU need the wiring for the worst case scenario, always.

The hardware is likely to be slower as you'll need way more wiring than for regular floats, and wire is not only cost, but also time.

That being said, even a hit in L1 is very energy hungry. Think about it, you need to go a 8 - way fetch (so you'll end up loading 4k of data from the cache) in parallel with address translation (usually 16 ways) in parallel with snooping into the load and the store buffer.

If the load is not aligned, you pretty much have to multiply this by 2 if it cross a cache line boundary.

I'm not sure what his number represent, but hitting L1 is quite power hungry. He is right on that one.

> So:
> * I don't believe the energy savings are real.
> * There is no guarantee that it would be possible to implement it in hardware without a speed penalty, regardless of how many transistors you throw at it (hardware analogue of Amdahl's Law)
> * but the error bound stuff is cool.

Yup, that's pretty much what I get out of it as well.

September 16, 2015
On Saturday, 11 July 2015 at 18:16:22 UTC, Timon Gehr wrote:
> On 07/11/2015 05:07 PM, Andrei Alexandrescu wrote:
>> On 7/10/15 11:02 PM, Nick B wrote:
>>> John Gustafson book is now out:
>>>
>>> It can be found here:
>>>
>>> http://www.amazon.com/End-Error-Computing-Chapman-Computational/dp/1482239868/ref=sr_1_1?s=books&ie=UTF8&qid=1436582956&sr=1-1&keywords=John+Gustafson&pebp=1436583212284&perid=093TDC82KFP9Y4S5PXPY
>>>
>>
>> Very interesting, I'll read it. Thanks! -- Andrei
>>
>
> I think Walter should read chapter 5.

What is this chapter about ?
September 16, 2015
On Wednesday, 16 September 2015 at 08:17:59 UTC, Don wrote:
> I'm not convinced. I think they are downplaying the hardware difficulties. Slide 34:

I don't think he is downplaying it. He has said that it will probably take at least 10 years before it is available in hardware. There is also a company called Rex Computing that are looking at unum:

http://www.theplatform.net/2015/07/22/supercomputer-chip-startup-scores-funding-darpa-contract/

He assumes that you use a scratchpad (a big register file), not caching, for intermediate calculations.

His basic reasoning is that brute force ubox methods makes for highly parallel calculations. It might be possible to design ALUs that can work with various unum bit widths efficiently (many small or a few large)... who knows. You'll have to try first.

Let's not forget that there is a _lot_ of legacy constraints and architectural assumptions in both x86 architecture.

> The energy comparisons are plain dishonest. The power required for accessing from DRAM is the energy consumption of a *cache miss* !! What's the energy consumption of a load from cache?

I think this argument is aiming at HPC where you can find funding for ASICs. They push a lot of data over the memory bus.

September 16, 2015
On Wednesday, 16 September 2015 at 08:38:25 UTC, deadalnix wrote:
> The energy comparison is bullshit. As long as you haven't loaded the data, you don't know how wide they are. Meaning you need either to go pessimistic and load for the worst case scenario or do 2 round trip to memory.

That really depends on memory layout and algorithm. A likely implementation would be a co-processor that would take a unum stream and then pipe it through a network of cores (tile based co-processor). The internal busses between cores are very very fast and with 256+ cores you get tremendous throughput. But you need a good compiler/libraries and software support.

> The hardware is likely to be slower as you'll need way more wiring than for regular floats, and wire is not only cost, but also time.

You need more transistors per ALU, but slower does not matter if the algorithm needs bounded accuracy or if it converge more quickly with unums.  The key challenge for him is to create a market, meaning getting the semantics into scientific software and getting initial workable implementations out to scientists.

If there is a market demand, then there will be products. But you need to create the market first. Hence he wrote an easy to read book on the topic and support people who want to implement it.

September 16, 2015
On Wednesday, 16 September 2015 at 08:38:25 UTC, deadalnix wrote:
>
> Also, predictable size mean you can split your dataset and process it in parallel, which is impossible if sizes are random.

I don't recall how he would deal with something similar to cache misses when you have to promote or demote a unum. However, my recollection of the book is that there was quite a bit of focus on a unum representation that has the same size as a double. If you only did the computations with this format, I would expect the sizes would be more-or-less fixed. Promotion would be pretty rare, but still possible, I would think.

Compared to calculations with doubles there might not be a strong case for energy efficiency (but I don't really know for sure). My understanding was that the benefit for energy efficiency is only when you use a smaller sized unum instead of a float. I don't recall how he would resolve your point about cache misses.

Anyway, while I can see a benefit from using unum numbers (accuracy, avoiding overflow, etc.) rather than floating point numbers, I think that performance or energy efficiency would have to be within range of floating point numbers for it to have any meaningful adoption.
September 16, 2015
On 09/16/2015 10:46 AM, deadalnix wrote:
> On Saturday, 11 July 2015 at 18:16:22 UTC, Timon Gehr wrote:
>> On 07/11/2015 05:07 PM, Andrei Alexandrescu wrote:
>>> On 7/10/15 11:02 PM, Nick B wrote:
>>>> John Gustafson book is now out:
>>>>
>>>> It can be found here:
>>>>
>>>> http://www.amazon.com/End-Error-Computing-Chapman-Computational/dp/1482239868/ref=sr_1_1?s=books&ie=UTF8&qid=1436582956&sr=1-1&keywords=John+Gustafson&pebp=1436583212284&perid=093TDC82KFP9Y4S5PXPY
>>>>
>>>>
>>>
>>> Very interesting, I'll read it. Thanks! -- Andrei
>>>
>>
>> I think Walter should read chapter 5.
>
> What is this chapter about ?

Relevant quote: "Programmers and users were never given visibility or control of when a value was promoted to “double extended precision” (80-bit or higher) format, unless they wrote assembly language; it just happened automatically, opportunistically, and unpredictably. Confusion caused by different results outweighed the advantage of reduced rounding-overflow-underflow problems, and now coprocessors must dumb down their results to mimic systems that have no such extra scratchpad capability."


September 16, 2015
On 09/16/2015 10:17 AM, Don wrote:
>
> So:
> ...
> * There is no guarantee that it would be possible to implement it in
> hardware without a speed penalty, regardless of how many transistors you
> throw at it (hardware analogue of Amdahl's Law)

https://en.wikipedia.org/wiki/Gustafson's_law :o)