September 16, 2015
On Wednesday, 16 September 2015 at 14:11:04 UTC, Ola Fosheim Grøstad wrote:
> On Wednesday, 16 September 2015 at 08:38:25 UTC, deadalnix wrote:
>> The energy comparison is bullshit. As long as you haven't loaded the data, you don't know how wide they are. Meaning you need either to go pessimistic and load for the worst case scenario or do 2 round trip to memory.
>
> That really depends on memory layout and algorithm. A likely implementation would be a co-processor that would take a unum stream and then pipe it through a network of cores (tile based co-processor). The internal busses between cores are very very fast and with 256+ cores you get tremendous throughput. But you need a good compiler/libraries and software support.
>

No you don't. Because the streamer still need to load the unum one by one. Maybe 2 by 2 with a fair amount of hardware speculation (which means you are already trading energy for performances, so the energy argument is weak). There is no way you can feed 256+ cores that way.

To gives you a similar example, x86 decoding is often the bottleneck on an x86 CPU. The number of ALUs in x86 over the past decade decreased rather than increased, because you simply can't decode fast enough to feed them. Yet, x86 CPUs have a 64 ways speculative decoding as a first stage.

>> The hardware is likely to be slower as you'll need way more wiring than for regular floats, and wire is not only cost, but also time.
>
> You need more transistors per ALU, but slower does not matter if the algorithm needs bounded accuracy or if it converge more quickly with unums.  The key challenge for him is to create a market, meaning getting the semantics into scientific software and getting initial workable implementations out to scientists.
>
> If there is a market demand, then there will be products. But you need to create the market first. Hence he wrote an easy to read book on the topic and support people who want to implement it.

The problem is not transistor it is wire. Because the damn thing is variadic in every ways, pretty much every bit as input can end up anywhere in the functional unit. That is a LOT of wire.

September 16, 2015
On Wednesday, 16 September 2015 at 19:21:59 UTC, deadalnix wrote:
> No you don't. Because the streamer still need to load the unum one by one. Maybe 2 by 2 with a fair amount of hardware speculation (which means you are already trading energy for performances, so the energy argument is weak). There is no way you can feed 256+ cores that way.

You can load continuously 64 bytes in a stream, decode to your internal format and push them into the scratchpad of other cores. You could even do this in hardware.

If you look at the ubox brute forcing method you compute many calculations over the same data, because you solve spatially, not by timesteps. So you can run many many parallell computations over the same data.

> To gives you a similar example, x86 decoding is often the bottleneck on an x86 CPU. The number of ALUs in x86 over the past decade decreased rather than increased, because you simply can't decode fast enough to feed them. Yet, x86 CPUs have a 64 ways speculative decoding as a first stage.

That's because we use a dumb compiler that does not prefetch intelligently. If you are writing for a tile based VLIW CPU you preload. These calculations are highly iterative so I'd rather think of it as a co-processor solving a single equation repeatedly than running the whole program. You can run the larger program on a regular CPU or a few cores.

> The problem is not transistor it is wire. Because the damn thing is variadic in every ways, pretty much every bit as input can end up anywhere in the functional unit. That is a LOT of wire.

I haven't seen a design, so I cannot comment. But keep in mind that the CPU does not have to work with the format, it can use a different format internally.

We'll probably see FPGA implementations that can be run on FPGU cards for PCs within a few years. I read somewhere that a group in Singapore was working on it.

September 16, 2015
On Wednesday, 16 September 2015 at 19:40:49 UTC, Ola Fosheim Grøstad wrote:
> You can load continuously 64 bytes in a stream, decode to your internal format and push them into the scratchpad of other cores. You could even do this in hardware.
>

1/ If you load the worst case scenario, then your power advantage is gone.
2/ If you load these one by one, how do you expect to feed 256+ cores ?

Obviously you can make this in hardware. And obviously this is not going to be able to feed 256+ cores. Even with a chip at low frequency, let's say 800MHz or so, you have about 80 cycles to access memory. That mean you need to have 20 000+ cycles of work to do per core per unum.

That simple back of the envelope calculation. Your proposal is simply ludicrous. It's a complete non starter.

You can make this in hardware. Sure you can, no problem. But you won't because it is a stupid idea.

>> To gives you a similar example, x86 decoding is often the bottleneck on an x86 CPU. The number of ALUs in x86 over the past decade decreased rather than increased, because you simply can't decode fast enough to feed them. Yet, x86 CPUs have a 64 ways speculative decoding as a first stage.
>
> That's because we use a dumb compiler that does not prefetch intelligently.

You know, when you have no idea what you are talking about, you can just move on to something you understand.

Prefetching would not change anything here. The problem come from variable size encoding, and the challenge it causes for hardware. You can have 100% L1 hit and still have the same problem.

No sufficiently smart compiler can fix that.

> If you are writing for a tile based VLIW CPU you preload. These calculations are highly iterative so I'd rather think of it as a co-processor solving a single equation repeatedly than running the whole program. You can run the larger program on a regular CPU or a few cores.
>

That's irrelevant. The problem is not the kind of CPU, it is how do you feed it at a fast enough rate.

>> The problem is not transistor it is wire. Because the damn thing is variadic in every ways, pretty much every bit as input can end up anywhere in the functional unit. That is a LOT of wire.
>
> I haven't seen a design, so I cannot comment. But keep in mind that the CPU does not have to work with the format, it can use a different format internally.
>
> We'll probably see FPGA implementations that can be run on FPGU cards for PCs within a few years. I read somewhere that a group in Singapore was working on it.

That's hardware 101.

When you have a floating point unit, you get your 32 bits you get 23 bits that go into the mantissa FU and 8 in the exponent FU. For instance, if you multiply floats, you send the 2 exponent into a adder, you send the 2 mantissa into a 24bits multiplier (you add a leading 1), you xor the bit signs.

You get the carry from the adder, and emit a multiply, or you count the leading 0 of the 48bit multiply result, shift by that amount and add the shit to the exponent.

If you get a carry in the exponent adder, you saturate and emit an inifinity.

Each bit goes into a given functional unit. That mean you need on wire from the input to the functional unit is goes to. Sale for these result.

Now, if the format is variadic, you need to wire all bits to all functional units, because they can potentially end up there. That's a lot of wire, in fact the number of wire is growing quadratically with that joke.

The author keep repeating that wire became the expensive thing and he is right. Meaning a solution with quadratic wiring is not going to cut it.

September 16, 2015
On Wednesday, 16 September 2015 at 20:06:43 UTC, deadalnix wrote:
> You know, when you have no idea what you are talking about, you can just move on to something you understand.

Ah, nice move. Back to your usual habits?

> Prefetching would not change anything here. The problem come from variable size encoding, and the challenge it causes for hardware. You can have 100% L1 hit and still have the same problem.

There is _no_ cache. The compiler fully controls the layout of the scratchpad.

> That's hardware 101.

Is it?

The core point is this:

1. if there is academic interest (i.e. publishing opportunities) you get research

2. if there is research you get new algorithms

3. you get funding

etc

You cannot predict at this point what the future will be like. Is it unlikely that anything specific will change status quo? Yes. Is it highly probable that something will change status quo? Yes. Will it happen over night. No.

50+ years has been invested in floating point design. Will this be offset over night, no.

It'll probably take 10+ years before anyone has a different type of numerical ALU on their desktop than IEEE754. By that time we are in a new era.

September 16, 2015
On Wednesday, 16 September 2015 at 08:53:24 UTC, Ola Fosheim Grøstad wrote:
>
> I don't think he is downplaying it. He has said that it will probably take at least 10 years before it is available in hardware. There is also a company called Rex Computing that are looking at unum:
>
Oh hey, I remember these jokers.  They were trying to blow some smoke about moving 288 GB/s at 4W.  They're looking at unum?  Of course they are; care to guess who's advising them?  Yep.

I'll be shocked if they ever even get to tape out.

-Wyatt
September 16, 2015
On Wednesday, 16 September 2015 at 20:30:36 UTC, Ola Fosheim Grøstad wrote:
> On Wednesday, 16 September 2015 at 20:06:43 UTC, deadalnix wrote:
>> You know, when you have no idea what you are talking about, you can just move on to something you understand.
>
> Ah, nice move. Back to your usual habits?
>

Stop

>> Prefetching would not change anything here. The problem come from variable size encoding, and the challenge it causes for hardware. You can have 100% L1 hit and still have the same problem.
>
> There is _no_ cache. The compiler fully controls the layout of the scratchpad.
>

You are the king of goalspot shifting. You answer about x86 decoding you get served.

You want to talk about a scraptch pad ? Good ! How do the data ends up in the scratchpad to begin with ? Using magic ? What is the scraptchpad made of if not flip flops ? If if so, how is it different from a cache as far as the hardware is concerned ?

You can play with words, but the problem remain the same. When you get on chip memory, be it cache or scratchpad, and a variadic encoding, you can't even feed a handful of ALUs. How do you expect to feed 256+ VLIW cores ? There are 3 order of magintude of gap in your reasoning.

You can't pull 3 orders of magnitude out of your ass and just pretend it can be done.

>> That's hardware 101.
>
> Is it?
>

Yes wire is hardware 101. I mean seriously, if one do not get how component can be wired together, one should probably abstain from making any hardware comment.

> You cannot predict at this point what the future will be like. Is it unlikely that anything specific will change status quo? Yes. Is it highly probable that something will change status quo? Yes. Will it happen over night. No.
>
> 50+ years has been invested in floating point design. Will this be offset over night, no.
>
> It'll probably take 10+ years before anyone has a different type of numerical ALU on their desktop than IEEE754. By that time we are in a new era.

Ok listen that is not complicated.

I don't know what car will come out next year? But I know there won't be a car that can go 10000km on 10 centiliter of gazoline. This would be physic defying stuff.

Same thing you won't be able to feed 256+ cores if you load data sequentially.

Don't get me this stupid we don't know what's going to happen tomorow bullshit. We won't have unicorn meat in supermarkets. We won't have free energy. We won't have interstellar travel. And we won't have the capability to feed 256+ cores sequentially.

I gave you numbers you gave me bullshit.

September 16, 2015
On Wednesday, 16 September 2015 at 20:35:16 UTC, Wyatt wrote:
> On Wednesday, 16 September 2015 at 08:53:24 UTC, Ola Fosheim Grøstad wrote:
>>
>> I don't think he is downplaying it. He has said that it will probably take at least 10 years before it is available in hardware. There is also a company called Rex Computing that are looking at unum:
>>
> Oh hey, I remember these jokers.  They were trying to blow some smoke about moving 288 GB/s at 4W.  They're looking at unum?  Of course they are; care to guess who's advising them?  Yep.
>
> I'll be shocked if they ever even get to tape out.

Yes, of course, most startups in hardware don't  succeed. I assume they get knowhow from Adapteva.

September 16, 2015
On Wednesday, 16 September 2015 at 20:53:37 UTC, deadalnix wrote:
> On Wednesday, 16 September 2015 at 20:30:36 UTC, Ola Fosheim Grøstad wrote:
>> On Wednesday, 16 September 2015 at 20:06:43 UTC, deadalnix wrote:
>>> You know, when you have no idea what you are talking about, you can just move on to something you understand.
>>
>> Ah, nice move. Back to your usual habits?
>>
>
> Stop

OK. I stop. You are beyond reason.

September 16, 2015
On Wednesday, 16 September 2015 at 21:12:11 UTC, Ola Fosheim Grøstad wrote:
> On Wednesday, 16 September 2015 at 20:53:37 UTC, deadalnix wrote:
>> On Wednesday, 16 September 2015 at 20:30:36 UTC, Ola Fosheim Grøstad wrote:
>>> On Wednesday, 16 September 2015 at 20:06:43 UTC, deadalnix wrote:
>>>> You know, when you have no idea what you are talking about, you can just move on to something you understand.
>>>
>>> Ah, nice move. Back to your usual habits?
>>>
>>
>> Stop
>
> OK. I stop. You are beyond reason.

True, how blind I was. It is fairly obvious now, thinking about it, that you can get 3 order of magnitude increase in sequential decoding in hardware by having a compiler with a vectorized SSA and a scratchpad !

Or maybe you have number to present us that show I'm wrong ?

September 16, 2015
On Wed, Sep 16, 2015 at 08:06:42PM +0000, deadalnix via Digitalmars-d wrote: [...]
> When you have a floating point unit, you get your 32 bits you get 23 bits that go into the mantissa FU and 8 in the exponent FU. For instance, if you multiply floats, you send the 2 exponent into a adder, you send the 2 mantissa into a 24bits multiplier (you add a leading 1), you xor the bit signs.
> 
> You get the carry from the adder, and emit a multiply, or you count the leading 0 of the 48bit multiply result, shift by that amount and add the shit to the exponent.
> 
> If you get a carry in the exponent adder, you saturate and emit an inifinity.
> 
> Each bit goes into a given functional unit. That mean you need on wire from the input to the functional unit is goes to. Sale for these result.
> 
> Now, if the format is variadic, you need to wire all bits to all functional units, because they can potentially end up there. That's a lot of wire, in fact the number of wire is growing quadratically with that joke.
> 
> The author keep repeating that wire became the expensive thing and he is right. Meaning a solution with quadratic wiring is not going to cut it.

I found this .pdf that explains the unum representation a bit more:

	http://sites.ieee.org/scv-cs/files/2013/03/Right-SizingPrecision1.pdf

On p.31, you can see the binary representation of unum. The utag has 3 bits for exponent size, presumably meaning the exponent can vary in size up to 7 bits.  There are 5 bits in the utag for the mantissa, so it can be anywhere from 0 to 31 bits.

It's not completely variadic, but it's complex enough that you will probably need some kind of shift register to extract the exponent and mantissa so that you can pass them in the right format to the various parts of the hardware.  It definitely won't be as straightforward as the current floating-point format; you can't just wire the bits directly to the adders and multipliers. This is probably what the author meant by needing "more transistors".  I guess his point was that we have to do more work in the CPU, but in return we (hopefully) reduce the traffic to DRAM, thereby saving the cost of data transfer.

I'm not so sure how well this will work in practice, though, unless we have a working prototype that proves the benefits.  What if you have a 10*10 unum matrix, and during some operation the size of the unums in the matrix changes?  Assuming the worst case, you could have started out with 10*10 unums with small exponent/mantissa, maybe fitting in 2-3 cache lines, but after the operation most of the entries expand to 7-bit exponent and 31-bit mantissa, so now your matrix doesn't fit into the allocated memory anymore.  So now your hardware has to talk to druntime to have it allocate new memory for storing the resulting unum matrix?

The only sensible solution seems to be to allocate the maximum size for each matrix entry, so that if the value changes you won't run out of space.  But that means we have lost the benefit of having a variadic encoding to begin with -- you will have to transfer the maximum size's worth of data when you load the matrix from DRAM, even if most of that data is unused (because the unum only takes up a small percentage of the space).  The author proposed GC, but I have a hard time imagining a GC implemented in *CPU*, no less, colliding with the rest of the world where it's the *software* that controls DRAM allocation.  (GC too slow for your application? Too bad, gotta upgrade your CPU...)

The way I see it from reading the PDF slides, is that what the author is proposing would work well as a *software* library, perhaps backed up by hardware support for some of the lower-level primitives.  I'm a bit skeptical of the claims of data traffic / power savings, unless there is hard data to prove that it works.


T

-- 
"The number you have dialed is imaginary. Please rotate your phone 90 degrees and try again."