October 24, 2019
On Thursday, 24 October 2019 at 00:53:27 UTC, H. S. Teoh wrote:
> I discovered something very interesting: GNU wc was generally on par with, or outperformed the D versions of the code for files that contained long lines, but performed more poorly when given files that contained short lines.
>
> Glancing at the glibc source code revealed why: glibc's memchr used an elaborate bit hack based algorithm that scanned the target string 8 bytes at a time. This required the data to be aligned, however, so when the string was not aligned, it had to manually process up to 7 bytes at either end of the string with a different algorithm.  So when the lines were long, the overall performance was dominated by the 8-byte at a time scanning code, which was very fast for large buffers.  However, when given a large number of short strings, the overhead of setting up for the 8-byte scan became more costly than the savings, so it performed more poorly than a naïve byte-by-byte scan.

Interesting observation. On the surface it seems this might also apply to splitter and find when used on narrow strings. I believe these call memchr on narrow strings. A common paradigm is to read lines, then call splitter to identify individual fields. Fields are often short, even when lines are long.

--Jon


October 27, 2019
On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:
> Another recent development is the occasional divergence of performance characteristics of CPUs across members of the same family, i.e., the same instruction on two different CPU models may perform quite differently.  Meaning that this sort of low-level optimization is really best left to the optimizer to optimize for the actual target CPU, rather than to choose a fixed series of instructions in an asm block that may perform poorly on some CPUs.  (This is also where JIT compilation can win over static compilation, if you ship a generic binary that isn't specifically targeted for the customer's CPU model.)
>
> T

Would it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions? Or at the very least, that they have a built-in "runtime" that is responsible for all that ILP magic - cache policy algorithms, MESI protocol, the branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently.

It is also a bit analogous to the GC vs. deterministic manual memory management debate.
October 27, 2019
27.10.2019 23:11, Mark пишет:
> On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:
>> Another recent development is the occasional divergence of performance characteristics of CPUs across members of the same family, i.e., the same instruction on two different CPU models may perform quite differently.  Meaning that this sort of low-level optimization is really best left to the optimizer to optimize for the actual target CPU, rather than to choose a fixed series of instructions in an asm block that may perform poorly on some CPUs.  (This is also where JIT compilation can win over static compilation, if you ship a generic binary that isn't specifically targeted for the customer's CPU model.)
>>
>> T
> 
> Would it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions? Or at the very least, that they have a built-in "runtime" that is responsible for all that ILP magic - cache policy algorithms, MESI protocol, the branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently.
> 
> It is also a bit analogous to the GC vs. deterministic manual memory management debate.
Wasn't Itanium fail be caused that AMD suggested another architecture that was capable to run existing software while in case of Itanium users was forced to recompile their source code? So Itanium failed just because it was incompatible to x86?
October 27, 2019
On 10/23/19 5:37 PM, welkam wrote:
> I watched many of his talks and he frequently talks about optimization that produce single digits % of speed up in frequently used algorithms but doesnt provide adequate prove that his change in algorithm was the reason why we see performance differences. Modern CPUs are sensitive to many things and one of them is code layout in memory. Hot loops are the most susceptible to this to the point where changing user name under which executable is run changes performance. A paper below goes deeper into this.
> 
> Producing Wrong Data Without Doing Anything Obviously Wrong!
> https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf 

I know of the paper and its follow-up by Berger et al. Thanks.

October 28, 2019
On 28/10/2019 9:39 AM, drug wrote:
> 27.10.2019 23:11, Mark пишет:
>> On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:
>>> Another recent development is the occasional divergence of performance characteristics of CPUs across members of the same family, i.e., the same instruction on two different CPU models may perform quite differently.  Meaning that this sort of low-level optimization is really best left to the optimizer to optimize for the actual target CPU, rather than to choose a fixed series of instructions in an asm block that may perform poorly on some CPUs.  (This is also where JIT compilation can win over static compilation, if you ship a generic binary that isn't specifically targeted for the customer's CPU model.)
>>>
>>> T
>>
>> Would it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions? Or at the very least, that they have a built-in "runtime" that is responsible for all that ILP magic - cache policy algorithms, MESI protocol, the branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently.
>>
>> It is also a bit analogous to the GC vs. deterministic manual memory management debate.
> Wasn't Itanium fail be caused that AMD suggested another architecture that was capable to run existing software while in case of Itanium users was forced to recompile their source code? So Itanium failed just because it was incompatible to x86?

Intel created Itanium.
AMD instead created AMD64 aka x86_64.
October 28, 2019
On 28/10/2019 9:11 AM, Mark wrote:
> On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:
>> Another recent development is the occasional divergence of performance characteristics of CPUs across members of the same family, i.e., the same instruction on two different CPU models may perform quite differently.  Meaning that this sort of low-level optimization is really best left to the optimizer to optimize for the actual target CPU, rather than to choose a fixed series of instructions in an asm block that may perform poorly on some CPUs.  (This is also where JIT compilation can win over static compilation, if you ship a generic binary that isn't specifically targeted for the customer's CPU model.)
>>
>> T
> 
> Would it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions? Or at the very least, that they have a built-in "runtime" that is responsible for all that ILP magic - cache policy algorithms, MESI protocol, the branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently.
> 
> It is also a bit analogous to the GC vs. deterministic manual memory management debate.

I have described modern x86 cpu's as an application VM, so I will agree with you :)
October 31, 2019
On Sunday, 27 October 2019 at 20:11:38 UTC, Mark wrote:
> Would it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions?

Old CISC CPUs did just that, so they could have high level instructions, and were reprogrammable... (I guess x86 also has that feature, at least to some extent).

Then RISC CPUs came in the 90s and didn't do that, thus they were faster and more compact as they could throw out the decoder (the bits in the instructions were carefully designed so that the decoding was instantaneous). But then memory bandwidth became an issue and developers started to write more and more bloated software...

x86 is an old CISC architecture and simply survives because of market dominance and R&D investments. Also, with increased real estate (more transistors) they can sacrifice lots of space for the instruction decoding...

The major change over the past 40 years that is causing sensitivity to instruction ordering is that modern CPUs can have deep pipelines (executing many instructions at the same time in a long staging queue), that they are superscalar (execute instructions in parallell), execute instructions speculatively (execute instructions even though the result might be discarded later), do tight-loop instruction unrolling before pipelining, and have various schemes for branch prediction (so that they execute the right sequence after a branch before they know what the branch-condition looks like).

Is this a good approach? Probably not... You would get much better performance from the same number of transistors by using many simple cores and a clever memory architecture, but that would not work with current software and development practice...

> branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently.

VLIW is not a bad concept, re RISC, but perhaps not profitable in terms of R&D.

You probably could get better scheduling of instructions if it was determined to be optimal statically. As the compiler would then have a "perfect model" of how much of the CPU is being utilized, and could give programmers feedback on it too.  But then you would need to recompile software to the actual CPU and have more advanced compilers, and perhaps write software in a different manner to avoid bad branching patterns.

Existing software code bases and a developer culture that is resilient to change do limit progress...

People pay to have their existing stuff to run well, they won't pay if they have to write new stuff in new ways, unless the benefits are extreme (e.g. GPUs)...







October 31, 2019
On 10/28/19 2:03 AM, rikki cattermole wrote:
> On 28/10/2019 9:39 AM, drug wrote:
>> 27.10.2019 23:11, Mark пишет:
>>> On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:
>>>> Another recent development is the occasional divergence of performance characteristics of CPUs across members of the same family, i.e., the same instruction on two different CPU models may perform quite differently.  Meaning that this sort of low-level optimization is really best left to the optimizer to optimize for the actual target CPU, rather than to choose a fixed series of instructions in an asm block that may perform poorly on some CPUs. (This is also where JIT compilation can win over static compilation, if you ship a generic binary that isn't specifically targeted for the customer's CPU model.)
>>>>
>>>> T
>>>
>>> Would it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions? Or at the very least, that they have a built-in "runtime" that is responsible for all that ILP magic - cache policy algorithms, MESI protocol, the branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently.
>>>
>>> It is also a bit analogous to the GC vs. deterministic manual memory management debate.
>> Wasn't Itanium fail be caused that AMD suggested another architecture that was capable to run existing software while in case of Itanium users was forced to recompile their source code? So Itanium failed just because it was incompatible to x86?
> 
> Intel created Itanium.
> AMD instead created AMD64 aka x86_64.
That's well known fact, I believe.

I meant that Intel planned that Itanium would be the next generation processor after x86. But AMD extended x86 and created amd64 and the plane of Intel failed because to use Itanium you shall recompile everything and to use amd64 you just run you software as before.
1 2
Next ›   Last »