December 06, 2013
Paulo Pinto:

> That is why most safe systems programming language compilers allow disabling bounds checking. :)

Disabling bounds checking (BC) is an admission of defeat (or just of practicality over technical refinement).

Various languages approach the situation in different ways, some examples:
- Python has BC, and it can't be disabled.
- Java has BC, but with the large engineering efforts done on the OracleVM, some array accesses are proved to be in-bound, and removed (some BC is removed thanks to inlining). The effect of this optimization is visible for matrix-processing code, etc. (It's not a large effect, but it's useful, and it could be handy to have in D too).
- D has BC, but you can disable it for each compilation unit.
- Ada is like D, but its stronger typing (and the strongly typed style most Ada programs are written! because it's also a matter of how you use a language) allows the compiler to optimize away some BC with very simple means, without the need for the analysis done by the OracleVM.

Bye,
bearophile
December 06, 2013
On 12/6/2013 3:06 PM, Maxim Fomin wrote:
> and what about holes in immutable, pure and rest type system?

If there are bugs in the type system, then that optimization breaks.

> C doesn't have virtual functions.

Right, but you can (and people do) fake virtual functions with tables of function pointers. No, C doesn't devirtualize those.

> By the way, does D devirtualize them?

It does for classes/methods marked 'final' and also in cases where it can statically tell that a class instance is the most derived type.

December 06, 2013
Another thing to keep in account is that C is not much any more the golden standard for code performance. Today people that want to write fast code often use parallel algorithms using GPUs with CUDA/OpenCL (that look like C with extras), and when they are on CPUs they need to use all cores efficiently and SIMD units of the CPUs. (See the ideas of the Intel language compiled by ispc compiler: http://ispc.github.io/  This contains ideas useful to improve D language. D is lacking here).

Bye,
bearophile
December 06, 2013
Walter Bright:

> It does for classes/methods marked 'final' and also in cases where it can statically tell that a class instance is the most derived type.

Recently I have seen this through Reddit (with a comment by Anon):

http://eli.thegreenplace.net/2013/12/05/the-cost-of-dynamic-virtual-calls-vs-static-crtp-dispatch-in-c/

The JavaVM is often able to de-virtualize virtual calls.


Regarding Java performance matters, from my experience another significant source of optimization in the JavaVM that is often overlooked is that the JavaVM is able to partially unroll even loops with a statically-unknown number of cycles. Currently I think GCC/DMD/LDC2 are not able or willing to do this. I think LLVM was trying to work on this problem a little.

Bye,
bearophile
December 06, 2013
On Fri, Dec 06, 2013 at 02:20:22PM -0800, Walter Bright wrote:
> 
> "there is no way proper C code can be slower than those languages."
> 
>   -- http://www.reddit.com/r/programming/comments/1s5ze3/benchmarking_d_vs_go_vs_erlang_vs_c_for_mqtt/cduwwoy
> 
> comes up now and then. I think it's incorrect, D has many inherent advantages in generating code over C:
> 
> 1. D knows when data is immutable. C has to always make worst case assumptions, and assume indirectly accessed data mutates.

Does the compiler currently take advantage of this, e.g., in aliasing analysis?


> 2. D knows when functions are pure. C has to make worst case assumptions.

Does the compiler currently take advantage of this?


> 3. Function inlining has generally been shown to be of tremendous value in optimization. D has access to all the source code in the program, or at least as much as you're willing to show it, and can inline across modules. C cannot inline functions unless they appear in the same module or in .h files. It's a rare practice to push many functions into .h files. Of course, there are now linkers that can do whole program optimization for C, but those are kind of herculean efforts to work around that C limitation of being able to see only one module at a time.

To be frank, dmd's current inlining capabilities are rather disappointing. However, gdc IME has shown better track record in that area, so it looks promising.


> 4. C strings are 0-terminated, D strings have a length property. The former has major negative performance consequences:
> 
>     a. lots of strlen()'s are necessary

Yeah, this is a common C bottleneck that most obsessively-optimizing C coders overlook. Many a time profiling of my "optimal" code turned up surprising bottlenecks, like fprintf's in the wrong place, or strlen's in inner loops. D's strings may be "heavier" (in the sense that they require a length field as opposed to a mere pointer -- y'know, the kind of stuff obsessively-optimizing C coders lose sleep over), but it comes with so many advantages that it's more than worth the price.


>     b. using substrings usually requires a malloc/copy/free sequence

Yeah, string manipulation in C (and to a great extent C++) is a royal pain. It actually drove me to use Perl for non-performance critical string manipulating code for quite a good number of years. The heavy syntax required for using a regex library in C, the memory management nightmare of keeping track of substrings, all contribute to fragile code, increased development times, and poorer performance. Only obsessively-optimizing C coders can fail to see this (and I used to be among them). ;-)


> 5. CTFE can push a lot of computation to compile time rather than run time. This has had spectacular positive performance consequences for things like regex. C has no CTFE ability.

D's compile-time capabilities were one of the major factors in my adoption of D. The fact that it allows one to write regex code with nice syntax yet zero runtime overhead is awesome.

One of my favorite demonstrations of D's compile-time capabilities is in TDPL, where Andrei described a PRNG generator that checks against poor PRNG parameters at compile-time, and stops compilation if the checks fail. So if it compiles, you're guaranteed the PRNG is good. In C? Well you google for prime numbers and stick randomly-chosen ones into your PRNG parameters, and cross your fingers and hope for the best... (Or at best, the code will do the checks at runtime and abort, but honestly, *what* C coder would do such a thing?)


> 6. D's array slicing coupled with GC means that many malloc/copy/free's normally done in C are unnecessary in D.

It does mean we need to get our act together and improve the GC's performance, though. ;-)

But this is a major advantage in array manipulation (esp. string manipulation) in D. In equivalent C code, best practices dictate that slicing should always be done by allocating a new array and copying, because otherwise you introduce intricate dependencies between pointers and your code either becomes wrong (leaks memory / has dangling pointers), or extremely convoluted (reinvent the GC with reference counting, etc.). Being able to freely slice any array you want without needing to keep track of every single pointer, is a big savings in development time, not to mention better runtime performance because you don't have to keep allocating & copying. All that copying does add up!

There's also another aspect to this: being built into the language, D slices allows different pieces of code to freely exchange them without needing to worry about memory management. In equivalent C code, an efficient implementation of slices would require the use of custom structs (as a pair of pointers, or a pointer + length ala D), or specific function parameter conventions (e.g. always pass pointer + length). Since everyone will implement this in a slightly different way, it makes interoperating different bits of code a pain. One function takes a pointer + length, another function takes a pointer pair, a third function takes a custom struct that encapsulates the foregoing, and a fourth function takes *another* custom struct that does the same thing but with a different implementation -- so now to make the code work together, you have to insert intermediate layers of code to convert between the representations. It's a big development time sink, and yet another corner for bugs to hide in. In D, since slices are built-in, everybody takes a slice, no interconversion is necessary, and everybody is happy.


> 7. D's "final switch" enables more efficient switch code generation, because the default doesn't have to be considered.

Not to mention the maintenance advantages of final switch: modify the enum, and the compiler automatically points you to all the places in the code where the new enum value has to be handled. In C, you just modify the enum, and everything still compiles, except now things don't quite work until you add the new value to every switch statement that needs to handle it. But it's easy to miss one or two places where this is needed, so you have latent bugs that only show up through thorough testing. (And everybody knows that C codebases always come with thorough unittests, right? *ahem* More likely, these bugs go unnoticed until it blows up in a customer's production environment, then you have to pull an all-nighter to track down what's causing it.)

It's not surprising that this led to the belief that switch statements are evil, as any OO aficionado would tell you. Well, *I* say that they're evil only because C's implementation of them is so primitive. In D, they can actually be a good thing!

(Of course, you can still screw yourself over in D by inserting empty default cases into the wrong switch statements, but you have to actively shoot yourself in the foot to do it, in which case it's really your own fault, not the language's.)

//

And I'd also add 8.: most applications can get by with using a GC, and this actually lets you write more efficient code.

One particular example I have in mind is a parse function that returns an AST. In C, the AST nodes would be malloc'd, of course, and then everyone downstream will have to make sure they call free on each and every AST node in order to avoid memory leakage. In the first version of the code, this is no biggie: just write a function called freeAST() or some such, and make sure everyone calls it. But then you want some parts of the code to hold on to bits of the AST -- say, store a function body with its corresponding symbol table entry -- and now you have to either recursively copy the AST (inefficient), or transplant it from the main AST and make sure you NULL the pointers in the parent tree so that it doesn't get double-freed, or some other such manual technique (very error-prone). But what if *two* places need to hold on to the same AST subtree? Conventional C wisdom says, duplicate it, since otherwise you'll have a major headache later on trying to keep track of when you can free something. Or reinvent the GC with reference-counting, etc., which you'll eventually have to once parts of the trees start cross-referencing each other.

And later yet, you add error-handling, and now you have to make sure every time an error occurs, you call freeAST() on partially-constructed AST subtrees, and soon your initially simple code turns into a spaghetti of "if(error) goto cleanupAST". Miss just *one* of these cases, and you've a memory leak.

In D? Just allocate the nodes and return them freely, and let them cross-reference them as much as they like -- the GC will clean up after you. Faster to code, less error-prone, and more performant (avoids duplicating AST subtrees, avoids complicated bookkeeping to keep track of when a node can be freed, etc.). Sure the GC will have to pause your program to run its collection cycles -- but that's not much different from a recursive free() of an AST subtree, which doesn't guarantee an upper-bound on running time (if your tree is 1 million nodes, then it has to do 1 million free's, right then, right there, and your program's gonna wait for that to finish, no matter what -- the thought that manual memory management is pause-free is a myth).

It's true that manual memory management targeted for your specific application will always outperform a generic GC, since it can take advantage of your application's specific usage patterns. But 95% of the time, you don't *need* this kind of tweaking; yet in the C world, you have no choice but to manually manage everything. And chances are, your ad hoc implementation will perform poorer than the GC. After all, not everybody is an expert at writing memory management code. And, in a team project, expect people to screw up and introduce poor performance and/or pointer bugs, even if you've tweaked the thing to perfection. Besides, it's so much effort for only a little gain. C coders who have never used a GC'd language don't know what they're missing, when it comes to ease of dealing with complex cross-referencing data structures.


T

-- 
Indifference will certainly be the downfall of mankind, but who cares? -- Miquel van Smoorenburg
December 06, 2013
On 12/6/2013 3:40 PM, bearophile wrote:
> Recently I have seen this through Reddit (with a comment by Anon):
>
> http://eli.thegreenplace.net/2013/12/05/the-cost-of-dynamic-virtual-calls-vs-static-crtp-dispatch-in-c/
>
> The JavaVM is often able to de-virtualize virtual calls.

I know. It is an advantage that JITing has. It's also an advantage if you can do whole-program analysis, which can easily be done in Java.

December 06, 2013
On Fri, Dec 06, 2013 at 03:19:24PM -0800, Walter Bright wrote:
> On 12/6/2013 3:02 PM, Maxim Fomin wrote:
[...]
> >Such advantages are offset by:
> >
> >- huge runtime library
> 
> C has a huge runtime library, too, it's just that you normally don't notice it because it's not statically linked in. Be that as it may, 2.064 substantially reduced the size of "hello world" programs.

Are there any upcoming further improvements in this area? It would be nice to not have multi-MB programs that only print "hello world". :) (Not that that's any meaningful indicator of typical program size, since typically programs do more than just print "hello world", but still, I think there's still lots of low-hanging fruit here.)


[...]
> >- phobos snowball - one invocation of some function in standard library leads to dozens template instantiations and invocations of pretty much stuff
> 
> True enough, but does that lead to non-performant code? 2.064 cuts that down quite a bit anyway, and I think we can make more improvements in this regard.

It would be nice to decouple Phobos modules more. A *lot* more. Currently there is a rather nasty tangle of mutual imports between several large modules (e.g., std.stdio, std.format, std.algorithm, std.conv, and a few others). Import just one of them, and it pulls in *everything*. This goes against the Phobos philosophy as advertised on dlang.org -- that dependencies between modules should be minimal.

One low-hanging fruit that comes to mind is to use local imports instead of module-wide imports. If the local imports are inside templated functions, I *think* it would prevent pulling in the imports until the function is actually used, which would have the desired effect. (Right?) Much of Phobos was written before we had this feature, but since we have it now, might as well make good use of it.


T

-- 
"A one-question geek test. If you get the joke, you're a geek: Seen on a California license plate on a VW Beetle: 'FEATURE'..." -- Joshua D. Wachs - Natural Intelligence, Inc.
December 06, 2013
On 12/6/2013 3:39 PM, H. S. Teoh wrote:
>> 1. D knows when data is immutable. C has to always make worst case
>> assumptions, and assume indirectly accessed data mutates.
>
> Does the compiler currently take advantage of this, e.g., in aliasing
> analysis?

I'm pretty sure dmd does, don't know about others.


>> 2. D knows when functions are pure. C has to make worst case assumptions.
>
> Does the compiler currently take advantage of this?

dmd does.


>> 3. Function inlining has generally been shown to be of tremendous
>> value in optimization. D has access to all the source code in the
>> program, or at least as much as you're willing to show it, and can
>> inline across modules. C cannot inline functions unless they appear
>> in the same module or in .h files. It's a rare practice to push many
>> functions into .h files. Of course, there are now linkers that can
>> do whole program optimization for C, but those are kind of herculean
>> efforts to work around that C limitation of being able to see only
>> one module at a time.
>
> To be frank, dmd's current inlining capabilities are rather
> disappointing. However, gdc IME has shown better track record in that
> area, so it looks promising.

This is about inherent language opportunities, not whether current implementations fall short or not.



>> 4. C strings are 0-terminated, D strings have a length property. The
>> former has major negative performance consequences:
>>
>>      a. lots of strlen()'s are necessary
>
> Yeah, this is a common C bottleneck that most obsessively-optimizing C
> coders overlook. Many a time profiling of my "optimal" code turned up
> surprising bottlenecks, like fprintf's in the wrong place, or strlen's
> in inner loops. D's strings may be "heavier" (in the sense that they
> require a length field as opposed to a mere pointer -- y'know, the kind
> of stuff obsessively-optimizing C coders lose sleep over), but it comes
> with so many advantages that it's more than worth the price.

Yup.


>>      b. using substrings usually requires a malloc/copy/free sequence
>
> Yeah, string manipulation in C (and to a great extent C++) is a royal
> pain. It actually drove me to use Perl for non-performance critical
> string manipulating code for quite a good number of years. The heavy
> syntax required for using a regex library in C, the memory management
> nightmare of keeping track of substrings, all contribute to fragile
> code, increased development times, and poorer performance. Only
> obsessively-optimizing C coders can fail to see this (and I used to be
> among them). ;-)

I failed to recognize the problem for what it was for years, too. I think the 0 terminated string issue is a severe fault in C, and it was carried over into C++'s std::string. C++ missed a big opportunity there.


>> 6. D's array slicing coupled with GC means that many
>> malloc/copy/free's normally done in C are unnecessary in D.
>
> It does mean we need to get our act together and improve the GC's
> performance, though. ;-)

The best memory allocation algorithm is one that doesn't do allocations at all. While this may sound trite, I think it is a crucial insight. The mere existence of the GC enables many allocations to not be done. The other thing is that D's ranges can also be used to eliminate a lot of low level allocations.


> But this is a major advantage in array manipulation (esp. string
> manipulation) in D. In equivalent C code, best practices dictate that
> slicing should always be done by allocating a new array and copying,
> because otherwise you introduce intricate dependencies between pointers
> and your code either becomes wrong (leaks memory / has dangling
> pointers), or extremely convoluted (reinvent the GC with reference
> counting, etc.). Being able to freely slice any array you want without
> needing to keep track of every single pointer, is a big savings in
> development time, not to mention better runtime performance because you
> don't have to keep allocating & copying. All that copying does add up!

Exactly.


> There's also another aspect to this: being built into the language, D
> slices allows different pieces of code to freely exchange them without
> needing to worry about memory management. In equivalent C code, an
> efficient implementation of slices would require the use of custom
> structs (as a pair of pointers, or a pointer + length ala D), or
> specific function parameter conventions (e.g. always pass pointer +
> length). Since everyone will implement this in a slightly different way,
> it makes interoperating different bits of code a pain. One function
> takes a pointer + length, another function takes a pointer pair, a third
> function takes a custom struct that encapsulates the foregoing, and a
> fourth function takes *another* custom struct that does the same thing
> but with a different implementation -- so now to make the code work
> together, you have to insert intermediate layers of code to convert
> between the representations. It's a big development time sink, and yet
> another corner for bugs to hide in. In D, since slices are built-in,
> everybody takes a slice, no interconversion is necessary, and everybody
> is happy.

Yup.

December 06, 2013
H. S. Teoh:

>(if your tree is 1 million nodes, then it
> has to do 1 million free's, right then, right there,

In practice real C programs use arenas and pools to allocate the nodes from. This sometimes doubles the performance of C code that has to allocate many nodes of a tree data structure. A simple example:

http://rosettacode.org/wiki/Self-referential_sequence#Faster_Low-level_Version

Some of such code will become useless once Phobos has Andrei allocators :-)

In C sometimes you also use hierarchical memory allocation, to simplify the memory management (http://swapped.cc/?_escaped_fragment_=/halloc#!/halloc ), not currently supported by Andrei allocators.

Bye,
bearophile
December 06, 2013
On Sat, Dec 07, 2013 at 12:40:35AM +0100, bearophile wrote: [...]
> Regarding Java performance matters, from my experience another significant source of optimization in the JavaVM that is often overlooked is that the JavaVM is able to partially unroll even loops with a statically-unknown number of cycles. Currently I think GCC/DMD/LDC2 are not able or willing to do this.
[...]

Really? I've seen gcc/gdc unroll loops with unknown number of iterations, esp. when you're using -O3. It just unrolls into something like:

	loop_start:
		if (!loopCondition) goto end;
		loopBody();
		if (!loopCondition) goto end;
		loopBody();
		if (!loopCondition) goto end;
		loopBody();
		if (!loopCondition) goto end;
		loopBody();
		goto loop_start;
	end:	...

I'm pretty sure I've seen gcc/gdc do this before.


T

-- 
Windows 95 was a joke, and Windows 98 was the punchline.