June 01, 2017
On Thursday, 1 June 2017 at 04:39:17 UTC, Jonathan M Davis wrote:
> On Wednesday, May 31, 2017 16:03:54 H. S. Teoh via Digitalmars-d-learn wrote:
>> [...]
> Digitalmars-d-learn wrote:
>> [...]
>
> If you're really trying to make it fast, there may be something that you can do with SIMD. IIRC, Brian Schott did that with his lexer (or maybe he was just talking about it - I don't remember for sure).
>

See my link above to realdworldtech. Using SIMD can give good results in micro-benchmarks but completely screw up performance of other things in practice (the alignment requirements are heavy and result in code bloat, cache misses, TLB misses, cost of context switches, AVX warm up time (Agner Fog observed around 10000 cycles before AVX switches from 128 bits to 256 bits operations), reduced turboing, etc.).
May 31, 2017
On Thursday, June 01, 2017 04:52:40 Patrick Schluter via Digitalmars-d-learn wrote:
> On Thursday, 1 June 2017 at 04:39:17 UTC, Jonathan M Davis wrote:
> > On Wednesday, May 31, 2017 16:03:54 H. S. Teoh via
> >
> > Digitalmars-d-learn wrote:
> >> [...]
> >
> > Digitalmars-d-learn wrote:
> >> [...]
> >
> > If you're really trying to make it fast, there may be something that you can do with SIMD. IIRC, Brian Schott did that with his lexer (or maybe he was just talking about it - I don't remember for sure).
>
> See my link above to realdworldtech. Using SIMD can give good results in micro-benchmarks but completely screw up performance of other things in practice (the alignment requirements are heavy and result in code bloat, cache misses, TLB misses, cost of context switches, AVX warm up time (Agner Fog observed around 10000 cycles before AVX switches from 128 bits to 256 bits operations), reduced turboing, etc.).

Whenever you attempt more complicated optimizations, it becomes harder to get it right, and you always have the problem of figuring out whether you really did make it better in general. It's the sort of thing that's easier when you have a specific use case and it's very difficult to get right when dealing with a general solution for a standard library. So, it doesn't surprise me at all if a particular optimization turns out to be a bad idea for Phobos even if it's great for some use cases.

- Jonathan M Davis

May 31, 2017
On Wed, May 31, 2017 at 10:05:50PM -0700, Jonathan M Davis via Digitalmars-d-learn wrote:
> On Thursday, June 01, 2017 04:52:40 Patrick Schluter via Digitalmars-d-learn
[...]
> > See my link above to realdworldtech. Using SIMD can give good results in micro-benchmarks but completely screw up performance of other things in practice (the alignment requirements are heavy and result in code bloat, cache misses, TLB misses, cost of context switches, AVX warm up time (Agner Fog observed around 10000 cycles before AVX switches from 128 bits to 256 bits operations), reduced turboing, etc.).
> 
> Whenever you attempt more complicated optimizations, it becomes harder to get it right, and you always have the problem of figuring out whether you really did make it better in general. It's the sort of thing that's easier when you have a specific use case and it's very difficult to get right when dealing with a general solution for a standard library. So, it doesn't surprise me at all if a particular optimization turns out to be a bad idea for Phobos even if it's great for some use cases.
[...]

It makes me want to just say, write a naïve loop expressing exactly what you intend to achieve, and let the compiler's optimizer figure out how to best optimize it for your target architecture.

Unfortunately, just earlier today while testing an incomplete version of count() that uses the ulong iteration optimization, I discovered to my horror that ldc (at -O3) apparently recognizes that exact hack and turns the loop into a massive bowl of SSE/MMX/AVX/etc soup that's many times the size of the "unoptimized" loop.  After reading the thread Patrick pointed out from realworldtech, I'm starting to wonder if the result is actually faster in practice, or if it only looks good in benchmarks, because that code bloat is going to add instruction cache pressure and probably TLB misses, etc..  If your program mostly calls count() on smallish arrays (which I'd say is rather likely in cases that matter, because I can't imagine someone would want to count bytes in 1MB arrays inside an inner loop -- in inner loops you'd tend to be working with smaller chunks of data and thus you'd want count() to be fast for small to medium-sized arrays), then a small, tight "unoptimized" loop is going to perform much better, because it would be easily inlineable, won't add 1KB to your function body size, and thus increase the chances your function body won't overflow the instruction cache. Plus, reducing the amount of complicated branches and other paraphrenalia will make the CPU's branch predictor more likely to get it right, so you're less likely to cause pipeline stalls.

Perhaps the right approach is to check if the array length is less than some arbitrary threshold, and just use a naïve loop below that, and only switch to the complicated hackish stuff where you're sure it will actually benefit, rather than hurt, performance.


T

-- 
Not all rumours are as misleading as this one.
June 01, 2017
On Wed, 2017-05-31 at 16:37 -0700, H. S. Teoh via Digitalmars-d-learn wrote:
> […]
> 
> An even more down-to-earth counterargument is that if CPU vendors had
> been content with understandable, simple CPU implementations, and
> eschewed "heroic", hard-to-understand things like instruction
> pipelines
> and cache hierarchies, we'd still be stuck with 16 MHz CPU's in 2017.
> 

The people looking at modern, ultra-parallel hardware architectures are indeed looking to use very simple CPU with ultra-low power use. Just because Moore Law, the demand for computation, etc, during the 1980s, 1990s and 2000s led to x86_64 with its "heroic" silicon wafer layout, doesn't mean that is where we have to stay. That's legacy thinking based on huge investments of capital and requirement of a company to continue to force an income stream from it's customers. The current state of mainstream hardware is all about not innovating.

The problem with supercomputing just at the moment, is that you have to
build a power station for each one. The x86_64 and GPGPU approach
hasn't hit the end of Moore's Law, it's hit the "we can't supply enough
power to run it" wall. The Cloud is also not the answer, it's just and
income stream for a couple of companies pretending to continue to
innovate.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

June 01, 2017
On Wednesday, May 31, 2017 22:50:19 H. S. Teoh via Digitalmars-d-learn wrote:
> Perhaps the right approach is to check if the array length is less than some arbitrary threshold, and just use a naïve loop below that, and only switch to the complicated hackish stuff where you're sure it will actually benefit, rather than hurt, performance.

Based on some previous discussions, I think that this is the sort of thing that std.algorithm.sort does (switch algorithms depending on the size of the range to be sorted), but I've never actually verified it.

- Jonathan M Davis


June 01, 2017
On Wed, 2017-05-31 at 16:37 -0700, H. S. Teoh via Digitalmars-d-learn wrote:
> 
[…]
> With D, we can have the cake and eat it too.  The understandable /
> naïve
> implementation can be available as a fallback (and reference
> implementation), with OS-specific optimized implementations guarded
> under version() or static-if blocks so that one could, in theory,
> provide implementations specific to each supported platform that
> would
> give the best performance.

But isn't that the compiler's job?

The lesson of functional programming, logic programming, etc. (when the acolytes  remember the actual lesson) is that declarative expression of goal is more comprehensible to people than detailed explanation of how the computer calculates a result. Object-oriented computing lost the plot somewhere in the early 2000s.

The advance of Scala, Kotlin, Groovy, and now Rust and Go (but only to some extent), which D has, is to express algorithms as declarative requirements in a dataflow manner.

One day hardware people will catch up with the hardware innovations of the 1970s and 1980s regarding support of dataflow rather than state.

> I disagree about the philosophy of "if you need to go faster, use a
> bigger computer".  There are some inherently complex problems (such
> as
> NP-complete, PSPACE-complete, or worse, outright exponential class
> problems) where the difference between a "heroic implementation" of a
> computational primitive and a naïve one may mean the difference
> between
> obtaining a result in this lifetime vs. practically never. Or, more
> realistically speaking, the difference between being able to solve
> moderately-complex problem instances vs. being able to solve only
> trivial toy instances.  When you're dealing with exponential
> complexity,
> every small bit counts, and you can never get a big enough computer.

There are always places for experimentation, and innovation. Hard problems will always be hard, we just need to find the least hard way of expressing the solutions. The crucial thing is that people always work to remove the heroicism of the initial solutions, creating new computational models, new programming languages, etc. to do it.

-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

June 01, 2017
On Wed, May 31, 2017 at 12:13:04PM -0700, H. S. Teoh via Digitalmars-d-learn wrote:
> On Tue, May 30, 2017 at 05:13:46PM -0700, Ali Çehreli via Digitalmars-d-learn wrote: [...]
> > I could not make the D program come close to wc's performance when the data was piped from stdin.
> [...]
> 
> Hmm. This is a particularly interesting case, because I adapted some of my algorithms to handle reading from stdin (i.e., std.mmfile is not an option), and I could not get it to reach wc's performance!  I even installed ldc just to see if that made a difference... it was somewhat faster than gdc, but still, the timings were about twice as slow as wc.
[...]

Here's a little update on my experiments w.r.t. reading from stdin
without being able to use std.mmfile: I found that I was able to achieve
decent performance by modifying the loop so that it loads data from the
array in size_t chunks rather than by individual bytes, and looping over
the bytes in the size_t to check for matches. Here's the code:

	size_t lineCount7(ref File input)
	{
	    import std.algorithm.searching : count;

	    ubyte[] buf;
	    size_t c = 0;

	    buf.length = 8192;

	    foreach (d; input.byChunk(buf))
	    {
		if (d.length == buf.length)
		{
		    auto ichunk = cast(ulong[]) d;
		    size_t subtotal = 0;

		    foreach (i; ichunk)
		    {
			enum eol = cast(ulong) '\n';
			if ((i & 0x00000000000000FF) ==  eol       ) subtotal++;
			if ((i & 0x000000000000FF00) == (eol <<  8)) subtotal++;
			if ((i & 0x0000000000FF0000) == (eol << 16)) subtotal++;
			if ((i & 0x00000000FF000000) == (eol << 24)) subtotal++;
			if ((i & 0x000000FF00000000) == (eol << 32)) subtotal++;
			if ((i & 0x0000FF0000000000) == (eol << 40)) subtotal++;
			if ((i & 0x00FF000000000000) == (eol << 48)) subtotal++;
			if ((i & 0xFF00000000000000) == (eol << 56)) subtotal++;
		    }
		    c += subtotal;
		}
		else
		    c += d.count('\n');
	    }
	    return c;
	}

When the last chunk of the file (possibly the entire file, if it's < 8 KB) is incomplete, we revert back to the naïve loop-over-bytes search.

While this superficially may seem like unnecessary complication, it actually makes a significant performance difference, because:

(1) Reading the array in size_t chunks means less roundtrips to RAM or
L1/L2/L3 caches.

(2) Since a size_t fits within a single CPU register, the inner loop can be completely done inside the CPU without needing to even go to L1 cache, which, while it's pretty fast, is still a memory roundtrip. The register file is the fastest memory of all, so we maximize this advantage here.

(3) Since size_t has a fixed size, the loop can be completely unrolled
(ldc does this) and thus completely eliminate branch hazards from the
inner loop.

I originally tried to copy the glibc memchr implementation's xor trick for checking whether a size_t word contains any matching bytes, but I got mixed results, and in any case it loses out to my system's wc implementation.  I suppose given enough effort I could track down what's causing the D code to slow down, but then I realized something else about wc: since it uses memchr to find EOL bytes, and now that I know memchr's implementation in glibc, it means that a lot of overhead is introduced when the data being scanned contains a lot of matches.

So I did a little test: I created two text files, both 200 million bytes long, with all 200 million bytes newlines (i.e., 200,000,000 blank lines), and one with 100-character lines (2,000,000 lines in total). Then as an intermediate between these two extremes, I concatenated all of the .d files in Phobos 10 times to make a file with 2.8 million lines of varying lengths.  Then I tested both my system's wc and my linecount written in D to see how they performed on these files (piped through stdin, so no mmap-ing is involved).

Here are the results (note: these are raw measurements; I did not account for system background noise):

	    +------------------+-------------------+------------------+
	    | 200M blank lines | 2M 100-byte lines | 10x Phobos code  |
+-----------+------------------+-------------------+------------------+
| wc -l     | real    0m0.475s | real    0m0.080s  | real    0m0.083s |
|           | user    0m0.417s | user    0m0.034s  | user    0m0.062s |
|           | sys     0m0.058s | sys     0m0.046s  | sys     0m0.020s |
+-----------+------------------+-------------------+------------------+
| linecount | real    0m0.181s | real    0m0.190s  | real    0m0.099s |
|           | user    0m0.138s | user    0m0.129s  | user    0m0.059s |
|           | sys     0m0.042s | sys     0m0.060s  | sys     0m0.040s |
+-----------+------------------+-------------------+------------------+

As expected, wc -l loses when dealing with blank lines (and, presumably, short lines); the D version was able to beat it by more than a factor of 2.  On the file with 100-byte lines, though, the performance of wc improved tremendously, because glibc's memchr is optimized for scanning large amounts of data before finding a match, whereas the D version performs more-or-less on par with the blank line case, but losing out to wc by about a factor of 2.

The results of the 10x Phobos code runs are not directly comparable with the first two test files, because the total file size is different. Here, the D code still loses out slightly to wc, presumably because memchr is ultimately still more efficient given the average line length in Phobos code.

These results show that the performance of these algorithms depend on the kind of data you feed them, and there's probably no "optimal" line-counting algorithm unless you can predict the average line lengths in advance.  In general, though, if you're dealing with text, I'd wager that the average line length should be closer to the 100-byte lines end of the extreme than the file filled with blank lines, so wc probably wins on your typical text file.  I guess that means that using memchr or its equivalent in D would be the best strategy to obtain results on par with wc, as far as reading from stdin is concerned.

When you can use std.mmfile, though, the D version still beats wc by a large margin. :-D


T

-- 
Help a man when he is in trouble and he will remember you when he is in trouble again.
June 02, 2017
On Thu, Jun 01, 2017 at 08:39:07AM +0100, Russel Winder via Digitalmars-d-learn wrote:
> On Wed, 2017-05-31 at 16:37 -0700, H. S. Teoh via Digitalmars-d-learn wrote:
> > 
> […]
> > With D, we can have the cake and eat it too.  The understandable / naïve implementation can be available as a fallback (and reference implementation), with OS-specific optimized implementations guarded under version() or static-if blocks so that one could, in theory, provide implementations specific to each supported platform that would give the best performance.
> 
> But isn't that the compiler's job?

Unfortunately, the compiler can only go so far, because it doesn't understand the larger context of what you're trying to accomplish. Modern optimizing compilers certainly go a lot further than before, but still, at some point some amount of human judgment is needed.

Also, compiler implementors do still have to do the "heroics", or rather, teach the compiler to do the "heroics" when compiling straightforward code. So while the general programmer probably will have less need for it, compiler writers still need to know how to do them in order to write the optimizing compilers in the first place.


> The lesson of functional programming, logic programming, etc. (when the acolytes  remember the actual lesson) is that declarative expression of goal is more comprehensible to people than detailed explanation of how the computer calculates a result. Object-oriented computing lost the plot somewhere in the early 2000s.

There is no argument that straightforward code is more comprehensible to people.  The question here is whether it delivers maximum performance.

We know from Kolgomorov complexity theory that global optimization, in the general case, is undecidable, so no optimizing compiler is going to be able to generate optimal code in all cases. There will always be cases where you have to manually tweak it yourself.  Of course, that doesn't mean you go around micro-optimizing everything -- the usual approach is to write it the straightforward way first, then profile it, identify the hotspots, and find ways to improve performance in the hotspots. Well, at a higher level, the first order of business is really to look at it from an algorithmic POV and decide whether or not a different algorithm ought to be used (and no optimizing compiler can help you there).  Then if that's still not enough, then you dig into the details and see if you can squeeze more juice out of your current algorithms -- if the profiler has indicated that they are the bottleneck.


> The advance of Scala, Kotlin, Groovy, and now Rust and Go (but only to some extent), which D has, is to express algorithms as declarative requirements in a dataflow manner.
> 
> One day hardware people will catch up with the hardware innovations of the 1970s and 1980s regarding support of dataflow rather than state.

Dataflow can only capture a limited subset of algorithms.  Of course, in my observation, 90% of code being written today generally belongs in the category of what I call "linear shuffling of data", i.e., iterate over some linear set of objects and perform some transformation X on each object, copying linear memory region X to linear memory region Y, rearranging some linear set of objects, permuting the order of some linear set of things, etc..  This category basically subsumes all of GUI programming and web programming, which in my estimation covers 90% or maybe even 95% of code out there.  The bulk of game code also falls under this category -- they are basically a matter of copying X items from A to B, be they pixels to be copied to the video buffer, traversing the list of in-game objects to update their current position, direction, speed, or traversing scanlines of a polygon to map a 3D object to 2D, etc.. Almost all business logic also falls under this category. All of these are easily captured by dataflow models.

However, there are algorithms outside of this category, that are not easily captured by the dataflow model.  Solving certain graph theory problems, for example, require actual insight into the structure of the problem than mere "move X items from A to B".  Route planning, which is an NP-complete problem that, for practical applications, can only be approximated, and therefore actual thought is required for how you actually go about solving the problem beyond "data X moves from A to B". Computing the convex hull of a set of input points, used for solving optimization problems, if expressed and solved in a naïve way, has O(n^3) time complexity, and therefore impractical for non-trivial problem instances.

True, your average general programmer may not even know what a convex hull problem is, and probably doesn't even need to care -- at worst, there are already libraries out there that implement the algorithms for you.  But the point is, *somebody* out there needs to write these algorithms, and they need to implement these algorithms in an optimal way so that it will continue to be useful for non-trivial problem sizes. You cannot just say, "here is the problem specification, now just let the computer / AI system / whatever figure out for themselves how to obtain the answer".  *Somebody* has to actually sit down and specify exactly how to compute the answer, because generic ways of arriving at the answer are exponential or worse and are therefore useless. And even a feasible algorithm may require a lot of "heroics" in order to make medium-sized problems more tractable.  You want your weather forecasting model to produce an answer by tomorrow before the 10am news, not 3 months later when the answer is no longer relevant.


> > I disagree about the philosophy of "if you need to go faster, use a
> > bigger computer".  There are some inherently complex problems (such
> > as NP-complete, PSPACE-complete, or worse, outright exponential
> > class problems) where the difference between a "heroic
> > implementation" of a computational primitive and a naïve one may
> > mean the difference between obtaining a result in this lifetime vs.
> > practically never. Or, more realistically speaking, the difference
> > between being able to solve moderately-complex problem instances vs.
> > being able to solve only trivial toy instances.  When you're dealing
> > with exponential complexity, every small bit counts, and you can
> > never get a big enough computer.
> 
> There are always places for experimentation, and innovation. Hard problems will always be hard, we just need to find the least hard way of expressing the solutions.

Some problems are inherently hard, and no amount of searching can reduce its complexity past a certain limit. These require extreme measures to be even remotely tractable.


> The crucial thing is that people always work to remove the heroicism of the initial solutions, creating new computational models, new programming languages, etc. to do it.
[...]

But *somebody* has to implement those computational models and programming languages.  If nobody knows how to write "heroic" code, then nobody would know how to write an optimizing compiler that produces such code either, and these computational models and programming languages wouldn't exist in the first place.

I know that the average general programmer doesn't (and shouldn't) care. But *somebody* has to, in order to implement the system in the first place. *Somebody* had to implement the "heroic" version of memchr so that others can use it as a primitive. Without that, everyone would have to roll their own, and it's almost a certainty that the results will be underwhelming.


T

-- 
Question authority. Don't ask why, just do it.
June 03, 2017
On Fri, 2017-06-02 at 10:32 -0700, H. S. Teoh via Digitalmars-d-learn wrote:
> […]
> 
> Also, compiler implementors do still have to do the "heroics", or
> rather, teach the compiler to do the "heroics" when compiling
> straightforward code. So while the general programmer probably will
> have
> less need for it, compiler writers still need to know how to do them
> in
> order to write the optimizing compilers in the first place.

There are many different sorts of programming. Operating systems, compilers, GUIs, Web services, machine learning, etc., etc. all require different techniques. Also there are always new areas, where idioms and standard approaches are yet to be discovered. There will always be a place for "heroic", but to put it up on a pedestal as being a Good Thing For All™ is to do "heroic" an injustice.

We should also note that in the Benchmark Game, the "heroic" solutions are targetted specifically at Isaac's execution machine, which often means they are crap programs on anyone else's computer.

> […]
> be able to generate optimal code in all cases. There will always be
> cases where you have to manually tweak it yourself.  Of course, that
> doesn't mean you go around micro-optimizing everything -- the usual
> approach is to write it the straightforward way first, then profile
> it,
> identify the hotspots, and find ways to improve performance in the
> hotspots. Well, at a higher level, the first order of business is
> really
> to look at it from an algorithmic POV and decide whether or not a
> different algorithm ought to be used (and no optimizing compiler can
> help you there).  Then if that's still not enough, then you dig into
> the
> details and see if you can squeeze more juice out of your current
> algorithms -- if the profiler has indicated that they are the
> bottleneck.

The optimisations are though generally aimed at the current execution computer. Which is fine in the short term. However in the long term, the optimisations become the problem. When the execution context of an optimised code changes then the optimisations should be backed out and new optimisations applied. Sadly this rarely happens, and you end up with new optimisations laid on old (redundant) optimisations, and hence to incomprehensible code that people darn't amend as they have no idea what the #### is going on.

> 
[…]
> But *somebody* has to implement those computational models and
> programming languages.  If nobody knows how to write "heroic" code,
> then
> nobody would know how to write an optimizing compiler that produces
> such
> code either, and these computational models and programming languages
> wouldn't exist in the first place.

Which returns us to there are different sorts of programming, and there are people at "the bleeding edge" of languages, techniques, and hardware, researching these new things. Or there ought to be, it's just that you need funds to do it.

> I know that the average general programmer doesn't (and shouldn't)
> care.
> But *somebody* has to, in order to implement the system in the first
> place. *Somebody* had to implement the "heroic" version of memchr so
> that others can use it as a primitive. Without that, everyone would
> have
> to roll their own, and it's almost a certainty that the results will
> be
> underwhelming.

It may be worth noting that far too few supposedly professional programmers actually know enough about the history of their subject to be deemed competent.


-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@winder.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

June 02, 2017
On Sat, Jun 03, 2017 at 07:00:47AM +0100, Russel Winder via Digitalmars-d-learn wrote: [...]
> There are many different sorts of programming. Operating systems, compilers, GUIs, Web services, machine learning, etc., etc. all require different techniques. Also there are always new areas, where idioms and standard approaches are yet to be discovered. There will always be a place for "heroic", but to put it up on a pedestal as being a Good Thing For All™ is to do "heroic" an injustice.

Fair enough.  I can see how this would lead to unnecessarily ugly, prematurely-optimized code.  It's probably the origin of premature optimization culture especially prevalent in C circles, where you just get into the habit of automatically thinking things like "i = i + 1 is less efficient than ++i", which may have been true in some bygone era but is no longer relevant in the machines and optimizing compilers of today. And also, constantly "optimizing" code that actually aren't the bottleneck, because of some vague notion of wanting "everything" to be fast, yet not being willing to use a profiler to find out where the real bottleneck is.  As a result you spend inordinate amounts of time writing the absolutestly fastest O(n^2) algorithm rather than substituting a moderately unoptimized O(n) algorithm that's far superior, thus actually introducing new bottlenecks instead of fixing existing ones.


> We should also note that in the Benchmark Game, the "heroic" solutions are targetted specifically at Isaac's execution machine, which often means they are crap programs on anyone else's computer.

Well, there is some value in targeting a specific execution environment, but I agree that holding that up as being exemplary of how code should be written would be rather misguided.


[...]
> The optimisations are though generally aimed at the current execution computer. Which is fine in the short term. However in the long term, the optimisations become the problem. When the execution context of an optimised code changes then the optimisations should be backed out and new optimisations applied. Sadly this rarely happens, and you end up with new optimisations laid on old (redundant) optimisations, and hence to incomprehensible code that people darn't amend as they have no idea what the #### is going on.

I've been thinking about this for a while now, actually.  It almost seems as though there ought to be two distinct layers of abstraction in a given piece of code, a high-level, logical layer that specifies using some computation model the desired results, and another lower-level layer that contains either implementation details or target-specific tweaks.  There should be an automatic translation from the upper layer to the lower layer, but after the automatic translation you can go in and tweak the lower layer *while keeping the upper layer intact*, and the system (IDE or whatever) would keep track of *both*, with the lower layer customizations tracked as a set of diffs against the automatically translated version.  When the upper layer changes, any corresponding diffs in the lower layer get invalidated and either produces a conflict the programmer must manually resolve, or else defaults to the new automated translation.

Furthermore, there should be some system of tracking multiple diff sets for the lower layer, so that you can specify diff A as applying to target machine X, and diff B as applying to target machine Y. So you can target the same logical piece of code to different target machines with different implementations.


[...]
> > I know that the average general programmer doesn't (and shouldn't) care.  But *somebody* has to, in order to implement the system in the first place. *Somebody* had to implement the "heroic" version of memchr so that others can use it as a primitive. Without that, everyone would have to roll their own, and it's almost a certainty that the results will be underwhelming.
> 
> It may be worth noting that far too few supposedly professional programmers actually know enough about the history of their subject to be deemed competent.
[...]

Yes, and that is why the people who actually know what they're doing need to be able to write the "hackish", optimized implementations of the nicer APIs provided by the language / system, so that at the very least the API calls would do something sane, even if the code above that is crap.


T

-- 
Questions are the beginning of intelligence, but the fear of God is the beginning of wisdom.
1 2 3
Next ›   Last »