C/C++ style Crashes? (page 3)

January 16, 2007
Re: C/C++ style Crashes?
Posted by Steve Horne
in reply to Sebastian Biallas
Permalink
Steve Horne
Posted in reply to Sebastian Biallas
Permalink
On Mon, 15 Jan 2007 17:54:57 +0100, Sebastian Biallas <groups.5.sepp@spamgourmet.com> wrote:

>a) Where do you need extremely large lists? Lists have a search time in
>O(n). Ok, you have a O(1) insertion/deletion here, but only at one
>place, since your iterators get void after an insertion/deletion
>(contrary to normal double linked lists)

Depends what you're storing in the list. Consider a word processor that is expected to handle large documents, for instance. You keep a few iterators for key current positions, but almost all accesses are sequential. Jumping to a specific page number is not, but that can afford a delay, or else you can keep a mapping of page numbers to start position iterators separately. A huge array of characters would cause problems with inserts and deletes, but a list handles this easily.

There are other approaches, but few as simple or elegant as they tend to require a pair of data structure approaches working together (for example a linked list of short strings).

>b) They don't work in a GC-enviroment or require special handling.

It's called opting out of garbage collection, and certainly it would be required in this case. Kind of handy that D allows this, therefore.

>Are they really used somewhere out of the IOCCC? I'd be curious to see a real world example.

The obfuscated C contest is about obfuscated code style. These lists can be implemented as cleanly and clearly as any other data structure - the code is far simpler than, for instance, red-black balanced binary trees (as used in the STL).

So consider the word processor example. One alternative approach would be to use a huge vector of characters, but that wouldn't work well for insertions and deletions - the larger the document, the longer it takes to resize the array. Another approach is to use a list of short arrays of characters, but then you're mixing two data structure approaches. Issues like merging overly small arrays and splitting overly long arrays would complicate the code, by your logic making it a candidate for the obfuscated C contest itself.

It is possible to create a tree data structure that implements a vector-style container, and which can implement efficient inserts and deletes anywhere, and using a multiway tree keeps the overheads relatively small and helps with cache performance, but of course you won't find support for this data structure in any existing language or standard library that I know of so we are back in creating-a-new-core-data-structure territory, with the usual bloat management issues, and this is a significantly more complex data structure to implement than the list approach already described (trust me, I've done exactly this in C++).

I've never needed to write a word processor myself (I've written text layout code, but it was non-interactive, and I've written a text editor, but that was back in the days of DOS 3) but if you ever need to do so, I'd recommend that you give serious consideration to the list approach I described. Of the options that can actually handle the job efficiently, it just might be the most simple and elegant option available.

>> In addition, pointer casting is useful for managing template bloat. The basic approach is to write type-unsafe non-template code, and then use a template to add a typesafe wrapper.
>
>This works with normal pointer casting (no need for arithmetik). Hey, this was even the normal usage of Java-Containers before they had the template stuff.

Yes - that was explanation leading into the point, not the point itself.

>> But in many cases, more
>> flexibility is needed than can be handled using simple pointer
>> casting.
>> 
>> For example, consider a library to handle flexible tree data structures. The container being implemented may or may not have a key field. It may or may not have a data field. It may or may not have subtree summary data stored within nodes. The non-template part needs a way to access fields such that the position within a node struct (of an unknown datatype) is determined at run-time.
>
>I don't think it's a good idea to trade portability for a few cycles. And this might be even slower, since you confuse the compiler (alias-analysis etc).

It should be clear that we're not talking about everyday apps programming here. To the extent that it would be used in an application at all, the word processor example above is probably relevant - it would implement a core data structure in a large application, but most developers working on the app wouldn't touch the internals of that data structure.

The portability issues are real but not hard to handle. For getting the offset of a specific field, for instance, the C standard library has an offsetof macro for precisely this kind of work. The template wrapper simply defines the struct it needs and uses offsetof to extract the needed offsets - technicalities such as alignment are handled automatically. The biggest issue is making sure that offsets are held in integers of the right size, and even that isn't much of a hassle - if you can't find an integer-the-same-size-as-a-pointer type in the language or library, you simply define your own using conditional compilation or template specialisation or whatever.

As for the 'a few cycles' thing, remember that we are talking about defining core data structures here - things that will be accessed a great deal, often in inner loops. The performance of core data structures is important enough that Walter makes a big deal of the few cycles per access saved due to Ds handling of hash tables (using binary trees to handle collisions, as opposed to other common implementations that use linked lists), and rightly so. If you have an application that does millions of accesses, those few cycles per access can really start to accumulate.

In fact one of the reasons why scripting languages have their core data structures built into the language is simply that if they were implemented using the scripting languages themselves, the performance would be far too poor to be practical. Having efficient operations on core data structures is extremely important.

It is of course true that, in general, programmers shouldn't fuss about code optimisation too much - choice of algorithm and data structure, yes, but fiddling with implementation details definitely not. Key operations on core data structures are often an exception to this rule, however.

And yes, there is a real chance that doing this may be slower for some operations - very much slower. That is why I said...

: The most efficient way to handle this depends on what kind of
: operation is being implemented. A single data structure may have two
: or more ways to access the same field, just to ensure that all
: operations can be handled efficiently - many accesses will probably be
: done as part of small functions supplied by the typesafe wrapper, to
: keep the call overheads down in inner loops.

... and also why I said these techniques are needed to manage bloat (rather than completely eliminate it). The compiler can handle these small inner-loop functions very well, but they create bloat since each instantiation of the template wrapper creates a new set of these small functions. Therefore, there is an issue of balance.

So it is absolutely true that these are not things to mess with every day. They require care, and for most things you are far better off sticking with the containers supplied by the language itself or its standard libraries. But there are exceptions to this rule, and depending on the kind of code that you tend to work on, these exceptions can happen more often than you might realise. And in any case, someone has to write the standard library containers in the first place.

-- 
Remove 'wants' and 'nospam' from e-mail.
Forums