July 19, 2010
On Mon, 19 Jul 2010 15:53:46 -0400, bearophile <bearophileHUGS@lycos.com> wrote:

> Steven Schveighoffer:
>
>> Not quite :)  There is one byte for padding because (insert gasp-inspiring
>> music accent) all struct heap allocations are allocated through newArray
>> with a size of 1.  I discovered this when working on the array append
>> patch.
>
> How much more hidden shit like this do I have to see?
> I have filed a bug report:
> http://d.puremagic.com/issues/show_bug.cgi?id=4487
> Maybe Walter has to fix this before porting dmd to 64 bits.
>
>
>> Most of this is mitigated if you have a custom allocator that allocates an
>> array of nodes at once
>
> The GC is supposed to not suck that much. If I want to do all manually and use custom allocators then maybe it's better if I start to switc to C language.
> Thank you Steven.

What's so horrible about it?  It's a corner case.  If you were allocating a 20-byte struct, you are wasting 12 bytes per value anyways.  Or what about a 36-byte struct?

All I'm saying is the pad byte itself isn't a huge issue, even if it didn't exist, there would always be inefficient allocations.  Take the redblack tree node for instance.  Get rid of the pad byte, and it's tuned for word-size payload.  But go over that, and you're back to pretty much wasting 50% of your memory.  If you want to tune your app to be the most efficient at memory allocation, you need to study the allocator and learn its tricks and nuances.  I think you have some misguided notion that C's allocator is perfect, and there's no hidden cost to it.  That's not the case.

That being said, it would be good if we could get rid of the 1-byte pad for single struct allocations on the heap.  As a bonus, the code will be more efficient because it doesn't have to deal with the "array" notion.

-Steve
July 19, 2010
Steven Schveighoffer:
> What's so horrible about it?

Using about 210% of the RAM I was planning to use. And not even saying it in a small print somewhere in the docs.


> It's a corner case.

4-word structs are quite common. It's not a common corner.


> All I'm saying is the pad byte itself isn't a huge issue, even if it didn't exist, there would always be inefficient allocations.

That is too much inefficient.


> If you want to tune your app to be the most efficient at memory allocation, you need to study the allocator and learn its tricks and nuances.

The allocator can have some overhead, but this is offensive.


>  I think you have some misguided notion that C's
> allocator is perfect, and there's no hidden cost to it.  That's not the
> case.

I have written plenty code in C and its cost is not that high.

I have seen you have changed my bug report, the first I have signed as 'major' into an 'enhancement'. And you have said:
> DMD functions exactly as designed.

Then I can answer it's a *design* bug. I feel offended.
July 19, 2010
On 07/19/2010 03:09 PM, Steven Schveighoffer wrote:
> Take the
> redblack tree node for instance. Get rid of the pad byte, and it's tuned
> for word-size payload. But go over that, and you're back to pretty much
> wasting 50% of your memory.

I think this characterization is a bit inaccurate because it suggests that there are gains only for one-word payloads. I think the truth is that there are gain for several payloads. Their relative value decreases with the size of the payload.

Long story short - the less slack (byte of overhead, bool for red/black information), the better (in various quanta and of various values).


Andrei
July 19, 2010
On Mon, 19 Jul 2010 16:24:04 -0400, bearophile <bearophileHUGS@lycos.com> wrote:

> Steven Schveighoffer:
>> What's so horrible about it?
>
> Using about 210% of the RAM I was planning to use. And not even saying it in a small print somewhere in the docs.
>
>
>> It's a corner case.
>
> 4-word structs are quite common. It's not a common corner.

You mean it *is* a common corner?  I agree, and I think the bug report is warranted, we should get it fixed.  But again, you have specifically designed your input to thwart the system.  Those examples will always be possible to construct, unless you have a perfect memory allocator, which probably will be so slow that it's unusable.

>
>
>> All I'm saying is the pad byte itself isn't a huge issue, even if it
>> didn't exist, there would always be inefficient allocations.
>
> That is too much inefficient.

That is the cost of allocation schemes that use fixed size memory blocks.  Especially when they grow in powers of 2.  Tune your app for it, and you won't have this problem.

>
>
>> If you want to tune your app to be the most
>> efficient at memory allocation, you need to study the allocator and learn
>> its tricks and nuances.
>
> The allocator can have some overhead, but this is offensive.

I guess my answer is, tune your app to the allocator.  If you allocate a lot of little items, you will go a long way to allocate by allocating an array of them instead and use a free list.

>
>
>>  I think you have some misguided notion that C's
>> allocator is perfect, and there's no hidden cost to it.  That's not the
>> case.
>
> I have written plenty code in C and its cost is not that high.

So C does not use pools of fixed-size memory?  All its blocks it hands out are exactly the size you ask for?

Hm... let me test...

Yep, C does the same thing.  I wrote a small program to allocate 1,000,000 blocks of a command-line-supplied size.

Up to 12 bytes per block allocates 17MiB, even if the blocks I request are 1 byte.  For a single pointer, that's 300% overhead.  How horrid, let's lambaste the C allocator developers.  WHERE'S THE FINE PRINT!????

16 bytes per block allocates 24.6MiB.  Wait, shouldn't it be 17, surely that holds 16MiB of data?  What's going on here?  How can anyone be asked to deal with this shit?  Oh wait, I didn't *tune my app for the allocator*.

Above that, the C allocator does a good job of minimizing overhead, but that's for a plastic example of simply calling malloc 1M times in a row.  And C has less to worry about than D when it comes to memory allocation and is far more experienced at it.  But it's not perfect in all situations.  It's much easier to tune your app to the allocator than tune the allocator to your app.

>
> I have seen you have changed my bug report, the first I have signed as 'major' into an 'enhancement'. And you have said:
>> DMD functions exactly as designed.
>
> Then I can answer it's a *design* bug. I feel offended.

A design bug *is* an enhancement :)

-Steve
July 19, 2010
On Mon, 19 Jul 2010 16:41:50 -0400, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 07/19/2010 03:09 PM, Steven Schveighoffer wrote:
>> Take the
>> redblack tree node for instance. Get rid of the pad byte, and it's tuned
>> for word-size payload. But go over that, and you're back to pretty much
>> wasting 50% of your memory.
>
> I think this characterization is a bit inaccurate because it suggests that there are gains only for one-word payloads. I think the truth is that there are gain for several payloads. Their relative value decreases with the size of the payload.
>
> Long story short - the less slack (byte of overhead, bool for red/black information), the better (in various quanta and of various values).

There is a cost though...  which was my point.  Isn't everyone always saying around here how cheap memory is these days? ;)

-Steve
July 19, 2010
Steven Schveighoffer:
> That is the cost of allocation schemes that use fixed size memory blocks. Especially when they grow in powers of 2.  Tune your app for it, and you won't have this problem.

I did know about the power of 2 allocations for small memory blocks, and I know it's useful to reduce memory fragmentation. So I have tuned my code for that, that's why I have several structs 16 bytes long, but now I have to target 15 bytes, that is not a power of 2 :o)


> A design bug *is* an enhancement :)

I see, I didn't know this. Sorry for losing my temper Steven...

Bye,
bearophile
July 19, 2010
Steven Schveighoffer:
> Isn't everyone always saying around here how cheap memory is these days? ;)

RAM is cheap, but the CPU doesn't used RAM, it mostly uses L1 cache (and a bit L2/L3 caches too), and they cost a lot :-) The more space your data structure uses, the less you can fit in the cache. Today cache effects are important for the code performance.

This is a nice example, shows how reducing the size of the data structure and changing its arrangement (the original was a normal tree, transversed for each pixel) can improve the code performance by something like one order of magnitude for ray-tracing:
http://www.cs.utah.edu/~bes/papers/fastRT/paper-node12.html

Bye,
bearophile
July 19, 2010
On Mon, 19 Jul 2010 17:01:34 -0400, bearophile <bearophileHUGS@lycos.com> wrote:

> Steven Schveighoffer:
>> That is the cost of allocation schemes that use fixed size memory blocks.
>> Especially when they grow in powers of 2.  Tune your app for it, and you
>> won't have this problem.
>
> I did know about the power of 2 allocations for small memory blocks, and I know it's useful to reduce memory fragmentation. So I have tuned my code for that, that's why I have several structs 16 bytes long, but now I have to target 15 bytes, that is not a power of 2 :o)

Hm... unfortunately, I think you will end up in the same boat.  Because any struct of size 15 is aligned to be on a 16-byte boundary.  From my memory, I don't think the array allocation code takes into account if the final element coincides with the pad byte, but I may be wrong.  Make sure to test this theory before going through and trying to trim bytes off all your structs.  I think if you use 12-byte structs, it will fit fine, but then of course, you are wasting 25% memory :)

If you can deal with some manual memory management, you may want to pre-allocate a large array of the structs and then use a free list to "allocate" and "deallocate" them.  This should pack them in as tightly as possible with almost no overhead.  Of course, if you depend on the GC to free your elements, then it might be more of a burden to change all your code to contain manual memory management.

-Steve
1 2 3 4 5 6 7 8 9
Next ›   Last »