September 24, 2013
On 9/24/13 9:12 AM, Paulo Pinto wrote:
> Am 24.09.2013 17:29, schrieb Andrei Alexandrescu:
>> On 9/23/13 11:32 PM, Jacob Carlborg wrote:
>>> On 2013-09-23 19:53, Andrei Alexandrescu wrote:
>>>
>>>>
>>>> I think this is debatable. For one, languages such as Java and C++
>>>> still
>>>> have built-in "new" but quite ubiquitously unrecommend their usage in
>>>> user code. Far as I can tell that's been a successful campaign.
>>>
>>> Since when is it _not_ recommended to use "new" in Java?
>>
>> http://goo.gl/KVmAom
>>
>>
>> Andrei
>>
>>
>
> The top results are from around 2002, when most JVMs still had basic GC
> implementations and did not perform escape analysis.

I'm having in mind the polymorphism/flexibility argument. I'd forgotten about the efficiency argument.

Andrei
September 24, 2013
On Tuesday, 24 September 2013 at 08:46:36 UTC, Dmitry Olshansky wrote:
> 23-Sep-2013 03:49, Andrei Alexandrescu пишет:
>> Hello,
>>
>>
>> I am making good progress on the design of std.allocator, and I am
>> optimistic about the way it turns out. D's introspection capabilities
>> really shine through, and in places the design does feel really
>> archetypal - e.g. "this is the essence of a freelist allocator". It's a
>> very good feeling. The overall inspiration comes from Berger's
>> HeapLayers, but D's introspection takes that pattern to a whole new level.
>>
>> Several details are still in flux, but at the top level it seems most
>> natural to divide things in two tiers:
>>
>> 1. Typed allocator, i.e. every request for allocation comes with the
>> exact type requested;
>>
>> 2. Untyped allocator - traffics exclusively in ubyte[].
>>
>
> Looks good (s/ubyte[]/void[] per current discussion).
>
> Do you imagine Typed allocators as something more then adapters that simplify a common pattern of allocate + emplace / destroy + deallocate? (create!T/destroy!T)
>
>> struct NullAllocator
>> {
>>     enum alignment = real.alignof;
>>     enum size_t available = 0;
>>     ubyte[] allocate(size_t s) shared { return null; }
>>     bool expand(ref ubyte[] b, size_t minDelta, size_t maxDelta) shared
>>     { assert(b is null); return false; }
>>     bool reallocate(ref ubyte[] b, size_t) shared
>>     { assert(b is null); return false; }
>>     void deallocate(ubyte[] b) shared { assert(b is null); }
>>     void collect() shared { }
>>     void deallocateAll() shared { }
>>     static shared NullAllocator it;
>> }
>>
>> Primitives:
>>
>
> First things first - can't allocator return alignment as run-time value - a property (just like 'available' does)? The implementation contract is that it must be O(1) vanila syscall-free function. (Read this as consult system info exactly once, at construction).
>
> Thinking more of it - it also may be immutable size_t? Then it gets proper value at construction and then is never changed.
>
>> * expand(b, minDelta, maxDelta) grows b's length by at least minDelta
>> (and on a best-effort basis by at least maxDelta) and returns true, or
>> does nothing and returns false. In most allocators this should be @safe.
>> (One key insight is that expand() can be made @safe whereas shrink() or
>> realloc() are most often not; such mini-epiphanies are very exciting
>> because they put the design on a beam guide with few principles and many
>> consequences.) The precondition is that b is null or has been previously
>> returned by allocate(). This method is optional.
>
> Great to see this primitive. Classic malloc-ators are so lame...
> (+ WinAPI Heaps fits here)
>
>> * deallocate(b) deallocates the memory allocated for b. b must have been
>> previously allocated by the same allocator. This method is usually
>> unsafe (but there are notable implementations that may offer safety,
>> such as unbounded freelists.) This method is optional.
>
> Does the following implication hold "have a deallocate" --> must be manually managed? Otherwise how would one reliably work with it and not leak? This brings us to traits that allocators may (should) have
> a-la automatic? free-all on termination? Zeros on allocate (more common then one would think)? etc.
>
>> * deallocateAll() deallocates in one shot all memory previously
>> allocated by this allocator. This method is optional, and when present
>> is almost always unsafe (I see no meaningful @safe implementation.)
>> Region allocators are notable examples of allocators that define this
>> method.
>
> Okay.. I presume region "mark" works by spiting off a subinstance of allocator and "release" by deallocateAll().
>
>>
>> * collect() frees unused memory that had been allocated with this
>> allocator. This optional method is implemented by tracing collectors and
>> is usually @safe.
>>
>
> This seems hard and/or suboptimal w/o typeinfo --> typed allocators? I guess they could be more then a simple helper.
>
>> There are quite a few more things to define more precisely, but this
>> part of the design has become quite stable. To validate the design, I've
>> defined some simple allocators (Mallocator, GCAllocator, Freelist,
>> StackRegion, Region etc) but not the more involved ones, such as
>> coalescing heaps, buddy system etc.
>
> The problem I see is that allocators are rooted in poor foundation - malloc is so out of fashion (its interface is simply too thin on guarantees), sbrk is frankly a stone-age syscall.
>
> I would love to make a root allocator one on top of mmap/VirtualAlloc.
> This leads me to my biggest problem with classical memory management - ~20 years of PCs having virtual memory support directly in the CPU, and yet hardly anything outside of OS takes advantage of it. They (allocators) seem simply not aware of it - none of interface implies anything that would help user take advantage of it. I'm talking of (potentially) large allocations that may be often resized.
>
> (large allocations actually almost always a blind spot/fallback in allocators)
>
> To the point - we have address space and optionally memory committed in there, leading us to a 3-state of an  _address range_:
> available, reserved, committed. The ranges of 2nd state are dirt cheap (abundant on 64bit) and could be easily flipped to committed and back.
>
> So I have a pattern I want to try to fit in your design. At present it is a datastructure + allocation logic. What I would want is clean and nice composition for ultimate reuse.
>
> An example is a heavy-duty array (could be used for potentially large stack). The requirements are:
> a) Never reallocate on resize - in certain cases it may even be too large to reallocate in RAM (!)
> b) Never waste RAM beyond some limit (a few pages or a tiny faction of size is fine)
> c) O(1) amortized appending as in plain array and other properties of an array -> it's a dynamic array after all
>
> Currently I use mmap/madvise and related entities directly. It goes as follows.
>
> Allocate a sufficiently large - as large as "it" may get (1Mb, 1G, 10G you name it) _address range_. Call this size a 'capacity'.
> Commit some memory (optionally on first resize) - call this a 'committed'. Appending goes using up committed space as typical array would do. Whenever it needs more it then that applies the same extending algorithm as usual but with (committed, capacity) pair.
> Ditto on resize back - (with some amortization) it asks OS to decommit pages decreasing the commited amount.
>
> In short - a c++ "vector" that has 2 capacities (current and absolute maximum) with a plus that it never reallocates. What to do if/when it hits the upper bound - is up to specific application (terminate, fallback to something else). It has the danger of sucking up address space on 32bits though.. but who cares :)
>
>

Somewhat related: http://probablydance.com/2013/05/13/4gb-per-vector/
September 24, 2013
On 9/24/13 9:39 AM, Brad Anderson wrote:
> Somewhat related: http://probablydance.com/2013/05/13/4gb-per-vector/

Great insight, long scroll. Please trim quoting appropriately.

Andrei

September 24, 2013
Am 24.09.2013 18:14, schrieb Andrei Alexandrescu:
> On 9/24/13 9:12 AM, Paulo Pinto wrote:
>> Am 24.09.2013 17:29, schrieb Andrei Alexandrescu:
>>> On 9/23/13 11:32 PM, Jacob Carlborg wrote:
>>>> On 2013-09-23 19:53, Andrei Alexandrescu wrote:
>>>>
>>>>>
>>>>> I think this is debatable. For one, languages such as Java and C++
>>>>> still
>>>>> have built-in "new" but quite ubiquitously unrecommend their usage in
>>>>> user code. Far as I can tell that's been a successful campaign.
>>>>
>>>> Since when is it _not_ recommended to use "new" in Java?
>>>
>>> http://goo.gl/KVmAom
>>>
>>>
>>> Andrei
>>>
>>>
>>
>> The top results are from around 2002, when most JVMs still had basic GC
>> implementations and did not perform escape analysis.
>
> I'm having in mind the polymorphism/flexibility argument. I'd forgotten
> about the efficiency argument.
>
> Andrei

Ah ok. There I agree with you.

The common trend is to favour interfaces for behaviour and make use of dependency injection instead of new.

--
Paulo
September 24, 2013
On Tuesday, 24 September 2013 at 15:25:11 UTC, Andrei Alexandrescu wrote:
>> What are they paying exactly? An extra arg to allocate that can probably
>> be defaulted?
>>   void[] allocate(size_t bytes, size_t align = this.alignment) shared;
>
> For allocating relatively small objects (say up to 32K), we're looking at tens of cycles, no more. An extra argument needs to be passed around and more importantly looked at and acted upon. At this level it's a serious dent in the time budget.

The cost of a few cycles really doesn't matter for memory allocation... If you are really allocating memory so frequently that those few extra cycles matter then you are probably going to be memory bound anyway.

I think this is a situation where you need to justify yourself with something concrete. Can you provide an example of some code whose performance is significantly impacted by the addition of an alignment parameter? It has to be "real code" that does something useful, not just a loop the continually calls allocate.

September 24, 2013
On 9/24/13 9:58 AM, Peter Alexander wrote:
> On Tuesday, 24 September 2013 at 15:25:11 UTC, Andrei Alexandrescu wrote:
>>> What are they paying exactly? An extra arg to allocate that can probably
>>> be defaulted?
>>>   void[] allocate(size_t bytes, size_t align = this.alignment) shared;
>>
>> For allocating relatively small objects (say up to 32K), we're looking
>> at tens of cycles, no more. An extra argument needs to be passed
>> around and more importantly looked at and acted upon. At this level
>> it's a serious dent in the time budget.
>
> The cost of a few cycles really doesn't matter for memory allocation...
> If you are really allocating memory so frequently that those few extra
> cycles matter then you are probably going to be memory bound anyway.

It does. I'm not even going to argue this.

> I think this is a situation where you need to justify yourself with
> something concrete. Can you provide an example of some code whose
> performance is significantly impacted by the addition of an alignment
> parameter? It has to be "real code" that does something useful, not just
> a loop the continually calls allocate.

Strings.


Andrei

September 24, 2013
On Tuesday, 24 September 2013 at 16:58:27 UTC, Peter Alexander wrote:
> The cost of a few cycles really doesn't matter for memory allocation... If you are really allocating memory so frequently that those few extra cycles matter then you are probably going to be memory bound anyway.

It is true if you are using malloc. Not for pool allocator or scratchpad allocator or similar performance-enhancing allocator.
September 24, 2013
On 9/24/13 10:05 AM, Dicebot wrote:
> On Tuesday, 24 September 2013 at 16:58:27 UTC, Peter Alexander wrote:
>> The cost of a few cycles really doesn't matter for memory
>> allocation... If you are really allocating memory so frequently that
>> those few extra cycles matter then you are probably going to be memory
>> bound anyway.
>
> It is true if you are using malloc. Not for pool allocator or scratchpad
> allocator or similar performance-enhancing allocator.

Even for malloc it's a concern. For small allocations, jemalloc takes only about one hundred cycles to take you from where you are to where you are plus a little memory.

Andrei

September 24, 2013
On Tuesday, 24 September 2013 at 16:06:39 UTC, Andrei Alexandrescu wrote:
> On 9/24/13 4:38 AM, Dan Schatzberg wrote:
>> One thing I'm not sure is addressed by this design is memory locality. I
>> know of libnuma http://linux.die.net/man/3/numa which allows me to
>> express what NUMA domain my memory should be allocated from at run-time
>> for each allocation.
>>
>> In the case that I want to allocate memory in a specific NUMA domain
>> (not just local vs non-local), I believe this design is insufficient
>> because the number of domains are only known at run-time.
>>
>> Also, as far as alignment is concerned I will throw in that x86 is
>> relatively unique in having a statically known cache-line size. Both ARM
>> and PowerPC cores can differ in their cache-line sizes. I feel this is a
>> significant argument for the ability to dynamically express alignment.
>
> Could you send a few links so I can take a look?
>
> My knee-jerk reaction to this is that NUMA allocators would provide their own additional primitives and not participate naively in compositions with other allocators.
>
>
> Andrei

Not sure what kind of links you're looking for

The following link is a good discussion of the issue and the current solutions
http://queue.acm.org/detail.cfm?id=2513149

In particular:

"The application may want fine-grained control of how the operating system handles allocation for each of its memory segments. For that purpose, system calls exist that allow the application to specify which memory region should use which policies for memory allocations.

The main performance issues typically involve large structures that are accessed frequently by the threads of the application from all memory nodes and that often contain information that needs to be shared among all threads. These are best placed using interleaving so that the objects are distributed over all available nodes."

The Linux/libc interfaces are linked in my first comment. Specifically with the mbind() call one can specify the policy for allocations from a virtual address range (which NUMA node to allocate the backing physical page from). More generally you could imagine specifying this per allocation.

What is your objective though? Aren't you trying to define a hierarchy of allocators where more specific allocators can be composed from general ones? In which case what is the concern with including locality at the base level? It seems to be one characteristic of memory that programmers might be concerned with and rather trivially you can compose a non-locality aware allocator from a locality aware allocator.
September 24, 2013
On Tuesday, 24 September 2013 at 17:02:18 UTC, Andrei Alexandrescu wrote:
> On 9/24/13 9:58 AM, Peter Alexander wrote:
>> On Tuesday, 24 September 2013 at 15:25:11 UTC, Andrei Alexandrescu wrote:
>>>> What are they paying exactly? An extra arg to allocate that can probably
>>>> be defaulted?
>>>>  void[] allocate(size_t bytes, size_t align = this.alignment) shared;
>>>
>>> For allocating relatively small objects (say up to 32K), we're looking
>>> at tens of cycles, no more. An extra argument needs to be passed
>>> around and more importantly looked at and acted upon. At this level
>>> it's a serious dent in the time budget.
>>
>> The cost of a few cycles really doesn't matter for memory allocation...
>> If you are really allocating memory so frequently that those few extra
>> cycles matter then you are probably going to be memory bound anyway.
>
> It does. I'm not even going to argue this.

Sorry but I find this insulting. Myself and Manu, both professional and senior game developers with a lot of experience in performance are both arguing against you. I'm not saying this makes us automatically right, but I think it's rude to dismiss our concerns as not even worthy of discussion.


>> I think this is a situation where you need to justify yourself with
>> something concrete. Can you provide an example of some code whose
>> performance is significantly impacted by the addition of an alignment
>> parameter? It has to be "real code" that does something useful, not just
>> a loop the continually calls allocate.
>
> Strings.

Strings what? Just allocating lots of small strings?

Ok, I've put together a benchmark of the simplest allocator I can think of (pointer bump) doing *nothing* but allocating 12 bytes at a time and copying a pre-defined string into the allocated memory: http://dpaste.dzfl.pl/59636d82

On my machine, the difference between the version with alignment and the version without 1%. I tried changing the allocator to a class so that the allocation was virtual and not inlined, and the difference was still only ~2% (Yes, I verified in the generated code that nothing was being omitted).

In a real scenario, much more will be going on outside the allocator, making the overhead much less than 1%.

Please let me know if you take issue with the benchmark. I wrote this quickly so hopefully I have not made any mistakes.