Jump to page: 1 2
Thread overview
Reimplementing software building blocks like malloc and free in D
Aug 11, 2018
Aruna Maurya
Aug 11, 2018
rikki cattermole
Aug 11, 2018
Eugene Wissner
Aug 11, 2018
Mike Franklin
Aug 12, 2018
Aruna Maurya
Aug 12, 2018
Mike Franklin
Aug 12, 2018
Eugene Wissner
Aug 13, 2018
Mike Franklin
Aug 13, 2018
bpr
Aug 12, 2018
Aruna Maurya
Aug 12, 2018
Eugene Wissner
Aug 12, 2018
Mike Franklin
Aug 12, 2018
Aruna Maurya
Aug 12, 2018
Mike Parker
Aug 12, 2018
Aruna Maurya
Aug 12, 2018
Mike Franklin
Aug 12, 2018
Mike Franklin
August 11, 2018
Hello all! I am a Computer Science undergraduate student. Currently a junior, I find the idea of 're-implementing software building blocks like malloc and free in D' listed in the wiki page of SAOC(Symmetry Autumn of Code) quite interesting.

I don't have experience of coding in D however, I have 4 years of coding in C++ and have a good hold on pointers and the standard template library. I have also successfully completed a course of Operating Systems and Data Structures in my 4th semester. I believe I can pick up D in due course of time.

I had few queries regarding the same and would be glad if anyone could pitch in to help me out.

1. If it would have been C++, the above can be implemented by using the mmap() syscall to                  allocate memory from the heap. However, something that I am not able to understand in the idea description is 'runtime hooks'.

2. Also as far as I can understand, this idea is about implementing the above from scratch, and not tweaking the existing codebase. Do let me know if I am wrong.

Thank you for your time!

August 12, 2018
On 12/08/2018 6:59 AM, Aruna Maurya wrote:
> 1. If it would have been C++, the above can be implemented by using the mmap() syscall to                  allocate memory from the heap. However, something that I am not able to understand in the idea description is 'runtime hooks'.

A runtime hook is a function that the compiler injects a call to, to perform some function. Like allocate a class.

By reimplementing a lot of these primitives, the compiler will be able to optimize them better e.g. inlining and removing dependencies on other runtime information.

> 2. Also as far as I can understand, this idea is about implementing the above from scratch, and not tweaking the existing codebase. Do let me know if I am wrong.

It is modifying existing druntime and with it dmd (frontend) after developing these building blocks.

This project is just an idea, you can set your own set of goals and scope in the proposal. For example, you may choose to rewrite the memory management stuff in libc in D and support it across Windows, Linux and OSX. Instead of dealing with integration into druntime/dmd.

August 11, 2018
On Saturday, 11 August 2018 at 18:59:00 UTC, Aruna Maurya wrote:
> Hello all! I am a Computer Science undergraduate student. Currently a junior, I find the idea of 're-implementing software building blocks like malloc and free in D' listed in the wiki page of SAOC(Symmetry Autumn of Code) quite interesting.
>
> I don't have experience of coding in D however, I have 4 years of coding in C++ and have a good hold on pointers and the standard template library. I have also successfully completed a course of Operating Systems and Data Structures in my 4th semester. I believe I can pick up D in due course of time.
>
> I had few queries regarding the same and would be glad if anyone could pitch in to help me out.
>
> 1. If it would have been C++, the above can be implemented by using the mmap() syscall to                  allocate memory from the heap. However, something that I am not able to understand in the idea description is 'runtime hooks'.
>
> 2. Also as far as I can understand, this idea is about implementing the above from scratch, and not tweaking the existing codebase. Do let me know if I am wrong.
>
> Thank you for your time!

I missed this project; To let everyone know, I've implemented malloc and free:

https://github.com/caraus-ecms/tanya/blob/master/source/tanya/memory/mmappool.d

(see allocate() and deallocate() for a starting point). My implementation is just not thread-safe.
August 11, 2018
On Saturday, 11 August 2018 at 18:59:00 UTC, Aruna Maurya wrote:

> 1. If it would have been C++, the above can be implemented by using the mmap() syscall to                  allocate memory from the heap. However, something that I am not able to understand in the idea description is 'runtime hooks'.

The compiler recognizes certain expressions and lowers them to functions in the D runtime that we call runtime hooks.  See https://wiki.dlang.org/Runtime_Hooks for an unmaintained, but still relevant, list of some of them.

For example, the expression `cast(int[])a`, where `a` is an array of another type will get lowered to `_d_arraycast`.

We're trying to rewrite many of these hooks as templates.  D has changed quite a bit over the past few years, and at the time the existing runtime hooks were written D didn't have templates and many other features.  See https://github.com/dlang/druntime/pull/2264 for an example converting `_d_arraycast` to a template.

Many of the existing runtime hooks leverage software building blocks like `memcpy`, `memcmp`, `malloc`, `free`, etc. in their implementation.  If the runtime hooks are implemented as templates, we will have access to the type information and other information at compile-time rather than run-time.  This will give us opportunities to specialize implementations at compile-time.  See https://github.com/JinShil/memcpyD/blob/master/memcpyd.d for a proof of concept exploring such opportunities for `memcpy`.

> 2. Also as far as I can understand, this idea is about implementing the above from scratch, and not tweaking the existing codebase. Do let me know if I am wrong.

I'm the one who proposed the idea, and, yes, I envisioned implementing these building blocks from scratch, translating an existing implementation from another language (watch the license!), or implementing an existing published algorithm in D.

There might even be a way to leverage implementation code in https://dlang.org/phobos/std_experimental_allocator.html, but if that code were used in the D runtime, it couldn't have dependencies on Phobos (D's standard library), so it'd probably have to be modified to make it free-standing.  I don't know, though, I haven't looked into it in any detail.

I didn't envision the participant modifying any existing code.  In the case of `malloc`, `realloc`, and `free`, I'd just like to have an idiomatic D implementation that we might be able to leverage when re-implementing the runtime hooks as templates so we no longer have to depend on the C standard library.  I think implementing them in D will make D more portable to other platforms, including freestanding bare-metal platforms like operating systems or microcontrollers, and D's unique feature set may provide opportunities to improve upon existing implementations in terms of performance and compile-time guarantees.

It's difficult for me to articulate all this, as it's a complex vision in my head, so if the above just raised more questions, keep asking.

Also see https://wiki.dlang.org/Memory_Management for some perspective, especially if you're new to D.

The ideas on the wiki page are just to plant the seeds of creativity.  Feel free to deviate from the idea as you wish, or submit a proposal formulated from your own ideas.

Mike


August 12, 2018
On Saturday, 11 August 2018 at 20:15:06 UTC, Mike Franklin wrote:
> On Saturday, 11 August 2018 at 18:59:00 UTC, Aruna Maurya wrote:
>
>> [...]
>
> The compiler recognizes certain expressions and lowers them to functions in the D runtime that we call runtime hooks.  See https://wiki.dlang.org/Runtime_Hooks for an unmaintained, but still relevant, list of some of them.
>
> [...]

So there are essentially two requirements here. Building the memory management functionalities from scratch in idiomatic D and implementing the existing runtime hooks as templates. The latter isn't quite a part of the idea as mentioned on the page.

Also I see in one of the above replies that malloc and free have been already implemented. So is that off the shelf or still available for implementation?
>
> [...]

Thanks for the resource! Will go through it and get back in case of any doubts.
>
> [...]

August 12, 2018
On Sunday, 12 August 2018 at 04:16:24 UTC, Aruna Maurya wrote:

> So there are essentially two requirements here. Building the memory management functionalities from scratch in idiomatic D and implementing the existing runtime hooks as templates. The latter isn't quite a part of the idea as mentioned on the page.

I'd say there is only 1 requirement:  implementing `malloc`, `realloc`, and `free` in idiomatic D without any dependencies on other libraries from other languages.

The "idiomatic D" part means you don't even have to name them "malloc", "realloc", and "free", nor use the same type-unsafe signatures found in the C standard library.  Do what makes sense for D, for example, you may want to implement an overload set like `T[] heapAllocate(T)(size_t length)` and `T heapAllocate(T)()`, or something totally different.  You may even want to provide some other features for gathering runtime statistics on the heap.  The world's your oyster here.

The only reason I mentioned the runtime hooks is to help make the case for why these D implementations are needed and provide some perspective about how they may be used.  I don't envision the Autumn of Code participant learning about the compiler/runtime interface and refactoring the D runtime with the D implementations (unless they really want to).

Regardless of whether we decide to incorporate these D implementations into the official D runtime or not, they will still be worth gold to those who desire using D in freestanding systems programming, so don't concern yourself too much with the d runtime; just make a kick-ass dynamic heap allocator in D, and people will use it.


> Also I see in one of the above replies that malloc and free have been already implemented. So is that off the shelf or still available for implementation?

I wasn't aware of Eugene's implementation.  At first glance it looks real nice.  It appears to have some dependencies on other features of his Tanya library, so at a minimum, those will have to be removed or replaced.

But, how does it perform?  In order to make the case for using these D implementations in druntime, we'll have to gather some metrics on the implementations in terms of performance, memory fragmentation, etc. and ensure they are on par with the C standard library implementations or alternative implementations like jemalloc and tcmalloc?

Also, can Eugene's implementation be used in `@safe` code and is it compatible with the DIP1000? (See https://github.com/dlang/DIPs/blob/master/DIPs/DIP1000.md and the -dip1000 compiler switch at https://github.com/dlang/DIPs/blob/master/DIPs/DIP1000.md)

Perhaps the D implementations can be enhanced to provide compile-time or runtime tuning parameters to achieve better performance in different use cases.

Perhaps, with Eugene's permission, you can use his implementation as a base, take some measurements, and see if it can be improved upon with either more features or better performance.

I'm sorry if I'm just muddying the waters with all this talk.  All I want is a high-quality, idiomatic D implementation of a heap allocator that I can use in my own freestanding software and as a substitute for the C standard library implementations in the D runtime.

Mike
August 12, 2018
On Sunday, 12 August 2018 at 05:24:56 UTC, Mike Franklin wrote:
> I wasn't aware of Eugene's implementation.  At first glance it looks real nice.  It appears to have some dependencies on other features of his Tanya library, so at a minimum, those will have to be removed or replaced.
>

It depends only on internal libc-free syscalls that can be replaced with mmap/munmap and on copy()/copyBackward() that are just memcpy and memmove respectively.

> But, how does it perform?  In order to make the case for using these D implementations in druntime, we'll have to gather some metrics on the implementations in terms of performance, memory fragmentation, etc. and ensure they are on par with the C standard library implementations or alternative implementations like jemalloc and tcmalloc?

Some years ago I had a BigInt implementation that allocated and deallocated memory like crazy. I compared how it works with malloc and my own allocator and my allocator was slightly faster than glibc (probably because of thread-safety). But yes, a good implementation needs benchmarks which test it systematically under various conditions.

Tanya's allocator is also tested only on x86-64 Linux, but is trivially to port to Windows and other architectures. I actually had a Windows version before but removed it because I was too lazy to read msdn.

> Perhaps, with Eugene's permission, you can use his implementation as a base, take some measurements, and see if it can be improved upon with either more features or better performance.

You can use it as reference when implementing something else or as a starting point for the further work. Anyway feel free to use it as you like.

P.S. In the last weeks I had thoughts to split low-level stuff from tanya and put it into a separate library, kind of native, gc-free x86-64 (and maybe aarch64) Linux runtime for D. Probably I should go for it :)
August 12, 2018
On Sunday, 12 August 2018 at 05:24:56 UTC, Mike Franklin wrote:
> On Sunday, 12 August 2018 at 04:16:24 UTC, Aruna Maurya wrote:
>
>> [...]
>
> I'd say there is only 1 requirement:  implementing `malloc`, `realloc`, and `free` in idiomatic D without any dependencies on other libraries from other languages.
>
> [...]
Also what about other implementations like memset or memcmp. Have they been implemented or require work from scratch?

> But, how does it perform?  In order to make the case for using these D implementations in druntime, we'll have to gather some metrics on the implementations in terms of performance, memory fragmentation, etc. and ensure they are on par with the C standard library implementations or alternative implementations like jemalloc and tcmalloc?
>
> [...]

So the only option to implement the above is to go through Eugene's code, take some measurements and see if it can be improved.

> Perhaps the D implementations can be enhanced to provide compile-time or runtime tuning parameters to achieve better performance in different use cases.
>
> [...]
That's totally okay. Your every comment is making the idea more clearer!
>
> [...]

August 12, 2018
On Sunday, 12 August 2018 at 06:44:37 UTC, Aruna Maurya wrote:
> On Sunday, 12 August 2018 at 05:24:56 UTC, Mike Franklin wrote:
>> On Sunday, 12 August 2018 at 04:16:24 UTC, Aruna Maurya wrote:
>>
>>> [...]
>>
>> I'd say there is only 1 requirement:  implementing `malloc`, `realloc`, and `free` in idiomatic D without any dependencies on other libraries from other languages.
>>
>> [...]
> Also what about other implementations like memset or memcmp. Have they been implemented or require work from scratch?

These functions are still mostly implemented in asm, so I'm not sure there is an "idiomatic D" way. I would only wrap them into a function working with slices and checking the length. Mike?
>
> So the only option to implement the above is to go through Eugene's code, take some measurements and see if it can be improved.
>
As said my allocator is nothing official but I would be overhappy if someone finds it useful. I can also advice to look at Doug Lea malloc:

http://g.oswego.edu/dl/html/malloc.html

It is like a book. It's one of the best documented programs I've ever seen.
August 12, 2018
On Sunday, 12 August 2018 at 07:00:30 UTC, Eugene Wissner wrote:

>> Also what about other implementations like memset or memcmp. Have they been implemented or require work from scratch?
>
> These functions are still mostly implemented in asm, so I'm not sure there is an "idiomatic D" way. I would only wrap them into a function working with slices and checking the length. Mike?

Inline ASM is a feature of D, so "idiomatic D" includes assembly implementations.  Where D shines here is with it's metaprogramming capabilities and the ability to select an implementation at compile-time or compose implementations.  See https://github.com/JinShil/memcpyD/blob/master/memcpyd.d  There you will that the compiler selects an implementation based on size in powers of 2.  Sizes that aren't powers of 2 can be composed of the powers of 2 implementations.  For example a 6 byte implementation would be comprised of a 4-byte implementation plus a the 2 byte implementation.

I have more code than what's currently check into that repository, but I stalled because I need AVX512 to do an optimized implementation, and that doesn't seem to be a feature of DMD at the moment.  That was one reason I posted https://wiki.dlang.org/SAOC_2018_ideas#Implement_AVX2_and_AVX-512_in_DMD.2FDRuntime on the wiki.

Agner Fog had this to say in https://agner.org/optimize/optimizing_assembly.pdf

---
There are several ways of moving large blocks of data. The most common methods are:
1. REP MOVS instruction.
2. If data are aligned: Read and write in a loop with the largest available register size.
3. If size is constant: inline move instructions.
165
4. If data are misaligned: First move as many bytes as required to make the destination aligned. Then read unaligned and write aligned in a loop with the largest available register size.
5. If data are misaligned: Read aligned, shift to compensate for misalignment and write aligned.
6. If the data size is too big for caching, use non-temporal writes to bypass the cache. Shift to compensate for misalignment, if necessary.

As you can see, it can be very difficult to choose the optimal method in a given situation. The best advice I can give for a universal memcpy function, based on my testing, is as follows:
* On Intel Wolfdale and earlier, Intel Atom, AMD K8 and earlier, and VIA Nano, use the aligned read - shift - aligned write method (5).
* On Intel Nehalem and later, method (4) is up to 33% faster than method (5).
166
* On AMD K10 and later and Bobcat, use the unaligned read - aligned write method (4).
* The non-temporal write method (6) can be used for data blocks bigger than half the size of the largest-level cache.
---

I think D is well suited for an implementation that takes all these things into consideration, selecting an appropriate implementation, and composing complex implementations of the simpler implementations.

Mike
« First   ‹ Prev
1 2