Jump to page: 1 2
Thread overview
Returning variable-sized stack data
Jul 15
IchorDev
Jul 15
ryuukk_
Jul 15
monkyyy
Jul 17
IchorDev
Jul 17
IchorDev
Jul 18
IchorDev
Aug 10
IchorDev
Jul 17
monkyyy
Jul 17
monkyyy
Aug 12
ryuukk_
Aug 13
ryuukk_
Aug 17
IchorDev
Jul 17
IchorDev
July 15

I mentioned in the monthly meeting how I would like to see a more convenient way to return variable-sized data to the stack in D. Walter mentioned that he wouldn't like to break the C ABI, which is understandable, but you can certainly make this work without a different ABI. In fact, you can even return variable-sized data to the stack in C:

#include <alloca.h>
#include <stdio.h>

struct A{ int a,b,c,d,e,f,g; };
struct B{ int a,b,c; };

int myFnRetSize(int n){ return n == 0 ? sizeof(struct A) : sizeof(struct B); }

void myFn(void* mem, int n){
	if(n == 0){
		struct A a = {1,2,3,4,5,6,7};
		*((struct A*)mem) = a;
	}else{
		struct B b = {1,2,7};
		*((struct B*)mem) = b;
	}
}

int main(){
	int n = 1; //<—— can be any number
	int size = myFnRetSize(n);
	void* mem = alloca(size);
	myFn(mem, n);
	
	//write out the result:
	for(int i=0; i<size/sizeof(int); i++){
		printf("%d ", ((int*)mem)[i]);
	}
	printf("\n");
	return 0;
}

Sorry if the code is terrible, but hopefully it demonstrates my point adequately. You might say that this is technically returning by reference, but at the machine-code level all stack access is done via pointers.

You might be wondering: what the point of having this feature would even be?
Well, unions always take as much space as their largest member. If a union contains a struct that's (for example) 512 bytes large, it will always take 512 bytes, when really we might only need to store a 4–8 byte number most of the time. With sumtypes, variable-sized stack returns could greatly optimise their stack consumption in cases where they have vastly different type sizes, with the smaller types being used most frequently.
Some might say reference types should be used for such a purpose; but when you're programming a largely data-driven system that uses masses of structs, the sheer amount of heap allocations could become a huge performance bottleneck, whereas stack allocation is practically instant. Of course, you can always pre-allocate a huge amount of data onto the stack, but then a lot of it will go to waste and your code will be more vulnerable to stack-overflows.

A way of making variable-sized stack returns less cumbersome in D would be to have some syntactic sugar that works something like this:

struct A{ int a,b,c,d,e,f,g; }
struct B{ int a,b,c; }

@stackArrayReturn myFn(int n){
	auto nSqr      = n * n; //demonstrate how variables can affect the return value
	auto condition = n * n;
	if(condition == 0){
		return A(nSqr+1,2,3,4,5,6,7);
	}else{
		return B(nSqr,2,7);
	}
}

void main(){
	int n = 1; //<—— can be any number
	void[] myMem = myFn(n);
}

Which gets lowered to this:

import std.typecons;
struct A{ int a,b,c,d,e,f,g; }
struct B{ int a,b,c; }

size_t myFn(out return scope void function(void[] memory, Tuple!(int,"nSqr") context) callback, out Tuple!(int,"nSqr") context, int n){
	auto nSqr      = n * n;
	auto condition = n * n;
	if(condition == 0){
		context = Tuple!(int,"nSqr")(nSqr);
		callback = (void[] m, Tuple!(int,"nSqr") ctx){
			*cast(A*)&m[0] = A(ctx.nSqr+1,2,3,4,5,6,7);
		};
		return A.sizeof;
	}else{
		context = Tuple!(int,"nSqr")(nSqr);
		callback = (void[] m, Tuple!(int,"nSqr") ctx){
			*cast(B*)&m[0] = B(ctx.nSqr,2,7);
		};
		return B.sizeof;
	}
}

void main(){
	int n = 1; //<—— can be any number
	void[] myMem;
	{
		scope void function(void[] memory, Tuple!(int,"nSqr") context) __returnCallback;
		Tuple!(int,"nSqr") __returnContext;
		size_t __returnSize = myFn(__returnCallback, __returnContext, n);
		import core.stdc.stdlib: alloca;
		myMem = alloca(__returnSize)[0..__returnSize];
		__returnCallback(myMem, __returnContext);
	}
}

This is an example of returning one of two different struct types, but this could also be used to return slices of any type (e.g. int[]).
Ideally there would be a nice way to do this with scoped delegates, but alloca will re-allocate the data that their context pointer points to. Another less stack-wasteful (albeit potentially significantly more CPU-wasteful) method would be to have two completely separate functions. One that determines the return size, and another that does all the other logic. However, this method would either limit the function's structure significantly, or generate wasteful code.

I would love to hear feedback and suggestions for improvements, or of other possible implementations for variadic stack returns.

July 15

+1, however the syntax proposed sucks

ubyte[$] get_variable_size_array(int cond) {
    if (cond == 0)
        return [1,2,3];
    else
        return [1,2,3,4,5];
}


// OR

scope(ubyte[]) get_variable_size_array(int cond) {
    if (cond == 0)
        return [1,2,3];
    else
        return [1,2,3,4,5];
}
July 15

On Monday, 15 July 2024 at 11:12:59 UTC, ryuukk_ wrote:

>

ubyte[$]

this one


I think you should also preemptively handle runtime stack allocated arrays inside the function

int[$] foo(size_t i){
  int[i] output;
  foreach(...){
    ...
  }
  return output;
}
July 17

On Monday, 15 July 2024 at 17:20:57 UTC, monkyyy wrote:

>

I think you should also preemptively handle runtime stack allocated arrays inside the function

int[$] foo(size_t i){
  int[i] output;
  foreach(...){
    ...
  }
  return output;
}

Oh yes, that would be fantastic!
It'd be lowered to something like this, right?

int[] output = (cast(int*)alloca(i * int.sizeof))[0..i];
July 17

On Monday, 15 July 2024 at 11:12:59 UTC, ryuukk_ wrote:

>

the syntax proposed sucks

I'm not really proposing any particular syntax at this point, I just gave an example of how an example implementation could be nicely lowered into existing D code.

July 17

On Wednesday, 17 July 2024 at 08:23:45 UTC, IchorDev wrote:

>

On Monday, 15 July 2024 at 17:20:57 UTC, monkyyy wrote:

>
int[$] foo(size_t i){
  int[i] output;

Isn't that a stack overflow footgun?

> >

foreach(...){
...
}
return output;
}

Oh yes, that would be fantastic!
It'd be lowered to something like this, right?

int[] output = (cast(int*)alloca(i * int.sizeof))[0..i];

Wouldn't that allocate on foo's stack? It needs to be on the caller's stack.

July 17

On Wednesday, 17 July 2024 at 09:31:37 UTC, Nick Treleaven wrote:

>

Isn't that a stack overflow footgun?

How so?

>

Wouldn't that allocate on foo's stack? It needs to be on the caller's stack.

Depends on whether you place the alloca call in the caller or the callee.
This brings us to an important implementation question: what if there's 3 layers of function calls, and all of them each want to return a variadic stack allocation? The example in my OP could certainly be chained with many returns, but perhaps there's a better way? What if we just return a pointer to the stack memory, use alloca, and then copy the data to the newly allocated pointer? Granted, the newly allocated memory will overlap the old memory if it's sufficiently large, but this shouldn't be a problem under any calling conventions that I'm familiar with. This example works with dmd >= 2.084.1, but fails with ldc2/old dmd for some reason:

import std.typecons;
struct A{ int a,b,c,d,e,f,g; }
struct B{ int a,b,c; }
void[] myFn(int n){
	import core.stdc.stdlib: alloca;
	auto nSqr = n * n;
	if(nSqr == 0){
		auto __returnPtr = alloca(A.sizeof);
		*cast(A*)__returnPtr = A(nSqr+1,2,3,4,5,6,7);
		return __returnPtr[0..A.sizeof];
	}else{
		auto __returnPtr = alloca(B.sizeof);
		*cast(B*)__returnPtr = B(nSqr,2,7);
		return __returnPtr[0..B.sizeof];
	}
}
void main(){
	int n = 0; //<—— can be any number
	void[] myMem;
	{
		import core.stdc.stdlib: alloca;
		void[] __returnMemory = myFn(n);
		myMem = alloca(__returnMemory.length)[0..__returnMemory.length];
		myMem[] = __returnMemory[];
	}
	import std.stdio;
	writeln(cast(int[])myMem);
}
July 17

On Wednesday, 17 July 2024 at 09:31:37 UTC, Nick Treleaven wrote:

>

On Wednesday, 17 July 2024 at 08:23:45 UTC, IchorDev wrote:

>

On Monday, 15 July 2024 at 17:20:57 UTC, monkyyy wrote:

>
int[$] foo(size_t i){
  int[i] output;

Isn't that a stack overflow footgun?

the size_t? probaly; but that ships sailed, slices use size_t lengths, its also a foot gun that 0-1 sized arrays will very much get back a "no" from the os when allocated and crash runtime, and theres wildly unnessery bits there.

July 17

On Wednesday, 17 July 2024 at 08:23:45 UTC, IchorDev wrote:

>

On Monday, 15 July 2024 at 17:20:57 UTC, monkyyy wrote:

>

I think you should also preemptively handle runtime stack allocated arrays inside the function

int[$] foo(size_t i){
  int[i] output;
  foreach(...){
    ...
  }
  return output;
}

Oh yes, that would be fantastic!
It'd be lowered to something like this, right?

int[] output = (cast(int*)alloca(i * int.sizeof))[0..i];

I have no idea, Im just saying it seems incomplete to have it only be returns; because they hack im imagining to make a, lets say, 0..10 element array where the magic is at the return call, will be some awful static foreach in a switch statement or some recursive functional meme.

July 18

On Wednesday, 17 July 2024 at 10:56:28 UTC, IchorDev wrote:

>

This example works with dmd >= 2.084.1, but fails with ldc2/old dmd for some reason:

OK I fixed it. It was the pesky slice copy that I forgot always breaks with overlapping slices. This updated example works with dmd >= 2.075.1 (earlier versions fail to compile because of what I must assume is a bug in write), and it also works with recent versions of LDC2:

import std.typecons;
struct A{ int a,b,c,d,e,f,g; }
struct B{ int a,b,c; }
ubyte[] myFn(int n){
	auto nSqr = n * n;
	import core.stdc.stdlib: alloca;
	if(nSqr == 0){
		auto __returnPtr = alloca(A.sizeof);
		*cast(A*)__returnPtr = A(nSqr+1,2,3,4,5,6,7);
		return (cast(ubyte*)__returnPtr)[0..A.sizeof];
	}else{
		auto __returnPtr = alloca(B.sizeof);
		*cast(B*)__returnPtr = B(nSqr,2,7);
		return (cast(ubyte*)__returnPtr)[0..B.sizeof];
	}
}
void main(){
	int n = 0; //<—— can be any number
	ubyte[] myMem;
	{
		ubyte[] __returnMemory = myFn(n);
		import core.stdc.stdlib: alloca;
		myMem = (cast(ubyte*)alloca(__returnMemory.length))[0..__returnMemory.length];
		foreach(i; 0..myMem.length){
			myMem[i] = __returnMemory[i];
		}
	}
	import std.stdio;
	writeln(cast(int[])myMem);
}

For some reason casting back to void[] after the copy causes problems with the output.

« First   ‹ Prev
1 2