Jump to page: 1 2
Thread overview
AA vs __gshared
Jul 27, 2023
IchorDev
Jul 27, 2023
Dennis
Jul 27, 2023
Jonathan M Davis
Jul 28, 2023
IchorDev
Jul 28, 2023
Kagamin
Jul 28, 2023
IchorDev
Jul 28, 2023
IchorDev
Jul 28, 2023
IchorDev
Aug 10
IchorDev
Aug 12
IchorDev
Aug 16
IchorDev
Jul 28, 2023
Kagamin
July 27, 2023

I've been getting a lot of segfaults from using associative arrays recently. The faults happen seemingly at random, and from pretty mundane stuff like if(auto x = y in z) that run very often:

Segmentation fault.
#0  0x0000555555670f4a in rt.aaA.Impl.findSlotLookup(ulong, scope const(void*), scope const(TypeInfo)) inout ()
#1  0x0000555555661662 in _aaInX ()

I suspect that this is because they've all been placed inside __ghared structs. Are DRuntime's AAs simply incompatible with __gshared? Do they need to be marked as shared to prevent these shenanigans?

July 27, 2023

On Thursday, 27 July 2023 at 15:57:51 UTC, IchorDev wrote:

>

The faults happen seemingly at random, and from pretty mundane stuff like if(auto x = y in z) that run very often:

Are you accessing the AA from multiple threads?

July 27, 2023
On Thursday, July 27, 2023 9:57:51 AM MDT IchorDev via Digitalmars-d-learn wrote:
> I've been getting a lot of segfaults from using associative
> arrays recently. The faults happen seemingly at random, and from
> pretty mundane stuff like `if(auto x = y in z)` that run very
> often:
> ```
> Segmentation fault.
> #0  0x0000555555670f4a in rt.aaA.Impl.findSlotLookup(ulong, scope
> const(void*), scope const(TypeInfo)) inout ()
> #1  0x0000555555661662 in _aaInX ()
> ```
>
> I suspect that this is because they've all been placed inside `__ghared` structs. Are DRuntime's AAs simply incompatible with `__gshared`? Do they need to be marked as `shared` to prevent these shenanigans?

Strictly speaking, __gshared is really only intended for stuff like C globals (which can't be shared due to name-mangling issues). Using it on anything else can at least potentially cause problems due to the fact that the compiler will assume that the variable is thread-local. So, I would strongly advise against using __gshared in a case like this. In practice, you can often get away with it, because the compiler doesn't do much in the way of optimizing stuff based on objects being thread-local right now, but it's definitely risking problems with the type system if you used __gshared when you're not trying to do something like bind to a C global.

What should normally be happening is that you use shared, and then when you've protected the object so that you know that it can only be accessed on the current thread by the section of code that you're in (e.g. by locking a mutex), you temporarily cast away shared to operate on the object via a thread-local reference. Then, before exiting that section of code and removing the protections that are preventing other threads from accessing the object (e.g. by unlocking the mutex), you make sure that you've gotten rid of all of the thread-local references to the object so that only the shared reference exists. That way, you don't accidentally mutate the object while it's not protected from access by other threads.

Now, as to what's happening in your code that's causing segfaults, the most likely culprit would be that you're accessing the AA without actually having done anything to prevent other threads from accessing it at the same time (or your protections were inadequate). And because the object is being treated as thread-local by the compiler, it would be easy to have accidentally let a reference to it leak somewhere that wasn't being protected by whatever mutex you're using, whereas if the AA were shared, the only sections of code where you would have to worry about thread-local references escaping would be in the sections of code where you've cast away shared after locking the relevant mutex. So, similar to what happens with @safe and @trusted, using shared allows you to limit the code that you have to examine to find the problem.

- Jonathan M Davis



July 28, 2023

On Thursday, 27 July 2023 at 21:31:02 UTC, Jonathan M Davis wrote:

>

Now, as to what's happening in your code that's causing segfaults, the most likely culprit would be that you're accessing the AA without actually having done anything to prevent other threads from accessing it at the same time (or your protections were inadequate).

I use a shared Mutex from core.sync.mutex.
The AA itself and the final class that it's a member of is only ever accessed by one thread for now.
The code inside that final class that accesses the AA is called by a function from another final class, and that other final class is used by multiple threads, but it's fully guarded by Mutex locks—I haven't had any issues with the millions of non-atomic reads and writes on its data I've performed from the two threads. Both objects are "static" (declared at module scope) if that matters at all.

>

if the AA were shared, the only sections of code where you would have to worry about thread-local references escaping would be in the sections of code where you've cast away shared after locking the relevant mutex. So, similar to what happens with @safe and @trusted, using shared allows you to limit the code that you have to examine to find the problem.

I was told that using __gshared is quite a bit faster at runtime than using shared, but I also don't really know anything concrete about shared because the spec is so incredibly vague about it.

July 28, 2023

On Friday, 28 July 2023 at 03:54:53 UTC, IchorDev wrote:

>

I was told that using __gshared is quite a bit faster at runtime than using shared, but I also don't really know anything concrete about shared because the spec is so incredibly vague about it.

The difference between them is purely formal if you're not on an old gdc, where shared was synchronized like C# volatile. If the crashes are frequent, can you reproduce a crash with a minimal amount of code, start many threads and access the locked AA concurrently.

July 28, 2023

On Friday, 28 July 2023 at 04:13:13 UTC, Kagamin wrote:

>

The difference between them is purely formal if you're not on an old gdc, where shared was synchronized like C# volatile.

I'm not sure that's correct. Also I always use the latest compiler versions where possible.

>

start many threads and access the locked AA concurrently.

It's only being accessed from one thread, so such a test wouldn't be an accurate reproduction of the issue I'm having. Also I'm only using 2 threads total.

July 28, 2023

On Thursday, 27 July 2023 at 21:31:02 UTC, Jonathan M Davis wrote:

>

What should normally be happening is that you use shared, and then when you've protected the object so that you know that it can only be accessed on the current thread by the section of code that you're in (e.g. by locking a mutex), you temporarily cast away shared to operate on the object via a thread-local reference. Then, before exiting that section of code and removing the protections that are preventing other threads from accessing the object (e.g. by unlocking the mutex), you make sure that you've gotten rid of all of the thread-local references to the object so that only the shared reference exists. That way, you don't accidentally mutate the object while it's not protected from access by other threads.

Doing this doesn't help, unfortunately.
This is an abstract version what the problematic code looks like now:

import std.stdio: writeln;
import core.sync.mutex;
import core.thread;
import core.time;
static import core.stdc.stdlib;

struct Pos{
	long x,y;
}

shared Obj obj;

final class Obj{
	CacheData[Pos] cache;
	CacheData* getCache(Pos key){
		if(auto item = key in cache){
			if(++item.useCount >= 8){
				cache.remove(key);
				//I thought this ^ might cause a use-after-free issue, but apparently not.
			}
			return item;
		}
		return null;
	}
	struct CacheData{
		double[1000] data;
		uint useCount = 1;
	}
	
	double[1000] doStuff(Pos pos){
		immutable data = (){
			if(auto data = getCache(pos)){
				return *data;
			}else{
				double[1000] data;
				//initialisses it with something, this is arbirray though:
				foreach(ref item; data){
					import std.random;
					item = uniform01();
				}
				cache[pos] = CacheData(data);
				return CacheData(data);
			}
		}();
		
		//do stuff with data
		
		return data.data;
	}	
}

__gshared OtherObj otherObj;

final class OtherObj{
	shared Mutex mutex;
	__gshared ubyte[2^^18] data;
	
	this(long n){
		obj = cast(shared Obj)alloc!Obj(n);
		mutex = new shared Mutex;
	}
	
	void callObj(Pos pos){
		double[1000] data;
		
		{
			auto objRef = cast(Obj)obj;
			data = objRef.doStuff(pos);
		}
		
		//do things with returned value...
	}
}

void thread(){
	bool run = true;
	Pos pos;
	while(run){
		otherObj.mutex.lock();
		foreach(i; 0..100){
			otherObj.callObj(pos);
			//this is pretty arbirary:
			import std.random;
			if(uniform01() > 0.9){
				auto v = uniform01();
				if(v < 0.25) pos.x--;
				else if(v < 0.5) pos.y++;
				else if(v < 0.75) pos.y--;
				else pos.x++;
			}
		}
		otherObj.mutex.unlock();
		
		Thread.sleep(1.seconds / 20);
	}
}

void main(){
	otherObj = alloc!OtherObj(-7);
	assert(otherObj !is null);
	
	auto otherThread = new Thread(&thread);
	otherThread.start();
	
	bool run = true;
	while(run){
		if(!otherThread.isRunning()) otherThread.join();
		otherObj.mutex.lock();
		foreach(val; otherObj.data){
			//do stuff
		}
		otherObj.mutex.unlock();
		Thread.sleep(1.seconds / 80);
	}
}

T alloc(T, A...)(auto ref A args){
	enum classSize = __traits(classInstanceSize, T);
	void* mem = core.stdc.stdlib.malloc(classSize);
	scope(failure) core.stdc.stdlib.free(mem);
	assert(mem !is null, "Out of memory");
	mem[0..classSize] = __traits(initSymbol, T);
	T inst = cast(T)mem;
	static if(__traits(hasMember, T, "__ctor")){
		
		inst.__ctor(__traits(parameters));
	}
	return inst;
}

Issue is, this code I posted actually runs just fine, unlike the real code.
My actual code does this HIGHLY SUSPICIOUS thing when printing their length each time before using them:

766
766
765
766
767
768
768
768
768
768
768
768
768
768
768
768
768
768
768
128000
Error Program exited with code -11
(backtrace:)
#0  0x0000555555670c46 in rt.aaA.Impl.findSlotLookup(ulong, scope const(void*), scope const(TypeInfo)) inout ()
#1  0x0000555555661592 in _aaInX ()

Therefore I must logically conclude that DRuntime's AAs are cursed! Unless this is a well-known issue or something...
Thinking back, I've actually had them cause segfaults in non-threaded code, maybe __gshared was never the cause at all.

July 28, 2023

On 7/28/23 4:39 AM, IchorDev wrote:

>

Issue is, this code I posted actually runs just fine, unlike the real code.
My actual code does this HIGHLY SUSPICIOUS thing when printing their length each time before using them:

766
766
765
766
767
768
768
768
768
768
768
768
768
768
768
768
768
768
768
128000
Error Program exited with code -11
(backtrace:)
#0  0x0000555555670c46 in rt.aaA.Impl.findSlotLookup(ulong, scope const(void*), scope const(TypeInfo)) inout ()
#1  0x0000555555661592 in _aaInX ()

My suspicion would be a race condition in your real code, and no race condition in this toy code. Second suspicion would be memory corruption (possibly caused by a race condition).

>

Therefore I must logically conclude that DRuntime's AAs are cursed! Unless this is a well-known issue or something...

AAs have worked forever. I've never had problems with them. Not saying there's not a bug here, maybe there is. But I would be surprised.

>

Thinking back, I've actually had them cause segfaults in non-threaded code, maybe __gshared was never the cause at all.

All __gshared does is give you storage that is accessible from all threads, but is not typed as shared. It doesn't change how the data is used.

-Steve

July 28, 2023

Your error is using allocating the object with malloc. Since gc doesn't see your AA, the AA is freed and you get UAF.

July 28, 2023

On Friday, 28 July 2023 at 11:15:31 UTC, Steven Schveighoffer wrote:

>

All __gshared does is give you storage that is accessible from all threads,

"All __gshared does is give you [a live bomb, ready to go off at any moment]"
!!

On Friday, 28 July 2023 at 14:10:16 UTC, Kagamin wrote:

>

Your error is using allocating the object with malloc. Since gc doesn't see your AA, the AA is freed and you get UAF.

Friend, I think you nailed it. After adding this I haven't been able to reproduce the segfault again:

this(long n){
	(){
		cache = new typeof(cache);
		assert(cache);
		import core.memory;
		core.memory.GC.addRoot(cast(void*)cache);
	}();
	// ...

It did always happen at random, so perhaps I haven't spent enough time testing it yet, but I've gone far longer without it segfaulting than ever before.

« First   ‹ Prev
1 2