Jump to page: 1 2
Thread overview
February 15

I have a program where the GC seems to be overwriting memory still in use and corrupting data.

Here's the code. It's massively reduced from the original program. It's hard to reduce it further because minor changes can prevent the problem from triggering. I'll explain below the important parts.

import std.stdio;

struct S {
    int check;
    S* next;
    int[4] data;
}

int main(string[] args) {
    void*[] allocs;
    enum bad_iter = 268;
    for (int n = 0; n < bad_iter+1; n++) {
        allocs.length = 0;
        auto x = "                   ";
        x ~= ' ';

        int[10][] ts;
        for(int i = 0; i < 21; i++) {
            ts.length++;
        }

        S head;
        S* s = &head;
        if (n == bad_iter) {
            n = bad_iter; // convenient line to set a breakpoint only for the last iteration
        }
        for(int i = 0; i < 8; i++) {
            auto ns = new S;
            ns.check = 1; // set test value here
            s.next = ns;
            s = ns;
        }
        s = head.next; // get the first S allocated this iteration
        if (s.check != 1) { // check test value here
            writefln("check=%d", s.check);
            return -1;
        }

        new int[10];
        allocs ~= null;
        new size_t[3];
    }
    return 0;
}

The important part is the following. On each iteration we create 8 instances of S. For each S value, we set its check field to 1. Then we check the value of that field (for the first instance of S). When compiled with the address sanitizer, we observe it's been corrupted and it's no longer 1.

Am I doing something incorrectly in the code? AFAIK I'm respecting the rules required by the GC. Maybe there's a silly bug I overlooked?

Tested with LDC 1.40.0 on x86_64 Linux:

$ ldc2 app.d -g --frame-pointer=all && ./app # OK
$ ldc2 app.d -fsanitize=address -g --frame-pointer=all && ./app # BUG
check=-337690816
$

By setting a watchpoint on the address of the field, I see that the code that writes to check is part of the GC implementation. Here's the backtrace:

* thread #1, name = 'app', stop reason = watchpoint 1
  * frame #0: 0x00007ffff7f4695c libdruntime-ldc-shared.so.110`_D4core8internal2gc4impl12conservativeQw3Gcx15recoverNextPageMFNbEQCmQCkQCeQCeQCcQCn4BinsZb + 348
    frame #1: 0x00007ffff7f46278 libdruntime-ldc-shared.so.110`_D4core8internal2gc4impl12conservativeQw3Gcx10smallAllocMFNbmKmkxC8TypeInfoZPv + 776
    frame #2: 0x00007ffff7f417d9 libdruntime-ldc-shared.so.110`_D4core8internal2gc4impl12conservativeQw14ConservativeGC__T9runLockedS_DQCsQCqQCkQCkQCiQCtQBy12mallocNoSyncMFNbmkKmxC8TypeInfoZPvS_DQFaQEyQEsQEsQEqQFb10mallocTimelS_DQGiQGgQGaQGaQFyQGj10numMallocslTmTkTmTxQDlZQFuMFNbKmKkKmKxQEeZQDx + 89
    frame #3: 0x00007ffff7f449d3 libdruntime-ldc-shared.so.110`_DThn16_4core8internal2gc4impl12conservativeQw14ConservativeGC6qallocMFNbmkMxC8TypeInfoZSQDd6memory8BlkInfo_ + 83
    frame #4: 0x00007ffff7f4ddec libdruntime-ldc-shared.so.110`gc_qalloc + 28
    frame #5: 0x000055555556be9a app`_D4core8lifetime__T11_d_newitemTTS3app1SZQwFNaNbNeZPQt at lifetime.d:2837:5
    frame #6: 0x000055555556b745 app`D main(args=string[] @ 0x00007fffffffe438) at app.d:28:13
    frame #7: 0x00007ffff7f68ecd libdruntime-ldc-shared.so.110`_D2rt6dmain212_d_run_main2UAAamPUQgZiZ6runAllMFZv + 77
    frame #8: 0x00007ffff7f68ce7 libdruntime-ldc-shared.so.110`_d_run_main2 + 407
    frame #9: 0x00007ffff7f68b3d libdruntime-ldc-shared.so.110`_d_run_main + 141
    frame #10: 0x000055555556c2b2 app`main(argc=1, argv=0x00007fffffffe728) at entrypoint.d:42:17
    frame #11: 0x00007ffff7745e08 libc.so.6`__libc_start_call_main(main=(app`main at entrypoint.d:39), argc=1, argv=0x00007fffffffe728) at libc_start_call_main.h:58:16
    frame #12: 0x00007ffff7745ecc libc.so.6`__libc_start_main_impl(main=(app`main at entrypoint.d:39), argc=1, argv=0x00007fffffffe728, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffe718) at libc-start.c:360:3
    frame #13: 0x000055555556b3a5 app`_start + 37

There is a subsequent write to that memory location in the leak sanitizer and LSan complains:

==4056526==LeakSanitizer has encountered a fatal error. (though usually this message isn't flushed)

I assume the original problem was caused by the GC and ASan/LSan are just subsequent victims, but it's hard to be sure. Apparently, LSan is automatically enabled for Linux when ASan is used. Although the ASan documentation says that LSan "can be enabled using ASAN_OPTIONS=detect_leaks=1 on macOS", setting that to 0 didn't seem to disable it, so I couldn't test with ASan but not LSan.

Any ideas of what might be going on?

February 16

On Saturday, 15 February 2025 at 23:31:42 UTC, Luís Marques wrote:

>

I have a program where the GC seems to be overwriting memory still in use and corrupting data.

Do I understand that this corruption is happening only with address sanitizer turned on?

I don't see any red flags here, though I'm assuming a lot of these weird random things you are doing (like appending a space to a string every loop) are essential to making the thing fail? It's possible these are tickling GC patterns that cause problems, or it's possible it's tickling bugs in code generation that might prevent the GC from seeing memory!

Writing to the "check" field might be because a gc cycle ran, and that item was incorrectly collected, and now the gc is writing to it because it thinks that data is fair game to use.

The writing is probably not the problem, the problem is the previous collection of that data.

I have learned a lot of tricks when implementing the new GC, and when faced with problems like this, it's super-difficult to figure out how to properly find the problem.

One technique I used is to fork after scanning, but before collection, and put that forked process to sleep. If the failure happens, then I can gdb into the forked process and see what state the GC was in, including the entire graph of memory, and I could see how a piece of memory is or isn't referenced.

This is a tedious process, and requires a lot of knowledge and patience. If this is indeed a problem with the GC, it's going to be tough to track down. If it's a problem with the codegen, then probably also difficult, but this function is small enough, that maybe someone can look at the assembly and verify that it's doing the right thing? I don't know.

February 16

On Saturday, 15 February 2025 at 23:31:42 UTC, Luís Marques wrote:

>

Tested with LDC 1.40.0 on x86_64 Linux:

$ ldc2 app.d -g --frame-pointer=all && ./app # OK
$ ldc2 app.d -fsanitize=address -g --frame-pointer=all && ./app # BUG
check=-337690816
$

Can you run with ASAN_OPTIONS=verbosity=1 and make sure that FakeStack is not enabled? (detect_stack_use_after_return=false)

(And also test with a little bit older LDC, with different LLVM version, to see if it is a new issue or not)

-Johan

February 16

On Sunday, 16 February 2025 at 20:18:18 UTC, Johan wrote:

>

Can you run with ASAN_OPTIONS=verbosity=1 and make sure that FakeStack is not enabled? (detect_stack_use_after_return=false)

The fake stack allocator is enabled. If I disable it via ASAN_OPTIONS=detect_stack_use_after_return=1 the problem no longer reproduces.

According to [1], integrating Fake Stack with GC requires special consideration. What's the status of ASan / fake stack support in LDC? (was it supposed to work, to be disabled by default, etc. ...?)

Thanks!

[1] https://github.com/google/sanitizers/wiki/AddressSanitizerUseAfterReturn#garbage-collection

February 16

On Sunday, 16 February 2025 at 21:48:31 UTC, Luís Marques wrote:

>

The fake stack allocator is enabled. If I disable it via ASAN_OPTIONS=detect_stack_use_after_return=1 the problem no longer reproduces.

I meant =0, of course.

February 16

On Sunday, 16 February 2025 at 21:48:31 UTC, Luís Marques wrote:

>

On Sunday, 16 February 2025 at 20:18:18 UTC, Johan wrote:

>

Can you run with ASAN_OPTIONS=verbosity=1 and make sure that FakeStack is not enabled? (detect_stack_use_after_return=false)

The fake stack allocator is enabled. If I disable it via ASAN_OPTIONS=detect_stack_use_after_return=1 the problem no longer reproduces.

According to [1], integrating Fake Stack with GC requires special consideration. What's the status of ASan / fake stack support in LDC? (was it supposed to work, to be disabled by default, etc. ...?)

FakeStack allocates (!) space for stack variables, and points to that "fake stack" memory with a pointer in actual CPU stack memory. This means that the stack variables are now no longer in memory that is scanned by the GC. The fix for that, of course, is to include all FakeStacks in the GC scanning [1a][1b].
This used to work, but somehow does not work anymore since LDC 2.100 (I perhaps have forgotten about this and just noticed it). [2]

You are very welcome to help investigate why it is no longer working!
[3] is an interesting test case of how code should work.

-Johan

[1a]
https://github.com/ldc-developers/druntime/compare/d6b328be91db63aff979f584b0d1def0f746d730...1d938e0b7f668b099f9fa694b135c82ef13dec59

[1b] https://github.com/ldc-developers/ldc/pull/3888

[2] https://github.com/ldc-developers/ldc/blob/d3f065816ec7d420f370e4c95c6000eb78187e25/tests/sanitizers/asan_fakestack_GC.d#L3

[3] https://github.com/llvm/llvm-project/blob/main/compiler-rt/test/asan/TestCases/Posix/gc-test.cpp

February 16

On Sunday, 16 February 2025 at 22:06:37 UTC, Johan wrote:

>

You are very welcome to help investigate why it is no longer working!

Sure, I'll have a look. Thanks.

February 17

On Sunday, 16 February 2025 at 22:40:58 UTC, Luís Marques wrote:

>

On Sunday, 16 February 2025 at 22:06:37 UTC, Johan wrote:

>

This used to work, but somehow does not work anymore since LDC 2.100 (I perhaps have forgotten about this and just noticed it). [2]

You are very welcome to help investigate why it is no longer working!

Sure, I'll have a look. Thanks.

I don't think this broke with the D 2.100. For instance, LDC 1.29.0 is based on 2.099.1 and exhibits the same problem. Even older LDC versions don't trip on this exact program but they do output AddressSanitizer CHECK failures.

==4108825==AddressSanitizer CHECK failed: /home/vsts/work/1/s/compiler-rt/lib/sanitizer_common/sanitizer_linux_libcdep.cpp:556 "((*tls_addr + *tls_size)) <= ((*stk_addr + *stk_size))" (0x78927e371080, 0x78927e371000)

I'll need some time to dig through the IR, the GC, etc. If you are going to look at this, please let me know, to avoid duplicate efforts.

February 17
Is this an issue with the new GC, or the old one?
February 18
On Monday, 17 February 2025 at 22:14:13 UTC, Walter Bright wrote:
> Is this an issue with the new GC, or the old one?

Old gc

-Steve
« First   ‹ Prev
1 2