October 23, 2020
On Friday, 23 October 2020 at 14:37:41 UTC, H. S. Teoh wrote:
> On Fri, Oct 23, 2020 at 12:47:11PM +0000, donallen via Digitalmars-d wrote:
>> On Thursday, 22 October 2020 at 21:58:04 UTC, rikki cattermole wrote:
>> > On 23/10/2020 4:56 AM, donallen wrote:
>> > > ==3854==    by 0x1B7E98: void verifier.main(immutable(char)[][]).walk_account_tree(verifier.main(immutable(char)[][]).Account,
>> > > int) (verifier.d:272)
>> > > ==3854==  Uninitialised value was created by a stack allocation
>> > > ==3854==    at 0x1B77FC: void verifier.main(immutable(char)[][]).walk_account_tree(verifier.main(immutable(char)[][]).Account,
>> > > int) (verifier.d:84)
>> > > ==3854==
>> > > ==3854== Conditional jump or move depends on uninitialised value(s)
> [...]
>
> Hmm. Which lines in your posted code do lines 272 and 84 refer to? Those seem to be the problematic spots, but it's not clear where in the function they are, since your posted code doesn't indicate line numbers.
>
>
> I also noticed this bit of code (which may or may not have anything to
> do with it):
>
>>                 int temp;
>>                 if ((temp = one_row(check_quantities,
>> &get_int).integer_value) > 0)
>>                 {
>>                     writeln(account.path,
>>                             " %s is not an asset account but has splits with
>> non-zero quantities. These will be fixed.");
>> 
>>                     bind_text(fix_quantities, 1, account.guid);
>>                     run_dm_stmt(fix_quantities, &reset_stmt);
>>                 }
>
> The value of `temp` is not used after assignment. This is a rather odd way of writing it.  What's the reason you didn't just write:
>
> 	if (one_row(...) > 0)
>
> instead?

This code was ported from the original C. In the C code, I printf-ed the quantity in the error message and so needed to stash it somewhere (temp). When I transformed the code to D, I think I hadn't discovered format at that point, so just used writeln's ability to print the account.path prepended to the fixed text of the message. I forgot to take out the temporary and write it as you suggest above. So what you see in my code is a vestige of the C that is no longer applicable, but not harmful.

>
>
> [...]
>> MultiType one_row(T)(sqlite3_stmt* stmt, T get_values)
>
> What's the definition of MultiType?

union MultiType
{
    int integer_value;
    sqlite3_int64 long_integer_value;
    double double_value;
    string string_value;
    void* pointer_value;
}

>
>
> [...]
>>     return result; // Need this for compiler happiness -- can't get here
>
> Note for the future: you could write `assert(0);` here to indicate that it should not be reachable. (It compiles to a single 'hlt' instruction, which will abort if the program somehow reaches it anyway.)

Thanks for the tip, useful if there's D in my future (which there will be if you guys can find something wrong with my code rather than the system).

>
>
> T


October 23, 2020
On Friday, 23 October 2020 at 15:49:49 UTC, Steven Schveighoffer wrote:
> On 10/23/20 8:47 AM, donallen wrote:
>> On Thursday, 22 October 2020 at 21:58:04 UTC, rikki cattermole wrote:
>>> On 23/10/2020 4:56 AM, donallen wrote:
>>>> ==3854==    by 0x1B7E98: void verifier.main(immutable(char)[][]).walk_account_tree(verifier.main(immutable(char)[][]).Account, int) (verifier.d:272)
>>>> ==3854==  Uninitialised value was created by a stack allocation
>>>> ==3854==    at 0x1B77FC: void verifier.main(immutable(char)[][]).walk_account_tree(verifier.main(immutable(char)[][]).Account, int) (verifier.d:84)
>>>> ==3854==
>>>> ==3854== Conditional jump or move depends on uninitialised value(s)
>>>
>>> That is coming up an awful lot.
>>>
>>> We'd need walk_account_tree source, to know if its doing something funky.
>>>
>>> But upon reviewing how the entire thing is working, I think we need one_row and next_row_available_p source as well. Oh and reset_stmt too.
>> 
>> Here's the source code you requested:
>
> This all looks ok. The only think I can possibly consider is that you are using bind_text a lot, to presumably bind sqlite prepared statements to D strings.
>
> This means it's *possible* that the prepared statements have pointers to the strings, but nothing else does (hard to work it out from just reading the code, I can't tell when things go out of scope easily, and your prepared statements clearly exist outside these functions). If a prepared statement is a C-malloc'd item, then it's not scanned by the GC. This means that what possibly might be happening is that you bind a D string to a text parameter, the D string goes out of scope, the GC collects it and allocates it elsewhere, and then sqlite tries to use it to send some requests.
>
> To test this theory, maybe try a version of the code that when calling bind_text, you append the string itself to a GC allocated array stored in a global. Might not prove anything, but if it's fixing the problem, maybe that's where the GC is collecting things that it shouldn't.

It's a good theory, but I don't think it explains this.

What you are hypothesizing is that there's an instance where the lifetime of a D string that is used in a sqlite bind_text is shorter than the lifetime of the sqlite binding. I don't think that's the case.

The calls to bind_text all immediately precede calls to run_dm_stmt, one_row, etc. Those calls either run an insert or update, or in the case of one_row, a select. When those
calls return, the statement has been reset, which means, among other things, that the
bindings have been released, ending the binding's life.

The other case to look at is the use of next_row_available_p in a loop. There, bindings are established before entering the loop, which processes multiple rows coming back from a select. When next_row_available_p runs out of rows, it calls the cleanup function, usually reset_stmt (which just calls sqlite3_reset; I found that I could not pass an extern (c) function to next_row_available_p, so used this workaround; maybe there is a way to do this, but I just used this simple expedient). So at that point the bindings are cleared and next_row_available_p returns false, ending the loop and life of the bindings. Unless I've missed something, I think you will find that all the strings passed to bind_text (which get massaged by toStringz, so they look like C strings) live beyond the point where the statements they are bound to are reset.

Most of the calls to bind_text in the tree walker bind fields in account, which is an argument to the tree walker and thus lives until it returns. The others are to local variables that are declared before they are used in bind-text calls and the statement gets reset before the scope ends.

Yes, your theory could be tested using some variant of what you describe, in case my there's a flaw in my hand-waving above.

Some food for thought:

If I insert a writeln at the very beginning of walk_account_tree, printing, say, the name and guid of the account, the problem goes away -- the program behaves correctly.

If I change
children[i].path = format("%s:%s", account.path, children[i].name);
to
children[i].path = format("%s:%s", account.path, children[i].name).dup;

in the loop that fills in the child Account structs in children, same thing -- correct behavior. The dup of the format should not be necessary, since format returns a new GC-allocated string.

I suspect all of these odd "cures" have the effect they do because they change the allocation pattern of this section of the code. Without any of them, it appears that an allocation is triggering a GC at exactly the wrong moment. What is not understood is what is wrong with the moment.

>
> -Steve


October 23, 2020
On Friday, 23 October 2020 at 18:31:41 UTC, donallen wrote:
> ...

Have you tried Dustmite?

https://dlang.org/blog/2020/04/13/dustmite-the-general-purpose-data-reduction-tool/

Matheus.
October 23, 2020
On Fri, Oct 23, 2020 at 06:31:41PM +0000, donallen via Digitalmars-d wrote: [...]
> Some food for thought:
> 
> If I insert a writeln at the very beginning of walk_account_tree, printing, say, the name and guid of the account, the problem goes away -- the program behaves correctly.

Curious.  IME, this is usually the sign of memory corruption elsewhere in the code, and it's just pure happenstance that changing the allocation pattern causes the corruption to overwrite something else that doesn't lead to observable symptoms.

I wonder what might happen if instead of inserting a writeln, you inserted a dummy allocation? Like:

	int[] dummy = [ 1, 2, 3 ];
	printf("%d\n", dummy.length); // don't let optimizer elide it


> If I change
> children[i].path = format("%s:%s", account.path, children[i].name);
> to
> children[i].path = format("%s:%s", account.path, children[i].name).dup;
> 
> in the loop that fills in the child Account structs in children, same thing -- correct behavior. The dup of the format should not be necessary, since format returns a new GC-allocated string.

Indeed.  So again it looks like some kind of memory corruption.  Only, without a reproducible test case we're just shooting in the dark here as to what exactly is causing it.  Have you looked into Dustmite?  Once you set it up, it's fully automated, and you can leave it running in the background until it has reduced the code to a minimal(-ish) test case, so it won't consume too much of your time.

Another random shot in the dark is, what compiler flag(s) are you using to compile the program?  If you're using dmd with -O or -inline (or both), there's a small chance you might be seeing a codegen bug.  (I've had bad experiences with dmd's backend before, both in terms of codegen bugs and poor optimization quality, and nowadays only use dmd for quick prototyping or fast turnaround during development; for production I avoid dmd and use LDC's much better and more reliable backend.)


> I suspect all of these odd "cures" have the effect they do because they change the allocation pattern of this section of the code. Without any of them, it appears that an allocation is triggering a GC at exactly the wrong moment. What is not understood is what is wrong with the moment.
[...]

Indeed.  If you could get dustmite to produce a minimal testcase that we can use to reproduce the problem locally, we'd be able to help you pinpoint exactly what is wrong with your code.  Or determine that it's a GC bug, should it come to that.


T

-- 
Give a man a fish, and he eats once. Teach a man to fish, and he will sit forever.
October 23, 2020
On 10/23/20 2:31 PM, donallen wrote:
> On Friday, 23 October 2020 at 15:49:49 UTC, Steven Schveighoffer wrote:
>> On 10/23/20 8:47 AM, donallen wrote:

>> This all looks ok. The only think I can possibly consider is that you are using bind_text a lot, to presumably bind sqlite prepared statements to D strings.
>>
>> This means it's *possible* that the prepared statements have pointers to the strings, but nothing else does (hard to work it out from just reading the code, I can't tell when things go out of scope easily, and your prepared statements clearly exist outside these functions). If a prepared statement is a C-malloc'd item, then it's not scanned by the GC. This means that what possibly might be happening is that you bind a D string to a text parameter, the D string goes out of scope, the GC collects it and allocates it elsewhere, and then sqlite tries to use it to send some requests.
>>
>> To test this theory, maybe try a version of the code that when calling bind_text, you append the string itself to a GC allocated array stored in a global. Might not prove anything, but if it's fixing the problem, maybe that's where the GC is collecting things that it shouldn't.
> 
> It's a good theory, but I don't think it explains this.
> 
> What you are hypothesizing is that there's an instance where the lifetime of a D string that is used in a sqlite bind_text is shorter than the lifetime of the sqlite binding. I don't think that's the case.
> 
> The calls to bind_text all immediately precede calls to run_dm_stmt, one_row, etc. Those calls either run an insert or update, or in the case of one_row, a select. When those
> calls return, the statement has been reset, which means, among other things, that the
> bindings have been released, ending the binding's life.

OK, I didn't know how that worked.

> 
> The other case to look at is the use of next_row_available_p in a loop. There, bindings are established before entering the loop, which processes multiple rows coming back from a select. When next_row_available_p runs out of rows, it calls the cleanup function, usually reset_stmt (which just calls sqlite3_reset; I found that I could not pass an extern (c) function to next_row_available_p, so used this workaround; maybe there is a way to do this, but I just used this simple expedient).

extern(C) function types are different than extern(D) function types. So it depends on what the parameter type is.

> So at that point the bindings are cleared and next_row_available_p returns false, ending the loop and life of the bindings. Unless I've missed something, I think you will find that all the strings passed to bind_text (which get massaged by toStringz, so they look like C strings) live beyond the point where the statements they are bound to are reset.
> 
> Most of the calls to bind_text in the tree walker bind fields in account, which is an argument to the tree walker and thus lives until it returns. The others are to local variables that are declared before they are used in bind-text calls and the statement gets reset before the scope ends.
> 
> Yes, your theory could be tested using some variant of what you describe, in case my there's a flaw in my hand-waving above.
> 
> Some food for thought:
> 
> If I insert a writeln at the very beginning of walk_account_tree, printing, say, the name and guid of the account, the problem goes away -- the program behaves correctly.

as with many memory corruption issues, introducing new code paths can change where the corruption occurs, or change that the corruption occurs at all.

You might want to try instead using printf, to avoid affecting anything in the D world. In the past to try and find corruption issues like this that are finnicky, I've also pre-allocated data to write to, and don't touch it for the duration of the problematic area, and then print it after.

> 
> If I change
> children[i].path = format("%s:%s", account.path, children[i].name);
> to
> children[i].path = format("%s:%s", account.path, children[i].name).dup;
> 
> in the loop that fills in the child Account structs in children, same thing -- correct behavior. The dup of the format should not be necessary, since format returns a new GC-allocated string.

Hm... that's interesting. You are correct that it allocates a new GC block. The dup is (should be) completely unnecessary. I don't think Appender (which is what format uses) has GC memory corruption issues, because it's used literally everywhere, it should be rock-solid. Always possible though.

> I suspect all of these odd "cures" have the effect they do because they change the allocation pattern of this section of the code. Without any of them, it appears that an allocation is triggering a GC at exactly the wrong moment. What is not understood is what is wrong with the moment.

Yeah, I agree. For sure most of these constructs you are using are very well tested. There are many programs which use them, for many years, which still points me back at the sqlite library wrapper/binding. Not impossible that format and the GC have a bug, but it seems unlikely. Any time you have GC memory being prematurely collected, it smells like something was stored where it shouldn't be. The other major cause of this is generally alignment. But it doesn't seem like you are specifying alignment anywhere.

In any case, I understand if you don't want to go through any more exercises, but if you do, it might help find something that is super-obscure. Thanks for taking the time to investigate as much as you have!

-Steve
October 23, 2020
On 10/21/20 12:24 PM, donallen wrote:

> The problem is that at some point, the verifier starts spewing bogus error messages about what it is seeing in the tree. Oddly, putting in debugging writelns results in the error messages not occurring -- a Heisenbug.  But working with gdb, I found that the account structs after the error messages start are zeroed. Turning on gc profiling tells me that 3 gcs have occurred. Disabling the gc results in the program running correctly -- no error messages (and I know the database has no errors because the C version, which has been around for awhile, confirms that the db is well formed).

Something I thought of - try running manual GC collections:

import core.memory;

GC.collect(); // peppered around your code

It might make your code fail closer to the true problem.

-Steve
October 24, 2020
Thanks!

The problem is not in that block of code I believe.

Now how about bind_text?

Is that still using c variadic args?
October 24, 2020
On Friday, 23 October 2020 at 23:41:05 UTC, rikki cattermole wrote:
> Thanks!
>
> The problem is not in that block of code I believe.
>
> Now how about bind_text?
>
> Is that still using c variadic args?

No. Read on.

The detective in me couldn't resist (I really need a Peugeot and a basset hound) and so I did some more work on this despite not really having the time and I have made some progress.

The walk_account_tree function is defined inside the main function. Before it is defined,
I prepare all the sql statements needed to do the verification, stored in local variables. One of the queries is count_children:

"select count(guid) from accounts where parent_guid = ?1"

That query is run in the section of walk_account_tree that we've been looking at:
        // Now do the children of this account
        // First determine how many there are
        bind_text(count_children, 1, account.guid);
        int n_children = one_row(count_children, &get_int).integer_value;
        if (n_children > 0)
        {
            Account[] children = new Account[n_children];
            bind_text(find_children, 1, account.guid);

So the result of that query is saved in n_children and determines the size of the children array.
Recall that 'Account account' is the first argument to walk_account_tree, so n_children is the number of accounts that have account.guid,
an account's unique identifier, as its parent. In other words, its children.

If there are children, I allocate the children array and run the find_children query:

            "select name, guid, ifnull(commodity_guid, ''), flags
                from accounts where parent_guid = ?1");

The number of rows that come back ought to be same as n_children. On a hunch, I put in a
test to check this:

        // Now do the children of this account
        // First determine how many there are
        bind_text(count_children, 1, account.guid);
        int n_children = one_row(count_children, &get_int).integer_value;
        if (n_children > 0)
        {
            Account[] children = new Account[n_children];
            bind_text(find_children, 1, account.guid);
            int i = 0;
            while (next_row_available_p(find_children, &reset_stmt))
            {
                children[i].name = fromStringz(sqlite3_column_text(find_children, 0)).dup;
                children[i].path = format("%s:%s", account.path, children[i].name);
               children[i].guid = fromStringz(sqlite3_column_text(find_children, 1)).dup; // guid
                children[i].commodity_guid = fromStringz(sqlite3_column_text(find_children, 2)).dup; // commodity_guid
                children[i].flags = sqlite3_column_int(find_children,
                        3) | account.flags & (account_flag_descendents_are_assets
                        | account_flag_descendents_are_liabilities | account_flag_descendents_are_income
                        | account_flag_descendents_are_expenses | account_flag_descendents_are_marketable
                        | account_flag_self_and_descendents_are_tax_related
                        | account_flag_descendents_need_commodity_link); // flags
                i = i + 1;
            }
            if (i != n_children)
                panic(format("walk_account_tree: bad child index, %d, %d, %s", i, n_children, account.path));
            foreach (child; children) {
                walk_account_tree(child, ancestor_flags | account.flags);
            }
        }

And, sure enough, the panic happens:
dca@pangloss:~/Software/newcash_d/verifier$ ./verifier /tmp/Finances.newcash
walk_account_tree: bad child index, 231, 344, :Assets:Investments:Equities and derivatives:Taxable:Donald C. Allen 2003 Revocable Trust:TD Ameritrade
dca@pangloss:~/Software/newcash_d/verifier$

Something has caused sqlite to end the return of rows prematurely. But next_row_p did not panic,
which it will if the the return code from sqlite3_step, which it calls, is other than SQLITE_ROW or SQLITE_DONE.
Since next_row_p just returned 'false', sqlite3_step must have returned SQLITE_DONE. sqlite returned 231 of 344 rows but is acting like everything is normal.

This explains why my earlier gdb session revealed that the error messages the verifier was spewing were due to being fed account structs that were all zero.
The recursive calls are being done on all 344 of the elements of the children array, but only 231 have data in them.
The rest are zero, as they were when the whole children array was allocated and those dataless "children" look bogus to the verifier.

Now, I know from prior experiment that if I disable the gc, the problem does not occur, so I tried this again with the newly installed panic:

dca@pangloss:~/Software/newcash_d/verifier$ ./verifier "--DRT-gcopt=disable:1" /tmp/Finances.newcash
dca@pangloss:~/Software/newcash_d/verifier$

This was run with the same executable as the above. No panic.

Next, I put in a call to GC.disable() just before the 'while' and a call to GC.enable{} after the loop returns. No panic.

So I moved GC.disable and enable calls into the loop, at first with the disable call at the top of the loop, which did not panic. I moved it down
one line at a time and when I get to this:

        // Now do the children of this account
        // First determine how many there are
        bind_text(count_children, 1, account.guid);
        int n_children = one_row(count_children, &get_int).integer_value;
        if (n_children > 0)
        {
            Account[] children = new Account[n_children];
            bind_text(find_children, 1, account.guid);
            int i = 0;
            while (next_row_available_p(find_children, &reset_stmt))
            {
                children[i].name = fromStringz(sqlite3_column_text(find_children, 0)).dup;
                children[i].path = format("%s:%s", account.path, children[i].name);
                GC.disable();
               children[i].guid = fromStringz(sqlite3_column_text(find_children, 1)).dup; // guid
                children[i].commodity_guid = fromStringz(sqlite3_column_text(find_children, 2)).dup; // commodity_guid
                children[i].flags = sqlite3_column_int(find_children,
                        3) | account.flags & (account_flag_descendents_are_assets
                        | account_flag_descendents_are_liabilities | account_flag_descendents_are_income
                        | account_flag_descendents_are_expenses | account_flag_descendents_are_marketable
                        | account_flag_self_and_descendents_are_tax_related
                        | account_flag_descendents_need_commodity_link); // flags
                i = i + 1;
                GC.enable();
            }
            if (i != n_children)
                panic(format("walk_account_tree: bad child index, %d, %d, %s", i, n_children, account.path));
            foreach (child; children) {
                walk_account_tree(child, ancestor_flags | account.flags);
            }
        }

the panic occurs.

Then I had another of my hunches. I re-wrote the line containing the call to format like so:

        // Now do the children of this account
        // First determine how many there are
        bind_text(count_children, 1, account.guid);
        int n_children = one_row(count_children, &get_int).integer_value;
        if (n_children > 0)
        {
            Account[] children = new Account[n_children];
            bind_text(find_children, 1, account.guid);
            int i = 0;
            while (next_row_available_p(find_children, &reset_stmt))
            {
                children[i].name = fromStringz(sqlite3_column_text(find_children, 0)).dup;
                size_t n_collections = GC.ProfileStats.numCollections;
                children[i].path = (account.path ~ ":" ~ children[i].name);
                bool collection_occured = GC.ProfileStats.numCollections != n_collections;
               children[i].guid = fromStringz(sqlite3_column_text(find_children, 1)).dup; // guid
                children[i].commodity_guid = fromStringz(sqlite3_column_text(find_children, 2)).dup; // commodity_guid
                children[i].flags = sqlite3_column_int(find_children,
                        3) | account.flags & (account_flag_descendents_are_assets
                        | account_flag_descendents_are_liabilities | account_flag_descendents_are_income
                        | account_flag_descendents_are_expenses | account_flag_descendents_are_marketable
                        | account_flag_self_and_descendents_are_tax_related
                        | account_flag_descendents_need_commodity_link); // flags
                i = i + 1;
            }
            if (i != n_children)
                panic(format("walk_account_tree: bad child index, %d, %d, %s, %s", i, collection_occurred, n_children, account.path));
            foreach (child; children) {
                walk_account_tree(child, ancestor_flags | account.flags);
            }
        }

This does not panic! In theory, this should consume the same amount of GC memory as the format call (though I have not looked at the format code; there may well
be intermediate allocations as it does its work).

Now I decided to instrument the code so I could determine whether GCs were occurring as a result of the call to format:

        // Now do the children of this account
        // First determine how many there are
        bind_text(count_children, 1, account.guid);
        int n_children = one_row(count_children, &get_int).integer_value;
        if (n_children > 0)
        {
            size_t collections_before = 0;
            size_t collections_after = 0;
            Account[] children = new Account[n_children];
            bind_text(find_children, 1, account.guid);
            int i = 0;
            while (next_row_available_p(find_children, &reset_stmt))
            {
                children[i].name = fromStringz(sqlite3_column_text(find_children, 0)).dup;
                collections_before += GC.profileStats().numCollections;
                //children[i].path = (account.path ~ ":" ~ children[i].name);
                children[i].path = format("%s:%s", account.path, children[i].name);
                collections_after += GC.profileStats().numCollections;
              children[i].guid = fromStringz(sqlite3_column_text(find_children, 1)).dup; // guid
                children[i].commodity_guid = fromStringz(sqlite3_column_text(find_children, 2)).dup; // commodity_guid
                children[i].flags = sqlite3_column_int(find_children,
                        3) | account.flags & (account_flag_descendents_are_assets
                        | account_flag_descendents_are_liabilities | account_flag_descendents_are_income
                        | account_flag_descendents_are_expenses | account_flag_descendents_are_marketable
                        | account_flag_self_and_descendents_are_tax_related
                        | account_flag_descendents_need_commodity_link); // flags
                i = i + 1;
            }
            if (i != n_children)
                panic(format("walk_account_tree: bad child index, %d, %d, %s, %d, %d", i, n_children, account.path, collections_before, collections_after));
            if (account.path == ":Assets:Investments:Equities and derivatives:Taxable:Donald C. Allen 2003 Revocable Trust:TD Ameritrade") {
                writeln(format("walk_account_tree: %d, %d, %s, %d, %d", i, n_children, account.path, collections_before, collections_after));
            }
            foreach (child; children) {
                walk_account_tree(child, ancestor_flags | account.flags);
            }
        }

Running this produces:
dca@pangloss:~/Software/newcash_d/verifier$ ./verifier "--DRT-gcopt=profile:1" /tmp/Finances.newcash
walk_account_tree: bad child index, 231, 344, :Assets:Investments:Equities and derivatives:Taxable:Donald C. Allen 2003 Revocable Trust:TD Ameritrade, 33, 34
dca@pangloss:~/Software/newcash_d/verifier$

This proves that at least once during the loop that failed, a GC occurred as a result of the format call.

Next I tried uncommenting the concatenation line and commenting out the format call. That produces:
dca@pangloss:~/Software/newcash_d/verifier$ ./verifier "--DRT-gcopt=profile:1" /tmp/Finances.newcash
walk_account_tree: 344, 344, :Assets:Investments:Equities and derivatives:Taxable:Donald C. Allen 2003 Revocable Trust:TD Ameritrade, 0, 0
        Number of collections:  2
        Total GC prep time:  0 milliseconds
        Total mark time:  0 milliseconds
        Total sweep time:  0 milliseconds
        Max Pause Time:  0 milliseconds
        Grand total GC time:  0 milliseconds
GC summary:    5 MB,    2 GC    0 ms, Pauses    0 ms <    0 ms
dca@pangloss:~/Software/newcash_d/verifier$

So again, using concatenation instead of format fixes the problem. And no GCs have occurred after finishing the loop on the account at issue.

And one other tidbit. If I add a precision to %s for the path in the format call, like so and change the test for the problem account by checking for its
guid rather than the truncated path:
        // Now do the children of this account
        // First determine how many there are
        bind_text(count_children, 1, account.guid);
        int n_children = one_row(count_children, &get_int).integer_value;
        if (n_children > 0)
        {
            size_t collections_before = 0;
            size_t collections_after = 0;
            Account[] children = new Account[n_children];
            bind_text(find_children, 1, account.guid);
            int i = 0;
            while (next_row_available_p(find_children, &reset_stmt))
            {
                children[i].name = fromStringz(sqlite3_column_text(find_children, 0)).dup;
                collections_before += GC.profileStats().numCollections;
                //children[i].path = (account.path ~ ":" ~ children[i].name);
                children[i].path = format("%s:%.10s", account.path, children[i].name);
                collections_after += GC.profileStats().numCollections;
              children[i].guid = fromStringz(sqlite3_column_text(find_children, 1)).dup; // guid
                children[i].commodity_guid = fromStringz(sqlite3_column_text(find_children, 2)).dup; // commodity_guid
                children[i].flags = sqlite3_column_int(find_children,
                        3) | account.flags & (account_flag_descendents_are_assets
                        | account_flag_descendents_are_liabilities | account_flag_descendents_are_income
                        | account_flag_descendents_are_expenses | account_flag_descendents_are_marketable
                        | account_flag_self_and_descendents_are_tax_related
                        | account_flag_descendents_need_commodity_link); // flags
                i = i + 1;
            }
            if (i != n_children)
                panic(format("walk_account_tree: bad child index, %d, %d, %s, %d, %d", i, n_children, account.path, collections_before, collections_after));
            if (account.guid == "c369abf6ec2c7222e3fdd174ce2c0c9a") {
                writeln(format("walk_account_tree: %d, %d, %s, %d, %d", i, n_children, account.path, collections_before, collections_after));
            }
            foreach (child; children) {
                walk_account_tree(child, ancestor_flags | account.flags);
            }
        }

I get:
dca@pangloss:~/Software/newcash_d/verifier$ ./verifier "--DRT-gcopt=profile:1" /tmp/Finances.newcash
walk_account_tree: 344, 344, :Assets:Investment:Equities a:Taxable:Donald C. :TD Ameritr, 0, 0
        Number of collections:  3
        Total GC prep time:  0 milliseconds
        Total mark time:  0 milliseconds
        Total sweep time:  0 milliseconds
        Max Pause Time:  0 milliseconds
        Grand total GC time:  1 milliseconds
GC summary:    5 MB,    3 GC    1 ms, Pauses    1 ms <    0 ms
dca@pangloss:~/Software/newcash_d/verifier$

No panic. So there appears to be a problem in format when dealing with strings fed to an unconstrained %s, probably when interacting the garbage collector. My guess is that the issue is
in format, not the gc.

That's as far as I've gotten. But I think we know more now than before.


October 24, 2020
On Saturday, 24 October 2020 at 14:21:52 UTC, donallen wrote:
> On Friday, 23 October 2020 at 23:41:05 UTC, rikki cattermole wrote:
>> [...]
>
> No. Read on.
>
> [...]

Great writeup. Thanks for taking the time me to do it! (Iirc there's a nogc format somewhere in https://code.dlang.org/packages/ocean)
October 24, 2020
On 10/24/20 10:21 AM, donallen wrote:
> On Friday, 23 October 2020 at 23:41:05 UTC, rikki cattermole wrote:
>> Thanks!
>>
>> The problem is not in that block of code I believe.
>>
>> Now how about bind_text?
>>
>> Is that still using c variadic args?
> 
> No. Read on.
> 
> The detective in me couldn't resist (I really need a Peugeot and a basset hound) and so I did some more work on this despite not really having the time and I have made some progress.

Very nice!

I wonder if the corruption is happening on the account.guid. Since we are talking about sqlite, which is a local DB, and not sending information to a server (where it can't be corrupted), it's possible the guid is being corrupted causing the premature exit of the loop.

Can you print out the guid after in the premature exit case to make sure it hasn't changed?

> This does not panic! In theory, this should consume the same amount of GC memory as the format call (though I have not looked at the format code; there may well
> be intermediate allocations as it does its work).

Just for informational purposes, an append statement between N strings allocates one block that holds all of them. Format appends, so it may involve multiple allocations, though it's possible it does a similar shortcut. I'm very interested in this code, I will look at the case for format, and see if I can sleuth out any possible corruption possibilities.

> 
> No panic. So there appears to be a problem in format when dealing with strings fed to an unconstrained %s, probably when interacting the garbage collector. My guess is that the issue is
> in format, not the gc.

Yeah, I tend to agree. More specifically, it could be in appender as well (which is what format uses to allocate memory).

> That's as far as I've gotten. But I think we know more now than before.

Good work, and I think we can probably get to the bottom of it, lots of great information!

-Steve