October 24, 2020
On Sat, Oct 24, 2020 at 02:21:52PM +0000, donallen via Digitalmars-d wrote: [...]
> The detective in me couldn't resist (I really need a Peugeot and a basset hound) and so I did some more work on this despite not really having the time and I have made some progress.
[...]
> Something has caused sqlite to end the return of rows prematurely. But next_row_p did not panic, which it will if the the return code from sqlite3_step, which it calls, is other than SQLITE_ROW or SQLITE_DONE. Since next_row_p just returned 'false', sqlite3_step must have returned SQLITE_DONE. sqlite returned 231 of 344 rows but is acting like everything is normal.
[...]

Very interesting finding indeed!


> Then I had another of my hunches. I re-wrote the line containing the call to format like so:
[...]
>                 children[i].path = (account.path ~ ":" ~ children[i].name);
[...]
> This does not panic! In theory, this should consume the same amount of GC memory as the format call (though I have not looked at the format code; there may well be intermediate allocations as it does its work).

Am I right in assuming that account.path and children[i].name are both of type `string`?  As opposed to some other object that defined a toString method?

AFAIK, format does allocate intermediate buffers for its work; that may explain the difference you observed here.


> Now I decided to instrument the code so I could determine whether GCs were occurring as a result of the call to format:
[...]
> Running this produces:
> dca@pangloss:~/Software/newcash_d/verifier$ ./verifier
> "--DRT-gcopt=profile:1" /tmp/Finances.newcash
> walk_account_tree: bad child index, 231, 344, :Assets:Investments:Equities
> and derivatives:Taxable:Donald C. Allen 2003 Revocable Trust:TD Ameritrade,
> 33, 34
> dca@pangloss:~/Software/newcash_d/verifier$
> 
> This proves that at least once during the loop that failed, a GC occurred as a result of the format call.
[...]

Here, there appears to have been 33 GC collections before the format call, and 34 after. Am I reading this right?


> Next I tried uncommenting the concatenation line and commenting out
> the format call. That produces:
> dca@pangloss:~/Software/newcash_d/verifier$ ./verifier
> "--DRT-gcopt=profile:1" /tmp/Finances.newcash
> walk_account_tree: 344, 344, :Assets:Investments:Equities and
> derivatives:Taxable:Donald C. Allen 2003 Revocable Trust:TD Ameritrade, 0, 0
[...]
> So again, using concatenation instead of format fixes the problem. And no GCs have occurred after finishing the loop on the account at issue.

However, here there appears to have been no collections at all (0 before, 0 after)!  So based on this we still cannot rule out a problem in the GC.

Which makes me wonder: what would happen if you inserted:

        if (account.path == ":Assets:Investments:Equities and derivatives:Taxable:Donald C. Allen 2003 Revocable Trust:TD Ameritrade")
		GC.collect();

right after the concatenation line to trigger a collection?  Would it reintroduce the panic?


> And one other tidbit. If I add a precision to %s for the path in the format call, like so and change the test for the problem account by checking for its guid rather than the truncated path:
[...]
>                 children[i].path = format("%s:%.10s", account.path,
> children[i].name);
[...]
> I get:
> dca@pangloss:~/Software/newcash_d/verifier$ ./verifier
> "--DRT-gcopt=profile:1" /tmp/Finances.newcash
> walk_account_tree: 344, 344, :Assets:Investment:Equities a:Taxable:Donald C.
> :TD Ameritr, 0, 0
[...]
> No panic. So there appears to be a problem in format when dealing with strings fed to an unconstrained %s, probably when interacting the garbage collector. My guess is that the issue is in format, not the gc.

Interestingly enough, again there appears to be no collections done at all (0 before, 0 after).  So I'm still not sure if we could entirely rule out the GC at this point.  It would be more interesting to find out what happens if you insert the GC.collect() line to manually trigger a collection with the constrained version of the format call.


> That's as far as I've gotten. But I think we know more now than before.

This is extremely informative, and thank you for taking the time to do this in spite of being very busy.  I hope we can eventually ferret out the cause of the problem, whether it's in the GC, in format, or in your code somehow.


T

-- 
Error: Keyboard not attached. Press F1 to continue. -- Yoon Ha Lee, CONLANG
October 24, 2020
On 10/24/20 11:04 AM, Steven Schveighoffer wrote:
> On 10/24/20 10:21 AM, donallen wrote:
>> This does not panic! In theory, this should consume the same amount of GC memory as the format call (though I have not looked at the format code; there may well
>> be intermediate allocations as it does its work).
> 
> Just for informational purposes, an append statement between N strings allocates one block that holds all of them. Format appends, so it may involve multiple allocations, though it's possible it does a similar shortcut. I'm very interested in this code, I will look at the case for format, and see if I can sleuth out any possible corruption possibilities.

Can you try this? This will remove format from the picture, but still use appender in the same way format does (to focus on whether the problem is in std.format or appender):

// instead of this
// children[i].path = format("%s:%s", account.path, children[i].name);

{ // need a scope to ensure appender goes out of scope
   import std.array;
   appender!string app;
   app.put(account.path);
   app.put(":");
   app.put(children[i].name);
   children[i].path = app[];
}

If this still panics, then I can eliminate all of std.format. If it doesn't panic, then it's most likely in std.format.

A closer look at appender, and it seems all correct. I'm kinda hoping this doesn't panic, but at the same time, the appender code is a lot less complex, so easier to examine closely.

-Steve
October 24, 2020
On Saturday, 24 October 2020 at 15:04:53 UTC, Steven Schveighoffer wrote:
> On 10/24/20 10:21 AM, donallen wrote:
>> On Friday, 23 October 2020 at 23:41:05 UTC, rikki cattermole wrote:
>>> Thanks!
>>>
>>> The problem is not in that block of code I believe.
>>>
>>> Now how about bind_text?
>>>
>>> Is that still using c variadic args?
>> 
>> No. Read on.
>> 
>> The detective in me couldn't resist (I really need a Peugeot and a basset hound) and so I did some more work on this despite not really having the time and I have made some progress.
>
> Very nice!
>
> I wonder if the corruption is happening on the account.guid. Since we are talking about sqlite, which is a local DB, and not sending information to a server (where it can't be corrupted), it's possible the guid is being corrupted causing the premature exit of the loop.
>
> Can you print out the guid after in the premature exit case to make sure it hasn't changed?

A good thought, but it doesn't change:

        // Now do the children of this account
        // First determine how many there are
        string guid_before = account.guid;
        bind_text(count_children, 1, account.guid);
        int n_children = one_row(count_children, &get_int).integer_value;
        if (n_children > 0)
        {
            size_t collections_before = 0;
            size_t collections_after = 0;
            Account[] children = new Account[n_children];
            string guid_after = account.guid;
            bind_text(find_children, 1, account.guid);
            int i = 0;
            while (next_row_available_p(find_children, &reset_stmt))
            {
                children[i].name = fromStringz(sqlite3_column_text(find_children, 0)).dup;
                collections_before += GC.profileStats().numCollections;
                //children[i].path = (account.path ~ ":" ~ children[i].name);
                children[i].path = format("%s:%s", account.path, children[i].name);
                collections_after += GC.profileStats().numCollections;
              children[i].guid = fromStringz(sqlite3_column_text(find_children, 1)).dup; // guid
                children[i].commodity_guid = fromStringz(sqlite3_column_text(find_children, 2)).dup; // commodity_guid
                children[i].flags = sqlite3_column_int(find_children,
                        3) | account.flags & (account_flag_descendents_are_assets
                        | account_flag_descendents_are_liabilities | account_flag_descendents_are_income
                        | account_flag_descendents_are_expenses | account_flag_descendents_are_marketable
                        | account_flag_self_and_descendents_are_tax_related
                        | account_flag_descendents_need_commodity_link); // flags
                i = i + 1;
            }
            if (i != n_children)
                panic(format("walk_account_tree: bad child index, %d, %d, %s, %s, %s, %d, %d", i, n_children, account.path, guid_before, guid_after, collections_before, collections_after));
            if (account.guid == "c369abf6ec2c7222e3fdd174ce2c0c9a") {
                writeln(format("walk_account_tree: %d, %d, %s, %d, %d", i, n_children, account.path, collections_before, collections_after));
            }
            foreach (child; children) {
                walk_account_tree(child, ancestor_flags | account.flags);
            }


dca@pangloss:~/Software/newcash_d/verifier$ ./verifier "--DRT-gcopt=profile:1" /tmp/Finances.newcash
walk_account_tree: bad child index, 231, 344, :Assets:Investments:Equities and derivatives:Taxable:Donald C. Allen 2003 Revocable Trust:TD Ameritrade, c369abf6ec2c7222e3fdd174ce2c0c9a, c369abf6ec2c7222e3fdd174ce2c0c9a, 33, 34

>
>> This does not panic! In theory, this should consume the same amount of GC memory as the format call (though I have not looked at the format code; there may well
>> be intermediate allocations as it does its work).
>
> Just for informational purposes, an append statement between N strings allocates one block that holds all of them. Format appends, so it may involve multiple allocations, though it's possible it does a similar shortcut. I'm very interested in this code, I will look at the case for format, and see if I can sleuth out any possible corruption possibilities.
>
>> 
>> No panic. So there appears to be a problem in format when dealing with strings fed to an unconstrained %s, probably when interacting the garbage collector. My guess is that the issue is
>> in format, not the gc.
>
> Yeah, I tend to agree. More specifically, it could be in appender as well (which is what format uses to allocate memory).
>
>> That's as far as I've gotten. But I think we know more now than before.
>
> Good work, and I think we can probably get to the bottom of it, lots of great information!
>
> -Steve


October 24, 2020
On Saturday, 24 October 2020 at 16:00:24 UTC, donallen wrote:
> On Saturday, 24 October 2020 at 15:04:53 UTC, Steven Schveighoffer wrote:
>> [...]
>
> A good thought, but it doesn't change:
>
> [...]

Have you tried formattedWrite or the nogc package?
October 25, 2020
On Saturday, 24 October 2020 at 20:00:42 UTC, Imperatorn wrote:
> On Saturday, 24 October 2020 at 16:00:24 UTC, donallen wrote:
>> On Saturday, 24 October 2020 at 15:04:53 UTC, Steven Schveighoffer wrote:
>>> [...]
>>
>> A good thought, but it doesn't change:
>>
>> [...]
>
> Have you tried formattedWrite or the nogc package?

No. At this point, the verifier works and I have stopped further porting to D of my financial tools.
October 26, 2020
On Sunday, 25 October 2020 at 17:03:00 UTC, donallen wrote:
> On Saturday, 24 October 2020 at 20:00:42 UTC, Imperatorn wrote:
>> On Saturday, 24 October 2020 at 16:00:24 UTC, donallen wrote:
>>> On Saturday, 24 October 2020 at 15:04:53 UTC, Steven Schveighoffer wrote:
>>>> [...]
>>>
>>> A good thought, but it doesn't change:
>>>
>>> [...]
>>
>> Have you tried formattedWrite or the nogc package?
>
> No. At this point, the verifier works and I have stopped further porting to D of my financial tools.

I should clarify that I've stopped further work with D because of the issue that came up in trying to port my verifier application. I am hoping that the information I provided over the weekend will enable someone to come up with a definitive explanation of what happened. If that occurs, I will then decide whether I want to continue porting my code to D.

I will add that I am impressed with the language -- a big improvement over C and C++, and with the documentation and the helpfulness of the community. I'm just concerned, based on the verifier experience, that D doesn't have the critical mass needed to make a complex system rock solid. Perhaps the explanation of the verifier problem will change that.
October 26, 2020
On 10/26/20 10:17 AM, donallen wrote:
> I should clarify that I've stopped further work with D because of the issue that came up in trying to port my verifier application. I am hoping that the information I provided over the weekend will enable someone to come up with a definitive explanation of what happened. If that occurs, I will then decide whether I want to continue porting my code to D.

Thanks for your reports. Unfortunately, without something reproducible locally, there's not much we can do without your help. As you know I'm sure, I can't fix problems without being able to test solutions. So far, nothing has jumped out as an obvious error in your code.

If you have time to run some more tests, then I would appreciate it. In particular, if you could try this test, it could narrow it down a bit: https://forum.dlang.org/post/rn1ie3$2363$1@digitalmars.com

Alternatively, if you feel comfortable with it, you can send me the code/data that fails (or maybe dummy data that also fails), and I can keep looking at it.

My email is schveiguy at gmail dot com.

-Steve
October 27, 2020
Basically what Steven said, is what I was going to reply.

We need to work with the full code, I understand that it is probably proprietary. If this is a concern, perhaps somebody associated with the D Language Foundation will be willing to offer some help, assuming that is some comfort.

October 26, 2020
On Monday, 26 October 2020 at 14:33:15 UTC, Steven Schveighoffer wrote:
> On 10/26/20 10:17 AM, donallen wrote:
>> I should clarify that I've stopped further work with D because of the issue that came up in trying to port my verifier application. I am hoping that the information I provided over the weekend will enable someone to come up with a definitive explanation of what happened. If that occurs, I will then decide whether I want to continue porting my code to D.
>
> Thanks for your reports. Unfortunately, without something reproducible locally, there's not much we can do without your help. As you know I'm sure, I can't fix problems without being able to test solutions. So far, nothing has jumped out as an obvious error in your code.
>
> If you have time to run some more tests, then I would appreciate it. In particular, if you could try this test, it could narrow it down a bit: https://forum.dlang.org/post/rn1ie3$2363$1@digitalmars.com

Sorry, I missed this suggestion in your earlier message. I'll give it a try. I'll let you know what happens.

>
> Alternatively, if you feel comfortable with it, you can send me the code/data that fails (or maybe dummy data that also fails), and I can keep looking at it.

As I said earlier, the code is not the issue, it's the data. If we get to this bridge and have to cross it, I'll look trying to cook up a dummy database that provokes the problem.

>
> My email is schveiguy at gmail dot com.
>
> -Steve


November 15, 2020
With the help of an excellent suggestion by Steve Schveighoffer, I finally identified this problem yesterday. It was my error, not D's.

I will spare you all the details. The problem is in this function:

void bind_text(sqlite3_stmt* stmt, int index, const char[] the_text)
{
    if (sqlite3_bind_text(stmt, index, toStringz(the_text), -1, null)
!= sqlite_ok)
    {
        stderr.writeln("bind_text failed");
        auto db = sqlite3_db_handle(stmt);
        stderr.writeln("Error: %s", fromStringz(sqlite3_errmsg(db)));
        exit(EXIT_FAILURE);
    }
}

which takes a sqlite prepared statement, a query parameter index and a D string and attempts to convert the D string to a null-terminated C string and call sqlite3_bind_text to bind the C string to the indicated query parameter.

The culprit is the call to toStringz. The documentation for that
function give some good advice that I failed to follow:
"Important Note: When passing a char* to a C function, and the C
function keeps it around for any reason, make sure that you keep a
reference to it in your D code. Otherwise, it may become invalid
during a garbage collection cycle and cause a nasty bug when the C
code tries to use it."

That's precisely what is happening here. There is no reference to the
string returned by the toStringz call retained and if a GC happens
during a loop retrieving all the rows of a query, the C version of the
string that has been passed to sqlite by the sqlite3_bind_text call gets
freed by the GC before sqlite is done with it. So now the sqlite query
no longer has a proper value for the query parameter, and all bets are
off as far as the query behaving correctly. This also explains the mystery of why the loop
terminated prematurely when I first discovered this problem.

I fixed it by having bind_text return the C string to its caller, to be stored in a variable to protect it from the garbage collector. The original error, which was reproducible, is now gone. Steve's suggestion, which was to insert frequent calls to the garbage collector, provoked a different error reliably that is also gone with the fix.

Thanks to everyone who tried to help figure this one out.