Thread overview
d2sqlite3 db.run, where lies the bug?
Apr 10, 2018
Ralph Amissah
Apr 10, 2018
ag0aep6g
Apr 11, 2018
Ralph Amissah
Apr 11, 2018
ag0aep6g
Apr 15, 2018
biozic
April 10, 2018
/+
Not sure where to report this, nor of where the bug lies. I hope
SQLite (and d2sqlite3) is used widely enough for this to be of
interest here.

I have sets of document files that are broken up and placed (inserted) into an sqlite3 db, some of which fail with what is to me an inexplicable utf-8 error as they contain no special characters and it is "corrected" without the removal of any character in particular, and I so far, cannot predict them. Below I have concocted a sample string that fails and variations of it that pass (with one character removed or added).

Note,
(i) There is no problem inserting the string in question into
  sqlite3 directly (i.e. using sqlite3 directly).
(ii) Likewise there is no problem using d2sqlite3 with a prepared
  statement on the string (.execute), but,... But for the use case,
  each document has several thousand content strings for each of
  which sqlite3 then requires a data locks and releases for the
  insertion of the next, thousands per document and this makes the
  operation conservatively several tens of time slower, than:
  generating a prepared sql statement inserting all document rows
  (objects/paragraphs) as a single sql statement with db.run begin
  and commit. Basically, it makes a significant difference that this
  works.
(iii) There do not appear to be any offending utf-8 characters

Assumption, most likely candidate for blame is the d2sqlite3 wrapper (or my code, (something that needs to be escaped in certain circumstances? please tell me)); sqlite3 does not have a problem with the string in question, and the C api for which d2sqlite3 provides a wrapper is much used and seems unlikely to be to blame.

The exact location of problem may be provided in the error statement "core.exception.UnicodeException@src/rt/util/utf.d(292): invalid UTF-8 sequence".

>From a laymans perspecive it would appear that db.run the passing of
an sql statement as a string to sqlite3 should be the simplest type of transaction; pass a statement/string unchanged to sqlite3

Sample offending text used:
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.",

The prepared sql statement that fails with d2sqlite3 (but succeeds
using sqlite3 directly):

        BEGIN;
        DROP TABLE IF EXISTS test;
        CREATE TABLE test (
          lid     BIGINT PRIMARY KEY,
          txt     TEXT NULL
        );
        INSERT INTO test (txt) VALUES ('Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.');
        COMMIT;

core.exception.UnicodeException@rt/util/utf.d(292): invalid UTF-8 sequence

in the sample problem text/string/statement for d2sqlite3 (above):
  - utf-8 error goes away on removal of either:
    - one of the semi-colons (of which there are 2)
    - any of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6)
  - comments
    - placing a random semi-colon within the two offending semi-colons sometimes works
    - double semi-colon does not help

Mock problem string with test code follows (d2sqlite3 required):

+/

module d2sqlite3_utf8.issue;
import std.conv : to;
import std.format;
import std.stdio;
import d2sqlite3;
void main() {
  string[] info_tag = ["pass", "fault"];
  foreach (t, tests; [ok_check, faults]) {
    foreach (i, insert_test; tests) {
      // auto db = Database("test.sqlite");
      auto db = Database(":memory:");
      string _sql_statement = format(q"¶
        BEGIN;
        DROP TABLE IF EXISTS test;
        CREATE TABLE test (
          lid     BIGINT PRIMARY KEY,
          txt     TEXT NULL
        );
        INSERT INTO test (txt) VALUES ('%s');
        COMMIT;
      ¶",
        insert_test[0],
      );
      writeln(
        (i + 1), ". ", info_tag[t], ": ", insert_test[1],
        "\n", _sql_statement);
      db.run(_sql_statement);
      foreach (parse; db.execute("SELECT txt FROM test;")) {
        foreach (s; parse) {
          writeln(
            "  (test string == sqlite database content): ",
            (insert_test[0] == s.to!string), "\n");
          assert(insert_test[0] == s.to!string);
          // writeln(s);
        }
      }
      assert(db.totalChanges == 1);
      db.close;
    }
  }
}
string[][] faults = [
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.",
    "sample problem text, utf-8 error?\n\"core.exception.UnicodeException@src/rt/util/utf.d(292): invalid UTF-8 sequence\"?",
  ]
];
/+
  in the sample problem text (above):
    - utf-8 error goes away on removal of either:
      - one of the semi-colons (of which there are 2)
      - any of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6)
    - comments
      - placing a random semi-colon within the two offending semi-colons sometimes works
      - double semi-colon does not help
  eat text that passes tests below
+/

string[][] ok_check = [
  [
    "Contrary to Peter’s cake all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.",
     "remove one of the semi-colons (of which there are 2) from sample problem text:"
  ],
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries funny.",
     "remove one of the semi-colons (of which there are 2) from sample problem text:"
  ],
  [
    "Contrary to Peter’s cake; all versions of Peters, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.",
    "remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text\n[here changing the word \"berries\" before the last semi-colon to a less than 7 character word like \"sugar\" or \"garlic\" results in failure]:",
  ],
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cooks permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.",
    "remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text:",
  ],
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peters recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.",
    "remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text:",
  ],
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peters lunch. You didn’t eat Peter’s berries; funny.",
    "remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text:",
  ],
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didnt eat Peter’s berries; funny.",
    "remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text:",
  ],
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peters berries; funny.",
    "remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text:",
  ],
  [
    "Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu; shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.",
    "place a random semi-colon within the two offending semi-colons and sample problem text sometimes works, an example:",
  ],
];

/+
  Sorry this became so verbose, (also if unclear, or if I have
  overlooked anything obvious), I probably should report directly to
  Nicolas Sicard first but the bug seems so off the wall. Nicolas
  thanks for the d2sqlite3 wrapper btw, and you are probably my best
  bet to have this fixed, unless I have stumbled upon something even
  more interesting.

  Thanks,
  Ralph Amissah

  P.S. if relevant, currently using:
  - d2sqlite3 0.16.0;
  - DMD64 D Compiler v2.074.1
    | LDC - the LLVM D compiler (1.8.0) (based on DMD v2.078.3 and
      LLVM 5.0.1)
    | gdc (Debian 8-20180402-1) 8.0.1 20180402 (experimental)
  - Debian (latest, so called unstable);

+/
April 10, 2018
On 04/10/2018 08:04 PM, Ralph Amissah wrote:
> The exact location of problem may be provided in the error statement
> "core.exception.UnicodeException@src/rt/util/utf.d(292): invalid
> UTF-8 sequence".
> 
[...]
> Mock problem string with test code follows (d2sqlite3 required):
> 
[... code ...]

A more minimal test case, reduced from your code:

----
module d2sqlite3_utf8.issue;
import d2sqlite3;
void main() {
  string[] info_tag = ["pass", "fault"];
    auto db = Database(":memory:");
    string _sql_statement = `SELECT '’’';`;
    db.run(_sql_statement);
    db.close;
}
----

From the exception's stack trace we see that `d2sqlite3.internal.util.byStatement(immutable(char)[]).ByStatement.findEnd` is the deepest non-Phobos function involved. So that's a good first spot to look for a bug. Let's check it out.

https://github.com/biozic/d2sqlite3/blob/2e8211946ae0e09646d561aeae1361a695adcc17/source/d2sqlite3/internal/util.d#L64-L83

And indeed, there's a bug in these lines:

----
auto tail = sql[pos .. $];
immutable offset = tail.countUntil(';') + 1;
pos += offset;
----

`pos` is used to slice the string `sql`. That means, `pos` is interpreted as a number of UTF-8 code *units*. But then the result of `countUntil` is added. `countUntil` counts code *points*. So a number of code points is mistaken as a number of code units. That means the next slicing can be incorrect and split up a multibyte sequence. And then `countUntil` will complain about broken UTF-8.

This can be fixed by letting `countUntil` operate on count code units instead:

----
import std.utf: byCodeUnit;
immutable offset = tail.byCodeUnit.countUntil(';') + 1;
----

If you want, you can make a bug report or a pull request with the fix. Otherwise, if you're not up to that, I can make one.

[...]
>    - DMD64 D Compiler v2.074.1

That's rather old. I'd recommend updating if possible.
April 10, 2018
ag0aep6g, thank you most of all for fixing the bug, the removal of my immediate frustration! Also for the thorough step-wise instruction on sleuthing and fixing it.

>
> If you want, you can make a bug report or a pull request with the fix. Otherwise, if you're not up to that, I can make one.
>

Generous, I merely stumbled on it and shouted out, you fixed it. I would be grateful if you please file the bug and your fix. (I thought I might have more to report, but your fix has resolved my multitude of issues with several texts). I did copy my original post to the author of d2sqlite3 (and again this reply which includes your fix), so he hopefully has the information required to make the upstream fix. The diff/patch I used:

diff -Naur d2sqlite3-0.16.0/d2sqlite3/source/d2sqlite3/internal/util.d d2sqlite3-0.16.0-1/d2sqlite3/source/d2sqlite3/internal/util.d
--- d2sqlite3-0.16.0/d2sqlite3/source/d2sqlite3/internal/util.d	2018-04-10 20:34:11.584498926 -0400
+++ d2sqlite3-0.16.0-1/d2sqlite3/source/d2sqlite3/internal/util.d	2018-04-10 20:46:09.869812899 -0400
@@ -65,13 +65,14 @@
         {
             import std.algorithm : countUntil;
             import std.string : toStringz;
+            import std.utf: byCodeUnit;

             size_t pos;
             bool complete;
             do
             {
                 auto tail = sql[pos .. $];
-                immutable offset = tail.countUntil(';') + 1;
+                immutable offset = tail.byCodeUnit.countUntil(';') + 1;
                 pos += offset;
                 if (offset == 0)
                     pos = sql.length;

(if patch is contained in patchfile called d2sqlite3_bugfix.diff and at the same
directory level as subdirectory containing d2sqlite3-0.16.0)
cp -av d2sqlite3-0.16.0 d2sqlite3-0.16.0-1 \
&& cd d2sqlite3-0.16.0-1 \
&& patch -Np1 < ../d2sqlite3_bugfix.diff

On Tue, Apr 10 2018, ag0aep6g via Digitalmars-d <digitalmars-d@puremagic.com> wrote:

> On 04/10/2018 08:04 PM, Ralph Amissah wrote:
>> The exact location of problem may be provided in the error statement "core.exception.UnicodeException@src/rt/util/utf.d(292): invalid UTF-8 sequence".
>>
> [...]
>> Mock problem string with test code follows (d2sqlite3 required):
>>
> [... code ...]

[... snip ...]

>
>  From the exception's stack trace we see that
> `d2sqlite3.internal.util.byStatement(immutable(char)[]).ByStatement.findEnd`
> is the deepest non-Phobos function involved. So that's a good first spot
> to look for a bug. Let's check it out.
>
> https://github.com/biozic/d2sqlite3/blob/2e8211946ae0e09646d561aeae1361a695adcc17/source/d2sqlite3/internal/util.d#L64-L83
>
> And indeed, there's a bug in these lines:
>
> ----
> auto tail = sql[pos .. $];
> immutable offset = tail.countUntil(';') + 1;
> pos += offset;
> ----
>
> `pos` is used to slice the string `sql`. That means, `pos` is interpreted as a number of UTF-8 code *units*. But then the result of `countUntil` is added. `countUntil` counts code *points*. So a number of code points is mistaken as a number of code units. That means the next slicing can be incorrect and split up a multibyte sequence. And then `countUntil` will complain about broken UTF-8.
>
> This can be fixed by letting `countUntil` operate on count code units instead:
>
> ----
> import std.utf: byCodeUnit;
> immutable offset = tail.byCodeUnit.countUntil(';') + 1;
> ----
>
> If you want, you can make a bug report or a pull request with the fix. Otherwise, if you're not up to that, I can make one.
>
> [...]
>>    - DMD64 D Compiler v2.074.1
>
> That's rather old. I'd recommend updating if possible.

for the time being it reflects the status of the rolling development branch of Debian. LDC is based on a newer version.

Thanks again.
Ralph Amissah
April 11, 2018
On 04/11/2018 04:08 AM, Ralph Amissah wrote:
> Generous, I merely stumbled on it and shouted out, you fixed it. I would be
> grateful if you please file the bug and your fix.

There you go: https://github.com/biozic/d2sqlite3/pull/43
April 15, 2018
On Wednesday, 11 April 2018 at 11:00:30 UTC, ag0aep6g wrote:
> On 04/11/2018 04:08 AM, Ralph Amissah wrote:
>> Generous, I merely stumbled on it and shouted out, you fixed it. I would be
>> grateful if you please file the bug and your fix.
>
> There you go: https://github.com/biozic/d2sqlite3/pull/43

Thanks for fixing this !

-- Nicolas