Ralph Amissah
| /+
Not sure where to report this, nor of where the bug lies. I hope
SQLite (and d2sqlite3) is used widely enough for this to be of
interest here.
I have sets of document files that are broken up and placed (inserted) into an sqlite3 db, some of which fail with what is to me an inexplicable utf-8 error as they contain no special characters and it is "corrected" without the removal of any character in particular, and I so far, cannot predict them. Below I have concocted a sample string that fails and variations of it that pass (with one character removed or added).
Note,
(i) There is no problem inserting the string in question into
sqlite3 directly (i.e. using sqlite3 directly).
(ii) Likewise there is no problem using d2sqlite3 with a prepared
statement on the string (.execute), but,... But for the use case,
each document has several thousand content strings for each of
which sqlite3 then requires a data locks and releases for the
insertion of the next, thousands per document and this makes the
operation conservatively several tens of time slower, than:
generating a prepared sql statement inserting all document rows
(objects/paragraphs) as a single sql statement with db.run begin
and commit. Basically, it makes a significant difference that this
works.
(iii) There do not appear to be any offending utf-8 characters
Assumption, most likely candidate for blame is the d2sqlite3 wrapper (or my code, (something that needs to be escaped in certain circumstances? please tell me)); sqlite3 does not have a problem with the string in question, and the C api for which d2sqlite3 provides a wrapper is much used and seems unlikely to be to blame.
The exact location of problem may be provided in the error statement "core.exception.UnicodeException@src/rt/util/utf.d(292): invalid UTF-8 sequence".
>From a laymans perspecive it would appear that db.run the passing of
an sql statement as a string to sqlite3 should be the simplest type of transaction; pass a statement/string unchanged to sqlite3
Sample offending text used:
"Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.",
The prepared sql statement that fails with d2sqlite3 (but succeeds
using sqlite3 directly):
BEGIN;
DROP TABLE IF EXISTS test;
CREATE TABLE test (
lid BIGINT PRIMARY KEY,
txt TEXT NULL
);
INSERT INTO test (txt) VALUES ('Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.');
COMMIT;
core.exception.UnicodeException@rt/util/utf.d(292): invalid UTF-8 sequence
in the sample problem text/string/statement for d2sqlite3 (above):
- utf-8 error goes away on removal of either:
- one of the semi-colons (of which there are 2)
- any of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6)
- comments
- placing a random semi-colon within the two offending semi-colons sometimes works
- double semi-colon does not help
Mock problem string with test code follows (d2sqlite3 required):
+/
module d2sqlite3_utf8.issue;
import std.conv : to;
import std.format;
import std.stdio;
import d2sqlite3;
void main() {
string[] info_tag = ["pass", "fault"];
foreach (t, tests; [ok_check, faults]) {
foreach (i, insert_test; tests) {
// auto db = Database("test.sqlite");
auto db = Database(":memory:");
string _sql_statement = format(q"¶
BEGIN;
DROP TABLE IF EXISTS test;
CREATE TABLE test (
lid BIGINT PRIMARY KEY,
txt TEXT NULL
);
INSERT INTO test (txt) VALUES ('%s');
COMMIT;
¶",
insert_test[0],
);
writeln(
(i + 1), ". ", info_tag[t], ": ", insert_test[1],
"\n", _sql_statement);
db.run(_sql_statement);
foreach (parse; db.execute("SELECT txt FROM test;")) {
foreach (s; parse) {
writeln(
" (test string == sqlite database content): ",
(insert_test[0] == s.to!string), "\n");
assert(insert_test[0] == s.to!string);
// writeln(s);
}
}
assert(db.totalChanges == 1);
db.close;
}
}
}
string[][] faults = [
[
"Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.",
"sample problem text, utf-8 error?\n\"core.exception.UnicodeException@src/rt/util/utf.d(292): invalid UTF-8 sequence\"?",
]
];
/+
in the sample problem text (above):
- utf-8 error goes away on removal of either:
- one of the semi-colons (of which there are 2)
- any of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6)
- comments
- placing a random semi-colon within the two offending semi-colons sometimes works
- double semi-colon does not help
eat text that passes tests below
+/
string[][] ok_check = [
[
"Contrary to Peter’s cake all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.",
"remove one of the semi-colons (of which there are 2) from sample problem text:"
],
[
"Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries funny.",
"remove one of the semi-colons (of which there are 2) from sample problem text:"
],
[
"Contrary to Peter’s cake; all versions of Peters, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.",
"remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text\n[here changing the word \"berries\" before the last semi-colon to a less than 7 character word like \"sugar\" or \"garlic\" results in failure]:",
],
[
"Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cooks permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.",
"remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text:",
],
[
"Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peters recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.",
"remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text:",
],
[
"Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peters lunch. You didn’t eat Peter’s berries; funny.",
"remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text:",
],
[
"Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didnt eat Peter’s berries; funny.",
"remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text:",
],
[
"Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peters berries; funny.",
"remove any one of the closing single quote mark symbols ’ occurring between the two semi-colons (of which there are 6) from sample problem text:",
],
[
"Contrary to Peter’s cake; all versions of Peter’s, custard. Eating without the cook’s permission is, of course, naughty. Now, a quick check of my menu; shows that sure enough, a number of files have Peter’s recipe in them. John had better tell us if he ate Peter’s lunch. You didn’t eat Peter’s berries; funny.",
"place a random semi-colon within the two offending semi-colons and sample problem text sometimes works, an example:",
],
];
/+
Sorry this became so verbose, (also if unclear, or if I have
overlooked anything obvious), I probably should report directly to
Nicolas Sicard first but the bug seems so off the wall. Nicolas
thanks for the d2sqlite3 wrapper btw, and you are probably my best
bet to have this fixed, unless I have stumbled upon something even
more interesting.
Thanks,
Ralph Amissah
P.S. if relevant, currently using:
- d2sqlite3 0.16.0;
- DMD64 D Compiler v2.074.1
| LDC - the LLVM D compiler (1.8.0) (based on DMD v2.078.3 and
LLVM 5.0.1)
| gdc (Debian 8-20180402-1) 8.0.1 20180402 (experimental)
- Debian (latest, so called unstable);
+/
|