Interpolated strings and SQL (page 3)

Settings

Help

Index » General » Interpolated strings and SQL (page 3)

January 09, 2024

Re: Interpolated strings and SQL

Posted by Walter Bright
in reply to Timon Gehr

Permalink

Walter Bright

Posted in reply to Timon Gehr

Permalink

On 1/9/2024 1:24 PM, Timon Gehr wrote:
> You actually did not.

See my other reply to you in this thread.

January 09, 2024

Re: Interpolated strings and SQL

Posted by Timon Gehr
in reply to Walter Bright

Permalink

Timon Gehr

Posted in reply to Walter Bright

Permalink

On 1/9/24 21:01, Walter Bright wrote:
> On 1/9/2024 4:35 AM, Timon Gehr wrote:
>> However, let's for the sake of argument assume that, miraculously, `execi` can read the format string at compile time, then:
> 
> Adam's implementation of execi() also runs at run time, not compile time.
> ...

Adam's `execi` partially runs at compile time and partially of course it will ultimately run at run time (like code generated by a metaprogram tends to do).

The SQL statement is prepared at compile time. Therefore, by construction, it cannot depend on any runtime parameters, preventing an SQL injection. (And it can be checked at compile time, like people are already doing with less convenient syntax).

> 
>> - With this signature, if you pass a manually-constructed string to it, it would just accept the SQL injection.
> 
> It was just a proof of concept piece of code.

So is Adam's example code. In any case:

I am talking about the function _signature_. Whatever crazy advanced thing you do in the implementation, the signature that DIP1027 expects `execi` to have is fundamentally significantly less safe.

> execi could check for format strings that contain ?n sequences. It could also check the number of %s formats against the number of arguments.
> ...

That does not fix the security issue.

> 
>  > But you get a useful error message that exactly pinpoints what the problem is.
>  > Also, they could be supported, which is the point.
>> - It does not give a proper error message for nested istrings.
> 
> execi could be extended to reject arguments that contain %s sequences. 

And now suddenly you can no longer store anything that looks like a format string in your data base.

> Or, if there was an embedded istring, the number of %s formats can be checked against the number of arguments.

Maybe at runtime. But why introduce this failure mode in the first place?

> An embedded istring would show a mismatch.
> ...

The error message would be phrased in overly general terms and hence be confusing.

> I expect that use of nested istrings would be exceedingly rare. If they are used, wrapping them in text() will make work.

Depends on how exactly they are used. For the SQL case, not allowing them is a decent option.

> Besides, would a nested istring in an sql call be intended as part of the sql format, or would a text string be the intended result?
> ...

Whatever it is, with DIP1036e and compile-time SQL construction, user data does not make it into the SQL expression sent to the database.

> 
>> - It has to manually parse the format string. It iterates over each character of the original format string.
> 
> Correct. And it does not need to iterate over and remove all the Interpolation arguments.

Adam's implementation does the filtering at compile time.

The function body will be something like:

auto statement = Statement(db, "...?1...?2...?3..."); // replace ... by query
int number = 0;
statement.bind(++number, firstArg);
statement.bind(++number, secondArg);
statement.bind(++number, thirdArg);

But yes, DIP1036e does make some concessions and it will indeed pass empty struct arguments in case the function is not inlined (could use pragma(inline, true) to avoid it.)

> Nor does it need the extra two arguments, which aren't free of cost.
> ...

Are you really going to argue that some extra empty struct arguments are in some way more expensive than runtime query construction including format string parsing and query construction using GC strings?

But anyway, if you think interpolation is not worth runtime overhead that would perhaps need to be mitigated using additional features or an improved calling convention, that's up to you, but then DIP1027 loses too.

> 
>> - It (ironically!) constructs a new format string, the original one was useless.
> 
> Yes, it converts the format specifiers to the sql ones. Why is this a problem?
> ...

You argued earlier like it is in some way an ironic benefit of DIP1027 that the DB interface requires something that is similar to a format string under the hood. Well, it does not require the kind of format string that DMD is generating.

> 
>> - If you pass a bad format string to it (for example, by specifying a manual format), it will just do nonsense, while DIP1036e avoids bad format strings by construction.
> 
> What happens when ?3 is included in a DIP1036 istring? `i"string ?3 ($betty)" ? I didn't see any check for that.

That's a fair point in general, but I was specifically talking about the format string that you pass into the function that accepts the istring, not similar kinds of strings that may or may not be generated in the implementation.

In any case, DIP1027 istrings can also create a format string with `?3`, and there no way to check within `execi` if that `?3` came from malicious data that was read as input to the program or was put there by an incompetent programmer.

> Of course, one could add such a check to the 1036 execi.
> ...

With DIP1036e the check could be done at compile time.

> printf format strings are checked by the compiler,

As a one-off special case that only supports a specific kind of format string.

> and writef format strings are checked by writef.

`writef` allows the format string to be passed as a template parameter if compile-time parsing and checking is requested. DIP1027 does not naturally support this.

> execi is also capable of being extended to check the format string to ensure the format matches the args.

With DIP1027, you'd have to do it at runtime.

January 09, 2024

Re: Interpolated strings and SQL

Posted by Steven Schveighoffer
in reply to Walter Bright

Permalink

Steven Schveighoffer

Posted in reply to Walter Bright

Permalink

At the end of the day, DIP1027 is an improvement of writef, and writef only (not even printf works correctly). The interpolation DIP Atila is writing (I'll call it IDIP) supports all manner of interpolated transformations, efficiently and effectively, with proper compiler checks.

Let's go through the points made...

On Monday, 8 January 2024 at 23:06:40 UTC, Walter Bright wrote:

Here's how SQL support is done for DIP1036:

https://github.com/adamdruppe/interpolation-examples/blob/master/lib/sql.d

...

This:

The istring, after converted to a tuple of arguments, is passed to the execi template.

Yes, and with an explicit type to be matched against, enabling overloading. Note that execi could be called the same thing as the normal execution function, and then users could use whatever form they prefer -- sql string + args or istring. It's a seamless experience.

Compare to DIP1027 where you can accidentally use the wrong form with string args.

It loops over the arguments, essentially turing it (ironically!) back into a format
string. The formats, instead of %s, are ?1, ?2, ?3, etc.

There is no formatting, sqlite does not have any kind of format specifiers.

No, it is not "turned back" into a format string, because there was no format string to begin with. The sql is constructed using the given information from the compiler clearly identifying which portions are sql and which portions are parameters.

And the SQL query is built at compile time, not runtime (as DIP1027 must do). This incurs no memory allocations at runtime.

It skips all the Interpolation arguments inserted by DIP1036.

Sure, those are not necessary here. Should be a no-op, as no data is actually passed.

The remaining argument are each bound to the indices 1, 2, 3, ...

Yes.

Then it executes the sql statement.

Yes.

Note that nested istrings are not supported.

Note that nested istrings can be detected.

And they are not supported as explicitly specified! This is not a defect or limitation but a choice of the particular example library.

Noting this "limitation" is like noting the limitation that void foo(int) can't be called with a string argument.

Let's see how this can work with DIP1027:

auto execi(Args...)(Sqlite db, Args args) {
    import arsd.sqlite;

    // sqlite lets you do ?1, ?2, etc

    enum string query = () {
        string sql;
        int number;
        import std.conv;
        auto fmt = args[0];
        for (size_t i = 0; i < fmt.length, ++i)
        {
            char c = fmt[i];
            if (c == '%' && i + 1 < fmt.length && fmt[i + 1] == 's')
            {
                sql ~= "?" ~ to!string(++number);
                ++i;
            }
            else if (c == '%' && i + 1 < fmt.length && fmt[i + 1] == '%')
                ++i;  // skip escaped %
            else
                sql ~= c;
        }
        return sql;
    }();

As mentioned several times, this fails to compile -- an enum cannot be built from the runtime variable args.

Now, you can just do this without an enum, and yes, it will compile, build a string at runtime, and you are now at the mercy of the user to not have put in specialized placeholder (poorly named as a "format specifier" in DIP1027 because it is solely focused on writef). No compiler help for you!

To put it another way, you have given up complete control of the API of your library to the compiler and the user. Instead of understanding what the user has said, you have to guess.

And BTW, this is valid SQL:

SELECT * FROM someTable WHERE fieldN LIKE '%something%'

Which means, the poor user needs to escape % in a way completely unrelated to the sql language or the istring specification, something that IDIP doesn't require.

This is a further burden on the user that is wholly unnecessary, just because DIP1027 decided to use %s as "the definitive ~~placeholder~~ format specifier".

    auto statement = Statement(db, query);
    int number;
    foreach(arg; args[1 .. args.length]) {
        statement.bind(++number, arg);
    }

    return statement.execute();
}

This:

The istring, after converted to a tuple of arguments, is passed to the execi template.

A tuple with an incorrect parameter that needs runtime transformation and allocations.

The first tuple element is the format string.
A replacement format string is created by replacing all instances of "%s" with
"?n", where n is the index of the corresponding arg.

SQL doesn't use format strings, so the parameter must be transformed at runtime using memory allocations.

And it does this without knowing whether the "%s" came from the "format string" or from a parameter.

Not to mention the user can pass in other "format specifiers" at will.

The replacement format string is bound to statement, and the arguments are bound
to their indices.

Maybe. sqlite frowns upon mismatching arguments because the library decided your search string was actually a placeholder in some unrelated domain specific language (the language of writef).

Then it executes the sql statement.

Maybe.

It is equivalent.

It is most certainly not. The two are only slightly comparable. IDIP is a mechanism for an SQL library author (and many other domains, see Adam's repository) to effectively and gracefully consume succinct and intuitive instructions from a user to avoid SQL injections, and use the compiler to weed out problematic calls.

Whereas DIP1027 is a loaded footgun which is built for writef that can be shoehorned into an SQL lib, which necessitates allocations and all checks are done at runtime.

-Steve

January 09, 2024

Re: Overhead of DIP1036

Posted by Timon Gehr
in reply to Walter Bright

Permalink

Timon Gehr

Posted in reply to Walter Bright

Permalink

On 1/9/24 20:05, Walter Bright wrote:
> On 1/9/2024 12:45 AM, Alexandru Ermicioi wrote:
>> If that's the case, then 1036 wins imho, by simple thing of not doing any parsing of format string.
> 
> Consider the overhead 1036 has by comparing it with plain writeln or writefln:
> 
> ```
> void test(int baz)
> {
>      writeln(i"$(baz + 4)");
>      writeln(baz + 5);
>      writefln("%d", baz + 6);
> }
> ```
> 
> ...

I think Alexandru and Nickolay already discharged the concerns about overhead pretty well, but just note that with DIP1027, `test(3)` prints:

%s7
8
9

There is fundamentally no way to make this work correctly, due to how DIP1027 throws away the information about the format string.

With DIP1036e, `test(3)` prints:

7
8
9

And you can get rid of the runtime overhead by adding a `pragma(inline, true)` `writeln` overload. (I guess with DMD that will still bloat the executable, but I think other compiler backends and linkers can be made elide such symbols completely.)

January 10, 2024

Re: Interpolated strings and SQL

Posted by Timon Gehr
in reply to Steven Schveighoffer

Permalink

Timon Gehr

Posted in reply to Steven Schveighoffer

Permalink

On 1/9/24 23:30, Steven Schveighoffer wrote:
> 
> And BTW, this is valid SQL:
> 
> ```sql
> SELECT * FROM someTable WHERE fieldN LIKE '%something%'
> ```
> 
> Which means, the poor user needs to escape `%` in a way completely unrelated to the sql language *or* the istring specification, something that IDIP doesn't require.

I had typed up a similar point in my post, but then thought that most likely DIP1027 does the escaping automatically and dropped the line of inquiry. But actually checking it now, it indeed does not seem to do anything to prevent such hijacking.

https://github.com/dlang/DIPs/blob/master/DIPs/rejected/DIP1027.md
https://github.com/dlang/dmd/compare/master...WalterBright:dmd:dip1027#diff-a556a8e6917dd4042f541bdb19673f96940149ec3d416b0156af4d0e4cc5e4bdR16347-R16452

Having the SQL library arbitrarily interpret a substring `%s` in your SQL query as a placeholder seems like unnecessary pain, and it also renders moot the idea that DIP1027 code is able to detect mismatches.

January 10, 2024

Re: Interpolated strings and SQL

Posted by Timon Gehr
in reply to Walter Bright

Permalink

Timon Gehr

Posted in reply to Walter Bright

Permalink

On 1/9/24 20:16, Walter Bright wrote:
> On 1/9/2024 4:35 AM, Timon Gehr wrote:
>> This does not work.
> 
> How so?

It does not compile. The arg->args fix I'll grant you as it is a typo whose only significance is to make it even more clear that you never tried to run any version of the code, but then you still get another compile error. I suggest you mock out the SQL library, you don't actually need to install it to try your code.

If we remove the `enum` then your code still does not work correctly, for example because it does not prevent an SQL injection attack if the user constructs the SQL string manually by accidentally using `format`. I and other people already pointed out this flaw and other flaws in other posts.

> Consider this:
> 
> ```
> import std.stdio;
> 
> auto execi(Args...)(Args args)
> {
>      auto fmt = args[0].dup;
>      fmt[0] = 'k';
>      writefln(fmt, args[1 .. args.length]);
> }
> 
> void main()
> {
>      string b = "betty";
>      execi(i"hello $b");
> }
> ```
> 
> which compiles and runs, printing:
> 
> kello betty

I considered it and it did not have an impact on the way I view the DIP1027 `execi` implementation you have given.

January 09, 2024

Re: Interpolated strings and SQL

Posted by Walter Bright
in reply to Nickolay Bukreyev

Permalink

Walter Bright

Posted in reply to Nickolay Bukreyev

Permalink

P.S. Thank you for your well constructed arguments.

On 1/9/2024 1:35 PM, Nickolay Bukreyev wrote:
> A valid point, thanks. Could you test if that fixes the issue?

Yes, that works.

> We are probably talking about different things. Adam’s implementation constructs a format string at compile time thanks to `enum` storage class [in line 36](https://github.com/adamdruppe/interpolation-examples/blob/a8a5d4d4ee37ee9ae3942c4f4e8489011c3c4673/lib/sql.d#L36).

Yes, you're right.

> Constructing it at compile time is essential so that we can validate the generated SQL and abort compilation, as Paolo [demonstrated](https://forum.dlang.org/post/qbtbyxcglwijjbeygtvi@forum.dlang.org).

That only checks one aspect of correctness - nested string interpolations.

>>  execi could be extended to reject arguments that contain %s sequences.
> I disagree. Storing a string that contains `%s` in a database should be allowed (storing any string should obviously be allowed, regardless of its contents).

True, which is why a % that is not intended as a format specifier is entered as %%.

> But `execi` is unable to differentiate between a string that happens to contain `%s` and a nested format string:
> 
> ```
> // DIP1027
> example(i"prefix1 $(i"prefix2 $(x) suffix2") suffix1");
> // Gets rewritten as:
> example("prefix1 %s suffix1", "prefix2 %s suffix2", x);
> ```
> 
> I might be wrong, but it appears to me that DIP1027 is not able to deal with nested format strings, in a general case.

The expansion for `example` has a mismatch in the number of formats (1) and number of arguments (2). This can be detected at runtime by `example`, as I've explained.

A compile time way is DIP1027 can be modified to reject any arguments that consist of tuples with other than one element. This would eliminate nested istring tuples at compile time.

> DIP1036 has no such limitation (demonstrated in point 2 [here](https://forum.dlang.org/post/lizjwxdgsnmgykaoczyf@forum.dlang.org)).

DIP1036 cannot detect other problems with the string literals. It seems like a lot of complexity to deal with only one issue with malformed strings at compile time rather than runtime.

> I explained [here](https://forum.dlang.org/post/qkvxnbqjefnvjyytfana@forum.dlang.org) why these two arguments are valuable. Aren’t free of cost—correct unless you enable inlining. `execi` may require some changes (like `filterOutEmpty` I showed above) to make them free of cost, but it is doable.

You'd have to also make every formatted writer a template, and add the filter to them.

> You are right, it doesn’t. Timon’s point (expressed as “This does not work”) is that DIP1036 is able to do validation at compile time while DIP1027 is only able to do it at runtime, when this function actually gets invoked.

The only validation it does is check for nested string interpolations.

January 10, 2024

Re: Interpolated strings and SQL

Posted by Timon Gehr
in reply to Walter Bright

Permalink

Timon Gehr

Posted in reply to Walter Bright

Permalink

On 1/9/24 21:01, Walter Bright wrote:
> 
> 
> I expect that use of nested istrings would be exceedingly rare. If they are used, wrapping them in text() will make work.

One more point here is that `text` will of course only work with DIP1038e, with DIP1027 you need `format`. In any case, unfortunately I have to bow out of this discussion now as it is consuming too much of my time right in front of a deadline. I can get back to this in a couple of days.

January 09, 2024

Re: Overhead of DIP1036

Posted by Steven Schveighoffer
in reply to Walter Bright

Permalink

Steven Schveighoffer

Posted in reply to Walter Bright

Permalink

On Tuesday, 9 January 2024 at 19:05:40 UTC, Walter Bright wrote:

With the istring, there are 4 calls to struct member functions that just return null.

Yeah, and writeln could avoid those if it's that important. A good optimizer will remove that call.

This can't be good for performance or program size.

Then use writeln the way you want? I don't see it as significant at all.

We can compute the number of arguments passed to the function:

istring: 1 + 3 * <number of arguments> + 1 + 1  (*)
writeln: <number of arguments>
writefln: 1 + <number of arguments>

(*) includes string literals before, between, and after arguments

I find it bizarre to be concerned about the call performance of zero-sized structs and empty strings to writeln or writef, like the function is some shining example of performance or efficient argument passing. If you do not have inlining or optimizations enabled, do you think the call tree of writefln is going to be compact? Not to mention it eventually just calls into C opaquely.

Note that you can write a simple wrapper that can be inlined, which will mitigate all of this via compile-time transformations.

If you like, I can write it up and you can try it out!

-Steve

January 09, 2024

Re: Interpolated strings and SQL

Posted by Paolo Invernizzi
in reply to Walter Bright

Permalink

Paolo Invernizzi

Posted in reply to Walter Bright

Permalink

On Tuesday, 9 January 2024 at 23:21:34 UTC, Walter Bright wrote:
> P.S. Thank you for your well constructed arguments.
>
> On 1/9/2024 1:35 PM, Nickolay Bukreyev wrote:
>> A valid point, thanks. Could you test if that fixes the issue?
>
> Yes, that works.
>
>> We are probably talking about different things. Adam’s implementation constructs a format string at compile time thanks to `enum` storage class [in line 36](https://github.com/adamdruppe/interpolation-examples/blob/a8a5d4d4ee37ee9ae3942c4f4e8489011c3c4673/lib/sql.d#L36).
>
> Yes, you're right.
>
>> Constructing it at compile time is essential so that we can validate the generated SQL and abort compilation, as Paolo [demonstrated](https://forum.dlang.org/post/qbtbyxcglwijjbeygtvi@forum.dlang.org).
>
> That only checks one aspect of correctness - nested string interpolations.

No.

If you look at the errors raised during the compilation of our codebase, we are checking FAR MORE, for example the second error is related to wrong missing grant condition on a select.

I've included it as an example, just not syntax, table names, semantic or so, but also permissions, at compile time.

And that's a concrete codebase, used in production, not speculations.

/P

Top | Forum index | About this forum

Forums