Jump to page: 1 2
Thread overview
toStringz and toUTFz potentially unsafe
Jul 25, 2011
Johann MacDonagh
Jul 25, 2011
Jonathan M Davis
Jul 25, 2011
Brad Roberts
Jul 25, 2011
Jonathan M Davis
Jul 25, 2011
Johann MacDonagh
Jul 25, 2011
Jonathan M Davis
Jul 25, 2011
Vladimir Panteleev
Jul 25, 2011
Jonathan M Davis
July 25, 2011
Both toStringz and toUTFz do something potentially unsafe. Both check whether the character after the end of the string is NULL. If so, then it simply returns a pointer to the original string. This is a good optimization in theory because this code:

string s = "abc";

will be a slice to a read-only section of the executable. The compiler will insert a NULL after the string in the read-only section. So this:

auto x = toStringz("abc");

is efficient. No relocations.

As @AndrejMitrovic commented in Phobos pull request 123 https://github.com/D-Programming-Language/phobos/pull/123, this has potential issues:

import std.string;
import std.stdio;

struct A
{
    immutable char[2] foo;
    char[2] bar;
}

void main()
{
    auto a = A("aa", "\0b");
    auto charptr = toStringz(a.foo[]);

    a.bar = "bo";
    printf(charptr);   // two chars, then garbage
}

Another issue not mentioned is with slices. If I do...

string s = "abc";
string y = s[];
string z = y[];

z ~= '\0';

auto c = toStringz(y);

assert(c.ptr == y.ptr);

... what happens if I change that last character of z before I pass c to the C routine? Bad news. I think this optimization is great, but doesn't it go against D's motto of doing "the right thing by default"?

The question is, how can we keep this optimization so that:

toStringz("abc");

remains efficient?

The capacity field is 0 if the string is in a read-only section *or* if the string is on the stack:

auto x = "abc";
assert(x.capacity == 0);
char[3] y = "abc";
assert(x.capacity == 0);

So, this isn't safe either. This code:

    char[3] x = "abc";
    char y = '\0';

will put y right after x, so changing y after calling toStringz will cause issues.

In reality, the only time it's safe to do the "peek after end" is if the string is in the read-only section. Otherwise, there are potential issues (even if they are edge cases).

Do we care about this? Is there something we can add to druntime arrays that will tell whether or not the data backing a slice is in read-only memory (or perhaps an enum: read-only, stack, heap, other)? In reality, the only time this changes is when a read-only / stack heap is appended to, so performance issues are minimal.

Comments? Ideas?
July 25, 2011
On 7/24/11 7:41 PM, Johann MacDonagh wrote:
> Both toStringz and toUTFz do something potentially unsafe. Both check
> whether the character after the end of the string is NULL. If so, then
> it simply returns a pointer to the original string. This is a good
> optimization in theory because this code:
>
> string s = "abc";
>
> will be a slice to a read-only section of the executable. The compiler
> will insert a NULL after the string in the read-only section. So this:
>
> auto x = toStringz("abc");
>
> is efficient. No relocations.
>
> As @AndrejMitrovic commented in Phobos pull request 123
> https://github.com/D-Programming-Language/phobos/pull/123, this has
> potential issues:
>
> import std.string;
> import std.stdio;
>
> struct A
> {
> immutable char[2] foo;
> char[2] bar;
> }
>
> void main()
> {
> auto a = A("aa", "\0b");
> auto charptr = toStringz(a.foo[]);
>
> a.bar = "bo";
> printf(charptr); // two chars, then garbage
> }
>
> Another issue not mentioned is with slices. If I do...
>
> string s = "abc";
> string y = s[];
> string z = y[];
>
> z ~= '\0';
>
> auto c = toStringz(y);
>
> assert(c.ptr == y.ptr);
>
> ... what happens if I change that last character of z before I pass c to
> the C routine? Bad news. I think this optimization is great, but doesn't
> it go against D's motto of doing "the right thing by default"?
>
> The question is, how can we keep this optimization so that:
>
> toStringz("abc");
>
> remains efficient?
>
> The capacity field is 0 if the string is in a read-only section *or* if
> the string is on the stack:
>
> auto x = "abc";
> assert(x.capacity == 0);
> char[3] y = "abc";
> assert(x.capacity == 0);
>
> So, this isn't safe either. This code:
>
> char[3] x = "abc";
> char y = '\0';
>
> will put y right after x, so changing y after calling toStringz will
> cause issues.
>
> In reality, the only time it's safe to do the "peek after end" is if the
> string is in the read-only section. Otherwise, there are potential
> issues (even if they are edge cases).
>
> Do we care about this? Is there something we can add to druntime arrays
> that will tell whether or not the data backing a slice is in read-only
> memory (or perhaps an enum: read-only, stack, heap, other)? In reality,
> the only time this changes is when a read-only / stack heap is appended
> to, so performance issues are minimal.
>
> Comments? Ideas?

I'm not too worried. I think it is fair to guarantee the pointer returned from toStringz is guaranteed to point to a zero-terminated string only up until the first relevant change. It is difficult to define what a relevant change is, but practically I think it is understood what's going on. If there's a need for a persistent stringz, creating a private copy immediately after the call to toStringz is always an option.


Andrei
July 25, 2011
On Sunday 24 July 2011 20:41:47 Johann MacDonagh wrote:
> Both toStringz and toUTFz do something potentially unsafe. Both check whether the character after the end of the string is NULL. If so, then it simply returns a pointer to the original string. This is a good optimization in theory because this code:
> 
> string s = "abc";
> 
> will be a slice to a read-only section of the executable. The compiler will insert a NULL after the string in the read-only section. So this:
> 
> auto x = toStringz("abc");
> 
> is efficient. No relocations.
> 
> As @AndrejMitrovic commented in Phobos pull request 123 https://github.com/D-Programming-Language/phobos/pull/123, this has potential issues:
> 
> import std.string;
> import std.stdio;
> 
> struct A
> {
>      immutable char[2] foo;
>      char[2] bar;
> }
> 
> void main()
> {
>      auto a = A("aa", "\0b");
>      auto charptr = toStringz(a.foo[]);
> 
>      a.bar = "bo";
>      printf(charptr);   // two chars, then garbage
> }
> 
> Another issue not mentioned is with slices. If I do...
> 
> string s = "abc";
> string y = s[];
> string z = y[];
> 
> z ~= '\0';
> 
> auto c = toStringz(y);
> 
> assert(c.ptr == y.ptr);
> 
> ... what happens if I change that last character of z before I pass c to the C routine? Bad news. I think this optimization is great, but doesn't it go against D's motto of doing "the right thing by default"?
> 
> The question is, how can we keep this optimization so that:
> 
> toStringz("abc");
> 
> remains efficient?
> 
> The capacity field is 0 if the string is in a read-only section *or* if the string is on the stack:
> 
> auto x = "abc";
> assert(x.capacity == 0);
> char[3] y = "abc";
> assert(x.capacity == 0);
> 
> So, this isn't safe either. This code:
> 
>      char[3] x = "abc";
>      char y = '\0';
> 
> will put y right after x, so changing y after calling toStringz will cause issues.
> 
> In reality, the only time it's safe to do the "peek after end" is if the string is in the read-only section. Otherwise, there are potential issues (even if they are edge cases).
> 
> Do we care about this? Is there something we can add to druntime arrays that will tell whether or not the data backing a slice is in read-only memory (or perhaps an enum: read-only, stack, heap, other)? In reality, the only time this changes is when a read-only / stack heap is appended to, so performance issues are minimal.
> 
> Comments? Ideas?

The _only_ time that this matters is if you actually keep the result of toStringz or toUTFz around. If you do what is almost always done - that is pass the the result of toStringz/toUTFz directly to a function and not save it at all - then there is _zero_ problem.

If you want to save the char* instead, then you can simply choose to append '\0' instead of calling toStringz/toUTFz. toStringz has pretty much 0 value if it's forced to copy the string in all cases, and unless checking past the end of the string _and_ guaranteeing that the character one past the end won't change can be done and done efficiently, then the only way to guarantee that then value one past the end won't change is to make a copy, in which case toStringz is essentially pointless.

The documentation for toUTFz has an appropriate warning on it, so the issue should be clear. If you want to avoid it, simply append '\0' instead of calling toStringz or toUTFz. In 99.99% of cases, the char* is passed directly to a C function anyway, in which case there is not problem.

So, I really don't think that this is really an issue. If we can find an efficient way to make better guarantees, then that would be great, but the risk at this point is _very_ minimal, and there's an easy way to get around it in the _rare_ case where it could be a problem (simply append '\0' instead of calling toStringz/toUTFz).

- Jonathan M Davis
July 25, 2011
On Sunday, July 24, 2011 5:56:04 PM, Jonathan M Davis wrote:
> On Sunday 24 July 2011 20:41:47 Johann MacDonagh wrote:
>> Both toStringz and toUTFz do something potentially unsafe. Both check whether the character after the end of the string is NULL. If so, then it simply returns a pointer to the original string. This is a good optimization in theory because this code:
>>
>> string s = "abc";
>>
>> will be a slice to a read-only section of the executable. The compiler will insert a NULL after the string in the read-only section. So this:
>>
>> auto x = toStringz("abc");
>>
>> is efficient. No relocations.
>>
>> As @AndrejMitrovic commented in Phobos pull request 123 https://github.com/D-Programming-Language/phobos/pull/123, this has potential issues:
>>
>> import std.string;
>> import std.stdio;
>>
>> struct A
>> {
>>      immutable char[2] foo;
>>      char[2] bar;
>> }
>>
>> void main()
>> {
>>      auto a = A("aa", "\0b");
>>      auto charptr = toStringz
(a.foo[]);
>>
>>      a.bar = "bo";
>>      printf(charptr);   // two chars, then garbage
>> }
>>
>> Another issue not mentioned is with slices. If I do...
>>
>> string s = "abc";
>> string y = s[];
>> string z = y[];
>>
>> z ~= '\0';
>>
>> auto c = toStringz(y);
>>
>> assert(c.ptr == y.ptr);
>>
>> ... what happens if I change that last character of z before I pass c to the C routine? Bad news. I think this optimization is great, but doesn't it go against D's motto of doing "the right thing by default"?
>>
>> The question is, how can we keep this optimization so that:
>>
>> toStringz("abc");
>>
>> remains efficient?
>>
>> The capacity field is 0 if the string is in a read-only section *or* if the string is on the stack:
>>
>> auto x = "abc";
>> assert(x.capacity == 0);
>> char[3] y = "abc";
>> assert(x.capacity == 0);
>>
>> So, this isn't safe either. This code:
>>
>>      char[3] x = "abc";
>>      char y = '\0';
>>
>> will put y right after x, so changing y after c
alling toStringz will
>> cause issues.
>>
>> In reality, the only time it's safe to do the "peek after end" is if the string is in the read-only section. Otherwise, there are potential issues (even if they are edge cases).
>>
>> Do we care about this? Is there something we can add to druntime arrays that will tell whether or not the data backing a slice is in read-only memory (or perhaps an enum: read-only, stack, heap, other)? In reality, the only time this changes is when a read-only / stack heap is appended to, so performance issues are minimal.
>>
>> Comments? Ideas?
> 
> The _only_ time that this matters is if you actually keep the result of toStringz or toUTFz around. If you do what is almost always done - that is pass the the result of toStringz/toUTFz directly to a function and not save it at all - then there is _zero_ problem.
> 
> If you want to save the char* instead, then you can simply choose to append '\0' instead of calling toStrin
gz/toUTFz. toStringz has pretty much 0 value if
> it's forced to copy the string in all cases, and unless checking past the end of the string _and_ guaranteeing that the character one past the end won't change can be done and done efficiently, then the only way to guarantee that then value one past the end won't change is to make a copy, in which case toStringz is essentially pointless.
> 
> The documentation for toUTFz has an appropriate warning on it, so the issue should be clear. If you want to avoid it, simply append '\0' instead of calling toStringz or toUTFz. In 99.99% of cases, the char* is passed directly to a C function anyway, in which case there is not problem.
> 
> So, I really don't think that this is really an issue. If we can find an efficient way to make better guarantees, then that would be great, but the risk at this point is _very_ minimal, and there's an easy way to get around it in the _rare_ case where it could be a problem
 (simply append '\0' instead of
> calling toStringz/toUTFz).
> 
> - Jonathan M Davis

Not entire true.  There's one more very important way this can cause a serious problem:  when the string in question is at the end of a block of memory and the +1 location is outside that.  The result will be a segv.

While I agree the chances of problem is low, it's certainly not 0 and is the sort of thing that if we saw another language do it we'd laugh and point.  This kind of code produces hard to debug problems and doesn't belong in the standard library.
July 25, 2011
On Sunday 24 July 2011 17:56:04 Jonathan M Davis wrote:
> On Sunday 24 July 2011 20:41:47 Johann MacDonagh wrote:
> > Both toStringz and toUTFz do something potentially unsafe. Both check whether the character after the end of the string is NULL. If so, then it simply returns a pointer to the original string. This is a good optimization in theory because this code:
> > 
> > string s = "abc";
> > 
> > will be a slice to a read-only section of the executable. The compiler will insert a NULL after the string in the read-only section. So this:
> > 
> > auto x = toStringz("abc");
> > 
> > is efficient. No relocations.
> > 
> > As @AndrejMitrovic commented in Phobos pull request 123 https://github.com/D-Programming-Language/phobos/pull/123, this has potential issues:
> > 
> > import std.string;
> > import std.stdio;
> > 
> > struct A
> > {
> > 
> >      immutable char[2] foo;
> >      char[2] bar;
> > 
> > }
> > 
> > void main()
> > {
> > 
> >      auto a = A("aa", "\0b");
> >      auto charptr = toStringz(a.foo[]);
> > 
> >      a.bar = "bo";
> >      printf(charptr);   // two chars, then garbage
> > 
> > }
> > 
> > Another issue not mentioned is with slices. If I do...
> > 
> > string s = "abc";
> > string y = s[];
> > string z = y[];
> > 
> > z ~= '\0';
> > 
> > auto c = toStringz(y);
> > 
> > assert(c.ptr == y.ptr);
> > 
> > ... what happens if I change that last character of z before I pass c to the C routine? Bad news. I think this optimization is great, but doesn't it go against D's motto of doing "the right thing by default"?
> > 
> > The question is, how can we keep this optimization so that:
> > 
> > toStringz("abc");
> > 
> > remains efficient?
> > 
> > The capacity field is 0 if the string is in a read-only section *or* if the string is on the stack:
> > 
> > auto x = "abc";
> > assert(x.capacity == 0);
> > char[3] y = "abc";
> > assert(x.capacity == 0);
> > 
> > So, this isn't safe either. This code:
> >      char[3] x = "abc";
> >      char y = '\0';
> > 
> > will put y right after x, so changing y after calling toStringz will cause issues.
> > 
> > In reality, the only time it's safe to do the "peek after end" is if the string is in the read-only section. Otherwise, there are potential issues (even if they are edge cases).
> > 
> > Do we care about this? Is there something we can add to druntime arrays that will tell whether or not the data backing a slice is in read-only memory (or perhaps an enum: read-only, stack, heap, other)? In reality, the only time this changes is when a read-only / stack heap is appended to, so performance issues are minimal.
> > 
> > Comments? Ideas?
> 
> The _only_ time that this matters is if you actually keep the result of toStringz or toUTFz around. If you do what is almost always done - that is pass the the result of toStringz/toUTFz directly to a function and not save it at all - then there is _zero_ problem.
> 
> If you want to save the char* instead, then you can simply choose to append '\0' instead of calling toStringz/toUTFz. toStringz has pretty much 0 value if it's forced to copy the string in all cases, and unless checking past the end of the string _and_ guaranteeing that the character one past the end won't change can be done and done efficiently, then the only way to guarantee that then value one past the end won't change is to make a copy, in which case toStringz is essentially pointless.
> 
> The documentation for toUTFz has an appropriate warning on it, so the issue should be clear. If you want to avoid it, simply append '\0' instead of calling toStringz or toUTFz. In 99.99% of cases, the char* is passed directly to a C function anyway, in which case there is not problem.
> 
> So, I really don't think that this is really an issue. If we can find an efficient way to make better guarantees, then that would be great, but the risk at this point is _very_ minimal, and there's an easy way to get around it in the _rare_ case where it could be a problem (simply append '\0' instead of calling toStringz/toUTFz).

The real question is what to do with to!(char*)(str). The plan is to make it call toUTFz, but at that point, the warning about toUTFz is not as obvious (though it can re-iterate the warning or point you to the toUTFz documentation to read it). Also, since you already have toUTFz, calling to!(char*) is kind of pointless. So, I think that there's a good argument for forcing to!(char*) to append '\0' instead of checking one past the end. Then when you want a guarantee that the '\0' isn't going to change, you can use to!(char*), and if you want the efficiency, you can call toUTFz. But it is debatable whether we should do that or just have to!(char*) call toUTFz in all cases. I'm leaning towards making it always copy though.

- Jonathan M Davis
July 25, 2011
On 7/24/11 8:05 PM, Brad Roberts wrote:
> Not entire true.  There's one more very important way this can cause a
> serious problem:  when the string in question is at the end of a block
> of memory and the +1 location is outside that.  The result will be a
> segv.

Good point. In fact the implementation already takes care of that.

https://github.com/D-Programming-Language/phobos/blob/master/std/string.d#L403

Andrei
July 25, 2011
On 7/24/2011 9:06 PM, Jonathan M Davis wrote:
> On Sunday 24 July 2011 17:56:04 Jonathan M Davis wrote:
> The real question is what to do with to!(char*)(str). The plan is to make it
> call toUTFz, but at that point, the warning about toUTFz is not as obvious
> (though it can re-iterate the warning or point you to the toUTFz documentation
> to read it). Also, since you already have toUTFz, calling to!(char*) is kind
> of pointless. So, I think that there's a good argument for forcing to!(char*)
> to append '\0' instead of checking one past the end. Then when you want a
> guarantee that the '\0' isn't going to change, you can use to!(char*), and if
> you want the efficiency, you can call toUTFz. But it is debatable whether we
> should do that or just have to!(char*) call toUTFz in all cases. I'm leaning
> towards making it always copy though.
>
> - Jonathan M Davis

In that case, maybe we should implement @schveiguy's suggestion.

immutable(char)* toStringz(string s, bool unsafe = true) pure nothrow

That way the user can decide whether to take the optimization risk (or if they know the string is on the stack, etc...). In addition, always copying is wasteful. We're usually able to append a NULL to a dynamic array without relocation / copying.
July 25, 2011
On Sunday 24 July 2011 21:57:34 Johann MacDonagh wrote:
> On 7/24/2011 9:06 PM, Jonathan M Davis wrote:
> > On Sunday 24 July 2011 17:56:04 Jonathan M Davis wrote:
> > The real question is what to do with to!(char*)(str). The plan is to
> > make it call toUTFz, but at that point, the warning about toUTFz is not
> > as obvious (though it can re-iterate the warning or point you to the
> > toUTFz documentation to read it). Also, since you already have toUTFz,
> > calling to!(char*) is kind of pointless. So, I think that there's a
> > good argument for forcing to!(char*) to append '\0' instead of checking
> > one past the end. Then when you want a guarantee that the '\0' isn't
> > going to change, you can use to!(char*), and if you want the
> > efficiency, you can call toUTFz. But it is debatable whether we should
> > do that or just have to!(char*) call toUTFz in all cases. I'm leaning
> > towards making it always copy though.
> > 
> > - Jonathan M Davis
> 
> In that case, maybe we should implement @schveiguy's suggestion.
> 
> immutable(char)* toStringz(string s, bool unsafe = true) pure nothrow
> 
> That way the user can decide whether to take the optimization risk (or if they know the string is on the stack, etc...). In addition, always copying is wasteful. We're usually able to append a NULL to a dynamic array without relocation / copying.

If you always append a '\0', then there's no point to toStringz at all. So, sure we _could_ add the unsafe parameter like that, but I seriously question the value of it. The primary value in toStringz is to give you the optimization of looking one past the end of the string and attempting to completely avoid any chance of reallocation.

And yes, you'd use ~= which _might_ copy rather than forcing a copy every time, but the cases where you could have just checked one past the end of the array and done nothing if it were '\0' are generally going to be the cases where ~= has to reallocate. So, in reality, you're pretty much going to copy every time that you could have avoided the copy if you had gone with true for unsafe rather than false.

- Jonathan M Davis
July 25, 2011
On Sun, 24 Jul 2011 20:41:47 -0400, Johann MacDonagh <johann.macdonagh.no@spam.gmail.com> wrote:

> Both toStringz and toUTFz do something potentially unsafe. Both check whether the character after the end of the string is NULL. If so, then it simply returns a pointer to the original string.

It is of critical importance to note that this *ONLY* happens for immutable data.  It is not done for const or mutable strings (instead a zero is always appended).

> Another issue not mentioned is with slices. If I do...
>
> string s = "abc";
> string y = s[];
> string z = y[];
>
> z ~= '\0';
>
> auto c = toStringz(y);
>
> assert(c.ptr == y.ptr);
>
> ... what happens if I change that last character of z before I pass c to the C routine?

Not possible.  There are two important errors in your assumptions.  First, z is an immutable(char)[], so you cannot change the last character without a cast.  And if you do that, it's undefined behavior.

Second, z.ptr != y.ptr.  You cannot append to a ROM string, it will reallocate.

However, it's quite possible that *unallocated* data may change.  That is:

string s = "abc".dup;  // it's possible that the data after "abc" is a 0.  The GC does not default initialize the unallocated portion of a data block if you are allocating a NO_SCAN block (for efficiency).

auto c = toStringz(s); // might just return s.ptr.

s ~= 'a'; // now if c == s.ptr, c is affected.

However, I still am not concerned about this (see my statement at the end of this message).

> Bad news. I think this optimization is great, but doesn't it go against D's motto of doing "the right thing by default"?
>
> The question is, how can we keep this optimization so that:
>
> toStringz("abc");
>
> remains efficient?
>
> The capacity field is 0 if the string is in a read-only section *or* if the string is on the stack:
>
> auto x = "abc";
> assert(x.capacity == 0);
> char[3] y = "abc";
> assert(x.capacity == 0);
>
> So, this isn't safe either. This code:
>
>      char[3] x = "abc";
>      char y = '\0';
>
> will put y right after x, so changing y after calling toStringz will cause issues.

A not-so-small flaw in this argument is char[3] x is not immutable -- as I stated above, only an immutable string causes the optimization to occur.

It's highly unlikely anyone would ever write:

immutable char[3] y = "abc";

> In reality, the only time it's safe to do the "peek after end" is if the string is in the read-only section. Otherwise, there are potential issues (even if they are edge cases).
>
> Do we care about this? Is there something we can add to druntime arrays that will tell whether or not the data backing a slice is in read-only memory (or perhaps an enum: read-only, stack, heap, other)? In reality, the only time this changes is when a read-only / stack heap is appended to, so performance issues are minimal.

I think there could be a compromise solution:

If the slice is for immutable data (i.e. string), AND the slice is of a heap-allocated array (the runtime can determine this via a block lookup), AND the characer after the slice is part of the used block data, it would be safe to assume that data is also immutable.  There are remote possibilities that this could not be the case, but this involves improbable casting scenarios.

Note that the lookup cost is not trivial -- it's O(lgn), and it requires a GC lock.

This, of course, would have to be in addition to checking if the string is in ROM.  I don't know if this is possible...

BTW, I agree with Jonathan that this is a very rare corner-case (it might even be non-existent in real code).

I've only ever used toStringz or seen it used as an expression for a parameter -- I've never actually saved the result and did other stuff before passing it.

I think the warning he states in documentation is sufficient.

-Steve
July 25, 2011
On Mon, 25 Jul 2011 05:09:17 +0300, Jonathan M Davis <jmdavisProg@gmx.com> wrote:

> If you always append a '\0', then there's no point to toStringz at all.

Semantics? toStringz(s) is much more readable than (s~'\0').ptr

-- 
Best regards,
 Vladimir                            mailto:vladimir@thecybershadow.net
« First   ‹ Prev
1 2