Jump to page: 1 2 3
Thread overview
toStringz and predictability
Jan 18, 2005
Ben Hinkle
Jan 18, 2005
Walter
Jan 18, 2005
Ben Hinkle
Jan 18, 2005
Ben Hinkle
Jan 19, 2005
Ben Hinkle
Jan 19, 2005
parabolis
Jan 19, 2005
Ben Hinkle
Jan 19, 2005
Lukas Pinkowski
Jan 19, 2005
Ben Hinkle
Jan 24, 2005
Georg Wrede
Jan 19, 2005
Ben Hinkle
Jan 20, 2005
Ben Hinkle
Jan 18, 2005
Ben Hinkle
Jan 19, 2005
parabolis
Jan 19, 2005
Ben Hinkle
Jan 20, 2005
parabolis
Jan 21, 2005
Matthew
Jan 21, 2005
parabolis
January 18, 2005
There's something about toStringz that has me uncomfortable. Consider this code:

import std.string;

int main() {
char* x;
uint b1;
char[4] y;
uint b2;
y[0] = 'a';
y[1] = 'b';
y[2] = 'c';
y[3] = 'd';
x = toStringz(y);
printf("x length is %d, ptr %p b1 %p b2 %p\n",strlen(x),x,&b1,&b2);
b1 = 0x11223344;
b2 = 0x11223344;
printf("x length is %d, ptr %p b1 %p b2 %p\n",strlen(x),x,&b1,&b2);
return 0;
}

Here's what it prints when I run it:
x length is 4, ptr 0xfefff870 b1 0xfefff86c b2 0xfefff874
x length is 17, ptr 0xfefff870 b1 0xfefff86c b2 0xfefff874

The reason why the length changed is that toStringz looks at one past the length of the string to see if it is 0 and does nothing to the string if it is. But the sample code then changes the byte past the string by touching a completely different variable and so the toStringz result is "corrupted". I have toStringz calls sprinkled through my code when I call C functions and now I'm starting to get nervous about the lifespans of those strings and how to figure out if they are valid or not. Thoughts? Walter, is there a guideline I should follow? The most extreme one that comes to mind is "only call toStringz for strings that get immediately copied".

-Ben


January 18, 2005
"Ben Hinkle" <Ben_member@pathlink.com> wrote in message news:csj4hq$1cvi$1@digitaldaemon.com...
> The reason why the length changed is that toStringz looks at one past the
length
> of the string to see if it is 0 and does nothing to the string if it is.
But the
> sample code then changes the byte past the string by touching a completely different variable and so the toStringz result is "corrupted". I have
toStringz
> calls sprinkled through my code when I call C functions and now I'm
starting to
> get nervous about the lifespans of those strings and how to figure out if
they
> are valid or not. Thoughts? Walter, is there a guideline I should follow?
The
> most extreme one that comes to mind is "only call toStringz for strings
that get
> immediately copied".

It's "COW" (Copy On Write) to the rescue. The idea is only modify a string that you know is unique. If you don't know it is unique, make a copy of it before modifying it. After the toStringz(), you're modifying the argument to toStringz() but there's another reference to that string that expects it to not change.


January 18, 2005
Ben Hinkle wrote:

> There's something about toStringz that has me uncomfortable. Consider this code:
> 
> import std.string;
> 
> int main() {
> char* x;
> uint b1;
> char[4] y;
> uint b2;
> y[0] = 'a';
> y[1] = 'b';
> y[2] = 'c';
> y[3] = 'd';
> x = toStringz(y);
> printf("x length is %d, ptr %p b1 %p b2 %p\n",strlen(x),x,&b1,&b2);
> b1 = 0x11223344;
> b2 = 0x11223344;
> printf("x length is %d, ptr %p b1 %p b2 %p\n",strlen(x),x,&b1,&b2);
> return 0;
> }
> 
> Here's what it prints when I run it:
> x length is 4, ptr 0xfefff870 b1 0xfefff86c b2 0xfefff874
> x length is 17, ptr 0xfefff870 b1 0xfefff86c b2 0xfefff874

That's dependent on the compiler, and the alignment:

GDC Linux:
x length is 4, ptr 0xbff772b8 b1 0xbff772bc b2 0xbff772b0
x length is 25, ptr 0xbff772b8 b1 0xbff772bc b2 0xbff772b0

GDC Mac OS X:
x length is 4, ptr 0xbffffaa0 b1 0xbffffa9c b2 0xbffffaa8
x length is 4, ptr 0xbffffaa0 b1 0xbffffa9c b2 0xbffffaa8


But why are you calling toStringz on a simple (char*),
without having it properly NUL-terminated at the end ?
If you change the code to : char[] y = new char[4];

Then it prints:
x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c
x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c


A more interesting question is why : x = toStringz(y[0..4]);
does *not* make a copy of the converted pointer-to-characters,
just because the next byte in memory happens to be a NUL char?
(ie. it works if first byte of "b1" is 42, but not if it's 0)

Having to use x = toStringz(y[0..4].dup); just because of
this little "optimization" feature is not exactly a given...
There should probably be a small warning printed about using
toStringz on slices (since it works with literals and arrays)

But that it fails on pointers and static arrays is not surprising?

--anders

PS. If you add a -O on Mac OS X, then it prints "12" instead.
    So just because it printed 4 above doesn't mean it works.
January 18, 2005
In article <csjffu$1qtp$1@digitaldaemon.com>, Walter says...
>
>
>"Ben Hinkle" <Ben_member@pathlink.com> wrote in message news:csj4hq$1cvi$1@digitaldaemon.com...
>> The reason why the length changed is that toStringz looks at one past the
>length
>> of the string to see if it is 0 and does nothing to the string if it is.
>But the
>> sample code then changes the byte past the string by touching a completely different variable and so the toStringz result is "corrupted". I have
>toStringz
>> calls sprinkled through my code when I call C functions and now I'm
>starting to
>> get nervous about the lifespans of those strings and how to figure out if
>they
>> are valid or not. Thoughts? Walter, is there a guideline I should follow?
>The
>> most extreme one that comes to mind is "only call toStringz for strings
>that get
>> immediately copied".
>
>It's "COW" (Copy On Write) to the rescue. The idea is only modify a string that you know is unique. If you don't know it is unique, make a copy of it before modifying it.

But the string doesn't necessarily own the byte after the string. It's a random piece of memory. Even if the string is living on the heap the byte one past the array can be changed at pretty much any time by anything. Modifying the byte following a string is different than modifying a string.

> After the toStringz(), you're modifying the argument to
> toStringz() [...]

actually I'm not. I'm modifying another variable.


January 18, 2005
>That's dependent on the compiler, and the alignment:
>
>GDC Linux:
>x length is 4, ptr 0xbff772b8 b1 0xbff772bc b2 0xbff772b0
>x length is 25, ptr 0xbff772b8 b1 0xbff772bc b2 0xbff772b0
>
>GDC Mac OS X:
>x length is 4, ptr 0xbffffaa0 b1 0xbffffa9c b2 0xbffffaa8
>x length is 4, ptr 0xbffffaa0 b1 0xbffffa9c b2 0xbffffaa8

even more interesting...

>But why are you calling toStringz on a simple (char*), without having it properly NUL-terminated at the end ?

The point of toStringz is to make a D string null terminated.

>If you change the code to : char[] y = new char[4];
>
>Then it prints:
>x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c
>x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c

That's becaseu the "new" allocates space on the heap and so it has nothing to do with b1 and b2 after that. To corrupt the string on the heap you'l have to wait until something else gets allocated right after that string and then assign something to the first byte.

>A more interesting question is why : x = toStringz(y[0..4]); does *not* make a copy of the converted pointer-to-characters, just because the next byte in memory happens to be a NUL char? (ie. it works if first byte of "b1" is 42, but not if it's 0)
>
>Having to use x = toStringz(y[0..4].dup); just because of this little "optimization" feature is not exactly a given... There should probably be a small warning printed about using toStringz on slices (since it works with literals and arrays)

I'm starting to think the only safe usage of toStringz is on arrays where you can guarantee the byte after the string is owned by the string - which includes literals and maybe some other special cases.

>But that it fails on pointers and static arrays is not surprising?
>
>--anders


>PS. If you add a -O on Mac OS X, then it prints "12" instead.
>     So just because it printed 4 above doesn't mean it works.

ok.


January 18, 2005
Ben Hinkle wrote:

>>But why are you calling toStringz on a simple (char*),
>>without having it properly NUL-terminated at the end ?
> 
> The point of toStringz is to make a D string null terminated.

Never mind, I was thinking in C (just because it is implemented
that way), forget that D treats static arrays as having lengths...

--anders
January 18, 2005
Ben Hinkle wrote:

> But the string doesn't necessarily own the byte after the string. It's a random
> piece of memory. Even if the string is living on the heap the byte one past the
> array can be changed at pretty much any time by anything. Modifying the byte
> following a string is different than modifying a string.

The bug is in std/string.d :

> 	p = &string[0] + string.length;
> 
> 	// Peek past end of string[], if it's 0, no conversion necessary.
> 	// Note that the compiler will put a 0 past the end of static
> 	// strings, and the storage allocator will put a 0 past the end
> 	// of newly allocated char[]'s.
> 	if (*p == 0)
> 	    return string;

Yes, it does work for string literals and for dynamic arrays...
But it doesn't work for slices of pointers, or static arrays ?

Unless there is a way to separate them, it should be avoided.
(since with the pointers/statics, the byte after is off-limits)

--anders
January 18, 2005
>Yes, it does work for string literals and for dynamic arrays...

Actually it doesn't even work for dynamic arrays:

import std.string;
int main() {
char* x;
char[] y = new char[32];
y[] = 0;
char[] z = new char[32];
z[] = 32;
x = toStringz(z);
printf("x length is %d\n",strlen(x));
y[] = 32;
printf("x length is %d\n",strlen(x));
return 0;
}

outputs
x length is 32
x length is 67

This is due to how the memory manager allocates memory. -Ben



January 18, 2005
Ben Hinkle wrote:

>>If you change the code to : char[] y = new char[4];
>>
>>Then it prints:
>>x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c
>>x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c
> 
> That's becaseu the "new" allocates space on the heap and so it has nothing to do
> with b1 and b2 after that. To corrupt the string on the heap you'l have to wait
> until something else gets allocated right after that string and then assign
> something to the first byte.

Right, I think I only got lucky because how it allocates memory...

I couldn't find any traces of "the storage allocator will put a 0
past the end of newly allocated char[]'s", so that must be just DMC.

In fact, I'm not sure that even DMD does it ? This test program:

> void main()
> {
>   for (int i = 1; i <= 1024; i++)
>   {
>     char[] a = new char[i];
>     char *p = &a[0] + a.length;
>     if(*p != 0) printf("%d\n",i);
>   }
> }

Prints out 16,32,64,128,256,512,1024 for *all* the various D compilers.

So that toStringz peeks beyond the length of the array is clearly a bug!
Perhaps if it could tell that the argument is a string literal ? Naah...

--anders
January 18, 2005
Ben Hinkle wrote:

>>Yes, it does work for string literals and for dynamic arrays...
> 
> Actually it doesn't even work for dynamic arrays:

Funny, I was just writing that :-)

It breaks down for certain multiples of two.
(16, 32, 64, 128, 256, 512, 1024, and so on)

Sample test program:

> import std.string;
> void main()
> {
>   for (int x = 15; x <= 17; x++)
>   {
>     char[] a = new char[x];
>     char[] b = new char[x];
>     char[] c = new char[x];
>     a[0] = 0;
>     b[0] = 0;
>     c[0] = 0;
>     printf("%d %p\n",a);
>     printf("%d %p\n",b);
>     printf("%d %p\n",c);
>     char *p = &a[0] + a.length;
>     if(*p != 0) printf("not 0\n"); else printf("is 0\n");
>     for(int i = 0; i < b.length; i++)
>       b[i] = 'A' + i;
>     char *z = toStringz(b);
>     for(int i = 0; i < a.length; i++)
>       a[i] = 'X';
>     for(int i = 0; i < c.length; i++)
>       c[i] = 'X';
>     printf("%s\n",z);
>   }
> }

Prints:

> 15 0xbf498fe0
> 15 0xbf498fd0
> 15 0xbf498fc0
> is 0
> ABCDEFGHIJKLMNO
> 16 0xbf498fb0
> 16 0xbf498fa0
> 16 0xbf498f90
> not 0
> ABCDEFGHIJKLMNOPXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> 17 0xbf497fa0
> 17 0xbf497f80
> 17 0xbf497f60
> is 0
> ABCDEFGHIJKLMNOPQ

Perhaps a bit contrived, but shows how it works...

std.string.toStringz is broken.

--anders
« First   ‹ Prev
1 2 3