Thread overview | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
January 18, 2005 toStringz and predictability | ||||
---|---|---|---|---|
| ||||
There's something about toStringz that has me uncomfortable. Consider this code: import std.string; int main() { char* x; uint b1; char[4] y; uint b2; y[0] = 'a'; y[1] = 'b'; y[2] = 'c'; y[3] = 'd'; x = toStringz(y); printf("x length is %d, ptr %p b1 %p b2 %p\n",strlen(x),x,&b1,&b2); b1 = 0x11223344; b2 = 0x11223344; printf("x length is %d, ptr %p b1 %p b2 %p\n",strlen(x),x,&b1,&b2); return 0; } Here's what it prints when I run it: x length is 4, ptr 0xfefff870 b1 0xfefff86c b2 0xfefff874 x length is 17, ptr 0xfefff870 b1 0xfefff86c b2 0xfefff874 The reason why the length changed is that toStringz looks at one past the length of the string to see if it is 0 and does nothing to the string if it is. But the sample code then changes the byte past the string by touching a completely different variable and so the toStringz result is "corrupted". I have toStringz calls sprinkled through my code when I call C functions and now I'm starting to get nervous about the lifespans of those strings and how to figure out if they are valid or not. Thoughts? Walter, is there a guideline I should follow? The most extreme one that comes to mind is "only call toStringz for strings that get immediately copied". -Ben |
January 18, 2005 Re: toStringz and predictability | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ben Hinkle | "Ben Hinkle" <Ben_member@pathlink.com> wrote in message news:csj4hq$1cvi$1@digitaldaemon.com... > The reason why the length changed is that toStringz looks at one past the length > of the string to see if it is 0 and does nothing to the string if it is. But the > sample code then changes the byte past the string by touching a completely different variable and so the toStringz result is "corrupted". I have toStringz > calls sprinkled through my code when I call C functions and now I'm starting to > get nervous about the lifespans of those strings and how to figure out if they > are valid or not. Thoughts? Walter, is there a guideline I should follow? The > most extreme one that comes to mind is "only call toStringz for strings that get > immediately copied". It's "COW" (Copy On Write) to the rescue. The idea is only modify a string that you know is unique. If you don't know it is unique, make a copy of it before modifying it. After the toStringz(), you're modifying the argument to toStringz() but there's another reference to that string that expects it to not change. |
January 18, 2005 Re: toStringz and predictability | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ben Hinkle | Ben Hinkle wrote:
> There's something about toStringz that has me uncomfortable. Consider this code:
>
> import std.string;
>
> int main() {
> char* x;
> uint b1;
> char[4] y;
> uint b2;
> y[0] = 'a';
> y[1] = 'b';
> y[2] = 'c';
> y[3] = 'd';
> x = toStringz(y);
> printf("x length is %d, ptr %p b1 %p b2 %p\n",strlen(x),x,&b1,&b2);
> b1 = 0x11223344;
> b2 = 0x11223344;
> printf("x length is %d, ptr %p b1 %p b2 %p\n",strlen(x),x,&b1,&b2);
> return 0;
> }
>
> Here's what it prints when I run it:
> x length is 4, ptr 0xfefff870 b1 0xfefff86c b2 0xfefff874
> x length is 17, ptr 0xfefff870 b1 0xfefff86c b2 0xfefff874
That's dependent on the compiler, and the alignment:
GDC Linux:
x length is 4, ptr 0xbff772b8 b1 0xbff772bc b2 0xbff772b0
x length is 25, ptr 0xbff772b8 b1 0xbff772bc b2 0xbff772b0
GDC Mac OS X:
x length is 4, ptr 0xbffffaa0 b1 0xbffffa9c b2 0xbffffaa8
x length is 4, ptr 0xbffffaa0 b1 0xbffffa9c b2 0xbffffaa8
But why are you calling toStringz on a simple (char*),
without having it properly NUL-terminated at the end ?
If you change the code to : char[] y = new char[4];
Then it prints:
x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c
x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c
A more interesting question is why : x = toStringz(y[0..4]);
does *not* make a copy of the converted pointer-to-characters,
just because the next byte in memory happens to be a NUL char?
(ie. it works if first byte of "b1" is 42, but not if it's 0)
Having to use x = toStringz(y[0..4].dup); just because of
this little "optimization" feature is not exactly a given...
There should probably be a small warning printed about using
toStringz on slices (since it works with literals and arrays)
But that it fails on pointers and static arrays is not surprising?
--anders
PS. If you add a -O on Mac OS X, then it prints "12" instead.
So just because it printed 4 above doesn't mean it works.
|
January 18, 2005 Re: toStringz and predictability | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | In article <csjffu$1qtp$1@digitaldaemon.com>, Walter says... > > >"Ben Hinkle" <Ben_member@pathlink.com> wrote in message news:csj4hq$1cvi$1@digitaldaemon.com... >> The reason why the length changed is that toStringz looks at one past the >length >> of the string to see if it is 0 and does nothing to the string if it is. >But the >> sample code then changes the byte past the string by touching a completely different variable and so the toStringz result is "corrupted". I have >toStringz >> calls sprinkled through my code when I call C functions and now I'm >starting to >> get nervous about the lifespans of those strings and how to figure out if >they >> are valid or not. Thoughts? Walter, is there a guideline I should follow? >The >> most extreme one that comes to mind is "only call toStringz for strings >that get >> immediately copied". > >It's "COW" (Copy On Write) to the rescue. The idea is only modify a string that you know is unique. If you don't know it is unique, make a copy of it before modifying it. But the string doesn't necessarily own the byte after the string. It's a random piece of memory. Even if the string is living on the heap the byte one past the array can be changed at pretty much any time by anything. Modifying the byte following a string is different than modifying a string. > After the toStringz(), you're modifying the argument to > toStringz() [...] actually I'm not. I'm modifying another variable. |
January 18, 2005 Re: toStringz and predictability | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund | >That's dependent on the compiler, and the alignment: > >GDC Linux: >x length is 4, ptr 0xbff772b8 b1 0xbff772bc b2 0xbff772b0 >x length is 25, ptr 0xbff772b8 b1 0xbff772bc b2 0xbff772b0 > >GDC Mac OS X: >x length is 4, ptr 0xbffffaa0 b1 0xbffffa9c b2 0xbffffaa8 >x length is 4, ptr 0xbffffaa0 b1 0xbffffa9c b2 0xbffffaa8 even more interesting... >But why are you calling toStringz on a simple (char*), without having it properly NUL-terminated at the end ? The point of toStringz is to make a D string null terminated. >If you change the code to : char[] y = new char[4]; > >Then it prints: >x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c >x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c That's becaseu the "new" allocates space on the heap and so it has nothing to do with b1 and b2 after that. To corrupt the string on the heap you'l have to wait until something else gets allocated right after that string and then assign something to the first byte. >A more interesting question is why : x = toStringz(y[0..4]); does *not* make a copy of the converted pointer-to-characters, just because the next byte in memory happens to be a NUL char? (ie. it works if first byte of "b1" is 42, but not if it's 0) > >Having to use x = toStringz(y[0..4].dup); just because of this little "optimization" feature is not exactly a given... There should probably be a small warning printed about using toStringz on slices (since it works with literals and arrays) I'm starting to think the only safe usage of toStringz is on arrays where you can guarantee the byte after the string is owned by the string - which includes literals and maybe some other special cases. >But that it fails on pointers and static arrays is not surprising? > >--anders >PS. If you add a -O on Mac OS X, then it prints "12" instead. > So just because it printed 4 above doesn't mean it works. ok. |
January 18, 2005 Re: toStringz and predictability | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ben Hinkle | Ben Hinkle wrote:
>>But why are you calling toStringz on a simple (char*),
>>without having it properly NUL-terminated at the end ?
>
> The point of toStringz is to make a D string null terminated.
Never mind, I was thinking in C (just because it is implemented
that way), forget that D treats static arrays as having lengths...
--anders
|
January 18, 2005 Re: toStringz and predictability | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ben Hinkle | Ben Hinkle wrote: > But the string doesn't necessarily own the byte after the string. It's a random > piece of memory. Even if the string is living on the heap the byte one past the > array can be changed at pretty much any time by anything. Modifying the byte > following a string is different than modifying a string. The bug is in std/string.d : > p = &string[0] + string.length; > > // Peek past end of string[], if it's 0, no conversion necessary. > // Note that the compiler will put a 0 past the end of static > // strings, and the storage allocator will put a 0 past the end > // of newly allocated char[]'s. > if (*p == 0) > return string; Yes, it does work for string literals and for dynamic arrays... But it doesn't work for slices of pointers, or static arrays ? Unless there is a way to separate them, it should be avoided. (since with the pointers/statics, the byte after is off-limits) --anders |
January 18, 2005 Re: toStringz and predictability | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund | >Yes, it does work for string literals and for dynamic arrays...
Actually it doesn't even work for dynamic arrays:
import std.string;
int main() {
char* x;
char[] y = new char[32];
y[] = 0;
char[] z = new char[32];
z[] = 32;
x = toStringz(z);
printf("x length is %d\n",strlen(x));
y[] = 32;
printf("x length is %d\n",strlen(x));
return 0;
}
outputs
x length is 32
x length is 67
This is due to how the memory manager allocates memory. -Ben
|
January 18, 2005 Re: toStringz and predictability | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ben Hinkle | Ben Hinkle wrote: >>If you change the code to : char[] y = new char[4]; >> >>Then it prints: >>x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c >>x length is 4, ptr 0xbf429fe0 b1 0xbff3c758 b2 0xbff3c74c > > That's becaseu the "new" allocates space on the heap and so it has nothing to do > with b1 and b2 after that. To corrupt the string on the heap you'l have to wait > until something else gets allocated right after that string and then assign > something to the first byte. Right, I think I only got lucky because how it allocates memory... I couldn't find any traces of "the storage allocator will put a 0 past the end of newly allocated char[]'s", so that must be just DMC. In fact, I'm not sure that even DMD does it ? This test program: > void main() > { > for (int i = 1; i <= 1024; i++) > { > char[] a = new char[i]; > char *p = &a[0] + a.length; > if(*p != 0) printf("%d\n",i); > } > } Prints out 16,32,64,128,256,512,1024 for *all* the various D compilers. So that toStringz peeks beyond the length of the array is clearly a bug! Perhaps if it could tell that the argument is a string literal ? Naah... --anders |
January 18, 2005 Re: toStringz and predictability | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ben Hinkle | Ben Hinkle wrote: >>Yes, it does work for string literals and for dynamic arrays... > > Actually it doesn't even work for dynamic arrays: Funny, I was just writing that :-) It breaks down for certain multiples of two. (16, 32, 64, 128, 256, 512, 1024, and so on) Sample test program: > import std.string; > void main() > { > for (int x = 15; x <= 17; x++) > { > char[] a = new char[x]; > char[] b = new char[x]; > char[] c = new char[x]; > a[0] = 0; > b[0] = 0; > c[0] = 0; > printf("%d %p\n",a); > printf("%d %p\n",b); > printf("%d %p\n",c); > char *p = &a[0] + a.length; > if(*p != 0) printf("not 0\n"); else printf("is 0\n"); > for(int i = 0; i < b.length; i++) > b[i] = 'A' + i; > char *z = toStringz(b); > for(int i = 0; i < a.length; i++) > a[i] = 'X'; > for(int i = 0; i < c.length; i++) > c[i] = 'X'; > printf("%s\n",z); > } > } Prints: > 15 0xbf498fe0 > 15 0xbf498fd0 > 15 0xbf498fc0 > is 0 > ABCDEFGHIJKLMNO > 16 0xbf498fb0 > 16 0xbf498fa0 > 16 0xbf498f90 > not 0 > ABCDEFGHIJKLMNOPXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > 17 0xbf497fa0 > 17 0xbf497f80 > 17 0xbf497f60 > is 0 > ABCDEFGHIJKLMNOPQ Perhaps a bit contrived, but shows how it works... std.string.toStringz is broken. --anders |
Copyright © 1999-2021 by the D Language Foundation