Thread overview
BSTR-style array length
Apr 19, 2004
Russ Lewis
Apr 19, 2004
Walter
Apr 19, 2004
Robert Atkinson
Apr 19, 2004
Walter
Apr 19, 2004
Walter
April 19, 2004
Forgive if this discussion has been had before; I could not find a way to search the forums.

Perhaps you are familiar with how COM's BSTR represents string/char arrays. Where D's arrays are represented as an 8-byte value, with offset 0 a 32-bit pointer to the array contents and offset 4 representing the 32-bit length of the array, a BSTR includes the length at the memory location of the char (wchar) array, but behind the pointer, as in:

BSTR variable:
Offset:     Contents:
0        pointer to array data

Array Data:
Offset:     Contents:
0        first char of array
-4        32 bit length of array

Naturally, this requires additional APIs in Windows for managing BSTRs, but it provides direct compatibility with existing wchar strings, since the pointer still points at the first character in the array. It also has the benefit of remaining a 32 bit value and retaining length information at the pointed-to memory location.

A similar approach in D could allow its arrays to remain a 32 bit pointer while retaining the smart aspects. An external API for managing such would not be necessary, since D already handles arrays specially. The additional benefit here would be the ability to pass array pointers into C/others and get them back without losing the length data (since the length is also stored at or "near" the pointer location), where currently passing only the first 32 bits of the array pointer into a non-D-aware language will "kill" the length data when it comes back out.

Has this been considered?

- Charlie



April 19, 2004
I think that things are even simpler than what you expect.

D arrays can be implicitly cast to pointers.  So this code works:
	char[] foo = "something";
	char *bar = foo;
The same happens with function calls into C library functions:
	extern(C) int baz(char *arg1);
	int rc = baz(foo);
The baz() function will see a pointer to the 's' character of "something".

This works pretty well, but you have to remember a few things:

1) D strings are not null terminated.

Since we have a built-in length field, D strings do not need to be null-terminated.  So, in most cases, you need to add a null terminator to the string before sending it to C.  The standard library function:
	import std.string;
	char *toStringz(char[]);
will do this for you.  So if you don't know whether your string is null terminated or not, you would call it like this:
	baz(toStringz(foo));

FYI: The compiler is smart enough to add a null after all constant strings.  (This null does NOT count toward the total length of the string.)  So, if you are passing an argument which is a constant string (not something you've generated at runtime), then toStringz() is unnecessary.  You can call it, but it won't do anything because it will see that there is a null out just past the "end" of the string.

2) D doesn't reset the length if C modifies the string.

So if C changes the length, your D code will have to account for that by hand.  The easiest way to do this is to reinitialize the array using the slice syntax and the strlen() function:
	import std.string; // gives you extern(C) int strlen(char*)
	int rc = baz(foo);
	foo = foo[0..strlen(foo)];
But you would have to do something like that with BSTR as well.

Charles Oliver Nutter wrote:
> Forgive if this discussion has been had before; I could not find a way to search
> the forums.
> 
> Perhaps you are familiar with how COM's BSTR represents string/char arrays.
> Where D's arrays are represented as an 8-byte value, with offset 0 a 32-bit
> pointer to the array contents and offset 4 representing the 32-bit length of the
> array, a BSTR includes the length at the memory location of the char (wchar)
> array, but behind the pointer, as in:
> 
> BSTR variable:
> Offset:     Contents:
> 0        pointer to array data
> 
> Array Data:
> Offset:     Contents:
> 0        first char of array
> -4        32 bit length of array
> 
> Naturally, this requires additional APIs in Windows for managing BSTRs, but it
> provides direct compatibility with existing wchar strings, since the pointer
> still points at the first character in the array. It also has the benefit of
> remaining a 32 bit value and retaining length information at the pointed-to
> memory location.
> 
> A similar approach in D could allow its arrays to remain a 32 bit pointer while
> retaining the smart aspects. An external API for managing such would not be
> necessary, since D already handles arrays specially. The additional benefit here
> would be the ability to pass array pointers into C/others and get them back
> without losing the length data (since the length is also stored at or "near" the
> pointer location), where currently passing only the first 32 bits of the array
> pointer into a non-D-aware language will "kill" the length data when it comes
> back out.
> 
> Has this been considered?
> 
> - Charlie

April 19, 2004
"Charles Oliver Nutter" <Charles_member@pathlink.com> wrote in message news:c61fq7$1pim$1@digitaldaemon.com...
> A similar approach in D could allow its arrays to remain a 32 bit pointer
while
> retaining the smart aspects. An external API for managing such would not
be
> necessary, since D already handles arrays specially. The additional
benefit here
> would be the ability to pass array pointers into C/others and get them
back
> without losing the length data (since the length is also stored at or
"near" the
> pointer location), where currently passing only the first 32 bits of the
array
> pointer into a non-D-aware language will "kill" the length data when it
comes
> back out.
>
> Has this been considered?

Yes. The good reasons for it are as you stated. The reason against it is such a technique does not allow for array 'slicing'. Other problems are things like a C string cannot be converted into a BSTR without copying the entire string.


April 19, 2004
The thought I had about "losing" the length would happen in the following case for D strings but not for BSTR:

char[256] mystring;
mystring = some_c_function_that_returns_char_pointer(mystring);

At this point, since we've passed 32 bits in and gotten 32 bits out, we've lost the length part of the variable. Perhaps this assignment is not even legal?

If the length was stored at the memory location pointed to by the variable, rather than in the variable itself, the length would never be lost simply by passing the value around.

There may not be a lot of functions that return a pointer to unmodified memory contents, but perhaps this illustrates where the BSTR method of length management avoids a possible problem.

I have another question regarding #1 below: I'm sure this has been discussed, but why not null-terminate strings? Considering the vast number of APIs that expect strings to be null terminated, wouldn't it make sense? I can appreciate wanting all arrays to be uniform, but anyone using D would never have to see the difference between char arrays and all other arrays, and existing APIs that expect terminating null chars would work without quirks.

- Charlie

Russ Lewis wrote:
> I think that things are even simpler than what you expect.
> 
> D arrays can be implicitly cast to pointers.  So this code works:
>     char[] foo = "something";
>     char *bar = foo;
> The same happens with function calls into C library functions:
>     extern(C) int baz(char *arg1);
>     int rc = baz(foo);
> The baz() function will see a pointer to the 's' character of "something".
> 
> This works pretty well, but you have to remember a few things:
> 
> 1) D strings are not null terminated.
> 
> Since we have a built-in length field, D strings do not need to be null-terminated.  So, in most cases, you need to add a null terminator to the string before sending it to C.  The standard library function:
>     import std.string;
>     char *toStringz(char[]);
> will do this for you.  So if you don't know whether your string is null terminated or not, you would call it like this:
>     baz(toStringz(foo));
> 
> FYI: The compiler is smart enough to add a null after all constant strings.  (This null does NOT count toward the total length of the string.)  So, if you are passing an argument which is a constant string (not something you've generated at runtime), then toStringz() is unnecessary.  You can call it, but it won't do anything because it will see that there is a null out just past the "end" of the string.
> 
> 2) D doesn't reset the length if C modifies the string.
> 
> So if C changes the length, your D code will have to account for that by hand.  The easiest way to do this is to reinitialize the array using the slice syntax and the strlen() function:
>     import std.string; // gives you extern(C) int strlen(char*)
>     int rc = baz(foo);
>     foo = foo[0..strlen(foo)];
> But you would have to do something like that with BSTR as well.
> 
> Charles Oliver Nutter wrote:
> 
>> Forgive if this discussion has been had before; I could not find a way to search
>> the forums.
>>
>> Perhaps you are familiar with how COM's BSTR represents string/char arrays.
>> Where D's arrays are represented as an 8-byte value, with offset 0 a 32-bit
>> pointer to the array contents and offset 4 representing the 32-bit length of the
>> array, a BSTR includes the length at the memory location of the char (wchar)
>> array, but behind the pointer, as in:
>>
>> BSTR variable:
>> Offset:     Contents:
>> 0        pointer to array data
>>
>> Array Data:
>> Offset:     Contents:
>> 0        first char of array
>> -4        32 bit length of array
>>
>> Naturally, this requires additional APIs in Windows for managing BSTRs, but it
>> provides direct compatibility with existing wchar strings, since the pointer
>> still points at the first character in the array. It also has the benefit of
>> remaining a 32 bit value and retaining length information at the pointed-to
>> memory location.
>>
>> A similar approach in D could allow its arrays to remain a 32 bit pointer while
>> retaining the smart aspects. An external API for managing such would not be
>> necessary, since D already handles arrays specially. The additional benefit here
>> would be the ability to pass array pointers into C/others and get them back
>> without losing the length data (since the length is also stored at or "near" the
>> pointer location), where currently passing only the first 32 bits of the array
>> pointer into a non-D-aware language will "kill" the length data when it comes
>> back out.
>>
>> Has this been considered?
>>
>> - Charlie
> 
> 
April 19, 2004
"Charles Oliver Nutter" <headius@headius.com> wrote in message news:c61hok$1teo$1@digitaldaemon.com...
> If the length was stored at the memory location pointed to by the variable, rather than in the variable itself, the length would never be lost simply by passing the value around.

Yes, it would, because there's no way to tell if whatever happens to be before the string data is a valid length or a random bit pattern.

> I have another question regarding #1 below: I'm sure this has been discussed, but why not null-terminate strings? Considering the vast number of APIs that expect strings to be null terminated, wouldn't it make sense? I can appreciate wanting all arrays to be uniform, but anyone using D would never have to see the difference between char arrays and all other arrays, and existing APIs that expect terminating null chars would work without quirks.

Null terminated strings cannot be sliced, and cause problems representing binary data.


April 19, 2004
Why not accomodate both?
Have all strings allocate an extra char internally, and always have it be
\0?

For D's purposes, the BSTR length property would be used, and the extra char ignored, but it would always be there for calling C functions expecting a zero terminated string.

- Rob

"Walter" <walter@digitalmars.com> wrote in message news:c61lul$25nn$1@digitaldaemon.com...
>
> "Charles Oliver Nutter" <headius@headius.com> wrote in message news:c61hok$1teo$1@digitaldaemon.com...
> > If the length was stored at the memory location pointed to by the variable, rather than in the variable itself, the length would never be lost simply by passing the value around.
>
> Yes, it would, because there's no way to tell if whatever happens to be before the string data is a valid length or a random bit pattern.
>
> > I have another question regarding #1 below: I'm sure this has been discussed, but why not null-terminate strings? Considering the vast number of APIs that expect strings to be null terminated, wouldn't it make sense? I can appreciate wanting all arrays to be uniform, but anyone using D would never have to see the difference between char arrays and all other arrays, and existing APIs that expect terminating null chars would work without quirks.
>
> Null terminated strings cannot be sliced, and cause problems representing binary data.
>
>


April 19, 2004
"Robert Atkinson" <z1zg1@NO.SPAM.unb.ca> wrote in message news:c61n39$27o5$1@digitaldaemon.com...
> Why not accomodate both?

There's no reason why one cannot program in D right now using char*'s and null terminated strings.

> Have all strings allocate an extra char internally, and always have it be \0?

This is already done for static string literals.

> For D's purposes, the BSTR length property would be used, and the extra
char
> ignored, but it would always be there for calling C functions expecting a zero terminated string.

One cannot slice a string and put a null at the end without corrupting memory. But if you stick with char*'s and the C string library functions, it will work fine with 0 terminated strings.