Thread overview
String literal consistency and implicit array conversions..
Jul 31, 2005
Regan Heath
Aug 01, 2005
Regan Heath
July 31, 2005
Okay, here's another array topic to discuss while we're on this whole "null array" issue.

Recently, I was working on some stuff that interfaced with a C function that took a char*.  It was working fine.  Then, I changed something, and started getting access violations.  Guess what it was?

myFunc("a string"); // works fine
myFunc(`a wysiwyg string?`); // access violation

I tracked down the problem - I wasn't calling toStringz() on the char[] before passing it to the C function.

This weird problem occurred because double-quoted string literals in D are null-terminated.  I guess this is for quick-and-dirty interop with C libraries.  This is inconsistent with D's design, however.   D strings should not be null-terminated, as in D, strings are represented by a length and data.  Any interaction with strings in C functions should use toStringz() on the char[] before passing it into the C function.

The second issue I had with this was that I was simply passing the array into the C function, like so:

void myFunc(char[] str)
{
    someCFunc(str);
}

This is an issue.  someCFunc() takes a char*.  str is not a char*, it's a char[].  They are two different things, and a D char[] does not have any built-in type that resembles it in C.  It should not be possible to implicitly cast T[] to T*; that's why the .ptr property was invented.  I can't think of any time when you would need to implicitly cast T[] to T* besides a quick-and-dirty way to pass arrays into C functions.  It's sloppy.

Interfacing with C functions should be possible in D, but it should be obvious when it happens.  If implicit conversion from T[] to T* is made illegal, it will make bugs like passing an un-toStringz'ed char[] into a C function show up.


July 31, 2005
On Sun, 31 Jul 2005 12:37:31 -0400, Jarrett Billingsley <kb3ctd2@yahoo.com> wrote:
> Okay, here's another array topic to discuss while we're on this whole "null array" issue.
>
> Recently, I was working on some stuff that interfaced with a C function that took a char*.  It was working fine.  Then, I changed something, and started getting access violations.  Guess what it was?
>
> myFunc("a string"); // works fine
> myFunc(`a wysiwyg string?`); // access violation
>
> I tracked down the problem - I wasn't calling toStringz() on the char[]
> before passing it to the C function.
>
> This weird problem occurred because double-quoted string literals in D are null-terminated.  I guess this is for quick-and-dirty interop with C
> libraries.  This is inconsistent with D's design, however.   D strings
> should not be null-terminated, as in D, strings are represented by a length and data.  Any interaction with strings in C functions should use
> toStringz() on the char[] before passing it into the C function.
>
> The second issue I had with this was that I was simply passing the array
> into the C function, like so:
>
> void myFunc(char[] str)
> {
>     someCFunc(str);
> }
>
> This is an issue.  someCFunc() takes a char*.  str is not a char*, it's a
> char[].  They are two different things, and a D char[] does not have any
> built-in type that resembles it in C.  It should not be possible to
> implicitly cast T[] to T*; that's why the .ptr property was invented.  I
> can't think of any time when you would need to implicitly cast T[] to T*
> besides a quick-and-dirty way to pass arrays into C functions.  It's sloppy.
>
> Interfacing with C functions should be possible in D, but it should be
> obvious when it happens.  If implicit conversion from T[] to T* is made
> illegal, it will make bugs like passing an un-toStringz'ed char[] into a C function show up.

Interesting..

Technically C's "char*" type is analogous to D's "byte*" type, not D's "char*" type.

So, technically the solution is to replace all the "char*" in phobos' C function declarations with "byte*". This will cause errors everywhere a char[] is passed as byte*.

To solve these I would replace toStringz with several toX functions where 'X' is the character set required. We'd need to write these functions, they could verify and/or convert the data if required to the character set.

The code would then resemble:

char* s = strchr(toISO9660("abc"),'a');

Or, if you prefer a less invasive solution...

In C a "char*" will be null terminated in all but the strangest cases. I'd guess 99.9% of cases are null terminated. So, when interfacing with C 99.9% of the time "char*" instances should be null terminated. I don't see much use of "char*" in straight D code, char[] is simply a much better choice.

So, the solution? Well, the one proposed:
 - make T[] to T* illegal.

Would probably solve the issue. Though a programmer could still change "array" to "array.ptr" and remove the error but not the crash.

Consider however byte, long, int or other arrays. These arrays are not typically "null terminated" because the value 0 generally has no special meaning for these types. In fact when using arrays of these types in C a special value is chosen which is outside the range of possible values, if that is possible, if not a length is passed with the array.

So, as a general rule:
 - make T[] to T* illegal.

has a negative impact on the usability of other array types WRT calling C functions. Granted these array types are not used anywhere near as often as char.

So, the solution: If we assume 99.9% of char* cases should be null terminated, can we ensure/make 100% of cases null terminated at no (or small) negative effect, in other words:
 - make char/wchar/dchar[] to char/wchar/dchar* conversions implicitly call toStringz

This will solve the issue, implicitly, silently, in fact most users won't even realise it's being done. That, however is the negative aspect, this could cause a reallocation and it's all done silenty.

Regan
July 31, 2005
"Regan Heath" <regan@netwin.co.nz> wrote in message news:opsusta6aa23k2f5@nrage.netwin.co.nz...
> Technically C's "char*" type is analogous to D's "byte*" type, not D's "char*" type.

I was getting more at the fact that C doesn't represent arrays in the same way that D does, but that interpretation works too.

> So, technically the solution is to replace all the "char*" in phobos' C function declarations with "byte*". This will cause errors everywhere a char[] is passed as byte*.
>
> To solve these I would replace toStringz with several toX functions where 'X' is the character set required. We'd need to write these functions, they could verify and/or convert the data if required to the character set.
>
> The code would then resemble:
>
> char* s = strchr(toISO9660("abc"),'a');

No offense, but I really don't have any idea what you're getting at here :)

> Or, if you prefer a less invasive solution...
>
> In C a "char*" will be null terminated in all but the strangest cases. I'd guess 99.9% of cases are null terminated. So, when interfacing with C 99.9% of the time "char*" instances should be null terminated. I don't see much use of "char*" in straight D code, char[] is simply a much better choice.
>
> So, the solution? Well, the one proposed:
>  - make T[] to T* illegal.
>
> Would probably solve the issue. Though a programmer could still change "array" to "array.ptr" and remove the error but not the crash.

Yes, but making this illegal

char[] someString;
someCFunc(someString);  // error, no implicit cast from char[] to char*

Should be a red flag that says "oh yeah, this is a string, I should probably be doing something to it."  Sure, you could write someString.ptr, but hopefully you'll realize that you need to write toStringz(someString) instead.

> Consider however byte, long, int or other arrays. These arrays are not typically "null terminated" because the value 0 generally has no special meaning for these types. In fact when using arrays of these types in C a special value is chosen which is outside the range of possible values, if that is possible, if not a length is passed with the array.
>
> So, as a general rule:
>  - make T[] to T* illegal.
>
> has a negative impact on the usability of other array types WRT calling C functions. Granted these array types are not used anywhere near as often as char.

The frequency that you pass a numerical array to a C function is very low indeed, and the only change that would need to be made would be writing ".ptr" explicitly.  If nothing else, it makes it more obvious that the array is being turned into a C-style "array."

> So, the solution: If we assume 99.9% of char* cases should be null
> terminated, can we ensure/make 100% of cases null terminated at no (or
> small) negative effect, in other words:
>  - make char/wchar/dchar[] to char/wchar/dchar* conversions implicitly
> call toStringz
>
> This will solve the issue, implicitly, silently, in fact most users won't even realise it's being done. That, however is the negative aspect, this could cause a reallocation and it's all done silenty.

Most of the time, when passing a string to a C function, you're going to be calling toStringz anyway.  And about the only time you take advantage of the implicit casting from [w/d]char[] to [w/d]char* is when passing a string to a C function.  So there wouldn't really be any loss in speed, though perhaps a non-trivial operation such as toStringz shouldn't be done implicitly, for the sake of clarity.

How about a .toStringz property for char[]?  ;)


August 01, 2005
On Sun, 31 Jul 2005 19:09:59 -0400, Jarrett Billingsley <kb3ctd2@yahoo.com> wrote:
> "Regan Heath" <regan@netwin.co.nz> wrote in message
> news:opsusta6aa23k2f5@nrage.netwin.co.nz...
>> Technically C's "char*" type is analogous to D's "byte*" type, not D's
>> "char*" type.
>
> I was getting more at the fact that C doesn't represent arrays in the same way that D does, but that interpretation works too.

I think this is the root of the problem. D's arrays are definately not C's arrays. At the same time D's "char*" is not C's "char*" either (tho these are more similar) because D's char* is UTF-8 encoded by definition. C's char* has no definite encoding. D's "byte*" is C's "char*", an array of signed 8 bit values.

>> So, technically the solution is to replace all the "char*" in phobos' C
>> function declarations with "byte*". This will cause errors everywhere a
>> char[] is passed as byte*.
>>
>> To solve these I would replace toStringz with several toX functions where
>> 'X' is the character set required. We'd need to write these functions,
>> they could verify and/or convert the data if required to the character
>> set.
>>
>> The code would then resemble:
>>
>> char* s = strchr(toISO9660("abc"),'a');
>
> No offense, but I really don't have any idea what you're getting at here :)

toStringz takes char/wchar/dchar and simply ensures it has a trailing null character. It does not deal with whether it is encoded in the correct character set.

My knowledge of character encodings is a little shaky, so someone correct me if I'm wrong.. C functions expect a certain encoding, that is defined by locale information. I believe it can differ on each PC based on the users language etc. On my PC I'd guess it's Windows-1252 (or something).

D's char[] is UTF-8, UTF-8 is a super-set of ASCII and Windows-1252. Because UTF-8 is a superset it contains characters that cannot be represented in Windows-1252, chinese characters for example. If a D char[] contains one of these characters and you pass it to a C function you will get strange results (because it expects Windows-1252 and you're sending it UTF-8).

So, instead of simply ensuring the trailing null is present why not ensure the encoding is correct also.

Thinking about it, I believe we can obtain the locale information and transcode to the local encoding and/or throw an exception if it's impossible automatically. So, in fact, (changing my mind/idea here) we don't need a seperate function for each encoding but can do the transcoding to the local encoding automatically inside toStringz.

If we want to leave toStringz "as is" we could instead add a toStringLocal function to do the transcoding. And/or a toCString function to do both. The transcoding requires some sort of character encoding library to be added to phobos, I believe there has been some work done in this area.. converting a C lib? (I forget the name)

>> Or, if you prefer a less invasive solution...
>>
>> In C a "char*" will be null terminated in all but the strangest cases. I'd
>> guess 99.9% of cases are null terminated. So, when interfacing with C
>> 99.9% of the time "char*" instances should be null terminated. I don't see
>> much use of "char*" in straight D code, char[] is simply a much better
>> choice.
>>
>> So, the solution? Well, the one proposed:
>>  - make T[] to T* illegal.
>>
>> Would probably solve the issue. Though a programmer could still change
>> "array" to "array.ptr" and remove the error but not the crash.
>
> Yes, but making this illegal
>
> char[] someString;
> someCFunc(someString);  // error, no implicit cast from char[] to char*
>
> Should be a red flag that says "oh yeah, this is a string, I should probably be doing something to it."  Sure, you could write someString.ptr, but
> hopefully you'll realize that you need to write toStringz(someString)
> instead.

Sure, "hopefully", that was my point. I'd prefer something less error prone.

>> Consider however byte, long, int or other arrays. These arrays are not
>> typically "null terminated" because the value 0 generally has no special
>> meaning for these types. In fact when using arrays of these types in C a
>> special value is chosen which is outside the range of possible values, if
>> that is possible, if not a length is passed with the array.
>>
>> So, as a general rule:
>>  - make T[] to T* illegal.
>>
>> has a negative impact on the usability of other array types WRT calling C
>> functions. Granted these array types are not used anywhere near as often
>> as char.
>
> The frequency that you pass a numerical array to a C function is very low
> indeed

Yes, like I said.

> , and the only change that would need to be made would be writing
> ".ptr" explicitly. If nothing else, it makes it more obvious that the array is being turned into a C-style "array."

Yes, but that means it's .ptr for non string types and toStringz for string types. A little inconsistent, tho acceptable. I prefer the change to "byte*" and if not that, I'd prefer to implicitly convert from char[] to null terminated char*.

>> So, the solution: If we assume 99.9% of char* cases should be null
>> terminated, can we ensure/make 100% of cases null terminated at no (or
>> small) negative effect, in other words:
>>  - make char/wchar/dchar[] to char/wchar/dchar* conversions implicitly
>> call toStringz
>>
>> This will solve the issue, implicitly, silently, in fact most users won't
>> even realise it's being done. That, however is the negative aspect, this
>> could cause a reallocation and it's all done silenty.
>
> Most of the time, when passing a string to a C function, you're going to be calling toStringz anyway.

Yes, so make it implicit. I say.

> And about the only time you take advantage of the implicit casting from [w/d]char[] to [w/d]char* is when passing a string to a C function.  So there wouldn't really be any loss in speed, though perhaps a non-trivial operation such as toStringz shouldn't be done implicitly, for the sake of clarity.

Maybe. That is the potential negative aspect of the idea. I wonder what someone else thinks.

> How about a .toStringz property for char[]?  ;)

.ptr could give a null terminated char* (i.e. call toStringz on the data) if we want it to be explicit.

Regan