View mode: basic / threaded / horizontal-split · Log in · Help
July 31, 2005
String literal consistency and implicit array conversions..
Okay, here's another array topic to discuss while we're on this whole "null 
array" issue.

Recently, I was working on some stuff that interfaced with a C function that 
took a char*.  It was working fine.  Then, I changed something, and started 
getting access violations.  Guess what it was?

myFunc("a string"); // works fine
myFunc(`a wysiwyg string?`); // access violation

I tracked down the problem - I wasn't calling toStringz() on the char[] 
before passing it to the C function.

This weird problem occurred because double-quoted string literals in D are 
null-terminated.  I guess this is for quick-and-dirty interop with C 
libraries.  This is inconsistent with D's design, however.   D strings 
should not be null-terminated, as in D, strings are represented by a length 
and data.  Any interaction with strings in C functions should use 
toStringz() on the char[] before passing it into the C function.

The second issue I had with this was that I was simply passing the array 
into the C function, like so:

void myFunc(char[] str)
{
   someCFunc(str);
}

This is an issue.  someCFunc() takes a char*.  str is not a char*, it's a 
char[].  They are two different things, and a D char[] does not have any 
built-in type that resembles it in C.  It should not be possible to 
implicitly cast T[] to T*; that's why the .ptr property was invented.  I 
can't think of any time when you would need to implicitly cast T[] to T* 
besides a quick-and-dirty way to pass arrays into C functions.  It's sloppy.

Interfacing with C functions should be possible in D, but it should be 
obvious when it happens.  If implicit conversion from T[] to T* is made 
illegal, it will make bugs like passing an un-toStringz'ed char[] into a C 
function show up.
July 31, 2005
Re: String literal consistency and implicit array conversions..
On Sun, 31 Jul 2005 12:37:31 -0400, Jarrett Billingsley  
<kb3ctd2@yahoo.com> wrote:
> Okay, here's another array topic to discuss while we're on this whole  
> "null array" issue.
>
> Recently, I was working on some stuff that interfaced with a C function  
> that took a char*.  It was working fine.  Then, I changed something, and  
> started getting access violations.  Guess what it was?
>
> myFunc("a string"); // works fine
> myFunc(`a wysiwyg string?`); // access violation
>
> I tracked down the problem - I wasn't calling toStringz() on the char[]
> before passing it to the C function.
>
> This weird problem occurred because double-quoted string literals in D  
> are null-terminated.  I guess this is for quick-and-dirty interop with C
> libraries.  This is inconsistent with D's design, however.   D strings
> should not be null-terminated, as in D, strings are represented by a  
> length and data.  Any interaction with strings in C functions should use
> toStringz() on the char[] before passing it into the C function.
>
> The second issue I had with this was that I was simply passing the array
> into the C function, like so:
>
> void myFunc(char[] str)
> {
>     someCFunc(str);
> }
>
> This is an issue.  someCFunc() takes a char*.  str is not a char*, it's a
> char[].  They are two different things, and a D char[] does not have any
> built-in type that resembles it in C.  It should not be possible to
> implicitly cast T[] to T*; that's why the .ptr property was invented.  I
> can't think of any time when you would need to implicitly cast T[] to T*
> besides a quick-and-dirty way to pass arrays into C functions.  It's  
> sloppy.
>
> Interfacing with C functions should be possible in D, but it should be
> obvious when it happens.  If implicit conversion from T[] to T* is made
> illegal, it will make bugs like passing an un-toStringz'ed char[] into a  
> C function show up.

Interesting..

Technically C's "char*" type is analogous to D's "byte*" type, not D's  
"char*" type.

So, technically the solution is to replace all the "char*" in phobos' C  
function declarations with "byte*". This will cause errors everywhere a  
char[] is passed as byte*.

To solve these I would replace toStringz with several toX functions where  
'X' is the character set required. We'd need to write these functions,  
they could verify and/or convert the data if required to the character set.

The code would then resemble:

char* s = strchr(toISO9660("abc"),'a');

Or, if you prefer a less invasive solution...

In C a "char*" will be null terminated in all but the strangest cases. I'd  
guess 99.9% of cases are null terminated. So, when interfacing with C  
99.9% of the time "char*" instances should be null terminated. I don't see  
much use of "char*" in straight D code, char[] is simply a much better  
choice.

So, the solution? Well, the one proposed:
 - make T[] to T* illegal.

Would probably solve the issue. Though a programmer could still change  
"array" to "array.ptr" and remove the error but not the crash.

Consider however byte, long, int or other arrays. These arrays are not  
typically "null terminated" because the value 0 generally has no special  
meaning for these types. In fact when using arrays of these types in C a  
special value is chosen which is outside the range of possible values, if  
that is possible, if not a length is passed with the array.

So, as a general rule:
 - make T[] to T* illegal.

has a negative impact on the usability of other array types WRT calling C  
functions. Granted these array types are not used anywhere near as often  
as char.

So, the solution: If we assume 99.9% of char* cases should be null  
terminated, can we ensure/make 100% of cases null terminated at no (or  
small) negative effect, in other words:
 - make char/wchar/dchar[] to char/wchar/dchar* conversions implicitly  
call toStringz

This will solve the issue, implicitly, silently, in fact most users won't  
even realise it's being done. That, however is the negative aspect, this  
could cause a reallocation and it's all done silenty.

Regan
July 31, 2005
Re: String literal consistency and implicit array conversions..
"Regan Heath" <regan@netwin.co.nz> wrote in message 
news:opsusta6aa23k2f5@nrage.netwin.co.nz...
> Technically C's "char*" type is analogous to D's "byte*" type, not D's 
> "char*" type.

I was getting more at the fact that C doesn't represent arrays in the same 
way that D does, but that interpretation works too.

> So, technically the solution is to replace all the "char*" in phobos' C 
> function declarations with "byte*". This will cause errors everywhere a 
> char[] is passed as byte*.
>
> To solve these I would replace toStringz with several toX functions where 
> 'X' is the character set required. We'd need to write these functions, 
> they could verify and/or convert the data if required to the character 
> set.
>
> The code would then resemble:
>
> char* s = strchr(toISO9660("abc"),'a');

No offense, but I really don't have any idea what you're getting at here :)

> Or, if you prefer a less invasive solution...
>
> In C a "char*" will be null terminated in all but the strangest cases. I'd 
> guess 99.9% of cases are null terminated. So, when interfacing with C 
> 99.9% of the time "char*" instances should be null terminated. I don't see 
> much use of "char*" in straight D code, char[] is simply a much better 
> choice.
>
> So, the solution? Well, the one proposed:
>  - make T[] to T* illegal.
>
> Would probably solve the issue. Though a programmer could still change 
> "array" to "array.ptr" and remove the error but not the crash.

Yes, but making this illegal

char[] someString;
someCFunc(someString);  // error, no implicit cast from char[] to char*

Should be a red flag that says "oh yeah, this is a string, I should probably 
be doing something to it."  Sure, you could write someString.ptr, but 
hopefully you'll realize that you need to write toStringz(someString) 
instead.

> Consider however byte, long, int or other arrays. These arrays are not 
> typically "null terminated" because the value 0 generally has no special 
> meaning for these types. In fact when using arrays of these types in C a 
> special value is chosen which is outside the range of possible values, if 
> that is possible, if not a length is passed with the array.
>
> So, as a general rule:
>  - make T[] to T* illegal.
>
> has a negative impact on the usability of other array types WRT calling C 
> functions. Granted these array types are not used anywhere near as often 
> as char.

The frequency that you pass a numerical array to a C function is very low 
indeed, and the only change that would need to be made would be writing 
".ptr" explicitly.  If nothing else, it makes it more obvious that the array 
is being turned into a C-style "array."

> So, the solution: If we assume 99.9% of char* cases should be null 
> terminated, can we ensure/make 100% of cases null terminated at no (or 
> small) negative effect, in other words:
>  - make char/wchar/dchar[] to char/wchar/dchar* conversions implicitly 
> call toStringz
>
> This will solve the issue, implicitly, silently, in fact most users won't 
> even realise it's being done. That, however is the negative aspect, this 
> could cause a reallocation and it's all done silenty.

Most of the time, when passing a string to a C function, you're going to be 
calling toStringz anyway.  And about the only time you take advantage of the 
implicit casting from [w/d]char[] to [w/d]char* is when passing a string to 
a C function.  So there wouldn't really be any loss in speed, though perhaps 
a non-trivial operation such as toStringz shouldn't be done implicitly, for 
the sake of clarity.

How about a .toStringz property for char[]?  ;)
August 01, 2005
Re: String literal consistency and implicit array conversions..
On Sun, 31 Jul 2005 19:09:59 -0400, Jarrett Billingsley  
<kb3ctd2@yahoo.com> wrote:
> "Regan Heath" <regan@netwin.co.nz> wrote in message
> news:opsusta6aa23k2f5@nrage.netwin.co.nz...
>> Technically C's "char*" type is analogous to D's "byte*" type, not D's
>> "char*" type.
>
> I was getting more at the fact that C doesn't represent arrays in the  
> same way that D does, but that interpretation works too.

I think this is the root of the problem. D's arrays are definately not C's  
arrays. At the same time D's "char*" is not C's "char*" either (tho these  
are more similar) because D's char* is UTF-8 encoded by definition. C's  
char* has no definite encoding. D's "byte*" is C's "char*", an array of  
signed 8 bit values.

>> So, technically the solution is to replace all the "char*" in phobos' C
>> function declarations with "byte*". This will cause errors everywhere a
>> char[] is passed as byte*.
>>
>> To solve these I would replace toStringz with several toX functions  
>> where
>> 'X' is the character set required. We'd need to write these functions,
>> they could verify and/or convert the data if required to the character
>> set.
>>
>> The code would then resemble:
>>
>> char* s = strchr(toISO9660("abc"),'a');
>
> No offense, but I really don't have any idea what you're getting at here  
> :)

toStringz takes char/wchar/dchar and simply ensures it has a trailing null  
character. It does not deal with whether it is encoded in the correct  
character set.

My knowledge of character encodings is a little shaky, so someone correct  
me if I'm wrong.. C functions expect a certain encoding, that is defined  
by locale information. I believe it can differ on each PC based on the  
users language etc. On my PC I'd guess it's Windows-1252 (or something).

D's char[] is UTF-8, UTF-8 is a super-set of ASCII and Windows-1252.  
Because UTF-8 is a superset it contains characters that cannot be  
represented in Windows-1252, chinese characters for example. If a D char[]  
contains one of these characters and you pass it to a C function you will  
get strange results (because it expects Windows-1252 and you're sending it  
UTF-8).

So, instead of simply ensuring the trailing null is present why not ensure  
the encoding is correct also.

Thinking about it, I believe we can obtain the locale information and  
transcode to the local encoding and/or throw an exception if it's  
impossible automatically. So, in fact, (changing my mind/idea here) we  
don't need a seperate function for each encoding but can do the  
transcoding to the local encoding automatically inside toStringz.

If we want to leave toStringz "as is" we could instead add a toStringLocal  
function to do the transcoding. And/or a toCString function to do both.  
The transcoding requires some sort of character encoding library to be  
added to phobos, I believe there has been some work done in this area..  
converting a C lib? (I forget the name)

>> Or, if you prefer a less invasive solution...
>>
>> In C a "char*" will be null terminated in all but the strangest cases.  
>> I'd
>> guess 99.9% of cases are null terminated. So, when interfacing with C
>> 99.9% of the time "char*" instances should be null terminated. I don't  
>> see
>> much use of "char*" in straight D code, char[] is simply a much better
>> choice.
>>
>> So, the solution? Well, the one proposed:
>>  - make T[] to T* illegal.
>>
>> Would probably solve the issue. Though a programmer could still change
>> "array" to "array.ptr" and remove the error but not the crash.
>
> Yes, but making this illegal
>
> char[] someString;
> someCFunc(someString);  // error, no implicit cast from char[] to char*
>
> Should be a red flag that says "oh yeah, this is a string, I should  
> probably be doing something to it."  Sure, you could write  
> someString.ptr, but
> hopefully you'll realize that you need to write toStringz(someString)
> instead.

Sure, "hopefully", that was my point. I'd prefer something less error  
prone.

>> Consider however byte, long, int or other arrays. These arrays are not
>> typically "null terminated" because the value 0 generally has no special
>> meaning for these types. In fact when using arrays of these types in C a
>> special value is chosen which is outside the range of possible values,  
>> if
>> that is possible, if not a length is passed with the array.
>>
>> So, as a general rule:
>>  - make T[] to T* illegal.
>>
>> has a negative impact on the usability of other array types WRT calling  
>> C
>> functions. Granted these array types are not used anywhere near as often
>> as char.
>
> The frequency that you pass a numerical array to a C function is very low
> indeed

Yes, like I said.

> , and the only change that would need to be made would be writing
> ".ptr" explicitly. If nothing else, it makes it more obvious that the  
> array is being turned into a C-style "array."

Yes, but that means it's .ptr for non string types and toStringz for  
string types. A little inconsistent, tho acceptable. I prefer the change  
to "byte*" and if not that, I'd prefer to implicitly convert from char[]  
to null terminated char*.

>> So, the solution: If we assume 99.9% of char* cases should be null
>> terminated, can we ensure/make 100% of cases null terminated at no (or
>> small) negative effect, in other words:
>>  - make char/wchar/dchar[] to char/wchar/dchar* conversions implicitly
>> call toStringz
>>
>> This will solve the issue, implicitly, silently, in fact most users  
>> won't
>> even realise it's being done. That, however is the negative aspect, this
>> could cause a reallocation and it's all done silenty.
>
> Most of the time, when passing a string to a C function, you're going to  
> be calling toStringz anyway.

Yes, so make it implicit. I say.

> And about the only time you take advantage of the implicit casting from  
> [w/d]char[] to [w/d]char* is when passing a string to a C function.  So  
> there wouldn't really be any loss in speed, though perhaps a non-trivial  
> operation such as toStringz shouldn't be done implicitly, for the sake  
> of clarity.

Maybe. That is the potential negative aspect of the idea. I wonder what  
someone else thinks.

> How about a .toStringz property for char[]?  ;)

.ptr could give a null terminated char* (i.e. call toStringz on the data)  
if we want it to be explicit.

Regan
Top | Discussion index | About this forum | D home