November 20, 2005
Derek Parnell wrote:
> Let's call a halt to this discussion. I suspect that you and I will
> not agree about this function signature matching issue anytime soon.

Whew, I was just starting to wonder what to do. :-)

Maybe we'll save the others some headaches too. Besides, at this point, I guess nobody else reads this thread anyway. :-)

But it was nice to learn that with some folks you really can disagree long and good, and still not start fighting.

georg
November 20, 2005
On Mon, 21 Nov 2005 00:23:39 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
> Derek Parnell wrote:
>> Let's call a halt to this discussion. I suspect that you and I will
>> not agree about this function signature matching issue anytime soon.
>
> Whew, I was just starting to wonder what to do. :-)

I'm interested in both your opinions on:
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5612

> Maybe we'll save the others some headaches too. Besides, at this point, I guess nobody else reads this thread anyway. :-)

Or they prefer to lurk. Or we scared them away.

> But it was nice to learn that with some folks you really can disagree long and good, and still not start fighting.

It's how it's supposed to work :)

The key, I believe, is to realise that it's not personal, it's an discussion/argument of opinion. Disagreeing with an opinion is not the same as disliking the person who holds that opinion. Of course this is only true when the participants do not make comments which can be taken as being directed at the person, as opposed to the points of the argument itself. This is harder than it sounds because the written word often does not convey your meaning as well as your face and voice could do in a face to face conversation.

My 2c.

Regan
November 21, 2005
On Mon, 21 Nov 2005 10:58:28 +1300, Regan Heath wrote:

Ok, I'll comment but only 'cos you asked ;-)

> On Sun, 20 Nov 2005 17:28:33 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
>> Regan Heath wrote:
>>>  Georg/Derek, I replied to Georg here:
>>> http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5587
>>>  saying essentially the same things as Derek has above. I reckon we
>>> combine  these threads and continue in this one, as opposed to the one
>>> I linked  above. I or you can link the other thread to here with a post
>>> if you're in  agreement.
>>
>> Good suggestion!
>>
>> I actually intended that, but forgot about it while reading and thinking. :-/
>>
>> So, the reply is to it directly.
> 
> Ok. I have taken your reply, clicked reply, and pasted it in here :)
> (I hope this post isn't confusing for anyone)
> 
> -------------------------
> Copied from: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5607
> -------------------------
> 
> On Sun, 20 Nov 2005 17:17:33 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
> 
>> Regan Heath wrote:
>>> On Fri, 18 Nov 2005 13:02:05 +0200, Georg Wrede
>>> <georg.wrede@nospam.org>  wrote:
>>>  Lets assume there is 2 functions of the same name (unintentionally),
>>> doing different things.
>>>  In that source file the programmer writes:
>>>  write("test");
>>>  DMD tries to choose the storage type of "test" based on the available
>>> overloads. There are 2 available overloads X and Y. It currently
>>> fails and gives an error.
>>>  If instead it picked an overload (X) and stored "test" in the type
>>> for X, calling the overload for X, I agree, there would be
>>> _absolutely no problems_ with the stored data.
>>>  BUT
>>>  the overload for X doesn't do the same thing as the overload for Y.
>>
>> Isn't that a problem with having overloading at all in a language? Sooner or later, most of us have done it. If not each already? Isn't this a problem with overloading in general, and not with UTF?
> 
> You're right. The problem is not limited to string literals, integer literals exhibit exactly the same problem, AFAICS. So, you've convinced me. Here is why...
> 
> http://www.digitalmars.com/d/lex.html#integerliteral
> (see "The type of the integer is resolved as follows")
> 
> In essence integer literals _default_ to 'int' unless another type is specified or required.
> 
> This suggested change does that, and nothing else? (can anyone see a difference?)

Are you suggesting that in the situation where multiple function signatures
could possibly match an undecorated string literal, that D should assume
that the string literal is actually in utf-8 format, and if that then fails
to find a match, it should signal an error?

> If so and if I can accept the behaviour for integer literals why can't I for string literals?
> 
> The only logical reason I can think of for not accepting it, is if there exists a difference between integer literals and string literals which affects this behaviour.
> 
> I can think of differences, but none which affect the behaviour. So, it seems that if I accept the risk for integers, I have to accept the risk for string literals too.

What might be a relevant point about this is that we are trying to talk about strings, but as far as D is concerned, we are really talking about arrays (of code-units). And for arrays, the current D behaviour is self-consistent. If however, D supported a true string data type, then a great deal of our messy code dealing with UTF conversions would disappear, just as it does with integers and floating point values. Imagine the problems we would have if integers were regarded as arrays of bits by the compiler!

> ---
> 
> Note that string promotion should occur just like integer promotion does, eg:
> 
> void foo(long i) {}
> foo(5); //calls foo(long) with no error

But what happens when ...
 void foo(long i) {}
 void foo(short i) {}

 foo(5); //calls ???

> void foo(wchar[] s) {}
> foo("test"); //should call foo(wchar[]) with no error
> 
> this behaviour is current and should not change.

Agreed.

 void foo(wchar[] s) {}
 void foo(char[]  s) {}
 foo("test"); //should call ???

I'm now thinking that it should call the char[] signature without error.

But in this case ...

 void foo(wchar[] s) {}
 void foo(dchar[] s) {}
 foo("test"); //should call an error.


If we had a generic string type we'd probably just code ....

 void foo(string s) {}
 foo("test");  // Calls the one function
 foo("test"d); // Also calls the one function

D would convert to an appropriate UTF format silently before (and after
calling).

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
21/11/2005 10:41:35 AM
November 21, 2005
On Mon, 21 Nov 2005 11:10:30 +1100, Derek Parnell <derek@psych.ward> wrote:
> On Mon, 21 Nov 2005 10:58:28 +1300, Regan Heath wrote:
>
> Ok, I'll comment but only 'cos you asked ;-)

Thanks <g>.

>>> Regan Heath wrote:
>>>> On Fri, 18 Nov 2005 13:02:05 +0200, Georg Wrede
>>>> <georg.wrede@nospam.org>  wrote:
>>>>  Lets assume there is 2 functions of the same name (unintentionally),
>>>> doing different things.
>>>>  In that source file the programmer writes:
>>>>  write("test");
>>>>  DMD tries to choose the storage type of "test" based on the available
>>>> overloads. There are 2 available overloads X and Y. It currently
>>>> fails and gives an error.
>>>>  If instead it picked an overload (X) and stored "test" in the type
>>>> for X, calling the overload for X, I agree, there would be
>>>> _absolutely no problems_ with the stored data.
>>>>  BUT
>>>>  the overload for X doesn't do the same thing as the overload for Y.
>>>
>>> Isn't that a problem with having overloading at all in a language?
>>> Sooner or later, most of us have done it. If not each already? Isn't
>>> this a problem with overloading in general, and not with UTF?
>>
>> You're right. The problem is not limited to string literals, integer
>> literals exhibit exactly the same problem, AFAICS. So, you've convinced
>> me. Here is why...
>>
>> http://www.digitalmars.com/d/lex.html#integerliteral
>> (see "The type of the integer is resolved as follows")
>>
>> In essence integer literals _default_ to 'int' unless another type is
>> specified or required.
>>
>> This suggested change does that, and nothing else? (can anyone see a
>> difference?)
>
> Are you suggesting that in the situation where multiple function signatures could possibly match an undecorated string literal, that D should assume
> that the string literal is actually in utf-8 format, and if that then fails to find a match, it should signal an error?

I'm suggesting that an undecorated string literal could default to char[] similar to how an undecorated integer literal defaults to 'int' and that the risk created by that behaviour would be no different in either case.

>> If so and if I can accept the behaviour for integer literals why can't I
>> for string literals?
>>
>> The only logical reason I can think of for not accepting it, is if there
>> exists a difference between integer literals and string literals which
>> affects this behaviour.
>>
>> I can think of differences, but none which affect the behaviour. So, it
>> seems that if I accept the risk for integers, I have to accept the risk
>> for string literals too.
>
> What might be a relevant point about this is that we are trying to talk
> about strings, but as far as D is concerned, we are really talking about
> arrays (of code-units). And for arrays, the current D behaviour is
> self-consistent.  If however, D supported a true string data type, then a
> great deal of our messy code dealing with UTF conversions would disappear, just as it does with integers and floating point values.Imagine the problems we would have if integers were regarded as arrays of bits by the
> compiler!

I'm not sure it makes any difference that char[] is an array, if you imagine that we removed the current integer literal rules, here:
  http://www.digitalmars.com/d/lex.html#integerliteral
  (see "The type of the integer is resolved as follows")

then short/int/long would exhibit the same problem that char[]/wchar[]/dchar[] does, this would be illegal:

void foo(short i) {}
void foo(int i) {}
void foo(long i) {}
foo(5);

requiring:

foo(5s); //to call short version
foo(5i); //to call int version
foo(5l); //to call long version

or:

foo(cast(short)5); //to call short version
foo(cast(int)5); //to call int version
foo(cast(long)5); //to call long version

just like char[]/wchar[]/dchar[] does today.

>> ---
>>
>> Note that string promotion should occur just like integer promotion does,
>> eg:
>>
>> void foo(long i) {}
>> foo(5); //calls foo(long) with no error
>
> But what happens when ...
>  void foo(long i) {}
>  void foo(short i) {}
>
>  foo(5); //calls ???

You get:

test.d(8): function test.foo called with argument types:
	(int)
matches both:
	test.foo(short)
and:
	test.foo(long)

which is correct IMO because 'int' can be promoted to both 'short' and 'long' with equal preference. ("that's the long and short of it" <g>)

>> void foo(wchar[] s) {}
>> foo("test"); //should call foo(wchar[]) with no error
>>
>> this behaviour is current and should not change.
>
> Agreed.
>
>  void foo(wchar[] s) {}
>  void foo(char[]  s) {}
>  foo("test"); //should call ???
>
> I'm now thinking that it should call the char[] signature without error.

That's what they've been suggesting. I have started to agree (because it's no more risky than current integer literal behaviour, which I have blithely accepted for years now - perhaps due to lack of knowledge when I first started programming, and now because I am used to it and it seems natural)

> But in this case ...
>
>  void foo(wchar[] s) {}
>  void foo(dchar[] s) {}
>  foo("test"); //should call an error.

Agreed, just like the integer literal example above.

> If we had a generic string type we'd probably just code ....
>
>  void foo(string s) {}
>  foo("test");  // Calls the one function
>  foo("test"d); // Also calls the one function
>
> D would convert to an appropriate UTF format silently before (and after
> calling).

It's an interesting idea. I was thinking the same thing recently, why not have 1 super-type "string" and have it convert between the format required when asked eg.

//writing strings
void c_function_call(char *string) {}
void os_function_call(wchar[] string) {}
void write_to_file_in_specific_encoding(dchar[] string) {}

string a = "test"; //"test" is stored in application defined default internal representation (more on this later)

c_function_call(a.utf8);
os_function_call(a.utf16);
write_to_file_in_specific_encoding(a.utf32);
normal_d_function(a);

//reading strings
void read_from_file_in_specific_encoding(inout dchar[]) {}

string a;
read_from_file_in_specific_encoding(a.utf32);


or, perhaps we can go one step further and implicitly transcode where required, eg:

c_function_call(a);
os_function_call(a);
write_to_file_in_specific_encoding(a);
read_from_file_in_specific_encoding(a);

The properties (Sean's idea, thanks Sean) utf8, utf16, and utf32 would be of type char[], wchar[] and dchar[] respectively. (so, these types remain)

Slicing string would give characters as opposed to code units (parts of characters).

I still believe the only times you care which encoding it is in, and/or should be transcoding, is on input and output, and for performance reasons you do not want it converting all over the place.

To address performance concerns each application may want to define the default internal encoding of strings for performance reasons, and/or we could use the encoding specified on assignment/creation, eg.

string a; //stored in application defined default (or char[] as that is D's general purpose default)

string a = "test"w; //stored as wchar[] internally
a.utf16 //does no transcoding
a.utf32; a.utf8 //causes transcoding

or, when you have nothing to assign a special syntax is used to specify the internal encoding

//some options off the top of my head...
string a = string.UTF16;
string a!(wchar[]); //random though, can all this be achieved with a template?
string a(UTF16);

read_from_file_in_specific_encoding(a.utf32);

the above would create an empty/non-existant(lets no go here yet <g>) utf16 string in memory, and transcode from the file, which is utf32 to utf16 for internal representation, then:
a.utf16 //does no transcoding
a.utf8 a.utf32 //causes transcoding

Assignment of strings of different internal representation would cause transcoding. This should be rare as most should be in the application defined internal representation, it would naturally occur on input and output where you cannot avoid it anyway.

This idea has me quite excited, if no-one can poke large unsightly holes in it perhaps we could work on a draft spec for it? (i.e. post it to digitalmars.D and see what everyone thinks)

Regan
November 21, 2005
On Mon, 21 Nov 2005 14:16:22 +1300, Regan Heath wrote:


[snip]

> This idea has me quite excited, if no-one can poke large unsightly holes in it perhaps we could work on a draft spec for it? (i.e. post it to digitalmars.D and see what everyone thinks)

Not only do great minds think alike, so you and I! I'm starting to thinking
that you (and your minion helpers) have hit upon a 'Great Idea(tm)'

-- 
Derek Parnell
Melbourne, Australia
21/11/2005 12:57:40 PM
November 21, 2005
"Regan Heath" <regan@netwin.co.nz> wrote...
> On Mon, 21 Nov 2005 11:10:30 +1100, Derek Parnell <derek@psych.ward> wrote:
[snip]
>>  void foo(wchar[] s) {}
>>  void foo(char[]  s) {}
>>  foo("test"); //should call ???
>>
>> I'm now thinking that it should call the char[] signature without error.
>
> That's what they've been suggesting. I have started to agree (because it's no more risky than current integer literal behaviour

Aye!


>> But in this case ...
>>
>>  void foo(wchar[] s) {}
>>  void foo(dchar[] s) {}
>>  foo("test"); //should call an error.
>
> Agreed, just like the integer literal example above.

Aye!


November 21, 2005
Derek Parnell wrote:
> On Sun, 20 Nov 2005 11:30:34 +0000, Bruno Medeiros wrote:
> 
> 
>>Georg Wrede wrote:
>>
>>>If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".)
>>>
>>>A cast should do precisely what our toUTFxxx functions currently do.
>>>
>>
>>It should? Why?, what is the problem of using the toUTFxx functions?
> 
> 
> Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(),
> .... ?
> 
No, we don't. But the case is different: between primitive numbers the casts are usually (if not allways?) implicit, but most importantly, they are quite trivial. And by trivial I mean Assembly-level trivial. String enconding conversions on the other hand (as you surely are aware) are quite not trivial (both in terms of code, run time, and heap memory usage), and I don't think a cast should perform such non-trivial operations.

-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
November 21, 2005
On Mon, 21 Nov 2005 11:14:47 +0000, Bruno Medeiros wrote:

> Derek Parnell wrote:
>> On Sun, 20 Nov 2005 11:30:34 +0000, Bruno Medeiros wrote:
>> 
>> 
>>>Georg Wrede wrote:
>>>
>>>>If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".)
>>>>
>>>>A cast should do precisely what our toUTFxxx functions currently do.
>>>>
>>>
>>>It should? Why?, what is the problem of using the toUTFxx functions?
>> 
>> 
>> Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(),
>> .... ?
>> 
> No, we don't. But the case is different: between primitive numbers the casts are usually (if not allways?) implicit, but most importantly, they are quite trivial. And by trivial I mean Assembly-level trivial. String enconding conversions on the other hand (as you surely are aware) are quite not trivial (both in terms of code, run time, and heap memory usage), and I don't think a cast should perform such non-trivial operations.

Why? If documented, the user can be prepared.

And where is the tipping point? The point at which an operation becomes non-trivial? You mention 'assembly-level' by which I think you mean that a sub-routine is not called but the machine code is generated in-line for the operation. Would that be the trivial/non-trivial divide?

Is conversion from byte to real done in-line or via sub-routine call? I don't actually know, just asking.

-- 
Derek Parnell
Melbourne, Australia
21/11/2005 10:47:34 PM
November 21, 2005
Derek Parnell wrote:
> On Mon, 21 Nov 2005 11:14:47 +0000, Bruno Medeiros wrote:
> 
> 
>>Derek Parnell wrote:
>>
>>>On Sun, 20 Nov 2005 11:30:34 +0000, Bruno Medeiros wrote:
>>>
>>>
>>>
>>>>Georg Wrede wrote:
>>>>
>>>>
>>>>>If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".)
>>>>>
>>>>>A cast should do precisely what our toUTFxxx functions currently do.
>>>>>
>>>>
>>>>It should? Why?, what is the problem of using the toUTFxx functions?
>>>
>>>
>>>Do we have a toReal(), toFloat(), toInt(), toDouble(), toLong(), toULong(),
>>>.... ?
>>>
>>
>>No, we don't. But the case is different: between primitive numbers the casts are usually (if not allways?) implicit, but most importantly, they are quite trivial. And by trivial I mean Assembly-level trivial. String enconding conversions on the other hand (as you surely are aware) are quite not trivial (both in terms of code, run time, and heap memory usage), and I don't think a cast should perform such non-trivial operations.
> 
> 
> Why? If documented, the user can be prepared.
> 
> And where is the tipping point? The point at which an operation becomes
> non-trivial? You mention 'assembly-level' by which I think you mean that a
> sub-routine is not called but the machine code is generated in-line for the
> operation. Would that be the trivial/non-trivial divide?

I would think so. I'd define trivial as: "the assembly code doesn't have any loops".

> Is conversion from byte to real done in-line or via sub-routine call? I
> don't actually know, just asking.

On x86,
int -> real can be done with the FILD instruction. Or can be done without FPU, in a couple of instructions.
short -> int is done with MOVSX
ushort -> uint is done with MOVZX.

HOWEVER -- I don't think this is really relevant. The real issue is about literals, which as Georg rightly said, could be stored in ANY format. Conversions from a literal to any type has ZERO runtime cost.

I think that in a few respects, the existing situation for strings is BETTER than the situation for integers.

I personally don't like the fact that integer literals default to 'int', unless you suffix them with L. Even if the number is too big to fit into an int! And floating-point constants default to 'double', not real.


One intriguing possibility would be to have literals having NO type (or more accurately, an unassigned type). The type only being assigned when it is used.

eg  "abc" is of type: const __unassignedchar [].
There are implicit conversions from __unassignedchar [] to char[], wchar[], and dchar[]. But there are none from char[] to wchar[].

Adding a suffix changes the type from __unassignedchar to char[], wchar[], or dchar[], preventing any implicit conversions.

(__unassignedchar could also be called __stringliteral -- it's inaccessable, anyway).

Similarly, an integral constant could be of type __integerliteral
UNTIL it is assigned to something.
At this point, a check is performed to see if the value can actually fit in the type. If not, (eg when an extended UTF char is assigned to a char), it's an error.

Admittedly, it's more difficult to deal with when you have integers, and especially with reals, where no lossless conversion exists (because 1.0/3.0f + 1.0/5.0f is not the same as cast(float)(1.0/3.0L + 1.0/5.0L) -- the roundoff errors are different).

There are some vaguaries -- what rounding mode is used when performing calculations on reals? This is implementation-defined in C and C++, would be nice if it were specified in D.

UTF strings are not the only worm in this can of worms :-)
November 21, 2005
In article <dlsj73$2fod$1@digitaldaemon.com>, Don Clugston says...

>I think that in a few respects, the existing situation for strings is BETTER than the situation for integers.
>
>I personally don't like the fact that integer literals default to 'int', unless you suffix them with L. Even if the number is too big to fit into an int! And floating-point constants default to 'double', not real.

I agree with you.

>One intriguing possibility would be to have literals having NO type (or more accurately, an unassigned type). The type only being assigned when it is used.
>
>eg  "abc" is of type: const __unassignedchar [].
>There are implicit conversions from __unassignedchar [] to char[],
>wchar[], and dchar[]. But there are none from char[] to wchar[].

String literals already work like this. :)
String literals without suffix are char[], but not "committed". String literals
with a suffix are "committed" to their type.

Check the frontend sources. StringExp::implicitConvTo(Type *t) allows conversion of non-committed string literals to {,w,d}char arrays and pointers.

This is what makes this an error:
# void print(char[] x) {}
# void print(wchar[] x) {}
# void main() { print("test"); }

Regards,

/Oskar