November 13, 2005
Regan Heath wrote:
> 
> It's not quite the same as short/int/long as you can loose data converting from long to int, and int to short, but you can't loose data converting from dchar to wchar or wchar to char.

True enough.  Though the compiler is smart enough to disambiguate overloads correctly in the case I described:

import std.c.stdio;

void fn(int x)   { printf("int\n"); }
void fn(byte x)  { printf( "byte\n"); }
void fn(short x) { printf("short\n"); }
void fn(long x)  { printf("long\n"); }

int main()
{
   fn( 1 );
   fn( 2147483647 );
   fn( 21474836470 );
}

prints:

int
int
long


Sean
November 14, 2005
"Georg Wrede" <georg.wrede@nospam.org> wrote...
> Sean Kelly wrote:
>> Georg Wrede wrote:
>>>
>>> 6. There is _no_ reason for not having a default encoding for undecorated string literals in source code.
>>
>> This seems vaguely consistent with integer promotion, as this is unambiguous:
>>
>> void fn(short x) {} void fn(int x) {} void fn(long x) {}
>>
>> The int version will be selected by default.  And since int has a
>> specific size in D (unlike C), this is really not substantially different
>> from having a default encoding for unqualified string
>> literals.
>
> Right!

Indeed.


November 14, 2005
Regan Heath wrote:
> On Mon, 14 Nov 2005 00:07:30 +0200, Georg Wrede <georg.wrede@nospam.org>  wrote:
> 
>> 6. There is _no_ reason for not having a default encoding for  undecorated string literals in source code.
> 
> What if you have:
> 
> void bob(char[] a)  { printf("1\n"); }
> void bob(wchar[] a) { printf("2\n"); }
> void bob(dchar[] a) { printf("3\n"); }
> 
> void main()
> {
>     bob("test");
> }
> 
> In other words 2 or 3 functions of the same name which do _different_
>  things, the compiler cannot correctly choose the function to call,
> right?

I'd say "shoot the programmer!"

When I was young, there was a law against overloading with different semantics!

> This can only occur if the functions are in the same module, if in different modules you get collision errors requiring 'alias' to resolve.  So, really it should never happen.. but even then once it happens  (assuming the compiler picks one silently) it could be a
> very hard bug to  find.
> 
>> 8. The only thing for which it matters is performance.
>> 
>> 9. Even this performance difference is minimal. (Here we are
>> talking only about string literals. Of course the performance
>> difference is big  for Real Work string processing, but here we are
>> only talking about  Undecorated string literals.)
> 
> What if the literal is used in the data processing i.e. inserted into
>  or  searched for within a large block of text in another encoding?
> What if the  literal is thus transcoded millions of times in the
> normal operation of  the program. I don't think you can discount
> performance so easily.

Neither the current (where you have to be explicit with every usage of UTC), nor the proposed (where String Literals are Understood) practice, are relevant to the question. By the time the string  literal is inserted or searched for, the target UTF width is known.

This means that it should be implicitly converted to whatever is needed.

Now, implicit conversions are not dangerous -- when data loss is no risk.

The only argument for not having implicit conversions "always, and between all the UTF widths", is to not have programmers write code that actually causes a lot of redundant (or even otherwise unessential) conversions. (Which in itself is fine by me.)

Oh, and by the way, if a compiler generates code where the same string Literal gets transcoded "millions of times", the writer probably gets a phone call from Walter! ;-)

>> 10. When the programmer doesn't explicitly specify, the compiler should  be free to choose what width an undecorated string literal
>> is.
> 
> Unless it is effected by cases like #6 and/or #9.

(( I thought I'd murdered them already! ))  :-)

>> As to Unicode, a string is either Unicode or not. Period. What this
>>  "width thing" means, is just a storage attribute. Therefore, the
>>  contents of the string do not change however you transfer it from
>> one width to the other. (Of course the width varies, as well as the
>> bit pattern, but the contents do not change.)
> 
> This is a key concept people _must_ understand before talking about
>  unicode issues. The 3 types can all represent the same data, it's
> just represented in different ways. It's not like short/int/long
> where a long  can represent values the other two cannot.

Agreed!!!

>> Sheesh, I feel like I'm banging my head against everybody else's head.  Ok, let's try another tack:
>> 
>> I just ran all the test programs that were compiled from different
>>  sorts  of UFT source. Then I saved the output of each, and checked
>> the exact  file type of them.
>> 
>> Turns out they _all_ were in UTF-8 format.
> 
> That is as it should be, assuming the program was intending to output
>  UTF-8, the source file encoding should never have any effect on the
>  program output.

Yes, but see immediately below.

>> Now, how perverted a person should I be if I implicitly assumed
>> that an  undecorated string literal on this machine is in UTF-8 ?
>> 
>> Think about it -- one of the lines in the program looks like this:
>> 
>> ds = toUTF32(s); writefln(ds);
>> 
>> and the output still turns out to be in UTF-8.
> 
> This is what was confusing me. I would have expected the line above
> to print in UTF-32. The only explaination I can think of is that the
>  output  stream is converting to UTF-8. In fact, I find it quite
> likely.

Frankly, I would have expected it too.

Seems like the OS creators decided "this is a UTF-8 system", so whatever gets output by anybody, gets ultimately converted to UTF-8.

Actually, not a bad idea, especially considering the hardships we've all had here with understanding any of the issue. :-)

---

Having got this far, one is almost compelled to suggest we abandon all of '"c', '"d', and '"w'.

If we don't dare to go for universal implicit width coercion (I can live with that), then at least let us have it for string literals!

Why? Because you never do anything final with a string literal. It is always going somewhere, right? And this somewhere is _always_ a context, where the width is known already.

So, merely _having_ the decorators is just creating misunderstanding, false hopes, and misconceptions.
November 14, 2005
Actually, it's still used a lot - for example, in emails (as well as what you're doing right now - news servers.)  Yes, it's old and not as useful in most cases as the others, but it's also necessary in some protocols (currently.)

But, for files... yes, it's probably not worth the time you spent to ask if it should be supported :P.

-[Unknown]


> Derek Parnell wrote:
> 
>> On Sun, 13 Nov 2005 22:12:36 +0200, Georg Wrede wrote:
>>
>> [snip]
>>
>>> Then I saved the file as UTF-7, UTF-8, UTF-16, UCS-2, UCS-4. 
>>
>>
>> I've just fixed the Build utility to read UTF-8, UTF-16le/be, UTF-32le/be
>> encodings, but is there any reason I should support UTF-7? It seems a bit
>> superfluous, and under-supported elsewhere too.
> 
> 
> No.
> 
> It's a relic.
> 
> Those who may need it, have other worries than choosing between languages (so D is not an option for those guys).
November 14, 2005
"Georg Wrede" <georg.wrede@nospam.org> wrote
> Regan Heath wrote:
>> On Mon, 14 Nov 2005 00:07:30 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
>>
>>> 6. There is _no_ reason for not having a default encoding for undecorated string literals in source code.
>>
>> What if you have:
>>
>> void bob(char[] a)  { printf("1\n"); }
>> void bob(wchar[] a) { printf("2\n"); }
>> void bob(dchar[] a) { printf("3\n"); }
>>
>> void main()
>> {
>>     bob("test");
>> }
>>
>> In other words 2 or 3 functions of the same name which do _different_
>>  things, the compiler cannot correctly choose the function to call,
>> right?
>
> I'd say "shoot the programmer!"
>
> When I was young, there was a law against overloading with different semantics!

Yes, indeed. Overloading method names with different semantics is silly, cantankerous, and/or asking for trouble. Using it as the basis of an argument is therefore, IMO, either naiive or disingenuous. Hopefully, we can move forward without giving this aspect further deliberation.


November 14, 2005
On Sun, 13 Nov 2005 16:45:34 -0800, Kris wrote:

> "Georg Wrede" <georg.wrede@nospam.org> wrote
>> Regan Heath wrote:
>>> On Mon, 14 Nov 2005 00:07:30 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
>>>
>>>> 6. There is _no_ reason for not having a default encoding for undecorated string literals in source code.
>>>
>>> What if you have:
>>>
>>> void bob(char[] a)  { printf("1\n"); }
>>> void bob(wchar[] a) { printf("2\n"); }
>>> void bob(dchar[] a) { printf("3\n"); }
>>>
>>> void main()
>>> {
>>>     bob("test");
>>> }
>>>
>>> In other words 2 or 3 functions of the same name which do _different_
>>>  things, the compiler cannot correctly choose the function to call,
>>> right?
>>
>> I'd say "shoot the programmer!"
>>
>> When I was young, there was a law against overloading with different semantics!
> 
> Yes, indeed. Overloading method names with different semantics is silly, cantankerous, and/or asking for trouble. Using it as the basis of an argument is therefore, IMO, either naiive or disingenuous. Hopefully, we can move forward without giving this aspect further deliberation.

Ummm ... but there are *some* valid uses for this.

 void SendTextToFile(char[] a)
    { SendBOM(utf8_bom);  Send(UTF8, a, 0); }
 void SendTextToFile(wchar[] a)
    { SendBOM(utf16_bom); Send(UTF16,a, LittleEndian); }
 void SendTextToFile(dchar[] a)
    { SendBOM(utf32_bom); Send(UTF32,a, LittleEndian); }

The 'semantics' are identical but the implementation is necessarily different depending on the parameter's data type.

So which variation of 'SendTextToFile' is to be assumed by the compiler here?

void main()
{
     SendTextToFile("test");
}

Sure, it could choose any and be done with it, but would it hurt the coder to be made aware that a decision is required here?

I'm in favour of D's current behaviour, even though it means I must decorate some string literals.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
14/11/2005 12:09:11 PM
November 14, 2005
On Mon, 14 Nov 2005 01:24:25 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
> Regan Heath wrote:
>> On Sun, 13 Nov 2005 14:21:35 -0800, Sean Kelly <sean@f4.ca> wrote:
>>
>>> Georg Wrede wrote:
>>>
>>>>  6. There is _no_ reason for not having a default encoding for  undecorated string literals in source code.
>>>
>>> This seems vaguely consistent with integer promotion, as this is  unambiguous:
>>>
>>> void fn(short x) {}
>>> void fn(int x) {}
>>> void fn(long x) {}
>>>
>>> The int version will be selected by default.  And since int has a  specific size in D (unlike C), this is really not substantially  different from having a default encoding for unqualified string literals.
>>  It's not quite the same as short/int/long as you can loose data converting  from long to int, and int to short, but you can't loose data converting  from dchar to wchar or wchar to char.
>
> Ehh, I'd have thought that the implication goes the other way?
>
> That is, if cardinals are handled like this (with the risk of losing content done the wrong way), then I don't see how that can be taken as an example of not doing the same with UTF, especially when UTF to UTF conversions _don't_ lose content!

I wasn't making a statement about the implications just stating a fact as I see it.

Yes, I agree, the fact that data is not lost implies that any encoding can be chosen at will. However, there are other considerations as I posted in my other reply.

Regan


November 14, 2005
On Sun, 13 Nov 2005 15:30:53 -0800, Sean Kelly <sean@f4.ca> wrote:
> Regan Heath wrote:
>>  It's not quite the same as short/int/long as you can loose data converting from long to int, and int to short, but you can't loose data converting from dchar to wchar or wchar to char.
>
> True enough.  Though the compiler is smart enough to disambiguate overloads correctly in the case I described:
>
> import std.c.stdio;
>
> void fn(int x)   { printf("int\n"); }
> void fn(byte x)  { printf( "byte\n"); }
> void fn(short x) { printf("short\n"); }
> void fn(long x)  { printf("long\n"); }
>
> int main()
> {
>     fn( 1 );
>     fn( 2147483647 );
>     fn( 21474836470 );
> }
>
> prints:
>
> int
> int
> long

But of course it does! It has rules for this sort of thing, namely that "21474836470" can't be an 'int' because "21474836470" cannot be represented by an 'int', so it does the logical thing and makes it a 'long'.

The same cannot be said for a dchar[] literal it can be represented as a wchar[] or char[] no trouble.

Essentially all I'm saying is that I don't think we can necessarily compare the behaviour of long/int/short to how dchar/wchar/char does or should behave because they're simply different.

Regan
November 14, 2005
On Mon, 14 Nov 2005 02:21:57 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
> Regan Heath wrote:
>> On Mon, 14 Nov 2005 00:07:30 +0200, Georg Wrede <georg.wrede@nospam.org>  wrote:
>>  In other words 2 or 3 functions of the same name which do _different_
>>  things, the compiler cannot correctly choose the function to call,
>> right?
>
> I'd say "shoot the programmer!"
>
> When I was young, there was a law against overloading with different semantics!

Yes, it's a bad idea(TM). I'm not saying it isn't. What I am saying is:

 - It is possible for this to occur.
 - When it does it creates a bug which is hard to find.

Given the above, the decision about what the compiler should do is a pro vs con decision where there is no "right" answer for everyone only a line drawn in the sand which either suits you or not.

As I personally have had no trouble with the current behaviour I guess I see no point in changing it. My role in this discussion has simply been to point out the consequences I see with the suggested changes.

>> What if the  literal is thus transcoded millions of times in the
>> normal operation of  the program. I don't think you can discount
>> performance so easily.
>
> Neither the current (where you have to be explicit with every usage of UTC), nor the proposed (where String Literals are Understood) practice, are relevant to the question. By the time the string  literal is inserted or searched for, the target UTF width is known.
>
> This means that it should be implicitly converted to whatever is needed.

On thinking further, I agree. I can't come up with a situation where the literal would be transcoded for every use.

Most likely there will simply be 3 copies of the literal, one for each branch of the logic dealing with the different UTF encodings of the data.

> Now, implicit conversions are not dangerous -- when data loss is no risk.

This statement is incorrect (as a whole) because there are more considerations than data loss, my first point about picking the wrong overload was intended to show this.

I suspect however you're ready to ignore that 'problem' because in your eyes it isn't one, which is fine, everyone draws the line in a different place. The question is, where should the compiler draw the line, after all it's drawing the line for everyone.

>>> 10. When the programmer doesn't explicitly specify, the compiler should  be free to choose what width an undecorated string literal
>>> is.
>>  Unless it is effected by cases like #6 and/or #9.
>
> (( I thought I'd murdered them already! ))  :-)

You've shot the programmer, and probably buried him/her in your backyard. That doesn't prevent the issue from occuring again and eventually you'll run out of backyard space to bury the bodies of those people you had to kill ... ;)

>>> Now, how perverted a person should I be if I implicitly assumed
>>> that an  undecorated string literal on this machine is in UTF-8 ?
>>>  Think about it -- one of the lines in the program looks like this:
>>>  ds = toUTF32(s); writefln(ds);
>>>  and the output still turns out to be in UTF-8.
>>  This is what was confusing me. I would have expected the line above
>> to print in UTF-32. The only explaination I can think of is that the
>>  output  stream is converting to UTF-8. In fact, I find it quite
>> likely.
>
> Frankly, I would have expected it too.
>
> Seems like the OS creators decided "this is a UTF-8 system", so whatever gets output by anybody, gets ultimately converted to UTF-8.

Are we positive it's the OS and not D's output routines?

Regan
November 14, 2005
"Derek Parnell" <derek@psych.ward> wrote
> Ummm ... but there are *some* valid uses for this.
>
> void SendTextToFile(char[] a)
>    { SendBOM(utf8_bom);  Send(UTF8, a, 0); }
> void SendTextToFile(wchar[] a)
>    { SendBOM(utf16_bom); Send(UTF16,a, LittleEndian); }
> void SendTextToFile(dchar[] a)
>    { SendBOM(utf32_bom); Send(UTF32,a, LittleEndian); }
>
> The 'semantics' are identical but the implementation is necessarily different depending on the parameter's data type.

Good point, yet the usage is somewhat different here. These methods are called one time for a file, and therefore it's rather unlikely one would invoke them multiple times, back to back, with string literals. I think you'd agree it's more likely to be one or two isolated calls, and using an array rather than a literal. Due to the latter, the compiler would know exactly which one to call. In other words, I think it's a somewhat contrived example. No offense intended.

Yes, one could argue that is splitting hairs somewhat. But consider the other side of the coin, where one might very well be using such methods back-to-back with literals, very often:

output.write("<html><head><title>")
         .write(title)
         .write("</title></head><body>")
         .write() ...

where the write() method could, and should, be overloaded for all native data types? And all corresponding array types? The Stream methods are named like this specifically to avoid the problem:

write     (char[])
writeW (wchar[])
writeD  (dchar[])

I think you'd agree that the above naming convention is superfluous, and should not be required. However, it is required unless one decorates each and every string literal with its type. NOTE: string literals are the only D type where this is currently unresolvable.

I suspect you, and many others, would justly and loudly complain if you were forced to qualify each and every string literal written to the Stream library. Yes? And, I suspect, more so if writef() and printf() were sensitive to the literal type?

As an exercise, I'd like to suggest someone recompiles Phobos with those names made equivalent, and see how (un)easy it is to use the Stream class then. That's a somewhat trivial example of a bigger problem.

One of the reasons I've complained about this for so long is that there's a reasonable set of problem domains, and body of existing work, where it's of serious concern. Imagine if /all/ Phobos IO methods were sensitive to this, and you'll get some idea of the distaste it can leave in people's mouths.

BTW: it's only a problem for output; one cannot input into a literal <g>


> I'm in favour of D's current behaviour, even though it means I must decorate some string literals.

I strongly suspect you'd feel quite differently if you had to do it pretty much all the time :-)

This is an issue that just won't go away by itself, and has probably remained low on the radar since most folk likely still use char[] only, or use some funky naming convention such as Stream does. I believe it would be in the interests of D to further address the concern. Technically speaking, there is at least one approach that can satisfy both perspectives.