November 11, 2005
"Derek Parnell" <derek@psych.ward> wrote ...
<snip>
> The source file encoding is a function of the editor and not the source
> code. To rely on the coder's editor preferences to determine the implied
> encoding of a string literal will end in tears for someone. In other
> words,
> one should be able to alter the source file encoding without altering the
> meaning of the code it contains.

Perhaps.

The derived notion is still applicable though: just suppose there were a compiler option to specify what the default literal type should be ~ would the discussed changes not result in an effective resolution?


November 11, 2005
Kris says...
>
>"Don Clugston" <dac@nospam.com.au> wrote ..
>> Kris wrote:
>>> D doesn't know whether to invoke the char[] or wchar[] signature, since the literal is treated as though it's possibly any of the three types. This is the kind of non-determinism you get when the compiler becomes too 'smart' (unwarranted automatic conversion, in this case).
>>
>> I agree, except that I think the problem in this case is that it's not converting "from" anything! There's no "exact match" which it tries first.
>
>There would be if the auto-casting were disabled, and the type were determined via the literal content, in conjunction with the /default/ literal type suggested by GW. Yes?
>

Gosh, all I wanted was a simple explanation. :-)  (kidding)

I used writeString and it works,

|17:24:22.68>type ftest.d
|import std.file;
|import std.stream;
|int main()
|{
|  File log = new File("myfile.txt",FileMode.Out);
|  log.writeString("this is a test");
|  log.close();
|  return 1;
|}

thanks.  Please, continue with your discussion. :-)

josé


November 11, 2005
bert says...
>
>In article <4374598B.30604@nospam.org>, Georg Wrede says...
>> 
>
>> 
>>The compiler knows (or at least _should_ know) the character width of the source code file. Now, if there's an undecorated string literal in it, then _simply_assume_ that is the _intended_ type of the string!
>> 
>
>The *programmer* assumes so *anyway*.
>
>Why on earth should the copiler assume anything else!
>
>BTW, D is really cool!

It is really cool. :-)


November 11, 2005

Nick says...
>
>In article <dl0hja$2aal$1@digitaldaemon.com>, jicman says...
>>
>>So, I have this complicated piece of code:
>>
>>|import std.file;
>>|import std.stream;
>>|int main()
>>|{
>>|  File log = new File("myfile.txt",FileMode.Out);
>>|  log.write("this is a test");
>>|  log.close();
>>|  return 1;
>>|}
>
>Also note one thing though: Stream.write() will write the string in binary format, ie. it will write a binary int with the length, and then the string. If you want a plain ASCII file, which is probably what you want in a log file, you should use Stream.writeString(), or Stream.writeLine() which inserts a line break. Or you can use writef/writefln for more advanced formatting. If you already knew this then disregard this post ;-)

Disregard this post?  Oh, no!  My friend, you wrote it, I am going to read it. (Yes, I knew that.  I was trying to quickly write some debugging code for something at work, and I found that compiler error and asked.)  Thanks.

josé


November 12, 2005
On Fri, 11 Nov 2005 14:03:36 -0800, Kris <fu@bar.com> wrote:
> "Derek Parnell" <derek@psych.ward> wrote ...
> <snip>
>> The source file encoding is a function of the editor and not the source
>> code. To rely on the coder's editor preferences to determine the implied
>> encoding of a string literal will end in tears for someone. In other
>> words,
>> one should be able to alter the source file encoding without altering the
>> meaning of the code it contains.
>
> Perhaps.
>
> The derived notion is still applicable though: just suppose there were a
> compiler option to specify what the default literal type should be ~ would
> the discussed changes not result in an effective resolution?

Yes, but now a change in compiler options can change how the application behaves (i.e. calling a function that has the same name but a different purpose to the intended one)

I'm with Derek above and I can't think of a better solution than the current behaviour. Take this example (similar to your original one):

void write (char[] x){}
void write (wchar[] x){}

void main()
{
  write ("part 1");
}

the compiler will error as it cannot decide which function to call. It's options that I can think of:

A - pick one at random
B - pick one using some sort of rule i.e. char[] first, then wchar[], then dchar[]
C - pick one based on string contents i.e. dchar literal, wchar literal, ascii
D - pick one based on file encoding
E - pick one based on compiler switch
F - (current behaviour) require a string suffix of 'c', 'w' or 'd' to disambiguate

I dislike A as it's not predicatable, what if the two functions are in fact different in purpose/function? (they shouldn't really have the same name, but it's entirely possible)

I dislike B for the same reasons as A, it may be predictable but I worry it picks a function the programmer did not intend and does so silently.

I dislike C as ideally all your strings should be of one type internally(*) and transcoding should only be done at input and output stages. I dislike it as because a dchar literal can exist quite happily as one or more wchars or chars once transcoded and if your application wants to use char internally what not transcode at compile time and save some run time?

I dislike D for reasons Derek gave above.

I dislike E for my reasons above.

I like F (current behaviour) because it forces you, the programmer, to decide which function to call.

(*) The bigger problem here IMO is transcoding in general and between libraries. Ideally it should only occur at input and output stages i.e. loading data from a certain source encoding, manipulating it internally using the desired internal encoding (specific to each application) and output in the desired encoding. Ideally 2 of those 3 types would be the same, reducing transcoding to one point input or output.

Each library needs to be aware of this and provide 3 versions of each function otherwise every call to that library may cause transcoding, depending on the chosen internal encoding for the app and the encoding the library uses.

The habit/trap here is to provide 3 functions but have 2 of them call the 3rd, it's common sense as it reduces maintenance nightmares where there is a bug in one implementation and not another. The 3rd function does all the work in one encoding, however, what this means is that unless the applications internal encoding is of that 3rd type all calls to that library result in transcoding.

I can't really see a good solution to this "transcoding nightmare", then again maybe I am seeing a bigger problem than there really is, is transcoding going to be a significant problem (in efficiency terms) for a common application?

Just my 2c, take it or leave it.

Regan
November 12, 2005
Regan Heath wrote:
> The habit/trap here is to provide 3 functions but have 2 of them call
>  the  3rd, it's common sense as it reduces maintenance nightmares
> where there is  a bug in one implementation and not another. The 3rd
> function does all the  work in one encoding, however, what this means
> is that unless the  applications internal encoding is of that 3rd
> type all calls to that  library result in transcoding.
> 
> I can't really see a good solution to this "transcoding nightmare", then  again maybe I am seeing a bigger problem than there really is,
> is transcoding going to be a significant problem (in efficiency
> terms) for a  common application?

THIS IS NOT AGAINST YOU, OR ANYBODY ELSE

-- I am cross posting this to digitalmars.D
-- I suggest follow-ups be written only there

To create a common understanding,

I propose a thought experiment. Those who are active or hard headed, may actually carry out the experiment. (Oh, and by Bob, if my reasoning below turns incorrect, please, inform us all about it!)  :-(

Given the below code:

void main()
{
    for(int i=0; i<50; i++)
    {
        // myprint("Foobar"); // see below about this line
        myprint("Foobar"c);
        myprint("Foobar"w);
        myprint("Foobar"d);
    }
}

int myrandom(int b, int e)
{
// return uniform random integer int >= b || int < e
}

void myprint( char[] s) {_myprint(char[] s}
void myprint(wchar[] s) {_myprint(char[] s}
void myprint(dchar[] s) {_myprint(char[] s}

void _myprint(char[]s)
{
    char[] cs;
    wchar[] ws;
    dchar[] ds;

    switch ( myrandom(0,3) )
    {
        case 0: cs = s; printfln(cs); break;
        case 1: ws = s; printfln(ws); break;
        case 2: ds = s; printfln(ds); break;
        default: static assert(0);
    }
}


Question 1: do the printed lines look alike?

Question 2: if one has a text editor that can produce a source code file in both UTF-8, UTF-16, UTF-32, and USASCII formats, and then "saves as" this source code to all of those -- would the program outputs look alike?

----

Supposing the reader has answered "yes" to both of the above, then a further question:

Question 3: in that case, if we uncomment line 5 above, AND do the excercise in question 2, would the program outputs look alike?

Question 4: would you say that it actually makes _no_ difference how the compiler chooses to consider an undecorated string literal?

Question 5: would you say that we could let the compiler vendors decide how they choose to interpret the undecorated string literal?

----

Supposing that the reader still has answered "yes" to all questions, we go further:

Question 6: would it now seem obvious that undecorated string literals should default to _something_, and when the programmer writes an undecorated string literal, he is thereby expressly indicating that he could not care, AND that he specifically does not want to argue with the compiler about it?
November 12, 2005
Regan Heath wrote:
> On Fri, 11 Nov 2005 14:03:36 -0800, Kris <fu@bar.com> wrote:
> 
>> "Derek Parnell" <derek@psych.ward> wrote ...
>> <snip>
>>
>>> The source file encoding is a function of the editor and not the source
>>> code. To rely on the coder's editor preferences to determine the implied
>>> encoding of a string literal will end in tears for someone. In other
>>> words,
>>> one should be able to alter the source file encoding without altering  the
>>> meaning of the code it contains.
>>
>>
>> Perhaps.
>>
>> The derived notion is still applicable though: just suppose there were a
>> compiler option to specify what the default literal type should be ~  would
>> the discussed changes not result in an effective resolution?
> 
> 
> Yes, but now a change in compiler options can change how the application  behaves (i.e. calling a function that has the same name but a different  purpose to the intended one)
> 
> I'm with Derek above and I can't think of a better solution than the  current behaviour. Take this example (similar to your original one):
> 
> void write (char[] x){}
> void write (wchar[] x){}
> 
> void main()
> {
>   write ("part 1");
> }
> 
> the compiler will error as it cannot decide which function to call. It's  options that I can think of:
> 
> A - pick one at random
> B - pick one using some sort of rule i.e. char[] first, then wchar[], then  dchar[]
> C - pick one based on string contents i.e. dchar literal, wchar literal,  ascii
> D - pick one based on file encoding
> E - pick one based on compiler switch
> F - (current behaviour) require a string suffix of 'c', 'w' or 'd' to  disambiguate
> 
>...

There is also the option of all undecorated string literals having a default type (like char[] for instance), instead of "it being inferred from the context."
This seems best to me, at first glance at least. What consequences could there be from this approach?
Well, when passing an undecorated string literal argument to a dchar ou wchar parameter, it would be an error and one would have to specify the string type, however, I don't I see that as an inconvenient.

-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
November 12, 2005
Kris schrieb am 2005-11-11:
> "Derek Parnell" <derek@psych.ward> wrote ...
><snip>
>> The source file encoding is a function of the editor and not the source
>> code. To rely on the coder's editor preferences to determine the implied
>> encoding of a string literal will end in tears for someone. In other
>> words,
>> one should be able to alter the source file encoding without altering the
>> meaning of the code it contains.
>
> Perhaps.
>
> The derived notion is still applicable though: just suppose there were a compiler option to specify what the default literal type should be ~ would the discussed changes not result in an effective resolution?

Please keep the encoding and the interpretation of the encoded data apart.
The same code might behave diffrently on Windows (usually UTF-16)
and Linux (usually UTF-8).

Thomas


November 13, 2005
On Sat, 12 Nov 2005 15:30:22 +0000, Bruno Medeiros <daiphoenixNO@SPAMlycos.com> wrote:
> Regan Heath wrote:
>> On Fri, 11 Nov 2005 14:03:36 -0800, Kris <fu@bar.com> wrote:
>>
>>> "Derek Parnell" <derek@psych.ward> wrote ...
>>> <snip>
>>>
>>>> The source file encoding is a function of the editor and not the source
>>>> code. To rely on the coder's editor preferences to determine the implied
>>>> encoding of a string literal will end in tears for someone. In other
>>>> words,
>>>> one should be able to alter the source file encoding without altering  the
>>>> meaning of the code it contains.
>>>
>>>
>>> Perhaps.
>>>
>>> The derived notion is still applicable though: just suppose there were a
>>> compiler option to specify what the default literal type should be ~  would
>>> the discussed changes not result in an effective resolution?
>>   Yes, but now a change in compiler options can change how the application  behaves (i.e. calling a function that has the same name but a different  purpose to the intended one)
>>  I'm with Derek above and I can't think of a better solution than the  current behaviour. Take this example (similar to your original one):
>>  void write (char[] x){}
>> void write (wchar[] x){}
>>  void main()
>> {
>>   write ("part 1");
>> }
>>  the compiler will error as it cannot decide which function to call. It's  options that I can think of:
>>  A - pick one at random
>> B - pick one using some sort of rule i.e. char[] first, then wchar[], then  dchar[]
>> C - pick one based on string contents i.e. dchar literal, wchar literal,  ascii
>> D - pick one based on file encoding
>> E - pick one based on compiler switch
>> F - (current behaviour) require a string suffix of 'c', 'w' or 'd' to  disambiguate
>>
>  >...
>
> There is also the option of all undecorated string literals having a default type (like char[] for instance), instead of "it being inferred from the context."

This is similar to B above, except that this rule would cause an error here:

write(dchar[] str) {}
void main() { write("a"); }

> This seems best to me, at first glance at least. What consequences could there be from this approach?

The same as for B. Say you wrote the code above, say another 'write' function existed elsewhere, say it took a char[], the compiler would silently call the other write function and not the one above. If the functions do the same thing, no problem, if not silent bug.

> Well, when passing an undecorated string literal argument to a dchar ou wchar parameter, it would be an error and one would have to specify the string type, however, I don't I see that as an inconvenient.

It is more inconvenient than the current situation which is that you have to decorate only incases where a collision exists.

Regan
November 13, 2005
On Sat, 12 Nov 2005 17:06:58 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
> Regan Heath wrote:
>> The habit/trap here is to provide 3 functions but have 2 of them call
>>  the  3rd, it's common sense as it reduces maintenance nightmares
>> where there is  a bug in one implementation and not another. The 3rd
>> function does all the  work in one encoding, however, what this means
>> is that unless the  applications internal encoding is of that 3rd
>> type all calls to that  library result in transcoding.
>>  I can't really see a good solution to this "transcoding nightmare", then  again maybe I am seeing a bigger problem than there really is,
>> is transcoding going to be a significant problem (in efficiency
>> terms) for a  common application?
>
> THIS IS NOT AGAINST YOU, OR ANYBODY ELSE
>
> -- I am cross posting this to digitalmars.D
> -- I suggest follow-ups be written only there
>
> To create a common understanding,

Sounds good to me.

> I propose a thought experiment. Those who are active or hard headed, may actually carry out the experiment. (Oh, and by Bob, if my reasoning below turns incorrect, please, inform us all about it!)  :-(

I have done so. My code changes are not criticism but rather, I hope, as examples of how it current works. The full modified code I used is posted at the end of this message.

> Given the below code:
>
> void main()
> {
>      for(int i=0; i<50; i++)
>      {
>          // myprint("Foobar"); // see below about this line

I suspect you meant:
>          // _myprint("Foobar"); // see below about this line

here? (note the _), am I correct?

>          myprint("Foobar"c);
>          myprint("Foobar"w);
>          myprint("Foobar"d);
>      }
> }
>
> int myrandom(int b, int e)
> {
> // return uniform random integer int >= b || int < e
> }
>
> void myprint( char[] s) {_myprint(char[] s}
> void myprint(wchar[] s) {_myprint(char[] s}
> void myprint(dchar[] s) {_myprint(char[] s}

Were these supposed to read?

void myprint( char[] s) {_myprint(s); }
void myprint(wchar[] s) {_myprint(cast(char[])s); }
void myprint(dchar[] s) {_myprint(cast(char[])s); }

If so, see below for why this does not work, instead you require:

void myprint( char[] s) {_myprint(s); }
void myprint(wchar[] s) {_myprint(toUTF8(s)); }
void myprint(dchar[] s) {_myprint(toUTF8(s)); }

> void _myprint(char[]s)
> {
>      char[] cs;
>      wchar[] ws;
>      dchar[] ds;
>
>      switch ( myrandom(0,3) )
>      {
>          case 0: cs = s; printfln(cs); break;
>          case 1: ws = s; printfln(ws); break;
>          case 2: ds = s; printfln(ds); break;
>          default: static assert(0);
>      }
> }

These lines:
>          case 0: cs = s; printfln(cs); break;
>          case 1: ws = s; printfln(ws); break;
>          case 2: ds = s; printfln(ds); break;

cause errors, D does not implicitly transcode or implicitly paint (thank Bob!) one char type as another.

These lines:
>          case 0: cs = s; printfln(cs); break;
>          case 1: ws = cast(wchar[])s; printfln(ws); break;
>          case 2: ds = cast(dchar[])s; printfln(ds); break;

compile but cause an "array cast missalignment" exception due to the cast 'painting' the data as opposed to transcoding it, these lines:

>          case 0: cs = s; printfln(cs); break;
>          case 1: ws = toUTF16(s); printfln(ws); break;
>          case 2: ds = toUTF32(s); printfln(ds); break;

will compile and run 'correctly' (as I see it).

> Question 1: do the printed lines look alike?

It depends on the contents of the string being printed.
It also depends on the format the output device is expecting.

Assuming:
1. the text is ASCII
2. the ouput device expects ASCII

then all 3 will be identical (the test code shows this with windows console default)

However, if the text contains characters outside of ASCII and the output device expects say UTF-8, then UTF-16 and UTF-32 strings will come out as garbage in some cases.

Take the windows console for example, it expects Latin-1 on my PC, printing non-ascii UTF-8, 16, or 32 codepoints (codepoints are parts of or whole characters) can cause garbage to be displayed for certain output. Setting it to unicode or utf-8 will cause utf-8 to be displayed correctly but can cause garbage for certain utf-16 or utf-32 output.

> Question 2: if one has a text editor that can produce a source code file in both UTF-8, UTF-16, UTF-32, and USASCII formats, and then "saves as" this source code to all of those -- would the program outputs look alike?

Saving the source file as a different encoding will have no effect on the output assuming you save as one of the valid encodings (as per D spec):
  ASCII
  UTF-8
  UTF-16BE
  UTF-16LE
  UTF-32BE
  UTF-32LE

D takes the literal as stored in whatever encoding and transcodes implicitly at compile time to the required type which is stored as that type in the binary (I believe) unless there is a collision (the original complaint that started this thread).

> ----
>
> Supposing the reader has answered "yes" to both of the above, then a further question:

I haven't, but I believe I can continue anyway...

> Question 3: in that case, if we uncomment line 5 above, AND do the excercise in question 2, would the program outputs look alike?

Uncommenting that line causes a compile error, adding the "_" works as I believe you intended.

The result is again identical lines, however (see answer to question #1 as it applies here as well) outputting UTF-16 or 32 to a UTF-8 console can cause garbage for certain output.

> Question 4: would you say that it actually makes _no_ difference how the compiler chooses to consider an undecorated string literal?

No.

For the reasons I have posted earlier, namely:

#1 If you have 2 functions of the same name, doing different things, and one takes char[] and the other wchar[] then allowing the compiler to choose an encoding can result in different behaviour depending on the function chosen.

#2 depending on the encoding the program uses internally, i.e. say it uses char[] everywhere, then if the compiler chooses wchar[], calling a wchar[] function returning a wchar[] (for example) then the result will need to be transcoded back to char[] for use with the rest of the program code.

Ignoring problem #1, in cases where the function takes a wchar literal and does not return a char it makes no difference because the compiler does the transcoding at compile time and stores the literal as a wchar in the binary. (however duplicate literals may be encoded in other encodings, due to the function they are passed to taking a different type, in these cases you'll get a binary with the same literal encoding in different ways)

In other words, ignoring #1, the encoding is important for efficiency reasons, the question is: How in-efficient is it? Does it make any real difference?

> Question 5: would you say that we could let the compiler vendors decide how they choose to interpret the undecorated string literal?

No. I believe it should be the programmers choice. For reasons above.

> ----
>
> Supposing that the reader still has answered "yes" to all questions, we go further:
>
> Question 6: would it now seem obvious that undecorated string literals should default to _something_, and when the programmer writes an undecorated string literal, he is thereby expressly indicating that he could not care, AND that he specifically does not want to argue with the compiler about it?

No. For reasons above.

To actually see the problem I describe in #1:

- Set the windows console to UTF-8 (I forget how, sorry)
- Find a character that exists in all 3 UTF encodings differently (as different code points)(*)
- Type/paste the character into the source
- Save the source as any of the valid UTF encodings (ASCII will not work as that char will not exist in ASCII)
- Run the test again.

What you should see (assuming it is done correctly and that I am correct in my reasoning) is that one, the UTF-8 one will look correct, the other 2 will look wrong.

(*) This character will be one represented by 1 dchar, 2 wchar, or 3+ char codepoints. (Arcane Jill, were she still frequenting this place, would be able to tell us one in an instant, sadly you have only me, a poor replacement for true expertise)

Regan

----test code----
import std.random;
import std.stdio;
import std.utf;

void main()
{
	for(int i=0; i<50; i++)
	{
		//_myprint("Foobar"); // see below about this line
		myprint("Foobar"c);
		myprint("Foobar"w);
		myprint("Foobar"d);
	}
}


int myrandom(int b, int e)
{
	// return uniform random integer int >= b || int < e
	return rand()%(e-b) + b;
}

void myprint( char[] s) {_myprint(s); }
void myprint(wchar[] s) {_myprint(toUTF8(s)); }
void myprint(dchar[] s) {_myprint(toUTF8(s)); }


void _myprint(char[]s)
{
	char[] cs;
	wchar[] ws;
	dchar[] ds;

	switch ( myrandom(0,3) )
	{
		case 0: cs = s; writefln(cs); break;
		case 1: ws = toUTF16(s); writefln(ws); break;
		case 2: ds = toUTF32(s); writefln(ds); break;
		default: assert(0);
	}
}