November 16, 2005
On Tue, 15 Nov 2005 20:52:34 -0800, Kris <fu@bar.com> wrote:
> "Walter Bright" <newshound@digitalmars.com> wrote...
>> "Kris" <fu@bar.com> wrote in message
>> news:dl2p7i$11gf$1@digitaldaemon.com...
>>> Don't you think the type can be inferred from the content?
>>
>> Not 100%.
>>
>>> For the sake of
>>> discussion, how about this:
>>>
>>> 1) if the literal has a double-wide char contained, it is defaulted to a
>>> dchar[] type.
>>>
>>> 2) if not 1, and if the literal has a wide-char contained, if defaults to
>>> being of wchar[] type.
>>>
>>> 3) if neither of the above, it defaults to char[] type.
>>>
>>> Given the above, I can't think of a /common/ situation where casting
>>> would
>>> be thus be required.
>>
>> I did consider that for a while, but eventually came to the conclusion
>> that
>> its behavior would be surprising to someone who did not very carefully
>> read
>> the spec. Also, the distinction between the various character types is not
>> obvious when looking at the rendered text, further making it surprising.
>>
>> I think it's better to now and then have to type in an extra character to
>> nail down an ambiguity than to have a complicated set of rules to try and
>> guess what the programmer's intent was.
>
> ======================================
>
> It's difficult to counter that position since it appears fair and considered
> (though perhaps a bit thin on detail :-)
>
> Yet, isn't the following example wholly contrary to what you claim?
>
> struct Foo
> {
>     void write (char x){}
>     void write (wchar x){}
>     void write (dchar x){}
> }
>
> void main()
> {
>   Foo f;
>
>   f.write ('1');  // invokes the first method
>   f.write ('\u0001');  // invokes the second method
>   f.write ('\u00000001');  // invokes the third method
> }
>
> To be clear: the compiler is doing explicitly what I describe above, but for char/wchar/dchar as opposed to their array counterparts. I feel this clearly contradicts your answer above, so what's the deal? Please? I'd really like to understand the basis for this distinction ...

The difference is that '\u00000001' cannot be a wchar or char and _must_ therefore be a dchar. This is the same as the integer literal example Sean gave earlier where a large number had to be a long and could not be an int or short. The same is not true for string literals because every possible string literal can be represented as every type of string, char[], wchar[] and dchar[].

In short the difference is, as you mentioned earlier that there is no symetry between char, wchar, dchar and char[], wchar[], dchar[], data is potentially lost going from dchar->wchar->char, but never lost going from dchar[]->wchar[]->char[].

There exist no string literal where one type is favoured or required beased solely on the information provided by the programmer. Therefore asking the compiler to choose means that it's choice will be arbitrary and/or based on something not related specifically to the instance at hand (perhaps performance i.e. picking the type that involves the smallest amount of memory or similar)

Regan
November 16, 2005
Kris wrote:
> "Bruno Medeiros" <daiphoenixNO@SPAMlycos.com> wrote...
> 
>>Kris wrote:
>>
>>>"Bruno Medeiros" <daiphoenixNO@SPAMlycos.com> wrote
>>>
>>>
>>>>Dear gods, no! Think of the code readability man.
>>>>One should keep all the elements necessary to understand a piece of code to a minimum, (and as close and local as possible). Keep things simple.
>>>
>>>
>>>Oh, I most fully agree, so suspect I may have been misunderstood.
>>>
>>>
>>
>>Didn't you just say you wanted a pragma directive that could change/specify the type of undecorated string literals *in an individual class* (i.e., not a global,program-wide directive, but a class specific one)?
> 
> 
> No, and yes. The post didn't say "wanted a pragma directive" per se (it said "pragma/something", in reference to the preceeding posts). And it was a suggestion for the purposes of discussion; not a "want", as you indicate. Don't mean to split hairs, Bruno, but implications are sometimes tough to manage <g>
Ok, it could be a method other than a pragma directive, and it was not a 'want' but more of sugestion/etc.
I stand clarified on those two points, but they are side-issues and don't affect my opposition to the ideia.

> 
> I did suggest that a mechanism might exist at the class/struct/interface level to describe how the {overloaded methods therein expect to handle literals}. Said mechanism might be scoped, it might not require additional syntax at all, it might be inherited, or might not. But any mechanism whose effect is constrained to a group of highly related methods (such as class/struct/interface) is, in my opinion, "local and as close as possible". As you described it. It also, IMO, keeps those "elements necessary to understand a piece of code to a minimum", since it's all tied up in a nice little code bundle. Being optional, that also helps to keep things simple.
> 
> I'm afraid code maintenence (and hence readability) is a /big/ soapbox of mine ~ but I'm beginning to suspect that matters less and less these days. T'was discussed in another thread ages ago, which you might be interested in.
> 
> Perhaps I misunderstood you :-) 
> 
Indeed such element would be somewhat local and close, however, being local and close would not beat the unexistence of that comprehension element. That is, not having that element at all is simpler (and better) than having one, however close and local it is. Way better in my opinion. I mean, having to find out in each apropos class how it treated undecorated string literals... :S


-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
November 16, 2005
On Tue, 15 Nov 2005 22:50:19 +0000, Bruno Medeiros <daiphoenixNO@SPAMlycos.com> wrote:
> Walter Bright wrote:
>> "Kris" <fu@bar.com> wrote in message news:dl0ngf$2f7s$1@digitaldaemon.com...
>>
>>> Still, it troubles me that one has to decorate string literals in this
>>> manner, when the type could be inferred by the content, or by a cast()
>>> operator where explicit conversion is required. Makes it a hassle to
>>  create
>>
>>> and use APIs that deal with all three array types.
>>   One doesn't have to decorate them in that manner, one could use a cast
>> instead.
>>
> Speaking of string casting, correct me if this is wrong: the only string cast that makes an (Unicode enconding) conversion is the cast of string literals, right? The other string casts are opaque/conversion-less casts.

I believe you're correct, runtime casting of a char[] to wchar[] only 'paints' it does not transcode the data. I think this behaviour is incorrect and that it should transcode. In the somewhat rare case you actually need to paint raw data you should be painting a byte[] to char[], etc

Regan
November 16, 2005
Georg Wrede wrote:

> I'd probably have to have a huge table, or then maybe there is an algorithm that tells me whether a particular form is non-shortest.

Yup :)
http://www.unicode.org/charts/normalization/


>> I think you misread that - the security thing applies to encoding characters in more UTF8 codepoints than neccessary (for example, if you encode a space character in 4 bytes instead of 1). This allows string comparisons using simple byte scanning, instead of always having to decode into characters..
> 
> 
> Read for yourself. The following is pasted from Fedora Core 4, man utf-8 command.
> 
> 
> SECURITY
> 
> The Unicode and UCS standards require that producers of UTF-8 shall use
> the shortest form possible, e.g., producing a  two-byte  sequence  with
> first  byte 0xc0 is non-conforming.  Unicode 3.1 has added the require-
> ment that conforming programs must not  accept  non-shortest  forms  in
> their input. This is for security reasons: if user input is checked for
> possible security violations, a program might check only for the  ASCII
> version  of  "/../" or ";" or NUL and overlook that there are many non-
> ASCII ways to represent these things in a non-shortest UTF-8  encoding.

Well, that's exactly what I said :) This applies to encoding characters in UTF-8 and says that each character shall be represented in UTF-8 using the shortest possible byte sequence. It says nothing about which characters to encode in the first place..


xs0
November 16, 2005
Derek Parnell wrote:
> On Tue, 15 Nov 2005 22:50:19 +0000, Bruno Medeiros wrote:
>>Speaking of string casting, correct me if this is wrong: the only string cast that makes an (Unicode enconding) conversion is the cast of string literals, right? The other string casts are opaque/conversion-less casts.
>>
>>Also, should the following be allowed? :
>>   cast(wchar[])("123456"c)
>>If so what should the meaning be? (I already know what the DMD.139 meaning *is*, just wondering if it's conceptually correct)
> 
> 
> I believe its saying that "123456" is a UTF-8 encoded string that you are
> tricking the compiler into thinking that its a UTF-16 encoded string. Not a
> wise thing to do in most situations.
> 
> The 'cast(<X>char[])' idiom never does conversions of one utf encoding to
> another; not for literals or for variables. The apparent exception is that
> an undecorated string literal with a 'cast' is syntactically equivalent to
> a decorated string literal.
> 

Yes, that too was what I was thinking it should do, however it's not what happens. The string literal becomes a wchar string literal (just like this: "123456"w ), the decoration thus being ignored.

Bug, isn't it?

-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
November 16, 2005
On Wed, 16 Nov 2005 11:35:20 +0000, Bruno Medeiros wrote:

> Derek Parnell wrote:

[snip]

>> The 'cast(<X>char[])' idiom never does conversions of one utf encoding to another; not for literals or for variables. The apparent exception is that an undecorated string literal with a 'cast' is syntactically equivalent to a decorated string literal.
>> 
> 
> Yes, that too was what I was thinking it should do, however it's not what happens. The string literal becomes a wchar string literal (just like this: "123456"w ), the decoration thus being ignored.
> 
> Bug, isn't it?

Oh my Bob! You're right.   Yes, this has to be a bug.

-- 
Derek Parnell
Melbourne, Australia
17/11/2005 12:29:11 AM
November 16, 2005
(having trouble with unicode strings not displaying properly ...)



It struck me that perhaps I might write this out more clearly:

1) It was suggested that default storage type for a string literal could be inferred from the content therein. The suffix would be used to override the default storage class, in the manner it does today.

2) The storage-class for said literal would be used to select an appropriate method overload, thus avoiding both the compile error and the need to /always/ decorate literals where overloading is present.

3) Walter notes that he considered that, but felt it was not good enough.

4) Turns out that char/wchar/dchar instances take the exact approach as described above, contrary to #3. In other words, the approach outlined is good enough for characters but somehow not good enough for character arrays.

Quote from below: "I did consider that for a while, but eventually came to the conclusion that its behavior would be surprising to someone who did not very carefully read the spec. Also, the distinction between the various character types is not obvious when looking at the rendered text, further making it surprising."

Counterpoint: Should that not apply equally to non-array instances also? If it is not clear what storage class "???????????" should have, then it is not clear what storage class '?' should have (the latter is just a single character). Such things should probably be treated equally? Thanks to GW for the unicode example.

I think D has /almost/ got a perfect solution -- for resolving these cases where literals are concerned:

a) leave character literals as they are. It works in an appropriate fashion.

b) leave string suffixes as they are. It works as a mechanism to explicitly dictate the storage class of a string literal.

Now add the following:

c) treat string literals just like char literals, in terms of extracting the storage class from the content. Usage of a suffix will always override. This provides the /default/ storage class desired, in addition to the fine control already exposed by the suffix mechanism ~ thus, the need in Stream.d for writeC()/writeW()/writeD() is made redundant, and op-overloads work just as cleanly with string literals as they do with char literals. It's surely a win win?


Examples:

String Usage
=========

struct Foo
{
    void write (char[] x){}
    void write (wchar[] x){}
    void write (dchar[] x){}
}

void main()
{
  Foo f;

  f.write ("I'm an ascii string");  // should invoke the char method
  f.write ("I'm an ascii string, but in wide chars"w);  // explicitly
invokes the wchar method
  f.write ("???????????"d);  // explicity invokes the dchar method
}


Character Usage
============

struct Foo
{
    void write (char x){}
    void write (wchar x){}
    void write (dchar x){}
}

void main()
{
  Foo f;

  f.write ('1');  // currently invokes the char method
  f.write ('\u0001');  // currently invokes the wchar method
  f.write ('\u00000001');  // currently invokes the dchar method
  f.write ('?');     // currently invokes the wchar method
}



"Kris" <fu@bar.com> wrote in message news:dledui$4ri$1@digitaldaemon.com...

> "Walter Bright" <newshound@digitalmars.com> wrote...
>> "Kris" <fu@bar.com> wrote in message news:dl2p7i$11gf$1@digitaldaemon.com...
>>> Don't you think the type can be inferred from the content?
>>
>> Not 100%.
>>
>>> For the sake of
>>> discussion, how about this:
>>>
>>> 1) if the literal has a double-wide char contained, it is defaulted to a dchar[] type.
>>>
>>> 2) if not 1, and if the literal has a wide-char contained, if defaults
>>> to
>>> being of wchar[] type.
>>>
>>> 3) if neither of the above, it defaults to char[] type.
>>>
>>> Given the above, I can't think of a /common/ situation where casting
>>> would
>>> be thus be required.
>>
>> I did consider that for a while, but eventually came to the conclusion
>> that
>> its behavior would be surprising to someone who did not very carefully
>> read
>> the spec. Also, the distinction between the various character types is
>> not
>> obvious when looking at the rendered text, further making it surprising.
>>
>> I think it's better to now and then have to type in an extra character to nail down an ambiguity than to have a complicated set of rules to try and guess what the programmer's intent was.
>
> ======================================
>
> It's difficult to counter that position since it appears fair and considered (though perhaps a bit thin on detail :-)
>
> Yet, isn't the following example wholly contrary to what you claim?
>
> struct Foo
> {
>    void write (char x){}
>    void write (wchar x){}
>    void write (dchar x){}
> }
>
> void main()
> {
>  Foo f;
>
>  f.write ('1');  // invokes the first method
>  f.write ('\u0001');  // invokes the second method
>  f.write ('\u00000001');  // invokes the third method
> }
>
> To be clear: the compiler is doing explicitly what I describe above, but for char/wchar/dchar as opposed to their array counterparts. I feel this clearly contradicts your answer above, so what's the deal? Please? I'd really like to understand the basis for this distinction ...
>
>
> P.S. I don't wish to somehow eliminate the suffix ~ I think that's great. What I'm after is a means to render the suffix redundant in the common case (common case as defined by the developer). It might also resolve the apparent inconsistency above?
>


November 16, 2005
Kris wrote:
>
> 3) Walter notes that he considered that, but felt it was not good enough.
> 
> 4) Turns out that char/wchar/dchar instances take the exact approach as described above, contrary to #3. In other words, the approach outlined is good enough for characters but somehow not good enough for character arrays.

It's relatively obvious which method will be chosen for a char literal as it's just one character.  Compare that to a string literal which may be pages of text with one unicode char embedded in the middle.
> 
> Quote from below: "I did consider that for a while, but eventually came to the conclusion that its behavior would be surprising to someone who did not very carefully read the spec. Also, the distinction between the various character types is not obvious when looking at the rendered text, further making it surprising."

See above.

> Counterpoint: Should that not apply equally to non-array instances also? If it is not clear what storage class "???????????" should have, then it is not clear what storage class '?' should have (the latter is just a single character). Such things should probably be treated equally? Thanks to GW for the unicode example.

I agree that they probably should, simply for the sake of consistency. And since this is already the case for char literals, I don't know that the comment about "carefully reading the spec" is fair--if the user understands how char literal overloads are resolved then it seems reasonable that he should expect string literal overloads to be resolved the same way.

For what it's worth however, I'm unable to reproduce GW's example with DMD .139.  Here is my test code:

import std.c.stdio;

void fn( char c )  { printf( "char\n" );  }
void fn( wchar c ) { printf( "wchar\n" ); }
void fn( dchar c ) { printf( "dchar\n" ); }

void main()
{
    fn( '1' );
    fn( '\u0001' );
    //fn( '\u00000001' );
}

This prints:

char
wchar

as expected.  If I uncomment the last function call however, I get these error messages:

C:\code\d>dmd test
test.d(11): unterminated character constant
test.d(11): found '1' when expecting ','
test.d(11): unterminated character constant

I've tried varying the number of zeroes in this char and this only seems to work if it's a valid wchar.  Is this a scan error in DMD or am I missing something obvious?


Sean
November 16, 2005
"Sean Kelly" <sean@f4.ca> wrote in message news:dlg7g2$27f7$1@digitaldaemon.com...
> Kris wrote:
>>
>> 3) Walter notes that he considered that, but felt it was not good enough.
>>
>> 4) Turns out that char/wchar/dchar instances take the exact approach as described above, contrary to #3. In other words, the approach outlined is good enough for characters but somehow not good enough for character arrays.
>
> It's relatively obvious which method will be chosen for a char literal as it's just one character.  Compare that to a string literal which may be pages of text with one unicode char embedded in the middle.

Pages of text? In a literal? You're a braver man than I <g>

<snip>
> I've tried varying the number of zeroes in this char and this only seems to work if it's a valid wchar.  Is this a scan error in DMD or am I missing something obvious?

That should be an uppercase \U, not a lowercase one. Darn typo's ...


BTW: How does one get unicode to show up in these posts? I tried setting the encoding to UTF8, but to no avail. Any ideas?


November 16, 2005
"Kris" <fu@bar.com> wrote

> Pages of text? In a literal? You're a braver man than I <g>

I'm sorry Sean; that should have read "In a literal /within a function call/ ?"