November 18, 2005
Regan Heath wrote:
> A simple example:
> 
> void write(char[] s)  { printf("1"); }
> void write(wchar[] s) { printf("2"); }
> void write(dchar[] s) { printf("3"); }
> 
> write("test");

This changes content.

Conversions between UTF types _do_not_change_ content, _ever_.
November 18, 2005
Kris wrote:
...
> The latter is clearly a bug. Yet, the auto keyword, IMO, is more reason to support default-storage-class via content. Sure, the suffixed version should 

Kris wrote:
> a) string literal storage-class is not derived in the same manner as char
> literal. If the rule is too complex to comprehend for string literals then

Kris wrote:
> A /decorated/ string literal is certainly explicit about its storage-class,
> but an undecorated one is not. The distinction is important in that

Kris wrote:
> 1) auto is selecting char[] as the "default" storage class, regardless of
> the content. We see this from the prior example ~ it operates as though
> there were a 'c' suffix attached to those literals.

Since when does auto (type inference) determines (or even has anything to do) with storage class?? And how is char[] a "'default' storage class"? char[] is a type! You are equating storage classes to types, where the hell did that come from?

*Sigh*, another terminology frak up?..


-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to
be... unnatural."
November 18, 2005
On Fri, 18 Nov 2005 13:02:05 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
> Regan Heath wrote:
>> A simple example:
>>  void write(char[] s)  { printf("1"); }
>> void write(wchar[] s) { printf("2"); }
>> void write(dchar[] s) { printf("3"); }
>>  write("test");
>
> This changes content.
>
> Conversions between UTF types _do_not_change_ content, _ever_.

Correct, that's not what I am worried about. Let me try again.

In the example above there are 3 functions, they have the same name but they do different things.

This is poor programming practice to be sure, but it's possible for it to occur unintentionally in a complex source file (also bad programming practice IMO). It's more likely for there to be 2 as opposed to 3. The likelyhood of any (unintentionally) is fairly low, but it's still existant.

Lets assume there is 2 functions of the same name (unintentionally), doing different things.

In that source file the programmer writes:

write("test");

DMD tries to choose the storage type of "test" based on the available overloads. There are 2 available overloads X and Y. It currently fails and gives an error.

If instead it picked an overload (X) and stored "test" in the type for X, calling the overload for X, I agree, there would be _absolutely no problems_ with the stored data.

BUT

the overload for X doesn't do the same thing as the overload for Y.

In short, it's not the choice of encoding which is the problem, it's the choice of overload. The fact that the choice of overload is tied directly to the choice encoding makes the choice of encoding a problem.

Regan
November 18, 2005
Hey, Bruno.

I described it as storage-class, since we're dealing with literals (static data). The suffixes cause the compiler to "store" the strings in the declared format, within the object file.

If we'd been talking about non-literals, then I would have just written 'type' instead.

I'm happy to think about it in either terms, but was trying to maintain the distinction between how literal and non-literal char[]/wchar[]/dchar[] were treated. One might make a similar distinction between static/dynamic arrays of any type. Both have type, while the former also has storage-class.

Like you say, it's just terminology. Didn't mean to cause confusion, so my apologies for that.

Cheers!



"Bruno Medeiros" <daiphoenixNO@SPAMlycos.com> wrote in message news:dlkeso$10es$1@digitaldaemon.com...
> Kris wrote:
> ...
>> The latter is clearly a bug. Yet, the auto keyword, IMO, is more reason to support default-storage-class via content. Sure, the suffixed version should
>
> Kris wrote:
>> a) string literal storage-class is not derived in the same manner as char literal. If the rule is too complex to comprehend for string literals then
>
> Kris wrote:
>> A /decorated/ string literal is certainly explicit about its
>> storage-class,
>> but an undecorated one is not. The distinction is important in that
>
> Kris wrote:
>> 1) auto is selecting char[] as the "default" storage class, regardless of the content. We see this from the prior example ~ it operates as though there were a 'c' suffix attached to those literals.
>
> Since when does auto (type inference) determines (or even has anything to do) with storage class?? And how is char[] a "'default' storage class"? char[] is a type! You are equating storage classes to types, where the hell did that come from?
>
> *Sigh*, another terminology frak up?..
>
>
> -- 
> Bruno Medeiros - CS/E student
> "Certain aspects of D are a pathway to many abilities some consider to
> be... unnatural."


November 20, 2005
Kris wrote:
> Hey, Bruno.
> 
> I described it as storage-class, since we're dealing with literals (static data). The suffixes cause the compiler to "store" the strings in the declared format, within the object file.
Yes, because they are string literals they will allways have static storage class. And so, their storage class is not inferred/determined/selected by auto or anything else. Only the type is inferred, which has nothing to do with their storage classes.

> 
> If we'd been talking about non-literals, then I would have just written 'type' instead.
> 
> I'm happy to think about it in either terms, but was trying to maintain the distinction between how literal and non-literal char[]/wchar[]/dchar[] were treated. One might make a similar distinction between static/dynamic arrays of any type. Both have type, while the former also has storage-class.

What? "while the former also has storage-class" ?? You're making incorrect statements again. All variables have storage-class, including dynamic arrays. (it can be a bit more confusing since as "they" are a dual-entity, one can say they have two storage-classes: of the reference, and of the referenced data).

> 
> Like you say, it's just terminology. Didn't mean to cause confusion, so my apologies for that.
> 
> Cheers!
>

It didn't cause confusion. Undefined, ambiguous, or redundant terminology causes confusion. This was incorrect terminology, and it causes annoyance more than anything (when you to understand it's incorrect, that is).
I'm sure you'll agree with this last statement, even if you don't (yet?) agree that your terminology was incorrect.


-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
November 20, 2005
Regan Heath wrote:
> On Fri, 18 Nov 2005 13:02:05 +0200, Georg Wrede <georg.wrede@nospam.org>  wrote:
> 
> Lets assume there is 2 functions of the same name (unintentionally), doing different things.
> 
> In that source file the programmer writes:
> 
> write("test");
> 
> DMD tries to choose the storage type of "test" based on the available
> overloads. There are 2 available overloads X and Y. It currently
> fails and gives an error.
> 
> If instead it picked an overload (X) and stored "test" in the type
> for X, calling the overload for X, I agree, there would be
> _absolutely no problems_ with the stored data.
> 
> BUT
> 
> the overload for X doesn't do the same thing as the overload for Y.

Isn't that a problem with having overloading at all in a language?
Sooner or later, most of us have done it. If not each already? Isn't this a problem with overloading in general, and not with UTF?

Equally, implicit casting in itself does create the risk for this. I at least admit having had underflow of unsigned types, overflow of signed types, and the like. (Fortunately not for a while, though.)

It's like forbidding for(...;...;...) loops in fear of one-off errors. Crap happens, even to the best of us, but this is not a language for VB programmers.

> In short, it's not the choice of encoding which is the problem, it's the choice of overload. The fact that the choice of overload is tied directly to the choice encoding makes the choice of encoding a
> problem.
November 20, 2005
Bruno,

thanks for being alert here! At least I, possibly others too write sometimes late in the night, but it's good to see someone awake here.

BTW, we really do need a good word to describe the fact that UTF strings are stored in dissimilar encodings and widths. Would you have any suggestions?



Bruno Medeiros wrote:
> Kris wrote:
> 
>> Hey, Bruno.
>> 
>> I described it as storage-class, since we're dealing with literals
>>  (static data). The suffixes cause the compiler to "store" the
>> strings in the declared format, within the object file.
> 
> Yes, because they are string literals they will allways have static storage class. And so, their storage class is not inferred/determined/selected by auto or anything else. Only the type
> is inferred, which has nothing to do with their storage classes.
> 
>> 
>> If we'd been talking about non-literals, then I would have just written 'type' instead.
>> 
>> I'm happy to think about it in either terms, but was trying to maintain the distinction between how literal and non-literal char[]/wchar[]/dchar[] were treated. One might make a similar distinction between static/dynamic arrays of any type. Both have
>> type, while the former also has storage-class.
> 
> 
> What? "while the former also has storage-class" ?? You're making incorrect statements again. All variables have storage-class,
> including dynamic arrays. (it can be a bit more confusing since as
> "they" are a dual-entity, one can say they have two storage-classes:
> of the reference, and of the referenced data).
> 
>> Like you say, it's just terminology. Didn't mean to cause
>> confusion, so my apologies for that.
> 
> It didn't cause confusion. Undefined, ambiguous, or redundant terminology causes confusion. This was incorrect terminology, and it
>  causes annoyance more than anything (when you to understand it's incorrect, that is). I'm sure you'll agree with this last statement,
> even if you don't (yet?) agree that your terminology was incorrect.
> 
> 
November 21, 2005
Georg Wrede wrote:
> Bruno,
> 
> thanks for being alert here! At least I, possibly others too write sometimes late in the night, but it's good to see someone awake here.
> 
> BTW, we really do need a good word to describe the fact that UTF strings are stored in dissimilar encodings and widths. Would you have any suggestions?
> 
> 
It's type of course! The different UTF-encoding strings have different types: char[], wchar[] and dchar[] . (And if they hadn't, you could just call it "encoding")

-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
November 21, 2005
"Bruno Medeiros" <daiphoenixNO@SPAMlycos.com> wrote
[snip]
> It didn't cause confusion. Undefined, ambiguous, or redundant terminology
> causes confusion. This was incorrect terminology, and it causes annoyance
> more than anything (when you to understand it's incorrect, that is).
> I'm sure you'll agree with this last statement, even if you don't (yet?)
> agree that your terminology was incorrect.

Yes, you are right. Thought it was a useful distinction at the time, but was clearly wrong.


November 21, 2005
Kris wrote:
> "Georg Wrede" <georg.wrede@nospam.org> wrote ...
> 
>>Kris wrote:
>>
>>>This is the long standing mishmash between character literal
>>>arguments and parameters of type char[], wchar[], and/or dchar[].
>>>Character literals don't really have a "solid" type ~ the compiler
>>>can, and will, convert between wide and narrow representations on the
>>>fly.
>>
>>Compared to the bit thing I recently "bitched" about, this, IMHO, is an
>>issue one can accept better. :-)
> 
> 
> That doesn't make it any less problematic :-)
> 
> 
> 
>>It is a problem for small example programs. Larger programs tend to
>>(and IMHO should) have wrappers anyhow:
> 
> 
> Not so. You'd see people complaining about this constantly if Stream.write() was not decorated to distinguish between the three relevant methods. Generally speaking, any code that deals with all three array types will bump into this. Mango.io has the same problem, since it exposes write() methods for every D type plus their array counterparts.
> 
> 
> 
>>Why not change Phobos
>>
>>void write ( char[] s) {.....};
>>void write (wchar[] s) {.....};
>>void write (dchar[] s) {.....};
>>
>>into
>>
>>void _write ( char[] s) {.....};
>>void _write (wchar[] s) {.....};
>>void _write (dchar[] s) {.....};
>>void write (char[] s) {_write(s)};
>>
>>I think this would solve the issue with string literals as discussed in this thread.
> 
> 
> Then, how would one write a dchar[] literal? You just moved the problem to the _write() method instead. I think there needs to be a general resolution instead.
> 
> One might infer the literal type from the content therein?
> 
> 
> 
>>Also, overloading would not be hampered.
>>
>>And, those who really _need_ types other than the 8 bit chars, could still have their types work as usual.
> 
> 
> Ahh. I think non-ASCII folks would be troubled by this bias <g> 
> 

char[] does NOT NECESSARILY MEAN an ASCII-only string in D.

char[] can be a collection of UTF-8 code points, which further confuses the matter.

So long as you can process each variant of Unicode encodings (UTF-8, UTF-16, and UTF-32), it should NOT matter which you choose as your default encoding for your project's strings.  The only effect of the choice is the efficiency with which your project processes strings.  You should not lose any data, unless you make incorrect assumptions in your code.

I think it was a very wise decision to make char type separate from byte and ubyte, but I don't think it has separated far enough.  There should be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, cdpt32).  Then, there should be a single ASCII character type called 'char'.  This would allow strings to be defined to hold ASCII characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points.

String literals created from the D compiler should be stored as a specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and should be represented as the corresponding static array of the type of character.  The default encoding should be modifiable with either commandline options or with pragmas, preferrably pragmas.

For instance, if the default encoding were to be UTF-8, then a string literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]').

Also, it should be possible to explicitly specify the encoding for each string literal on a case-by-case basis.