November 16, 2005
Walter Bright wrote:
> "Kris" <fu@bar.com> wrote in message news:dl3048$17nj$1@ digitaldaemon.com...
> 
>> Ahh. GW made a suggestion in the other thread (d.learn) that would help here. The notion is that the default type of string literals can be implied through the file encoding (D can handle multiple file encodings), or might be set explicitly via a pragma of some
>> kind. I think the former is an interesting idea.

> But then the meaning of the program would change if the source was transliterated into a different UTF encoding. I don't think this is a
> good idea, as it would be surprising, and would work against someone
> wanting to edit code in the UTF format their editor happens to be
> good at.

That sounds like you haven't read these two:

Xref: digitalmars.com digitalmars.D.bugs:5436 digitalmars.D:29904
Xref: digitalmars.com digitalmars.D.bugs:5440 digitalmars.D:29906

Those show that the meaning of the program does not change when the source code is transliterated to a different UTF encoding.

They also show that editing code in different UTF formats, and inserting "foreign" text even directly to string literals, does survive intact when the source file is converted between different UTF formats.

Further, they show that decorating a string literal to c, w, or d, does not change the interpretation of the contents of the string whether it contains "foreign" literals directly inserted, or not.

Most permutations of the above 3 paragraphs were tested.

---

I'm sure that having read those two articles, you'd agree that the concept of UTF string literal decorations becomes obsolete.

---

The fact still remains that a UTF (or actually any non USASCII file) is at risk of not being understood when transferred between disparate systems. From that follows, that source code files containing "foreign" characters, be it within string literals, in comments, or as variable names, is risky, at best.

String Literal Decorations in D however, do _not_ alleviate any of these problems. They serve merely as "storage attributes". The particular decoration has no impact on the interpretation of the string's contents.

(This should be obvious reading the test program source and its output.)

---

At this point, I'm also confident that Implicit Casts between the different UTF widths _will_not_ carry _any_ risk of content deterioration.

From all of this follows, that I suggest we get rid of the string literal decorations altogether, and that we let the compiler implicitly cast these strings as needed.

I do not have an opinion for or against _implicit_ UTF width casts in _other_ contexts. (That is a non-technical issue(!), mostly related to expected programmer behavior, and somewhat to performance.)

November 16, 2005
On Wed, 16 Nov 2005 03:09:04 +0200, Georg Wrede wrote:


[snip]

> String Literal Decorations in D however, do _not_ alleviate any of these problems. They serve merely as "storage attributes". The particular decoration has no impact on the interpretation of the string's contents.

[snip]

>  From all of this follows, that I suggest we get rid of the string
> literal decorations altogether, and that we let the compiler implicitly
> cast these strings as needed.

It's not the content of the string literal that is at issue, but the treatment that the content will receive once it is passed to a function. The issue with undecorated string literals is how does the compiler decide which function to call when more than one possible contender is available?

Your suggestions and findings do not solve that issue. One can never guarantee that functions with the same name but different signatures will do 'the same thing', therefore the compiler cannot arbitrarily decide between them. The resolution must come from the coder and decorating strings is one such method to resolve this.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
16/11/2005 12:37:20 PM
November 16, 2005
"Derek Parnell" <derek@psych.ward> wrote ...
> On Tue, 15 Nov 2005 16:00:08 -0800, Kris wrote:
>> <snip>
>> This was not always the case, as past topics will regale in bountiful
>> measure. That's what all the implicit-utf-conversion huffing and puffing
>> used to be about. Hurrah! Hurrah!
>
> I don't think that D ever worked that way. I think the D has not changed
> on
> this aspect. It now works the way it always has.

Perhaps you're right. Perhaps I became blinded by the implicit-conversion brigade?


> I believe that all the huffing and puffing was about function signature matching and not about conversion of strings.

It is right now, Derek. But I was referring to topics in the past re implicit transcoding.


November 16, 2005
Derek Parnell wrote:
> On Wed, 16 Nov 2005 03:09:04 +0200, Georg Wrede wrote:
> 
>> From all of this follows, that I suggest we get rid of the string literal decorations altogether, and that we let the compiler
>> implicitly cast these strings as needed.
> 
> It's not the content of the string literal that is at issue, but the treatment that the content will receive once it is passed to a
> function. The issue with undecorated string literals is how does the
> compiler decide which function to call when more than one possible
> contender is available?

My point is that even if it chose at random, there's no harm done!

> Your suggestions and findings do not solve that issue. One can never guarantee that functions with the same name but different signatures
> will do 'the same thing', 

That same problem should exist with cardinal types, right? And with floating types. What if you have three functions that take a short, an int, and a long, respectively. How can you guarantee that they do 'the same thing'?

> therefore the compiler cannot arbitrarily
> decide between them. The resolution must come from the coder and
> decorating strings is one such method to resolve this.

The choice is between 8, 16, and 32 bit UTF. Suppose we make a bad choice. The _worst_ thing that can happen, is it takes a bit more time to convert it (cast) into whatever is needed. Be it an output stream, or an assignment.

Since UTF to UTF conversions are computationally a lot lighter than one would think, _and_ considering that there really aren't that many string literals in real programs (compared to other stuff), the cost is minimal.

Whatever we do, there's _no_ risk to the contents!

Let's take another example, if I write:

int i = 43;
ubyte j = 43;

Does it matter to anybody how the 43 is stored, before it gets assigned? I'd say no. AS long as I have numbers in the range 0..127, nobody couldn't care less about how the number gets stored during the time before assignment, or before output. And with numbers in that range, there is no risk of loss of content.

Now, the exact same thing applies to UTF.


Typically a program may contain a dozen string literals, but it also typically goes through megabytes worth of data at runtime. Or clock cycles in the 1..1000 * 10^9 range. Not terribly many of those clock cycles get wasted on (possibly) unnecessary UTF width conversions -- how bad we even try to make the default for string literals.

In every other place in the program except those literals, we of course define storage, and therefore width. We get to choose if it is char, wchar, or dchar. From then on, there's no conversions needed unless we explicitly ask for them.

But the mere fact that you _seem_ to be able to make a huge difference in _something_ (_what_ that is, most folks don't even comprehend), makes people believe that it has some fundamental importance how you decorate a string literal.

Suppose I decorated the two example numbers, like this:

int i = 43L;
ubyte j = 43L;

What would be different? I submit, pretty much nothing.
November 16, 2005
"Georg Wrede" <georg.wrede@nospam.org> ...
> Derek Parnell wrote:
>> On Wed, 16 Nov 2005 03:09:04 +0200, Georg Wrede wrote:
>>
>>> From all of this follows, that I suggest we get rid of the string literal decorations altogether, and that we let the compiler implicitly cast these strings as needed.
>>
>> It's not the content of the string literal that is at issue, but the treatment that the content will receive once it is passed to a function. The issue with undecorated string literals is how does the compiler decide which function to call when more than one possible contender is available?
>
> My point is that even if it chose at random, there's no harm done!
>

That's an interesting thought :-)

If we assume method overloading does not change the high-level semantics (for example, all write() methods still actually write the output in some manner rather than, say, one of them eating the content), then what you suggest is certainly food for thought. It fits in with what Sean had suggested earlier today too: he wondered whether the compiler might pick the 'weakest' or 'strongest' of the common signatures.

Where this might break down, though, is where single char/wchar/dchar instances are involved. Suppose I have a binary protocol that's supposed to get a dchar stuffed into it. Now suppose the compiler chooses the wrong method because a literal of 'x' happens to fit in the char-sig instead? This is a contrived example, but it's not entire unrealistic.

The same is true of array instances too, if we consider wire protocols (both ends have to agree on what the data type is, yet 'read' does not expose the issue enjoyed by 'write' at all). It's just a variation on the problem. Hence, I'd still maintain that a fully deterministic, /developer-controlled/, mechanism be applied.

Some feel that a suffix is sufficient ~ I agree, for it's intended purpose. What I'm after is a means to render that suffix /redundant/ in the common case (common case as defined by the developer); not to eliminate it. Doing so will enable operator-overloads and method-overloading to operate "cleanly" with these literal types, and make D more succinct into the bargain.

As you stated; char[], wchar[], and dchar[] are all interchangable without loss (given certain considerations). Alas, that does not apply for char, wchar, and dchar. There's that lack of symmetry to deal with as well :-(

Hey, I see it as a good sign that we're sweating these kinds of details ~ it indicates some of the bigger items are currently in bed or have been killed off completely.  Now, if we could just get "read only arrays" ... <g>




November 16, 2005
"Kris" <fu@bar.com> wrote
<snip>
> Some feel that a suffix is sufficient ~ I agree, for it's intended purpose. What I'm after is a means to render that suffix /redundant/ in the common case (common case as defined by the developer); not to eliminate it. Doing so will enable operator-overloads and method-overloading to operate "cleanly" with these literal types, and make D more succinct into the bargain.

Addendum: I meant to note that operator overload (and hence op overloads) actually use the value of the char literal to determine which method should be used. e.g.

struct Foo
{
    void write (char x){}
    void write (wchar x){}
    void wirte (dchar x){}
}

void main()
{
  Foo f;

  f.write ('1');  // invokes the first method
  f.write ('\u0001');  // invokes the second method
  f.write ('\u00000001');  // invokes the third method
}


Three things to note in that respect:

- contrary to discussed array behavior, this did not fail to resolve the method (!)

- the only way to get a char /literal/ into the dchar method is to cast(dchar) it.

- char/wchar/dchar resolving operates differently than char[],wchar[],dchar[]


November 16, 2005
"Walter Bright" <newshound@digitalmars.com> wrote...
> "Kris" <fu@bar.com> wrote in message news:dl2p7i$11gf$1@digitaldaemon.com...
>> Don't you think the type can be inferred from the content?
>
> Not 100%.
>
>> For the sake of
>> discussion, how about this:
>>
>> 1) if the literal has a double-wide char contained, it is defaulted to a dchar[] type.
>>
>> 2) if not 1, and if the literal has a wide-char contained, if defaults to being of wchar[] type.
>>
>> 3) if neither of the above, it defaults to char[] type.
>>
>> Given the above, I can't think of a /common/ situation where casting
>> would
>> be thus be required.
>
> I did consider that for a while, but eventually came to the conclusion
> that
> its behavior would be surprising to someone who did not very carefully
> read
> the spec. Also, the distinction between the various character types is not
> obvious when looking at the rendered text, further making it surprising.
>
> I think it's better to now and then have to type in an extra character to nail down an ambiguity than to have a complicated set of rules to try and guess what the programmer's intent was.

======================================

It's difficult to counter that position since it appears fair and considered (though perhaps a bit thin on detail :-)

Yet, isn't the following example wholly contrary to what you claim?

struct Foo
{
    void write (char x){}
    void write (wchar x){}
    void write (dchar x){}
}

void main()
{
  Foo f;

  f.write ('1');  // invokes the first method
  f.write ('\u0001');  // invokes the second method
  f.write ('\u00000001');  // invokes the third method
}

To be clear: the compiler is doing explicitly what I describe above, but for char/wchar/dchar as opposed to their array counterparts. I feel this clearly contradicts your answer above, so what's the deal? Please? I'd really like to understand the basis for this distinction ...


P.S. I don't wish to somehow eliminate the suffix ~ I think that's great. What I'm after is a means to render the suffix redundant in the common case (common case as defined by the developer). It might also resolve the apparent inconsistency above?


November 16, 2005
On Tue, 15 Nov 2005 22:02:51 +0000, Bruno Medeiros <daiphoenixNO@SPAMlycos.com> wrote:
> Bruno Medeiros wrote:
>> Damn... :(
>> I'm no Unicode expert, but from what I've just read in Wikipedia about Unicode, quite some of the above posts/discussion have been made with an incorrect notion of "code point". A code point is an integer number for each Unicode grapheme/symbol. What actually varies is the (encoding) code unit:
>>  http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
>> "All normal unicode encodings use some form of fixed size code unit. Depending on the format and the code point to be encoded one or more of these code units will represent a Unicode code point."
>>  http://en.wikipedia.org/wiki/Unicode
>> "Unicode takes the role of providing a unique code point — a number, not a glyph — for each character."
>>  Is this not correct?
>>
>
> Also, seems that the term "code value" is also used with the same meaning as "code unit" :
>
> http://en.wikipedia.org/wiki/Unicode
> "In UTF-32 and UCS-4, one 32-bit code value serves as a fairly direct representation of any character's code point (although the endianness, which varies across different platforms, affects how the code value actually manifests as a bit sequence). In the other cases, each code point may be represented by a variable number of code values."
>
> (frankly I don't like that term very much. Prefer "code unit")

Damn, sorry. So...

"character" is "grapheme" is "code point".
"code unit" is part of a "code point".

Regan
November 16, 2005
"Derek Parnell" <derek@psych.ward> wrote ...
>> Before you say "I don't have to", consider that that those functions
>> skirt
>> the issue by mostly avoiding such method overloading, or by kludging the
>> names to do so. That doesn't mean other designs are somehow less valid.
>> Does
>> it?
>
> Just like yourself, I'm suggesting *alternatives* not *replacements*.

True.

> The judicial use of named (string) literals will also pay dividends at
> code
> maintenance time.

Amen to that.


November 16, 2005
> "character" is "grapheme" is "code point".
> "code unit" is part of a "code point".

Thanks Bruno and Regan!

A common language makes it a lot easier to communicate.
And even nicer when the language is known Right.