November 14, 2005
On Mon, 14 Nov 2005 16:12:32 +0200, Georg Wrede wrote:


[snip]

> 
> I think we've painted ourselves in the corner by calling the UTF-8 entity "char"!!
> 
> There's no end to the misunderstanding this proliferates and breeds.

Ain't that the truth! Once I finally understood that D char's were not C's chars, a whole lot of my thinking had to be revised. Calling utf-8 code fragments 'char' was a huge mistake. I now have my GetFileText() function that reads a ubyte stream that might be in utf-8, -16, -32, or ASCII and converts it into the D char[] format. However, I find now that a lot of my applications' internal character representation is now in dchar[] strings so that I can do character manipulations. Doing so with char[] and wchar[] strings is a PITA.

-- 
Derek Parnell
Melbourne, Australia
15/11/2005 1:56:09 AM
November 14, 2005
In article <4374598B.30604@nospam.org>, Georg Wrede says...
>
>The compiler knows (or at least _should_ know) the character width of the source code file. Now, if there's an undecorated string literal in it, then _simply_assume_ that is the _intended_ type of the string!
>
>(( At this time opponents will say "what if the source code file gets converted into another character width?" -- My answer: "Tough, ain't it!", since there's a law against gratuituous mucking with source code.  ))

Well that's a nice attitude. Makes copy-and-paste impossible, and makes writing code off html, plain text, and books impossible too, since the code's behaviour now dependens on your language environment. I'm sure that won't cause any bugs at all ;-)

Nick


November 14, 2005
Sorry if this gets posted twice. The web-interface gave me a cryptic message on the first try.

In article <43789B50.7090203@nospam.org>, Georg Wrede says...

>I think we've painted ourselves in the corner by calling the UTF-8 entity "char"!!
>
>There's no end to the misunderstanding this proliferates and breeds.

The problem is not D calling UTF-8 codepoints char, it is C calling bytes char.
;)
Both Java's and C#'s char are different from C. Their users cope. This is just
one more of D's subtle differences from C. (Like == null, default assert(0) in
switch, etc.)

-----

Currently in D, char seems to have a prominent role over w/dchar:

- std.file uses char[] for file and pathnames (and does toUTF16z on Win32)
- std.string split/find only works on char[]
- std.regexp uses char[] (but does it through an alias)
- etc...

Ideally, all string functions should accept all string types. split/find etc should become generic and work on int[], double[] etc.

There is one solution to this without adding lots of code and without changing current function: Templated polymorphism (through implicit template instantiation).

But (unfortunately there is one), this will not work if string literals remain
type less (type inferred from the wrong end).

Say we have a template:

template write( T : T[] ) {
void write(T[] x) { ... }
}

How would a future compiler with implicit template instantiation instantiate:

write("hello");

?

-----

Interestingly:

#void write(char[] a) { printf("1\n"); }
#void write(dchar[] a) { printf("2\n"); }
#void write(wchar[] a) { printf("3\n"); }
#
#int main() {
#  typeof("hej") test = "hej";
#  write(test);                          // Compiles and prints "1\n"
#  writef("%s\n",typeid(typeof("hej"))); // Prints char[3]
#  // write("hej");                      // Doesn't compile
#  return 0;
#}

Line 6-7 and 9 ought to be identical. This feels like an inconsistency in the language...

Type inferred string literals is just as bad as return type function overloading.

/Oskar


November 14, 2005
Oskar Linde wrote:

> Currently in D, char seems to have a prominent role over w/dchar: 
> 
> - std.file uses char[] for file and pathnames (and does toUTF16z on Win32)
> - std.string split/find only works on char[]
> - std.regexp uses char[] (but does it through an alias)
> - etc...

You meant to say "char[] has a prominent role over dchar[]",
which is different. "char" is good for ASCII, but otherwise
you'll see a lot of dchars in the actual codepoint manipulation...

--anders
November 14, 2005
Georg Wrede wrote:
> 
> I think we've painted ourselves in the corner by calling the UTF-8 entity "char"!!

I agree that it's a tad weird, especially since a char literal can only hold a subset of valid UTF-8 values (ie. the single-byte ones).

> There's no end to the misunderstanding this proliferates and breeds.
> 
> For example, who would have thought that it is illegal to chop a char[] at an arbitrary point? I bet hardly anybody. Not 2 weeks ago anyway.

So long as you're only dealing with ASCII values (common for us US folks) you *can* chop a char[] anywhere you want.  Though this is obviously risky for some applictions.

> Parallax thinking. That's what we are doing with our (so called) strings. (Parallax thinking is a term in Bubblefield, which essentially means "thinking of a red car that is blue".)
> 
>    If we had the types
>    utf8, utf16, utf32,
>    while forbidding char,
>    we could get somewhere.

I think this would be misleading, as it seems to imply that a utf8 literal may occupy 1-4 bytes.  This would make sense for strings however.

> When somebody wants to do array twiddling while disregarding the risk of encountering a multi-byte utf8, then one should have to be explicit about it.
>
> Equally, when somebody wants to twiddle with arrays of "virtually just byte size entities, but in the rare case potentially multibyte" entities, then _this_ should have to be explicit. This would also keep the programmer himself remembering what he is twiddling.

This is why I implemented readf using dchars.  It's just a lot simpler to manage this sort of thing when you don't have to worry about multibyte characters.  The name "Dchar" also implies to me that it's the char type for the D language, even if it really means "double-word char."


Sean
November 14, 2005
Oskar Linde wrote:
> 
> Currently in D, char seems to have a prominent role over w/dchar: 
> 
> - std.file uses char[] for file and pathnames (and does toUTF16z on Win32)
> - std.string split/find only works on char[]
> - std.regexp uses char[] (but does it through an alias)
> - etc...

This is something that really irritates me about Windows.  If I understand correctly, Posix systems all implement wchar_t as UTF-32, while Win32 implements wchar_t as UTF-16 (because it had too much legacy code to easily transition to UTF-32).  I think the best approach may be to deprecate all ASCII functions in favor of their wide counterpart, and then add something like this to Phobos:

template toPlatformString(R,T) {
    R toPlatformString( T str ) {
        version(Win32)
            return toUTF16z(str);
        else
            return toUTF32z(str);
    }
}

version(Win32) {
    alias toPlatformString!(wchar*,char[]) toPlatformString;
    alias toPlatformString!(wchar*,wchar[]) toPlatformString;
    alias toPlatformString!(wchar*,dchar[]) toPlatformString;
}
else {
    alias toPlatformString!(dchar*,char[]) toPlatformString;
    alias toPlatformString!(dchar*,wchar[]) toPlatformString;
    alias toPlatformString!(dchar*,dchar[]) toPlatformString;
}

The alternative would be to offer functions to encode Unicode strings to C code pages, which just seems horrible.



Sean
November 14, 2005
On Mon, 14 Nov 2005 12:44:20 +0000, Bruno Medeiros <daiphoenixNO@SPAMlycos.com> wrote:
> Regan Heath wrote:
>> On Sat, 12 Nov 2005 15:30:22 +0000, Bruno Medeiros  <daiphoenixNO@SPAMlycos.com> wrote:
>>
>>> Regan Heath wrote:
>>>
>>>> On Fri, 11 Nov 2005 14:03:36 -0800, Kris <fu@bar.com> wrote:
>>>>
>>>>   Yes, but now a change in compiler options can change how the  application  behaves (i.e. calling a function that has the same name  but a different  purpose to the intended one)
>>>>  I'm with Derek above and I can't think of a better solution than the   current behaviour. Take this example (similar to your original one):
>>>>  void write (char[] x){}
>>>> void write (wchar[] x){}
>>>>  void main()
>>>> {
>>>>   write ("part 1");
>>>> }
>>>>  the compiler will error as it cannot decide which function to call.  It's  options that I can think of:
>>>>  A - pick one at random
>>>> B - pick one using some sort of rule i.e. char[] first, then wchar[],  then  dchar[]
>>>> C - pick one based on string contents i.e. dchar literal, wchar  literal,  ascii
>>>> D - pick one based on file encoding
>>>> E - pick one based on compiler switch
>>>> F - (current behaviour) require a string suffix of 'c', 'w' or 'd' to   disambiguate
>>>>
>>>  >...
>>>
>>> There is also the option of all undecorated string literals having a  default type (like char[] for instance), instead of "it being inferred  from the context."
>>   This is similar to B above, except that this rule would cause an error  here:
>>  write(dchar[] str) {}
>> void main() { write("a"); }
>>
>>> This seems best to me, at first glance at least. What consequences could  there be from this approach?
>>   The same as for B. Say you wrote the code above, say another 'write'  function existed elsewhere, say it took a char[], the compiler would  silently call the other write function and not the one above. If the  functions do the same thing, no problem, if not silent bug.
>>
>
> But here you would know with certainty that if the code was compiling then it was calling a char[] parameter function, and never a dchar[] one.

True, once it was "the rule" and once people got used to it, you should know.

The error shown above still bugs me though. It's the sort of thing Georg and Kris want to 'fix' with changes and this idea doesn't seem to do that.

The upshot of this rule may be that people are be forced to write the interface to their library using char[] as opposed to one of the other types, this may actually be a good thing.

> In the current case, if you have only one function, you cannot tell the type of the string (and thus the parameter of the called function) just by looking at the function call.

True, there are already several places in D where you can't do this, 'out' and 'inout' parameters for example. I'm not saying we shouldn't be able to, just that it often gets sacrificed for other functionality.

In the current situation when there is 1 function you do not know by looking at the call site, but when there are 2 or more you will, i.e.

write("test"d);

>>> Well, when passing an undecorated string literal argument to a dchar ou  wchar parameter, it would be an error and one would have to specify the  string type, however, I don't I see that as an inconvenient.
>>   It is more inconvenient than the current situation which is that you have to decorate only incases where a collision exists.
>>  Regan
> Yes, but which case is more frequent I wonder?

At the moment char[] is probably the most commonly used char type, it may alaways be the case.. I think that the more important internationalization etc becomes the more common wchar and dchar will become and the more frequent this idea would require casts or suffixes to call wchar and dchar functions.

Of course, it's only a problem with literals, which are probably less common than actual wchar[] and dchar[] variables.

Regan
November 14, 2005
On Mon, 14 Nov 2005 16:12:32 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
>>> I'm in favour of D's current behaviour, even though it means I must
>>>  decorate some string literals.
>>  As am I. The current behaviour hasn't caused me any trouble and I dislike  the compiler silently making choices for me when those
>> choices have a  chance of being wrong and causing bugs.
>
> I think we've royally shot ourselves in the foot -- since way back!
>
> We have these things called char, wchar and dchar. They are defined as UTF-8, UTF-16 and UTF-32. Fine.
>
> C has this thing called char, too. Many of us work in both languages.

Indeed, I work in C on a daily basis and D in my spare time.

> So? Well, what we do a lot is manipulating arrays of "char"s. While at it, in D we are doing most of this manipulating without further thinking, which results in code that essentially manipulates arrays of "ubyte". Since practically nobody here has a mother tongue that _needs_ characters of more than 8 bits, we "haven't had any trouble".

That is not what I meant. Tho it's also true. If I had to parse a language involving multi byte UTF-8 characters I would likely have had some trouble doing relativley simple things like getting the string length.

In part what I actually meant is explained by Kris when he referred to the stream libraries, I quote:

"The Stream methods are named like this specifically to avoid the problem:

write  (char[])
writeW (wchar[])
writeD (dchar[])"

I suspect he is correct in that _if_ the stream functions were all called "write" I would have encountered the problem with this:

write("test");

not knowing what function to call and requiring a suffix.

That said, I think I still prefer the current behaviour to a change which could allow a silent bug to creep in, however unlikely it may be (and I agree it's unlikely).

> ---
>
> I think we've painted ourselves in the corner by calling the UTF-8 entity "char"!!
>
> There's no end to the misunderstanding this proliferates and breeds.

This is true, technically char and char* from C should be byte and byte* in D as C's char is a 8 bit signed type with "no specific encoding". Whereas char is an 8 bit signed type with a specified encoding of UTF-8.

Perhaps if we had used byte in our C porting D code it would have highlighted the differences to new D programmers.
It would certainly make things like this:

char[] a;
strlwr(a);

illegal right off the bat. Which is the opposite to what D does, D actually implicitly converts the char[] to char* where required.

In order to call strlwr safely we have to use toStringz anyway, right? So why not write it:

byte* toStringz(char[] a);
byte* toStringz(wchar[] a);
byte* toStringz(dchar[] a);

and turn all the C 'char' parameter types into 'byte'?

I suspect this change would cause a bunch of compile errors for people, but a large percentage of those may be bugs waiting to happen and the rest will be:

strlwr("literal string");

Tho not likely calling strlwr ;)

And we can resolve this with either:
 - another suffix, 'z' anyone?
 - a call to toStringz, I know not as efficient.

> For example, who would have thought that it is illegal to chop a char[] at an arbitrary point? I bet hardly anybody. Not 2 weeks ago anyway.

I did, but only because a year or so ago Arcane Jill was stomping round this place righting all the missinformation and confusion about unicode issues and the D types etc.

> Parallax thinking. That's what we are doing with our (so called) strings. (Parallax thinking is a term in Bubblefield, which essentially means "thinking of a red car that is blue".)
>
>     If we had the types
>     utf8, utf16, utf32,
>     while forbidding char,
>     we could get somewhere.
>
> When somebody wants to do array twiddling while disregarding the risk of encountering a multi-byte utf8, then one should have to be explicit about it.
>
> Equally, when somebody wants to twiddle with arrays of "virtually just byte size entities, but in the rare case potentially multibyte" entities, then _this_ should have to be explicit. This would also keep the programmer himself remembering what he is twiddling.

D has a tradition, borrowed from C of giving you access to the bare metal. char[] etc are examples of this.

The concerns you've voiced above are the reasons a lot of people have proposed that a string type is required, in fact I'm surprised you haven't done the same.

A string type could theoretically encapsulate string data in any of the 3 types and provide access on a character by character basis instead of byte by byte. It could provide a strlen style method which would actually be correct for multibyte char[] data, and so on and so forth.

Regan
November 14, 2005
> There's no end to the misunderstanding this proliferates and breeds.

The online manual (http://www.digitalmars.com/d/type.html) states that "char is an unsigned 8 bit UTF-8".

That means that something like the "C kind of 'char'" simply does not exist in D.

A few lines down, under Integer Promotions, there is a table that says char and wchar are converted to int, and dchar to uint.

---

So, _WHAT_ should the following program output?

import std.stdio;

void main()
{
	struct U {
		union {char c; int m;}
	}
	U * u = new U;

	for (int i=0;i < 0xFF; i++)
	{
		u.c = i;
		int ii = u.c;

		writefln("i:%4d  ii:%4d  m:%5x m:%d", i,ii,u.m,u.m);
	}
}
November 14, 2005
On Mon, 14 Nov 2005 16:50:32 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
> Derek Parnell wrote:
>> void main() { SendTextToFile("test"); }
>>  Sure, it could choose any and be done with it, but would it hurt the
>> coder to be made aware that a decision is required here?
>>  I'm in favour of D's current behaviour, even though it means I must decorate some string literals.
>
> Suppose D, one day behaves like this:
>
> When using literal or other strings, D internally uses the width that is customary on that particular OS/hardware combination.
>
> Any undecorated string literal, any array and other string capable data structure defaults to this width, any library routines default to this.
>
> Still, if the programmer specifically wants a certain width, then that is used instead. (But it's ok with me if he then has to specify the width each time he defines a reference, data structure, input parameter, or literal string.)
>
> Upsides:
>
>   - normal program would get easier to write
>   - not being width dependent becomes explicit
>
>   - programs that need a certain width
>      - would stand out
>      - a width decision has been made
>
>   What would the downsides be?

As before:

1. It is possible for the compiler to pick the wrong function overload silently, causing a bug.

2. You're using X[] internally (X defined by OS) and interfacing to libraries which may/may not support an X interface but instead support a [char|wchar|dchar] interface resulting in toUTFX() calls and excess transcoding.

As I've mentioned before. Ideally the encoding you use within your program is also your input and/or ouput encoding and you only transcode at input and/or output and only if required. Excess transcoding is wasteful. How wasteful? Depends. How much of a problem is it? I cannot say with any surety.

In short the encoding you want to use internally may not be the OS encoding (tho I agree it's likely in many cases to be the OS encoding).

Regan