the D crowd does bobdamn Rocket Science

I've spent a week studying the UTF issue, and another trying to explain it. Some progress, but not enough. Either I'm a bad explainer (hence, skip dreams of returning to teaching CS when I'm old), or, this really is an intractable issue. (Hmmmm, or everybody else is right, and I need to get on the green pills.)

My almost final try: Let's imagine (please, pretty please!), that Bill Gates descends from Claudius. This would invariably lead to cardinals being represented as Roman Numerals in computer user interfaces, ever since MSDOS times.

Then we'd have that as an everyday representation of integers. Obviously we'd have the functions r2a and a2r for changing string representations between Arabic ("1234...") and Roman ("I II III ...") numerals.

To make this example work we also need to imagine that we need a notation for roman numerals within string. Let this notation be:

"\R" as in "\RXIV" (to represent the number 14)

(Since this in reality is not needed, we have to (again) imagine that it is, for Historical Reasons -- the MSDOS machines sometimes crashed when there were too many capital X on a command line, but at the time nobody found the reason, so the \R notation was created as a [q&d] fix.)

So, since it is politically incorrect to write "December 24", we have to write "December XXIV" but since the ancient bug lurks if this file gets transferred to "a major operating system", we have to be careful and write "December \RXXIV".

Now, programmers are lazy, and they end up writing "\Rxxiv" and getting all kinds of error messages like "invalid string literal". So a few anarchist programmers decided to implement the possibility of writing lower case roman numerals, even if the Romans themselves disapproved of it from the beginning.

The prefix \r is already taken, so they had two choices, either make computers smart and let them understand \Rxxiv, but that would risk Bill getting angry. So they needed another prefix. They chose \N (this choice is a unix inside joke).

---

Then a compiler guru (who happened to descend from Asterix the Gallian) decided to write a new language. In the midst of that all, he stumbled upon the Roman issue. Beign diligent (which I've wanted to be all my life too but never succeeded (go ask my mother, my teacher, my bosses)), he decided to implement strings in a non-brekable way.

So now we have:

char[], Rchar[] and Nchar[], the latter two being for situations where the string might contain [expletive deleted] roman values.

The logical next step was to decorate the strings themselves, so that the computer can unambiguously know what to assign where. Therefore we now have "", ""R and ""N kind of strings. Oh, and to be totally unambiguous and symmetric, the redundant ""C was introduced to explicitly denote the non-R, non-N kind of string, in case such might be needed some day.

Now, being modern, the guru had already made the "" kinds of strings Roman Proof, with the help of Ancient Gill, an elusive but legendary oracle.

---

The III wise men had also become aware of this problem space. Since everything in modern times grows exponentially, the letters X, C and M (for ten, hundred and thousand), would sooner than later need to be accompanied by letters for larger numbers. For a million M was already taken, so they chose N. And then G, T, P, E, Z, Y for giga, tera, peta, exa, zetta and yotta. Then the in-betweens had to be worked out too, for 5000, 5000000 etc. 50 was already L and 500 was D..... whatever, you get the picture. .-)

So, they decided that, to make a string spec that lasts "forever" the new string had to be stored with 32 bits per character. (Since exponential growth (it's true, just look how Bill's purse grows!), is reality, they figured the numerals would run out of letters, and that's why new glyphs would have to be invented eventually, ad absurdum. 32 bits would carry us until the end of the Universe.) They called it N. This was the official representation of strings that might contain roman numerals way into the future.

Then some other guys thought "Naaw, t's nuff if we have strings that take us till the day we retire, so 16 bits oughtta be plenty." That became the R string. Practical Bill adopted the R string. Later the other guys had to admit that their employer might catch the "retiring plot", so they amended the R string to SOMETIMES contain 32 bits.

Now, Bill, the practic he is, ignored the issue (probably on the same "retiring plot"). And what Bill does, defines what is right (at least with suits, and hey, they rule -- as opposed to us geeks).

---

Luckily II blessed-by-Bob Sourcerers (notice the spelling), thought the R and N stuff was wasting space, was needed only occasionally, and was in general cumbersome. Everybody elses but Bill's R had to be able to handle 32 bits every once in a while, and the N stuff really was overkill.

They figured "The absolute majority of crap needs 7 bits, the absolute majority of the rest needs 9 bits, and the absolute majority of the rest needs 12 bits. So there's pretty little left after all this -- however, since we are blessed-by-Bob, and we do stuff properly, we won't give up until we can handle all this, and handle it gracefully."

They decided that their string (which they christened ""C) has to be compact, handle 7-bit stuff as fast as non-roman-aware programs do, 9 bit stuff almost as fast as the R programs, and it has to be lightning fast to convert to and from. Also, they wanted the C strings to be usable as much as possible by old library routines, so for example, the old routines should be able to search and sort their strings without upgrades. And they knew that strings do get chopped, so they designed them so that you can start wherever, and just by looking at the particular octet, you'd know whether it's proper to chop the string there. And if it isn't, it should be trivial to look a couple of octets forward (or even back), and just immediately see where the next breakable place is. Ha, and they wanted the C strings to be endiannes-proof!!

The II were already celebrities with the Enlightened, so it was decided that the C string will be standard on POSIX systems. Smart crowd.

*** if I don't see light here, I'll write some more one day ***

---

If I write the following strings here, and somebody pastes them in his source code,

"abracadabra"
"räyhäpäivä"
"ШЖЯЮЄШ"

compiles his D program, and runs it, what should (and probably will!) happen, is that the program output looks like the strings here.

If the guy has never heard of our Unbelievable utf-discussion, he probably never is aware that some UTF or other crap is or has been involved. (Hell, I've used Finnish letters in my D source code all the time, and never thought any of it.)

After having seen this discussion, he gets nervous, and quickly changes all his strings so that they are ""c ""w and ""d decorated. From then on, he hardly dares to touch strings that contain non US content. Like us here.

The interesting thing is, did I originally write them in UTF-8, UTF-16 or UTF-32?

How many times were they converted between these widths while travelling from my keyboard to this newsgroup to his machine to the executable to the output file?

Probably they've been in UTF-7 too, since they've gone through mail transport, which still is from the previous millennium.

---

At this point I have to ask, are there any folks here who do not believe that the following proves anything:

> Xref: digitalmars.com digitalmars.D.bugs:5436 digitalmars.D:29904
> Xref: digitalmars.com digitalmars.D.bugs:5440 digitalmars.D:29906
> 
> Those show that the meaning of the program does not change when the
> source code is transliterated to a different UTF encoding.
> 
> They also show that editing code in different UTF formats, and
> inserting "foreign" text even directly to string literals, does
> survive intact when the source file is converted between different
> UTF formats.
> 
> Further, they show that decorating a string literal to c, w, or d,
> does not change the interpretation of the contents of the string
> whether it contains "foreign" literals directly inserted, or not.
> 
> Most permutations of the above 3 paragraphs were tested.

(Oh, correction to the last line: "_All_ cross permutations of the 3 paragraphs were tested.)

Endiannes was not considered, but hey, with wrong endianness, either your text editor cant't read the file to begin with, or if it can, then you _can_ edit the strings with even more "foreign characters" and still be ok!

###################################################
I hereby declare, that it makes _no_ difference
whatsoever in which width a string literal is stored,
as long as the compiler implicitly casts it when
it gets used.
###################################################
I hereby also declare, that implicit casts of strings
(be they literal or heap or stack allocated), carries
no risks whatsoever. Period.
###################################################
I hereby declare that string literal decorations
are not only unneeded, they create an enormous
amount of confusion. (Even we are totally bewildered,
so _every_ newcomer to D will be that too.) There
are _no_ upsides to them.
###################################################
I hereby declare that it should be illegal to
implicitly convert char or wchar to any integer
type. Further it should be illegal to even cast
char or wchar to any integer type. The cast should
have to be via a cast to void! (I.e. difficult but
possible.) With dchar even implicit casts are ok.
Cast from char or wchar via dchar should be illegal.
(Trust me, illegal. While at the same time even
implicit casts from char[] and wchar[] to each
other and to and from dchar[] are ok!) Cast between
char wchar and dchar should be illegal, unless via
void.
###################################################

A good programmer would use the same width all over the place. An even better programmer would typedef his own anyway. If an idiot has his program convert width at every other assignment, then he'll have other idiocies in his code too. He should go to VB.

---

  But some other things are (both now, and even
  if we fix the above) downright hazardous, and
  should cause a throw, and in non-release
  programs a runtime assert failure:

Copying any string to a fixed length array, _if_ the array is either wchar[] or char[]. (dchar[] is ok.) The (throw or) assert should fail if the copied string is not breakable where the receiving array gets full.

whatever foo = "ää";   // foo and "" can be any of c/w/d.
char[3] barf = foo;    // write cast if needed
// Odd number of chars in barf, breaks ää wrong. "ää" is 4 bytes.

Same goes for wchar[3].

---

Once we agree on this, then it's time to see if some more AJ stuff is left to fix. But not before.

November 18, 2005

Re: the D crowd does bobdamn Rocket Science

Posted by Derek Parnell
in reply to Georg Wrede

Permalink

Derek Parnell

Posted in reply to Georg Wrede

Permalink

On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:

[snip]

It seems that you use the word 'cast' to mean conversion of one utf encoding to another. However, this is not what D does.

   dchar[] y;
   wchar[] x;

   x = cast(wchar[])y;

does *not* convert the content of 'y' to utf-16 encoding. Currently you *must* use the toUTF16 function to do that.

  x = std.utf.toUTF16(y);

However, are you saying that D should change its behaviour such that it should always implicitly convert between encoding types? Should this happen only with assignments or should it also happen on function calls?

   foo(wchar[] x) { . . .  } // #1
   foo(dchar[] x) { . . .  } // #2
   dchar y;
   foo(y);  // Obviously should call #2
   foo("Some Test Data"); // Which one now?

Given just the function signature and an undecorated string, it is not possible for the compiler to call the 'correct' function. In fact, it is not possible for a person (other than the original designer) to know which is the right one to call?

D has currently got the better solution to this problem; get the coder to identify the storage characteristics of the string!

-- 
Derek Parnell
Melbourne, Australia
18/11/2005 10:42:31 PM

November 18, 2005

Re: the D crowd does bobdamn Rocket Science

Posted by Georg Wrede
in reply to Derek Parnell

Permalink

Georg Wrede

Posted in reply to Derek Parnell

Permalink

Derek Parnell wrote:
> On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:
> 
> [snip]
> 
> It seems that you use the word 'cast' to mean conversion of one utf
> encoding to another. However, this is not what D does.
> 
>    dchar[] y;
>    wchar[] x;
> 
>    x = cast(wchar[])y;
> 
> does *not* convert the content of 'y' to utf-16 encoding. Currently you
> *must* use the toUTF16 function to do that.

If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".)

A cast should do precisely what our toUTFxxx functions currently do.

> However, are you saying that D should change its behaviour such that
> it should always implicitly convert between encoding types? Should this happen only with assignments or should it also happen on
> function calls?

Both. And everywhere else (in case we forgot to name some situation).

>    foo(wchar[] x) { . . .  } // #1
>    foo(dchar[] x) { . . .  } // #2
>    dchar y;
>    foo(y);  // Obviously should call #2
>    foo("Some Test Data"); // Which one now?

Test data is undecorated, hence char[]. Technically on the last line above it could pick at random, when it has no "right" alternative, but I think it would be Polite Manners to make the compiler complain.

I'm still trying to get through the notion that it _really_does_not_matter_ what it chooses!

(Of course performance is slower with a lot of unnecessary casts ( = conversions), but that's the programmer's fault, not ours.)

> Given just the function signature and an undecorated string, it is not
> possible for the compiler to call the 'correct' function. In fact, it
> is not possible for a person (other than the original designer) to
> know which is the right one to call?

That is (I'm sorry, no offense), based on a misconception.

Please see my other posts today, where I try to clear (among other things) this very issue.

> D has currently got the better solution to this problem; get the
> coder to identify the storage characteristics of the string!

He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.

This also I try to explain in the other posts.

(The issue and concepts are crystal clear, maybe it's just me not being able to describe them with the right words. Not to you, or Walter, or the others?)

We are all seeing bogeymen all over the place, where there are none. It's like my kids this time of the year, when it is always dark behind the house, under the bed, and on the attic.

Aaaaaaaaaaah, now I got it. It's been Halloween again. Sigh!

November 18, 2005

Re: the D crowd does bobdamn Rocket Science

Posted by Sean Kelly
in reply to Georg Wrede

Permalink

Sean Kelly

Posted in reply to Georg Wrede

Permalink

Georg Wrede wrote:
> Derek Parnell wrote:
>> On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:
>>
>> [snip]
>>
>> It seems that you use the word 'cast' to mean conversion of one utf
>> encoding to another. However, this is not what D does.
>>
>>    dchar[] y;
>>    wchar[] x;
>>
>>    x = cast(wchar[])y;
>>
>> does *not* convert the content of 'y' to utf-16 encoding. Currently you
>> *must* use the toUTF16 function to do that.
> 
> If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".)
> 
> A cast should do precisely what our toUTFxxx functions currently do.

I somewhat agree.  Since the three char types in D really do represent various encodings, the current behavior of casting a char[] to dchar[] produces a meaningless result (AFAIK).  On the other hand, this would make casting strings behave differently from casting anything else in D, and I abhor inconsistencies.  Though for what it's worth, I don't consider the conversion cost to be much of an issue so long as strings must be cast explicitly.  And either way, I would love to have UTF conversion for strings supported in-language.  It does make some sense, given that the three encodings exist as distinct value types in D already.

>> However, are you saying that D should change its behaviour such that
>> it should always implicitly convert between encoding types? Should this happen only with assignments or should it also happen on
>> function calls?
> 
> Both. And everywhere else (in case we forgot to name some situation).

I disagree.  While this would make programming quite simple, it would also incur hidden runtime costs that would be difficult to ferret out. This might be fine for a scripting-type language, but not for a systems programming language IMO.

>> D has currently got the better solution to this problem; get the
>> coder to identify the storage characteristics of the string!
> 
> He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.

True.

> This also I try to explain in the other posts.
> 
> (The issue and concepts are crystal clear, maybe it's just me not being able to describe them with the right words. Not to you, or Walter, or the others?)
> 
> We are all seeing bogeymen all over the place, where there are none. It's like my kids this time of the year, when it is always dark behind the house, under the bed, and on the attic.

What I like about the current behavior (no implicit conversion), is that it makes it readily obvious where translation needs to occur and thus makes it easy for the programmer to decide if that seems appropriate. That said, I agree that the overall runtime cost is likely consistent between a program with and without implicit conversion--either the API calls with have overloads for all types and thus allow you to avoid conversion, or they will only support one type and require conversion if you've standardized on a different type.  It may well be that concerns over implicit convesion is unfounded, but I'll have to give the matter some more thought before I can say one way or the other.  My current experience with D isn't such that I've had to deal with this particular issue much.

Sean

November 18, 2005

Re: the D crowd does bobdamn Rocket Science

Posted by Regan Heath
in reply to Georg Wrede

Permalink

Regan Heath

Posted in reply to Georg Wrede

Permalink

On Fri, 18 Nov 2005 15:31:48 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
> Derek Parnell wrote:
>> On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:
>>  [snip]
>>  It seems that you use the word 'cast' to mean conversion of one utf
>> encoding to another. However, this is not what D does.
>>     dchar[] y;
>>    wchar[] x;
>>     x = cast(wchar[])y;
>>  does *not* convert the content of 'y' to utf-16 encoding. Currently you
>> *must* use the toUTF16 function to do that.
>
> If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".)
>
> A cast should do precisely what our toUTFxxx functions currently do.

You have my vote here.

>> However, are you saying that D should change its behaviour such that
>> it should always implicitly convert between encoding types? Should this happen only with assignments or should it also happen on
>> function calls?
>
> Both. And everywhere else (in case we forgot to name some situation).

The main argument against this last time it was proposed was that a expression containing several char[] types would implicitly convert any number of times during the expression. This transcoding would be inefficient, and silent, and thus bad, eg.

 char[] a = "this is a test string"
wchar[] b = "regan was here";
dchar[] c = "georg posted this thing";

 char[] d = c[0..7] ~ b[6..10] ~ a[10..14] ~ c[20..$] ~ a[14..$] ~ c[16..17]
 //supposed to be: georg was testing strings :)

How many times does the above transcode using the current implicit conversion rules (the last time this topic was aired it branched into a discussion about how these rules could change to improve the situation)

>>    foo(wchar[] x) { . . .  } // #1
>>    foo(dchar[] x) { . . .  } // #2
>>    dchar y;
>>    foo(y);  // Obviously should call #2
>>    foo("Some Test Data"); // Which one now?
>
> Test data is undecorated, hence char[]. Technically on the last line above it could pick at random, when it has no "right" alternative, but I think it would be Polite Manners to make the compiler complain.

Which is what it does currently, right?

> I'm still trying to get through the notion that it _really_does_not_matter_ what it chooses!

I'm still not convinced. I will raise my issues in the later posts you promise.

> (Of course performance is slower with a lot of unnecessary casts ( = conversions), but that's the programmer's fault, not ours.)

I tend to agree here but as I say above, last time this aired people complained about this very thing.

>> Given just the function signature and an undecorated string, it is not
>> possible for the compiler to call the 'correct' function. In fact, it
>> is not possible for a person (other than the original designer) to
>> know which is the right one to call?
>
> That is (I'm sorry, no offense), based on a misconception.
>
> Please see my other posts today, where I try to clear (among other things) this very issue.

Ok.

>> D has currently got the better solution to this problem; get the
>> coder to identify the storage characteristics of the string!
>
> He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.
>
> This also I try to explain in the other posts.

Ok.

Regan

November 18, 2005

Re: the D crowd does bobdamn Rocket Science

Posted by Regan Heath
in reply to Sean Kelly

Permalink

Regan Heath

Posted in reply to Sean Kelly

Permalink

On Fri, 18 Nov 2005 09:48:51 -0800, Sean Kelly <sean@f4.ca> wrote:
> Georg Wrede wrote:
>> Derek Parnell wrote:
>>> On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:
>>>
>>> [snip]
>>>
>>> It seems that you use the word 'cast' to mean conversion of one utf
>>> encoding to another. However, this is not what D does.
>>>
>>>    dchar[] y;
>>>    wchar[] x;
>>>
>>>    x = cast(wchar[])y;
>>>
>>> does *not* convert the content of 'y' to utf-16 encoding. Currently you
>>> *must* use the toUTF16 function to do that.
>>  If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".)
>>  A cast should do precisely what our toUTFxxx functions currently do.
>
> I somewhat agree.  Since the three char types in D really do represent various encodings, the current behavior of casting a char[] to dchar[] produces a meaningless result (AFAIK).  On the other hand, this would make casting strings behave differently from casting anything else in D, and I abhor inconsistencies.  Though for what it's worth, I don't consider the conversion cost to be much of an issue so long as strings must be cast explicitly.  And either way, I would love to have UTF conversion for strings supported in-language.  It does make some sense, given that the three encodings exist as distinct value types in D already.

Making the cast explicit sounds like a good compromise to me.

The way I see it casting from int to float is similar to casting from char[] to wchar[]. The data must be converted from one form to another for it to make sense, you'd never 'paint' and 'int' as a 'float' it would be meaningless, the same is true for char[] to wchar[].

The correct way to paint data as char[], wchar[] or dchar[] is to paint a byte[](or ubyte[]). In other words if you have some data of unknown encoding you should be reading it into byte[](or ubyte[]) and then painting as the correct type, once it is known.

Regan

November 18, 2005

Re: the D crowd does bobdamn Rocket Science

Posted by Derek Parnell
in reply to Georg Wrede

Permalink

Derek Parnell

Posted in reply to Georg Wrede

Permalink

On Fri, 18 Nov 2005 15:31:48 +0200, Georg Wrede wrote:

> Derek Parnell wrote:
>> On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:
>> 
>> [snip]
>> 
>> It seems that you use the word 'cast' to mean conversion of one utf encoding to another. However, this is not what D does.
>> 
>>    dchar[] y;
>>    wchar[] x;
>> 
>>    x = cast(wchar[])y;
>> 
>> does *not* convert the content of 'y' to utf-16 encoding. Currently you *must* use the toUTF16 function to do that.
> 
> If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".)
> 
> A cast should do precisely what our toUTFxxx functions currently do.

Agreed. There are times, I suppose, when the coder does not want this to happen, but those could be coded with a cast(byte[]) to avoid that.

>> However, are you saying that D should change its behaviour such that it should always implicitly convert between encoding types? Should this happen only with assignments or should it also happen on function calls?
> 
> Both. And everywhere else (in case we forgot to name some situation).

We have problems with inout and out parameters.

  foo(inout wchar x) {}

  dchar[] y = "abc";
  foo(y);

In this case, if automatic conversion took place, it would have to do it twice. It would be like doing ...

   auto wchar[] temp;
   temp = toUTF16(y);
   foo(temp);
   y = toUTF32(temp);

>>    foo(wchar[] x) { . . .  } // #1
>>    foo(dchar[] x) { . . .  } // #2
>>    dchar y;
>>    foo(y);  // Obviously should call #2
>>    foo("Some Test Data"); // Which one now?
> 
> Test data is undecorated, hence char[]. Technically on the last line above it could pick at random, when it has no "right" alternative, but I think it would be Polite Manners to make the compiler complain.

Yes, at that's what happens now.

> I'm still trying to get through the notion that it _really_does_not_matter_ what it chooses!

I disagree. Without know what the intention of the function is, one has no way of knowing which function to call.

Try it. Which one is the right one to call in the example above? It is quite possible that there is no right one.

If we have automatic conversion and it choose one at random, there is no way of knowing that its doing the 'right' thing to the data we give it. In my opinion, its a coding error and the coder need to provide more information to the compiler.

> (Of course performance is slower with a lot of unnecessary casts ( = conversions), but that's the programmer's fault, not ours.)
> 
>> Given just the function signature and an undecorated string, it is not possible for the compiler to call the 'correct' function. In fact, it is not possible for a person (other than the original designer) to know which is the right one to call?
> 
> That is (I'm sorry, no offense), based on a misconception.
> 
> Please see my other posts today, where I try to clear (among other things) this very issue.

I challenge you, right here and now, to tell me which of those two functions above is the one that the coder intended to be called.

If the coder had written

    foo("Some Test Data"w);

then its pretty clear which function was intended.

For example, D rightly complains when the similar situation occurs with the various integers.

void foo(long x) {}
void foo(int x) {}
void main()
{
  short y;
  foo(y);
}

If D did implicit conversions and chose one at random I'm sure we would complain.

>> D has currently got the better solution to this problem; get the coder to identify the storage characteristics of the string!
> 
> He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.

But it *DOES* make a difference when doing signature matching. I'm not talking about assignments to variables.

-- 
Derek Parnell
Melbourne, Australia
19/11/2005 8:59:16 AM

November 18, 2005

Re: the D crowd does bobdamn Rocket Science

Posted by Regan Heath
in reply to Derek Parnell

Permalink

Regan Heath

Posted in reply to Derek Parnell

Permalink

On Sat, 19 Nov 2005 09:19:28 +1100, Derek Parnell <derek@psych.ward> wrote:
>> I'm still trying to get through the notion that it
>> _really_does_not_matter_ what it chooses!
>
> I disagree. Without know what the intention of the function is, one has no
> way of knowing which function to call.
>
> Try it. Which one is the right one to call in the example above? It is
> quite possible that there is no right one.
>
> If we have automatic conversion and it choose one at random, there is no
> way of knowing that its doing the 'right' thing to the data we give it. In
> my opinion, its a coding error and the coder need to provide more
> information to the compiler.
>
>> (Of course performance is slower with a lot of unnecessary casts ( =
>> conversions), but that's the programmer's fault, not ours.)
>>
>>> Given just the function signature and an undecorated string, it is not
>>> possible for the compiler to call the 'correct' function. In fact, it
>>> is not possible for a person (other than the original designer) to
>>> know which is the right one to call?
>>
>> That is (I'm sorry, no offense), based on a misconception.
>>
>> Please see my other posts today, where I try to clear (among other
>> things) this very issue.
>
> I challenge you, right here and now, to tell me which of those two
> functions above is the one that the coder intended to be called.
>
> If the coder had written
>
>     foo("Some Test Data"w);
>
> then its pretty clear which function was intended.
>
>
> For example, D rightly complains when the similar situation occurs with the
> various integers.
>
> void foo(long x) {}
> void foo(int x) {}
> void main()
> {
>   short y;
>   foo(y);
> }
>
> If D did implicit conversions and chose one at random I'm sure we would
> complain.
>
>>> D has currently got the better solution to this problem; get the
>>> coder to identify the storage characteristics of the string!
>>
>> He does, at assignment to a variable. And, up till that time, it makes
>> no difference. It _really_ does not.
>
> But it *DOES* make a difference when doing signature matching. I'm not
> talking about assignments to variables.

Georg/Derek, I replied to Georg here:
  http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/5587

saying essentially the same things as Derek has above. I reckon we combine these threads and continue in this one, as opposed to the one I linked above. I or you can link the other thread to here with a post if you're in agreement.

Regan

November 18, 2005

Re: the D crowd does bobdamn Rocket Science ~ wth ICU example

Posted by Kris
in reply to Sean Kelly

Permalink

Kris

Posted in reply to Sean Kelly

Permalink

"Sean Kelly" <sean@f4.ca> wrote in message
> Georg Wrede wrote:
>> Derek Parnell wrote:
>>> On Fri, 18 Nov 2005 12:56:24 +0200, Georg Wrede wrote:
>>>
>>> [snip]
>>>
>>> It seems that you use the word 'cast' to mean conversion of one utf encoding to another. However, this is not what D does.
>>>
>>>    dchar[] y;
>>>    wchar[] x;
>>>
>>>    x = cast(wchar[])y;
>>>
>>> does *not* convert the content of 'y' to utf-16 encoding. Currently you *must* use the toUTF16 function to do that.
>>
>> If somebody wants to retain the bit pattern while storing the contents to something else, it should be done with a union. (Just as you can do with pointers, or even objects! To name a few "workarounds".)
>>
>> A cast should do precisely what our toUTFxxx functions currently do.
>
> I somewhat agree.  Since the three char types in D really do represent various encodings, the current behavior of casting a char[] to dchar[] produces a meaningless result (AFAIK).  On the other hand, this would make casting strings behave differently from casting anything else in D, and I abhor inconsistencies.

Amen to that.

> Though for what it's worth, I don't consider the conversion cost to be much of an issue so long as strings must be cast explicitly.  And either way, I would love to have UTF conversion for strings supported in-language.  It does make some sense, given that the three encodings exist as distinct value types in D already.
>

FWIW, I agree. And it should be explicit, to avoid unseen /runtime/ conversion (the performance issue).

But, I have a feeling that cast([]) is not the right approach here? One reason is that structs/classes can have only one opCast() method. perhaps there's another approach for such syntax? That's assuming, however, that one does not create a special-case for char[] types (per above inconsistencies).

>>> However, are you saying that D should change its behaviour such that
>>> it should always implicitly convert between encoding types? Should this
>>> happen only with assignments or should it also happen on
>>> function calls?
>>
>> Both. And everywhere else (in case we forgot to name some situation).
>
> I disagree.  While this would make programming quite simple, it would also incur hidden runtime costs that would be difficult to ferret out. This might be fine for a scripting-type language, but not for a systems programming language IMO.

Amen to that, too!

>>> D has currently got the better solution to this problem; get the coder to identify the storage characteristics of the string!
>>
>> He does, at assignment to a variable. And, up till that time, it makes no difference. It _really_ does not.
>
> True.
>
>> This also I try to explain in the other posts.
>>
>> (The issue and concepts are crystal clear, maybe it's just me not being able to describe them with the right words. Not to you, or Walter, or the others?)
>>
>> We are all seeing bogeymen all over the place, where there are none. It's like my kids this time of the year, when it is always dark behind the house, under the bed, and on the attic.
>
> What I like about the current behavior (no implicit conversion), is that it makes it readily obvious where translation needs to occur and thus makes it easy for the programmer to decide if that seems appropriate.

Right on!

> That said, I agree that the overall runtime cost is likely consistent between a program with and without implicit conversion--either the API calls with have overloads for all types and thus allow you to avoid conversion, or they will only support one type and require conversion if you've standardized on a different type.

As long as the /runtime/ penalties are clear within the code design (not quietly 'padded' by the compiler), that makes sense.

> It may well be that concerns over implicit convesion is unfounded, but I'll have to give the matter some more thought before I can say one way or the other.  My current experience with D isn't such that I've had to deal with this particular issue much.

I'm afraid I have. Both in Mango.io and in the ICU wrappers. While there are no metrics for such things (that I'm aware of) my gut feel was that 'hidden' conversion would not be a good thing. Of course, that depends upon the "level" one is talking about:

High level :: slow to medium performance
Low level :: high performance

A lot of folks just don't care about performance (oh, woe!) and that's fine. But I think it's worth keeping the distinction in mind when discussing this topic. I'd be a bit horrified to find the compiler adding hidden transcoding at the IO level (via Mango.io for example). But then, I'm a dinosaur.

So. That doesn't mean that the language should not perhaps support some sugar for such operations. Yet the difficulty there is said sugar would likely bind directly to some internal runtime support (such as utf.d), which may not be the most appropriate for the task (it tends to be character oriented, rather than stream oriented). In addition, there's often a need for multiple return-values from certain types of transcoding ops. I imagine that would be tricky via such sugar? Maybe not.

Transcoding is easy when the source content is reasonably small and fully contained within block of memory. It quickly becomes quite complex when streaming instead. That's really worth considering.

To illustrate, here's some of the transcoder signatures from the ICU code:

uint   function (Handle, wchar*, uint, void*, uint, inout Error)
ucnv_toUChars;
uint   function (Handle, void*, uint, wchar*, uint, inout Error)
ucnv_fromUChars;

Above are the simple ones, where all of the source is present in memory.

void   function (Handle, void**, void*, wchar**, wchar*, int*, ubyte, inout
Error) ucnv_fromUnicode;
void   function (Handle, wchar**, wchar*, void**, void*, int*, ubyte, inout
Error)  ucnv_toUnicode;
void   function (Handle, Handle, void**, void*, void**, void*, wchar*,
wchar*, wchar*, wchar*, ubyte, ubyte, inout Error) ucnv_convertEx;

And those are the ones for handling streaming; note the double pointers? That's so one can handle "trailing" partial characters. Non trival :-)

Thus, I'd suspect it may be appropriate for D to add some transcoding sugar. But it would likely have to be highly constrained (per the simple case). Is it worth it?

November 18, 2005

Re: the D crowd does bobdamn Rocket Science

Posted by Sean Kelly
in reply to Regan Heath

Permalink

Sean Kelly

Posted in reply to Regan Heath

Permalink

Regan Heath wrote:
> 
> Making the cast explicit sounds like a good compromise to me.
> 
> The way I see it casting from int to float is similar to casting from char[] to wchar[]. The data must be converted from one form to another for it to make sense, you'd never 'paint' and 'int' as a 'float' it would be meaningless, the same is true for char[] to wchar[].

This is the comparison I was thinking of as well.  Though I've never tried casting an array of ints to floats.  I suspect it doesn't work, does it?  My only other reservation is that the behavior could not be preserved for casting char types, and unlike narrowing conversions (such as float to int), meaning can't even be preserved in narrowing char conversions (such as wchar to char).

Sean

Top | Forum index | About this forum

Forums