August 24, 2004
In article <cgelmc$dqu$1@digitaldaemon.com>, antiAlias says...

>And what happens when just one additional byte-oriented encoding is introduced? Perhaps UTF-7? Perhaps EBCDIC? The basic premise is flawed because there's no flexibility.

Yes. But I don't see D introducing a utf7char or a utfEbcdicChar type any time soon. There's no flexibility because there are no plans to extend it.

I can live with ONE native string type.
I can also live with 3 mutually interoperable native string types.

But neither of these schemes precludes anyone else from writing/using other string types.





>We /are/ talking about arrays. Perhaps if that sentence had ended "array reference rather than an *array* instance?", it might have been more clear? The point being made is that you would not be able to do anything like this:
>
>char[15] dst;
>dchar[10] src;
>
>dst = cast(char[]) src;

But that's because you already can't do:

#    dst = toUTF8(src);

and that's not going to change, so that's the end of that. Regan isn't suggesting anything beyond a bit of syntactic sugar here. No-one (so far) has suggested that dynamic arrays be converted to static arrays, or that auto-casting should write into a user-supplied fixed-size buffer.

D does not prohibit you from doing the things you have suggested. It's just that you'd have to do them explicitly. /Implicit/ conversion is only being suggested for the three D string types.




>because there's no ability via a cast() to indicate how many items from src were converted, and how many items in dst were populated. You are forced into this kind of thing:
>
>char[] dst;
>dchar[10] src;
>
>dst = cast(char[]) src;

Actually it would be:

#    dst = src;

if I had my way.


>You see the distinction? It may be subtle to some, but it's a glaring imbalance to others. The lValue must always be a reference because it's gonna' be allocated dynamically to ensure all of the rValue will fit. In the end, it's just far better to use functions/methods that provide the feedback required so you can actually control what's going on (or such that a library function can). That way, you're not restricted in the same fashion. We don't need more asymmetry in D, and this just reeks of poor design, IMO.

I don't agree with this claim. It is being suggested that:

#    wchar[] s1;
#    char s2 = s1;

be equivalent to:

#    wchar[] s1;
#    char s2 = std.utf.toUTF8(s1);

Nobody loses anything by this. All that happens is that things work more smoothly. If you want to call a different function to do your transcoding then there's nothing to stop you.

I assume that the use you have in mind is stream-internal transcoding via buffers. You can still do that. The above won't stop you.



>To drive this home, consider the decoding version (rather than the encoding
>above):
>
>char[15] src;
>dchar[] dst;
>
>dst = cast(dchar[]) src;
>
>What happens when there's a partial character left undecoded at the end of 'src'?

*ALL* that is being suggested is that, given the above declarations:

#    dst = src;

would be equivalent to:

#    dst = std.utf.toUTF32(src);

Nothing more. So the answer to your question is, it would throw an exception.



>There nothing here to tell you that you've got a dangly bit left at
>the end of the souce-buffer. It's gone. Poof! Any further decoding from the
>same file/socket/whatever is henceforth trashed, because the ball has been
>both dropped and buried. End of story.

This has got nothing to do with either transcoding or streams. This is not being suggested as a general transcoding mechanism, merely as an internal conversion between D's three string types. /General/ transcoding will have to work for all supported encodings, and won't be relying on the std.utf functions. Files and sockets won't use the std.utf functions either because they will employ the general transcoding mechanism.

Your transcoding ideas are excellent, but they are not relevant to this.



>There are many things a programmer should take
>responsibility for; transcoding comes under that umbrella because (a) there
>can be subtle complexity involved and (b) it is relatively expensive to
>churn through text and convert it; particularly so with the Phobos utf-8
>code.

I'm surprised at you. I would have said:

>There are some things a programmer should /not/ have to take responsibility for; transcoding comes under that umbrella because (a) there can be subtle complexity involved and (b) it is relatively expensive to churn through text and convert it;

A library is the appropriate place for this stuff. That could be Phobos, Mango, whatever. Here's my take on it:

(1) Implicit conversion between native D strings should be handled by the compiler in whatever way it sees fit. If it chooses to invoke a library function then so be it. (And it /should/ be Phobos, because Phobos is shipped with D). Note that this is exactly the same situation that the future "cent" type will incur. If I divide one "cent" by another, I would expect the D compiler to translate this into a function call within Phobos. Why is that such a crime?

(2) Fully featured transcoding should be done by ICU, as this is a high performance mature product, and all the code is already written.

(3) Some adaptation of the ICU wrapper may be necessary to integrate this more neatly with "the D way". But I'm confident we can do this in a way which is not biased toward Phobos.



Phobos UTF-8 code could be faster, I grant you. But perhaps it will be in the next release. We're only talking about a tiny number of functions here, after all.




>What you appear to be suggesting is that this kind of thing should happen silently whilst one nonchalantly passes arguments around between methods.

I would vote for that, yes.
But only as a second choice.

My /first/ choice would be to have D standardize on one single kind of string, the wchar[].


>That's insane, so I hope that's not what you're advocating. Java, for example, does that at one specific layer (I/O), but you're apparently suggesting doing it at any old place! And several times over, just in case it wasn't good enough the first time :-)  Sorry man. This is inane. D is /not/ a scripting language; instead it's supposed to be a systems language.

How about we all get on the same side here?

Like I said, my /first/ choice would be to have D standardize on one single kind of string, the wchar[].

But we don't always get our first choice. Walter doesn't like this idea. I suggest you add your voice to mine and help try to persuade him that ONE kind of string is the way to go. BUT - if we fail in convincing him - the second choice is better than the status quo. Why? Because if we fail to convice Walter that multiple string types is bad, then all that conversion is going to happen ANYWAY. It will happen because Object.toString() will (often) return UTF-8; because string literals will generate UTF-8; and because the functions in ICU will require and return UTF-16. The ONLY difference between this suggestion and the status quo is that won't have to write "cast(char[])" and "cast(wchar[])" all over the place.

You're doing a lot of arguing /against/. What are you /for/? Arguing against a suggestion is usually interpreted as a vote for the status quo. Are you really doing that?



>That's pretty short-sighted IMO. You appear to be saying that implicit transcoding would take the place of ICU; terribly misleading.

Of course he's not.


>Excuse me for jesting, but perhaps the Bidi text algorithms plus date/numeric formatting & parsing will all fit into a single operator also? That's kind of what's being suggested.

Nothing is being suggested except syntactic sugar.

My first choice vote still goes to ditching the char. With char gone, it would be natural for wchar[] to become the "standard" D string, which fits in well with ICU. There would be no need for implicit (or even explicit) cast-conversion between wchar[] and dchar[], because dchar[] would be a specialist thing, only used by speed-efficiency fanatics, while wchar[] would be business as usual.

But if I can't have my first choice, I'll vote for implicit conversion as my second choice.

Arcane Jill


August 24, 2004
"antiAlias" <fu@bar.com> wrote in message news:cge5h3$5hv$1@digitaldaemon.com...
> On the one hand, this would be well served by an opCast() and/or
opCast_r()
> on the primitive types; just the kind of thing  suggested in a related thread (which talked about overloadable methods for primitive types).
>
> On the other hand, we're talking transcoding here. Are you gonna' limit
this
> to UTF-8 only?

No, to between char[], wchar[], and dchar[].

> Then, since the source and destination will typically be of
> different sizes, do you then force all casts between these types to have
the
> destination be an array reference rather than an instance? One that is always allocated on the fly?

I see it as implicitly calling the conversion function(s) in std.utf.

> If you do implement overloadable primitive-methods (like properties) then, will you allow a programmer to override them? So they can make the
opCast()
> do something more specific to their own specific task?
>
> That's seems like a lot to build into the core of a language.

I don't see adding opCast() for builtin types.

> Personally, I think it's 'borderline' to have so many data types available for similar things. If there were an "alias wchar[] string", and the core Object supported that via "string toString()", and the IUC library were adopted, then I think some of the general confusion would perhaps melt somewhat.
>
> In many respects, too many choices is simply a BadThing (TM). Especially when there's precious little solid guidance to help. That guidance might come from a decent library that indicates how the types are used, and uses one obvious type (string?) consistently. Besides, if IUC were adopted,
less
> people would have to worry about the distinction anyway.
>
> Believe me when I say that Mango would dearly love to go dchar[] only. Actually, it probably will at the higher levels because it makes life
simple
> for everyone. Oh, and I've been accused many times of being an efficiency fanatic, especially when it comes to servers. But there's always a
tradeoff
> somewhere. Here, the tradeoff is simplicity-of-use versus quantities of
RAM.
> Which one changes dramatically over time? Hmmmm ... let me see now ... 64bit-OS for desktops just around corner?
>
> Even on an embedded device I'd probably go "dchar only" regarding I18N. Simply because the quantity of text processed on such devices is very limited. Before anyone shoots me over this one, I regularly write code for devices with just 4KB RAM ~ still use 16bit chars there when dealing with XML input.
>
> So what am I saying here? Available RAM will always increase in great
leaps.
> Contemplating that the latter should dictate ease-of-use within D is a serious breach of logic, IMO. Ease of use, and above all, /consistency/ should be paramount; if you have the programmer in mind.

I thought so, too, until I built a server app that used all dchar[] internally. Server apps tend to be driven to the limit, and reaching that limit 4x sooner means that the customer has to buy 4x more server hardware. Remember, that using 4 bytes per char doesn't just consume more ram, it consumes a LOT more processor cycles with managing the extra memory. (scanning, copying, initializing, gc marking, etc.)


August 24, 2004
"Arcane Jill" <Arcane_member@pathlink.com>
> >There nothing here to tell you that you've got a dangly bit left at the end of the souce-buffer. It's gone. Poof! Any further decoding from
the
> >same file/socket/whatever is henceforth trashed, because the ball has
been
> >both dropped and buried. End of story.
>
> This has got nothing to do with either transcoding or streams. This is not
being
> suggested as a general transcoding mechanism, merely as an internal
conversion
> between D's three string types. /General/ transcoding will have to work
for all
> supported encodings, and won't be relying on the std.utf functions. Files
and
> sockets won't use the std.utf functions either because they will employ
the
> general transcoding mechanism.

Oh. I was under the impression the 'solution' being tendered was a jack-of-all trades. If we're simply talking about converting /static/ strings between different representations, then, cool. It's done at compile-time.


> I'm surprised at you. I would have said:
>
> >There are some things a programmer should /not/ have to take responsibility for; transcoding comes under that umbrella because (a)
there
> >can be subtle complexity involved and (b) it is relatively expensive to churn through text and convert it;
>
> A library is the appropriate place for this stuff. That could be Phobos,
Mango,
> whatever. Here's my take on it:

We agree. My point was that expensive operations such as these should perhaps not be hidden "under the covers"; but explicitly handled by a library call instead. However, I clearly have the wrong impression about the extent of what this implicit-conversion is attempting.


> How about we all get on the same side here?

I really think we are, Jill. My concerns are about trying to build partial versions of ICU functionality into the D language itself, rather than let that extensive and capable library take care of it. But apparently that's not what's happening here. My mistake.


> You're doing a lot of arguing /against/. What are you /for/? Arguing
against a
> suggestion is usually interpreted as a vote for the status quo. Are you
really
> doing that?

Nope. There appeared to be some concensus-building that transcoding could all be handled via a cast() operator. I felt it worth pointing out where and why that's not a valid approach.

The other aspect involved here is that of string-concatenation. D cannot have more that one return type for toString() as you know. It's fixed at char[]. If string concatenation uses the toString() method to retrieve its components (as is being proposed elsewhere), then there will be multiple, redundant, implicit conversions going on where the string really wanted to be dchar[] in the first place. That is:

A a; // class instances ...
B b;
C c;

dchar[] message = c ~ b ~ a;

Under the proposed "implicit" scheme, if each toString() of A, B, and C wish to return dchar[], then each concatenation causes an implicit conversion/encoding from each dchar[] to char[] (for the toString() return). Then another full conversion/decoding is performed back to the dchar[] assignment once each has been concatenated. This is like the Wintel 'plot' for selling more cpu's :-)

Doing this manually, one would forego the toString() altogether:

dchar[] message = c.getString() ~ b.getString() ~ a.getString();

... where getString() is a programmer-specific idiom to return the (natural) dchar[] for these classes, and we carefully avoided all those darned implicit-conversions. However, which approach do you think people will use? My guess is that D may become bogged down in conversion hell over such things.


So, to answer your question:
What I'm /for/ is not covering up these types of issues with blanket-style
implicit conversions. Something more constructive (and with a little more
forethought) needs to be done.


August 24, 2004
"Walter"  wrote in...
> > So what am I saying here? Available RAM will always increase in great
> leaps.
> > Contemplating that the latter should dictate ease-of-use within D is a serious breach of logic, IMO. Ease of use, and above all, /consistency/ should be paramount; if you have the programmer in mind.
>
> I thought so, too, until I built a server app that used all dchar[] internally. Server apps tend to be driven to the limit, and reaching that limit 4x sooner means that the customer has to buy 4x more server
hardware.
> Remember, that using 4 bytes per char doesn't just consume more ram, it consumes a LOT more processor cycles with managing the extra memory. (scanning, copying, initializing, gc marking, etc.)

I disagree with that for a number of reasons <g>

a) I was saying that usage of memory should not dictate language ease-of-use. I didn't say that byte[], ubyte[], char[] and wchar[] should all be dumped.  If D were dchar[] oriented, rather than char[] oriented, it would arguably make it easier to use for the everyday folks. Those who really care about squeezing bytes can, and should, deal with text encoding and decoding issues. As it is, /everyone/ currently has to deal with those issues at various levels.

b) There's an implication that all server apps are text-bound. That's just not the case, but perhaps I'm being pedantic.

c) People who write servers have (traditionally) been a little more careful about what they do. There are plenty of ways to avoid allocating memory and thrashing the GC, where that's a concern. I do it all the time. In fact, one of the unwritten goals of writing server software is to avoid regularly using malloc/calloc where possible.

d) The predominant modern cpu's all have prefetch built-in, because of the marketing craze for streaming-style application. This is great news for wide chars! It means that a server can stream dchar[] much more effectively than it could just a few years back. It's the conversions that are arguably a problem.

e) dchar is the natural width of a 32bit processor, so it's not gonna take more Processor Cycles to process those than 8bit chars. In fact, it's the other way round where UTF-8 is involved. The bottleneck used to be the front-side bus. Not so these days of 1Ghz HyperTransport, 800MHz Intel quad-pumped bus, and prefetch everywhere.

So, no. I simply cannot agree that using dchar[] automatically means the customer has to buy 4x more server hardware <g>


August 24, 2004
In article <cgg5ko$16ja$1@digitaldaemon.com>, antiAlias says...
>
>a) I was saying that usage of memory should not dictate language ease-of-use. I didn't say that byte[], ubyte[], char[] and wchar[] should all be dumped.  If D were dchar[] oriented, rather than char[] oriented, it would arguably make it easier to use for the everyday folks. Those who really care about squeezing bytes can, and should, deal with text encoding and decoding issues. As it is, /everyone/ currently has to deal with those issues at various levels.

Ideally, an i/o library should be able to handle most conversions invisibly, so the user can work in whatever internal format they want without worrying too much about the external format.  doFormat already takes char, wchar, and dchar arguments and outputs UTF-8 or UTF-16 as appropriate, and I designed unFormat to do pretty much the same.  I will say, however, that multibyte encoding schemes are generally not very easy to deal with, so internal use of dchars still makes a lot of sense.

>b) There's an implication that all server apps are text-bound. That's just not the case, but perhaps I'm being pedantic.

More often than not they are, especially in this age of XML and such.  And for the ones that aren't text-bound, who cares how D deals with strings? :)

>e) dchar is the natural width of a 32bit processor, so it's not gonna take more Processor Cycles to process those than 8bit chars. In fact, it's the other way round where UTF-8 is involved. The bottleneck used to be the front-side bus. Not so these days of 1Ghz HyperTransport, 800MHz Intel quad-pumped bus, and prefetch everywhere.

I think that UTF-8 is still more efficient in terms of memory reads and writes, simply because it tends to take up less space than UTF-16 or UTF-32.  The tradeoff is in processing time when the data ventures into the multibyte realm, which is becoming increasingly more common.  But I'll have to think some more about whether I would always write text-bound servers using dchars.  I'd like to since it would simplify handling XML data, but I'm not particularly keen on those 1GB streams suddenly becoming 4GB streams.  That's a decent bit of i/o overhead, especially if I know that little to no data in that stream lives beyond the ASCII charset.

At the end of the day, I think the programmer should be able to choose the appropriate charset for the job.  Implicit conversion between char types is a great idea and should clear up most of the confusion.  And the UTF-32 version is called "dchar" which implies to me that it's the native character format for D anyway.  Perhaps "char" should be renamed to something else?


Sean


August 24, 2004
"Sean Kelly" <sean@f4.ca> wrote ...
>> I think that UTF-8 is still more efficient in terms of memory reads and
writes,
> simply because it tends to take up less space than UTF-16 or UTF-32.  The tradeoff is in processing time when the data ventures into the multibyte
realm,
> which is becoming increasingly more common.  But I'll have to think some
more
> about whether I would always write text-bound servers using dchars.

I'm not saying that they should always be dchar[] :-)  I'm saying using the additional memory usage of dchar[] as a tradeoff against language ease-of-use is an invalid one. You can always dip down into ubyte[] for those apps that actually care.

> I'd like to
> since it would simplify handling XML data, but I'm not particularly keen
on
> those 1GB streams suddenly becoming 4GB streams.  That's a decent bit of
i/o
> overhead, especially if I know that little to no data in that stream lives beyond the ASCII charset.

I'm rather tempted to suggest that a 1GB stream of XML has other problems to content with <g>



August 24, 2004
In article <cggca6$19k9$1@digitaldaemon.com>, antiAlias says...
>
>I'm rather tempted to suggest that a 1GB stream of XML has other problems to content with <g>

Its only problem is that XML is a terrible text format.  But for one server I wrote it was not impossible that there may be a 1GB (uncompressed) data stream. This was obviously not going to a web browser :)


Sean


August 25, 2004
On Mon, 23 Aug 2004 23:05:45 -0700, antiAlias <fu@bar.com> wrote:
> "Regan Heath" <regan@netwin.co.nz> wrote in message
> news:opsc7u56zf5a2sq9@digitalmars.com...
>> On Mon, 23 Aug 2004 18:29:54 -0700, antiAlias <fu@bar.com> wrote:
>> > "Walter" <newshound@digitalmars.com> wrote in message ...
>> >> "Regan" wrote:
>> >> > There are two ways of looking at this, on one hand you're saying 
>> they
>> >> > should all 'paint' as that is consistent. However, on the other I'm
>> > saying
>> >> > they should all produce a 'valid' result. So my argument here is 
>> that
>> > when
>> >> > you cast you expect a valid result, much like casting a float to an
>> >> int
>> >> > does not just 'paint'.
>> >> >
>> >> > I am interested to hear your opinions on this idea.
>> >>
>> >> I think your idea has a lot of merit. I'm certainly leaning that way.
>> >
>> > On the one hand, this would be well served by an opCast() and/or
>> > opCast_r()
>> > on the primitive types; just the kind of thing  suggested in a related
>> > thread (which talked about overloadable methods for primitive types).
>>
>> True. However, what else should the opCast for char[] to dchar[] do except
>> transcode it? What about opCast for int to dchar.. it seems to me there is
>> only 1 choice about what to do, anything else would be operator abuse.
>>
>> > On the other hand, we're talking transcoding here. Are you gonna' 
>> limit
>> > this
>> > to UTF-8 only?
>>
>> I hope not, I am hoping to see:
>>
>>         | UTF-8 | UTF-16 | UTF-32
>> --------------------------------
>> UTF-8  |   -        +       +
>> UTF-16 |   +        -       +
>> UTF-32 |   +        +       -
>>
>> (+ indicates transcoding occurs)
>
> ========================
> And what happens when just one additional byte-oriented encoding is
> introduced? Perhaps UTF-7? Perhaps EBCDIC? The basic premise is flawed
> because there's no flexibility.

---------------------------------------
You'll have to check with Walter but I believe he has no plans to add another basic type to hold any specific encoding. Encodings other than UTF-x will be done with library functions and ubyte[], ushort[], uint[], ulong[].

>> > Then, since the source and destination will typically be of
>> > different sizes, do you then force all casts between these types to 
>> have
>> > the
>> > destination be an array reference rather than an instance?
>>
>> I don't think transcoding makes any sense unless you're talking about a
>> 'string' (i.e. char[], wchar[], or dchar[]) as opposed to a UTF-x fragment
>> (i.e. char, wchar)
>
> ========================
> We /are/ talking about arrays. Perhaps if that sentence had ended "array
> reference rather than an *array* instance?", it might have been more clear?

---------------------------------------
Or maybe I just miss-read or missunderstood it :)

> The point being made is that you would not be able to do anything like this:
>
> char[15] dst;
> dchar[10] src;
>
> dst = cast(char[]) src;
>
> because there's no ability via a cast() to indicate how many items from src
> were converted, and how many items in dst were populated. You are forced
> into this kind of thing:
>
> char[] dst;
> dchar[10] src;
>
> dst = cast(char[]) src;

---------------------------------------
How is that different to what we have to do now?

  dst = toUTF8(src);

?

> You see the distinction? It may be subtle to some, but it's a glaring
> imbalance to others. The lValue must always be a reference because it's
> gonna' be allocated dynamically to ensure all of the rValue will fit. In the
> end, it's just far better to use functions/methods that provide the feedback
> required so you can actually control what's going on (or such that a library
> function can). That way, you're not restricted in the same fashion. We don't
> need more asymmetry in D, and this just reeks of poor design, IMO.

---------------------------------------
Correct me if I'm wrong you're suggesting the use of a library function like this...

bool toUTF8(char[] dst, dchar[] src) {}

or similar, where the caller passes a buffer for the result, and has full control of the size /location/allocation etc of that buffer, correct?

1. The implicit casting idea does not prevent this.
2. The above function doesn't exist currently.. instead we have
   char[] toUTF8(dchar[] src) {}
which is identical to what an implict cast would do.

> To drive this home, consider the decoding version (rather than the encoding
> above):
>
> char[15] src;
> dchar[] dst;
>
> dst = cast(dchar[]) src;
>
> What happens when there's a partial character left undecoded at the end of
> 'src'?

---------------------------------------
How is that even possible? Assuming src is a 'valid' UTF-8 sequences, all the characters can and will be encoded into UTF-32 producing a 'valid' UTF-32 sequence.

If 'src' _is_ invalid then an exception will be thrown much like the ones we currently have for invalid UTF sequences.

> There nothing here to tell you that you've got a dangly bit left at
> the end of the souce-buffer. It's gone. Poof! Any further decoding from the
> same file/socket/whatever is henceforth trashed, because the ball has been
> both dropped and buried. End of story.
>
>
>> Having 3 types requiring manual transcoding between them _is_ a pain.
>
> ========================
> It certainly is. That's why other languages try to avoid it at all costs.
>
> Having it done "generously" by the compiler is also a pain, inflexible, and
> likely expensive.

---------------------------------------
I'd argue that you're wrong about all but the last assertion above. Having it done by the compiler would not be:

1. a 'pain' - you wouldn't notice, and if you did and did not desire the behaviour you can manually convert just like you have to do now.

2. 'inflexible' - this idea does not preclude you doing things another way. It simply provides a default, which IMO is the sensible/correct thing to do.

it is however more 'expensive' than the current situation, but, it's no more expensive than doing the conversion manually, which is what you currently have to do..

> There are many things a programmer should take
> responsibility for; transcoding comes under that umbrella because (a) there
> can be subtle complexity involved and (b) it is relatively expensive to
> churn through text and convert it; particularly so with the Phobos utf-8
> code.
>
> What you appear to be suggesting is that this kind of thing should happen
> silently whilst one nonchalantly passes arguments around between methods.

---------------------------------------
Yes.

> That's insane, so I hope that's not what you're advocating. Java, for
> example, does that at one specific layer (I/O)

---------------------------------------
Which it can do because it only has one string type.

> , but you're apparently
> suggesting doing it at any old place! And several times over, just in case
> it wasn't good enough the first time :-)

---------------------------------------
Yes, as we only want to go to an in-efficient type temporarily eg.

Walters comment:

"dchars are very convenient to work with, however, and make a great deal of sense as the
temporary common intermediate form of all the conversions. I stress the
temporary, though, as if you keep it around you'll start to notice the
slowdowns it causes."

reflects exactly what implicit conversion will give you, the ability to store things in memory the format you think is most efficient and dip in and out of utf-32 _if_ required.

utf-32 is not 'required' for anything, it's simply 'convenient' for certain things, as AJ has frequently pointed out you can encode every Unicode character in all 3 formats UTF-8, UTF-16 and UTF-32.

> Sorry man. This is inane. D is
> /not/ a scripting language; instead it's supposed to be a systems language.

---------------------------------------
Which is why having a native UTF-8 type is so useful, and why being able to 'temporarily' implicitly convert to utf-32 for convenience while storing in the most space efficient format is so useful.

>> > Besides, if IUC were adopted, less
>> > people would have to worry about the distinction anyway.
>>
>> The same could be said for the implicit transcoding from one type to the
>> other.
>
> ========================
> That's pretty short-sighted IMO. You appear to be saying that implicit
> transcoding would take the place of ICU; terribly misleading.

---------------------------------------
Not at all. I implied this statement:

"if <implicit transcoding> were adopted, less people would have to worry about the distinction anyway"

or tried to.

> Transcoding is
> just a very small part of that package. Please try to reread the comment as
> "most people would be shielded completely by the library functions,
> therefore there's far fewer scenarios where they'd ever have a need to drop
> into anything else".
>
> This would be a very GoodThing for D users. Far better to have a good
> library to take case of /all/ this crap than have the D language do some
> partial-conversions on the fly, and then quit because it doesn't know how to
> provide any further functionality. This is the classic
> core-language-versus-library-functionality bitchfest all over again.
> Building all this into a cast()? Hey! Let's make Walter's Regex class part
> of the compiler too; and make it do UTF-8 decoding /while/ it's searching,
> since you'll be able to pass it a dchar[] that will be generously converted
> to the accepted char[] for you "on-the-fly".
>
> Excuse me for jesting, but perhaps the Bidi text algorithms plus
> date/numeric formatting & parsing will all fit into a single operator also?
> That's kind of what's being suggested. I believe there's a serious
> underestimate of the task being discussed.

---------------------------------------
This is an over-exageration, perhaps you should give Jim Carrey acting lessons <g>

>> > Believe me when I say that Mango would dearly love to go dchar[] only.
>> > Actually, it probably will at the higher levels because it makes life
>> > simple
>> > for everyone. Oh, and I've been accused many times of being an
> efficiency
>> > fanatic, especially when it comes to servers. But there's always a
>> > tradeoff
>> > somewhere. Here, the tradeoff is simplicity-of-use versus quantities 
>> of
>> > RAM.
>> > Which one changes dramatically over time? Hmmmm ... let me see now ...
>> > 64bit-OS for desktops just around corner?
>>
>> What you really mean is you'd dearly love to not have to worry about the
>> differences between the 3 types, implicit transcoding will give you that.
>> Furthermore it's simplicity without sacraficing RAM.
>
> ========================
> Ahh Thanks. I didn't realize that's what I "really meant". Wait a minute ...

---------------------------------------
Forgive me for sharing my interpretation of what you said.


>> > Even on an embedded device I'd probably go "dchar only" regarding 
>> I18N.
>> > Simply because the quantity of text processed on such devices is very
>> > limited. Before anyone shoots me over this one, I regularly write code
>> > for
>> > devices with just 4KB RAM ~ still use 16bit chars there when dealing
> with
>> > XML input.
>>
>> We're you to use implicit transcoding you could store the data in memory
>> in UTF-8, or UTF-16 then only transcode to UTF-32 when required, this
>> would be more efficient.
>
> ========================
> That's a rather large assumption, don't you think? More efficient? In which
> particualr way? Is memory usage or CPU usage more important in /my/
> particular applications? Please either refrain, or commit to rewriting all
> my old code more efficiently for me ... for free <g>

---------------------------------------
As stated earlier this concept does not stop you from optimising your app in any way shape or form, it's simply a sensible default behaviour IMO.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
August 25, 2004
On Tue, 24 Aug 2004 11:36:45 -0700, antiAlias <fu@bar.com> wrote:

<snip>

> Oh. I was under the impression the 'solution' being tendered was a
> jack-of-all trades. If we're simply talking about converting /static/
> strings between different representations, then, cool. It's done at
> compile-time.

Nope.

We are talking about implicit conversion to/from all 3 forms of UTF-x encoded base types where required.

Does this make any sense whatsoever currently?

char[]  c = ?; //some valid utf-8 sequence
dchar[] d;

d = cast(char[])c;


I believe the answer is "no", reasoning: d will now (possibly) contain an invalid utf-32 sequence. The only sensible thing to do is transcode.

If you want to 'paint' a char type as a smaller type to get at it's bits/bytes or shorts (snicker) you can and should use ubyte or ushort, _not_ another char type.

char types imply a default encoding so painting one to another is illegal, painting something to/from a char type is legal and useful.

<snip>

> We agree. My point was that expensive operations such as these should
> perhaps not be hidden "under the covers"; but explicitly handled by a
> library call instead.

On principle I totally agree with this statement.

However in this case I am simply suggesting implicit conversion where you would already have to write toUTFxx(), the idea does not _add_ any expense, only convenience.

Yes, an un-aware programmer might not realise it's transcoding, they might make some in-efficient choices, but that same programmer will probably also do this:

char[]  c;
dcahr[] d;

c = cast(char[])d;

and create invalid utf-x sequences (some of the time), and a bug.

> However, I clearly have the wrong impression about the
> extent of what this implicit-conversion is attempting.

No.. I _think_ you understood it fine. I _think_ we just disagree about what is efficient and what is not.

>> How about we all get on the same side here?
>
> I really think we are, Jill. My concerns are about trying to build partial
> versions of ICU functionality into the D language itself, rather than let
> that extensive and capable library take care of it. But apparently that's
> not what's happening here. My mistake.

Err... it is, a small part, the part that already exists in std.utf, conversion from utf-x to utf-<another x>.

>> You're doing a lot of arguing /against/. What are you /for/? Arguing
> against a
>> suggestion is usually interpreted as a vote for the status quo. Are you
> really
>> doing that?
>
> Nope. There appeared to be some concensus-building that transcoding could
> all be handled via a cast() operator. I felt it worth pointing out where and
> why that's not a valid approach.

Actually I want it to transcode implicitly eg.

char[]  c;
dchar[] d;

d = c;  //transcodes

Can you enumerate your reasons why this is 'not a valid approach' (I could search the previous posts and try to do that for you, but I might miss-interpret what you meant).

> The other aspect involved here is that of string-concatenation. D cannot
> have more that one return type for toString() as you know.

True.

> It's fixed at
> char[].

Is it?!
I didn't realise that, so this is invalid?

class A {
  dchar[] toString() {}
}

> If string concatenation uses the toString() method to retrieve its
> components (as is being proposed elsewhere), then there will be multiple,
> redundant, implicit conversions going on where the string really wanted to
> be dchar[] in the first place. That is:
>
> A a; // class instances ...
> B b;
> C c;
>
> dchar[] message = c ~ b ~ a;
>
> Under the proposed "implicit" scheme, if each toString() of A, B, and C wish
> to return dchar[], then each concatenation causes an implicit
> conversion/encoding from each dchar[] to char[] (for the toString() return).

Assuming toString returned char[] and not dchar[], yes.
Assuming that trying to return dchar without transcoding will create invalid UTF-8 sequences.

If implicit transcoding were implemented then you should be able to define the return value of your classes toString to be any of char[] wchar[] or dchar[] as it will implicitly transcode to the type it requires.

Basically you use the most applicable type, for example AJ's Int class would use char[] (unless AJ has another reason not to) as all the string data required is ASCII and fits best in UTF-8.

> Then another full conversion/decoding is performed back to the dchar[]
> assignment once each has been concatenated. This is like the Wintel 'plot' for selling more cpu's :-)

Not if toString can return dchar[] and all 3 classes do that.

> Doing this manually, one would forego the toString() altogether:
>
> dchar[] message = c.getString() ~ b.getString() ~ a.getString();
>
> ... where getString() is a programmer-specific idiom to return the (natural)
> dchar[] for these classes, and we carefully avoided all those darned
> implicit-conversions. However, which approach do you think people will use?

Taking into account what I have said above.. the easy one, i.e. implicit transcoding.

> My guess is that D may become bogged down in conversion hell over such
> things.
>
> So, to answer your question:
> What I'm /for/ is not covering up these types of issues with blanket-style
> implicit conversions. Something more constructive (and with a little more
> forethought) needs to be done.

I believe implicit conversion to be constructive, it stops bugs and makes string handling much easier. What we are doing here _is_ the forethought, after all nothing has been implemented yet.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
August 25, 2004
"Regan Heath" <regan@netwin.co.nz> wrote ..
> > What happens when there's a partial character left undecoded at the end of  'src'?
> ---------------------------------------
> How is that even possible?

It happens all the time with streamed input. However, as AJ pointed out, neither you nor Walter are apparently suggesting that the cast() approach be used for anything other than trivial conversions. That is, one would not use this approach with respect to IO streaming. I had the (distinctly wrong) impression this implied-conversion was intended to be a jack-of-all-trades.

Everything else in the post is therefore cast(void)  ~  so let's stop
wasting our breath :)

If these implicit conversions are put in place, then I respectfully suggest the std.utf functions be replaced with something that avoids fragmenting the heap in the manner they currently do (for non Latin-1); and it's not hard to make them an order-of-magnitude faster, too.

Finally; there's still the problems related to string-concatentation and toString(), as described toward the end of this post news:cgg1mi$14l2$1@digitaldaemon.com