The case for ditching char (page 4) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » The case for ditching char (page 4)

August 25, 2004

Re: The case for ditching char

Posted by Carlos Santander B.
in reply to antiAlias

Carlos Santander B.

Posted in reply to antiAlias

"antiAlias" <fu@bar.com> escribió en el mensaje
news:cgg1mi$14l2$1@digitaldaemon.com
| A a; // class instances ...
| B b;
| C c;
|
| dchar[] message = c ~ b ~ a;

I have a question regarding this: what if A, B, and C were like this?

//////////////////////////
class A
{
    ... opCat_r (B b) { ... }
    ...
}

class B
{
    ... opCat (A a) { ... }
    ... opCat_r (C c) { ... }
    ...
}

class C
{
    ... opCat (B b) { ... }
    ...
}
//////////////////////////

How would "c ~ b ~ a" work with the proposed automatic call to .toString?

-----------------------
Carlos Santander Bernal

August 25, 2004

Re: The case for ditching char and wchar (and renaming

Posted by Regan Heath
in reply to Arcane Jill

Regan Heath

Posted in reply to Arcane Jill

On Tue, 24 Aug 2004 12:03:17 +0000 (UTC), Arcane Jill <Arcane_member@pathlink.com> wrote:

<big snip> Thanks for those explainations.

> The "confusion" in D arises (IMO) because we don't have implicit conversion.

That is my thought also, tho I note you would rather have 1 string type.

Lets do a pro's/con's list for implicit conversion and one string type because I am not totally convinced one is better than the other, let me start (trying to be as objective as possible and not favour 'my' idea)

[implicit conversion]
PROS:
 P1 - will cause:
     dchar[] d;
     char[] c = d;
   to produce valid utf sequences.

 P2 - allows you to write 1 of each string returning function (instead of 3)

 P3 - explicit conversion calls not required. eg toUTFxx().


p1: is vital IMO
p2: this means less code replication, and less code in general needs to be written.
P3: could be argued to be 'laziness', I've been called lazy in the past.

CONS:
 C1 - transcoding is not FREE and it will happen without obvious indicators that it is happening.

 C2 - ppl will not learn the difference between char, wchar, and dchar as quickly.


c1: I would argue it's is not as big a deal as it first appears, where it happens you would need a toUTFxx call anyway. In string concatenations some extra transcoding will occur and I have no good solution for that, tho, allowing toString to return any of the 3 types would lessen this effect.

c2: Might be a 'pro' in disguise, they don't learn the difference because with implicit conversion it doesn't matter.


[one string type]
PROS:
P1 - allows you to write 1 of each string returning function (instead of 3)
     SAME AS ABOVE

CONS:
C1 - all your string characters are 16bits wide, space is wasted when support for ASCII or any other 8bit encoding is all that is required.


c1: I believe this to be a major 'con' for embedded etc small systems programming.


Please everyone add to this list any/all you can think of, correct any you think I have wrong, or miss represented.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 25, 2004

Re: The case for ditching char

Posted by antiAlias
in reply to Regan Heath

antiAlias

Posted in reply to Regan Heath

"Regan Heath" <regan@netwin.co.nz> wrote
> Is it?!
> I didn't realise that, so this is invalid?
>
> class A {
>    dchar[] toString() {}
> }

Yes. It most certainly is, Regan. I (incorrectly) assumed you understood that. Sorry. There have been a number of posts that note this, and its implications.

August 25, 2004

Re: The case for ditching char

Posted by Ben Hinkle
in reply to antiAlias

Ben Hinkle

Posted in reply to antiAlias

> 
> The other aspect involved here is that of string-concatenation. D cannot have more that one return type for toString() as you know. It's fixed at char[]. If string concatenation uses the toString() method to retrieve its components (as is being proposed elsewhere), then there will be multiple, redundant, implicit conversions going on where the string really wanted to be dchar[] in the first place. That is:
> 
> A a; // class instances ...
> B b;
> C c;
> 
> dchar[] message = c ~ b ~ a;
> 
> Under the proposed "implicit" scheme, if each toString() of A, B, and C wish to return dchar[], then each concatenation causes an implicit conversion/encoding from each dchar[] to char[] (for the toString() return). Then another full conversion/decoding is performed back to the dchar[] assignment once each has been concatenated. This is like the Wintel 'plot' for selling more cpu's :-)
> 
> Doing this manually, one would forego the toString() altogether:
> 
> dchar[] message = c.getString() ~ b.getString() ~ a.getString();
> 
> ... where getString() is a programmer-specific idiom to return the (natural) dchar[] for these classes, and we carefully avoided all those darned implicit-conversions. However, which approach do you think people will use? My guess is that D may become bogged down in conversion hell over such things.

"Conversion hell" will exist any time three standards are in use. It doesn't
matter how those standards are wrapped up - implicit, explicit, String
class or whatever. That's why we all win by agreeing on one standard and
trying to stick to it. In D now life is peachy in char[] land, slightly
less peachy in wchar[] and dchar[] land. I don't think there's any way to
make life peachy for all three cases.

> So, to answer your question:
> What I'm /for/ is not covering up these types of issues with blanket-style
> implicit conversions. Something more constructive (and with a little more
> forethought) needs to be done.

August 25, 2004

Re: The case for ditching char and wchar (and renaming "dchar" as "char")

Posted by Regan Heath
in reply to antiAlias

Regan Heath

Posted in reply to antiAlias

On Tue, 24 Aug 2004 12:44:03 -0700, antiAlias <fu@bar.com> wrote:
> "Walter"  wrote in...
>> > So what am I saying here? Available RAM will always increase in great
>> leaps.
>> > Contemplating that the latter should dictate ease-of-use within D is a
>> > serious breach of logic, IMO. Ease of use, and above all, 
>> /consistency/
>> > should be paramount; if you have the programmer in mind.
>>
>> I thought so, too, until I built a server app that used all dchar[]
>> internally. Server apps tend to be driven to the limit, and reaching that
>> limit 4x sooner means that the customer has to buy 4x more server
> hardware.
>> Remember, that using 4 bytes per char doesn't just consume more ram, it
>> consumes a LOT more processor cycles with managing the extra memory.
>> (scanning, copying, initializing, gc marking, etc.)
>
> I disagree with that for a number of reasons <g>
>
> a) I was saying that usage of memory should not dictate language
> ease-of-use. I didn't say that byte[], ubyte[], char[] and wchar[] should
> all be dumped.  If D were dchar[] oriented, rather than char[] oriented, it
> would arguably make it easier to use for the everyday folks. Those who
> really care about squeezing bytes can, and should, deal with text encoding
> and decoding issues. As it is, /everyone/ currently has to deal with those
> issues at various levels.

The exact same thing can be said for implicit transcoding. What I mean is...

Implicit transcoding will make it easier to use for everyday folks. Those who really care about squeezing bytes can.

The advantage implicit transcoding has is everyday folks will likely be dealing with ASCII and will likely use char[] which has an advantage over dchar[] where you're dealing with ASCII.

Furthermore implicit transcoding removes the need to deal with encoding/decoding issue, generally speaking. Those that need to worry about it, can and will optimise where the implicit transcoding causes in-efficient behaviour.

> b) There's an implication that all server apps are text-bound. That's just
> not the case, but perhaps I'm being pedantic.

It depends on the server. The mail server I work on is a candidate to be text-bound, in fact it is, it's disk bound, meaning, we cannot write our email text out to disk as fast as we can receive it (tcp/ip), and process it (transcoding etc).

> c) People who write servers have (traditionally) been a little more careful
> about what they do. There are plenty of ways to avoid allocating memory and
> thrashing the GC, where that's a concern. I do it all the time. In fact, one
> of the unwritten goals of writing server software is to avoid regularly
> using malloc/calloc where possible.

Definately. having a UTF-8 char type which you can implicitly convert to a more convenient format temporarily (dchar[], utf-32) simply makes this easier IMO.

> d) The predominant modern cpu's all have prefetch built-in, because of the
> marketing craze for streaming-style application. This is great news for wide
> chars! It means that a server can stream dchar[] much more effectively than
> it could just a few years back. It's the conversions that are arguably a
> problem.

If we're talking streaming as in streaming to disk or tcp/ip etc, I would argue that the time it takes to transcode is much less than the time it takes to write/send.

> e) dchar is the natural width of a 32bit processor, so it's not gonna take
> more Processor Cycles to process those than 8bit chars. In fact, it's the
> other way round where UTF-8 is involved. The bottleneck used to be the
> front-side bus. Not so these days of 1Ghz HyperTransport, 800MHz Intel
> quad-pumped bus, and prefetch everywhere.
>
> So, no. I simply cannot agree that using dchar[] automatically means the
> customer has to buy 4x more server hardware <g>

All this arguing about what is more efficient is IMO totally pointless, the types of application vary so much that for one application/situation one method will be best and for another the other method will be.

D's goal is not to be specialised for any one style or application, as such 3 char types makes sense, doesn't it?

Regardless the only way to settle the performance argument is to benchmark something, therefore...

In what situations do you believe using UTF-32 dchar throughout the application will be faster than using all 3 types and implicit transcoding.. Consider:
 - the input may be in any encoding
 - the output may be in any encoding
 - it may need to store large amounts of the input in memory

..can you think of any more?

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 25, 2004

Re: The case for ditching char

Posted by Regan Heath
in reply to antiAlias

Regan Heath

Posted in reply to antiAlias

On Tue, 24 Aug 2004 19:45:07 -0700, antiAlias <fu@bar.com> wrote:

> "Regan Heath" <regan@netwin.co.nz> wrote
>> Is it?!
>> I didn't realise that, so this is invalid?
>>
>> class A {
>>    dchar[] toString() {}
>> }
>
> Yes. It most certainly is, Regan. I (incorrectly) assumed you understood
> that.

Either:
a. I am overly sensitive/insecure
b. You didn't realise
c. You're intentionally trying to belittle me

because ... "understood" is not the right word "knew" is a better choice.. "understood" implies I knew but didn't understand. That isn't the case. (this time)

> Sorry. There have been a number of posts that note this, and its
> implications.

I must have missed them, or missed the importance of that fact. strange given that I read *everything* in all the D NG's on digitalmars.com.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 25, 2004

Re: The case for ditching char

Posted by Regan Heath
in reply to Carlos Santander B.

Regan Heath

Posted in reply to Carlos Santander B.

On Tue, 24 Aug 2004 21:31:56 -0500, Carlos Santander B. <carlos8294@msn.com> wrote:
> "antiAlias" <fu@bar.com> escribió en el mensaje
> news:cgg1mi$14l2$1@digitaldaemon.com
> | A a; // class instances ...
> | B b;
> | C c;
> |
> | dchar[] message = c ~ b ~ a;
>
> I have a question regarding this: what if A, B, and C were like this?
>
> //////////////////////////
> class A
> {
>     ... opCat_r (B b) { ... }
>     ...
> }
>
> class B
> {
>     ... opCat (A a) { ... }
>     ... opCat_r (C c) { ... }
>     ...
> }
>
> class C
> {
>     ... opCat (B b) { ... }
>     ...
> }
> //////////////////////////
>
> How would "c ~ b ~ a" work with the proposed automatic call to .toString?

I assumed opCat's parameter would have to be char[], wchar[] or dchar[], as would it's return value. eg.

class B
{
  char[] opCat(char[] rhs){}
}

given implicit transcoding you could then say.

char[]  c;
wchar[] w;
dchar[] d;

B b = new B();

char[] p;

p = b ~ c;
p = b ~ w;
p = b ~ d;

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 25, 2004

Re: The case for ditching char

Posted by Regan Heath
in reply to Ben Hinkle

Regan Heath

Posted in reply to Ben Hinkle

On Tue, 24 Aug 2004 22:53:47 -0400, Ben Hinkle <bhinkle4@juno.com> wrote:

<snip>

> "Conversion hell" will exist any time three standards are in use. It doesn't
> matter how those standards are wrapped up - implicit, explicit, String
> class or whatever. That's why we all win by agreeing on one standard and
> trying to stick to it. In D now life is peachy in char[] land, slightly
> less peachy in wchar[] and dchar[] land. I don't think there's any way to
> make life peachy for all three cases.

Lets assume implicit transcoding is implemented, why wouldn't that make life peachy in all 3?

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 25, 2004

Re: The case for ditching char and wchar (and renaming "dchar" as "char")

Posted by antiAlias
in reply to Regan Heath

antiAlias

Posted in reply to Regan Heath

Regan; I appeal to you to try and read things in context. I'm not even vaguely interested in getting into a pissing contest with you, so please, try and follow this (and correlate with the text below if you have to):

a) I say available-memory will always increase in great leaps, so using that as a design guide vis-a-vis a computer language doesn't make sense to me.

b) Walter says he used to think so too, until he build a wide-char-only server; and points out that wide-chars can force the customer into purchasing much more hardware due to memory consumption and additional CPU usage.

c) I disagree with that position, and try to illustrate why I don't think wide-chars are the demon they might once have been considered. And that perhaps they get a 'bad rap' for the wrong reasons.

What you added here seems intended to fan some imaginary flames, or to be argumentative purely for the sake of it, rather than to make any cohesive point. In fact, four out of the five items you managed to completely misconstrue. That may be my failing in terms of language use, so I'll accept the consequences. I will not, however, bite.

Good-day, my friend  :-)


"Regan Heath" <regan@netwin.co.nz> wrote in message news:opsc9o30tt5a2sq9@digitalmars.com...
> On Tue, 24 Aug 2004 12:44:03 -0700, antiAlias <fu@bar.com> wrote:
> > "Walter"  wrote in...
> >> > So what am I saying here? Available RAM will always increase in great
> >> leaps.
> >> > Contemplating that the latter should dictate ease-of-use within D is
a
> >> > serious breach of logic, IMO. Ease of use, and above all,
> >> /consistency/
> >> > should be paramount; if you have the programmer in mind.
> >>
> >> I thought so, too, until I built a server app that used all dchar[]
> >> internally. Server apps tend to be driven to the limit, and reaching
> >> that
> >> limit 4x sooner means that the customer has to buy 4x more server
> > hardware.
> >> Remember, that using 4 bytes per char doesn't just consume more ram, it consumes a LOT more processor cycles with managing the extra memory. (scanning, copying, initializing, gc marking, etc.)
> >
> > I disagree with that for a number of reasons <g>
> >
> > a) I was saying that usage of memory should not dictate language ease-of-use. I didn't say that byte[], ubyte[], char[] and wchar[]
should
> > all be dumped.  If D were dchar[] oriented, rather than char[] oriented,
> > it
> > would arguably make it easier to use for the everyday folks. Those who
> > really care about squeezing bytes can, and should, deal with text
> > encoding
> > and decoding issues. As it is, /everyone/ currently has to deal with
> > those
> > issues at various levels.
>
> The exact same thing can be said for implicit transcoding. What I mean is...
>
> Implicit transcoding will make it easier to use for everyday folks. Those who really care about squeezing bytes can.
>
> The advantage implicit transcoding has is everyday folks will likely be dealing with ASCII and will likely use char[] which has an advantage over dchar[] where you're dealing with ASCII.
>
> Furthermore implicit transcoding removes the need to deal with encoding/decoding issue, generally speaking. Those that need to worry about it, can and will optimise where the implicit transcoding causes in-efficient behaviour.
>
> > b) There's an implication that all server apps are text-bound. That's
> > just
> > not the case, but perhaps I'm being pedantic.
>
> It depends on the server. The mail server I work on is a candidate to be text-bound, in fact it is, it's disk bound, meaning, we cannot write our email text out to disk as fast as we can receive it (tcp/ip), and process it (transcoding etc).
>
> > c) People who write servers have (traditionally) been a little more
> > careful
> > about what they do. There are plenty of ways to avoid allocating memory
> > and
> > thrashing the GC, where that's a concern. I do it all the time. In fact,
> > one
> > of the unwritten goals of writing server software is to avoid regularly
> > using malloc/calloc where possible.
>
> Definately. having a UTF-8 char type which you can implicitly convert to a more convenient format temporarily (dchar[], utf-32) simply makes this easier IMO.
>
> > d) The predominant modern cpu's all have prefetch built-in, because of
> > the
> > marketing craze for streaming-style application. This is great news for
> > wide
> > chars! It means that a server can stream dchar[] much more effectively
> > than
> > it could just a few years back. It's the conversions that are arguably a
> > problem.
>
> If we're talking streaming as in streaming to disk or tcp/ip etc, I would argue that the time it takes to transcode is much less than the time it takes to write/send.
>
> > e) dchar is the natural width of a 32bit processor, so it's not gonna
> > take
> > more Processor Cycles to process those than 8bit chars. In fact, it's
the
> > other way round where UTF-8 is involved. The bottleneck used to be the front-side bus. Not so these days of 1Ghz HyperTransport, 800MHz Intel quad-pumped bus, and prefetch everywhere.
> >
> > So, no. I simply cannot agree that using dchar[] automatically means the customer has to buy 4x more server hardware <g>
>
> All this arguing about what is more efficient is IMO totally pointless, the types of application vary so much that for one application/situation one method will be best and for another the other method will be.
>
> D's goal is not to be specialised for any one style or application, as such 3 char types makes sense, doesn't it?
>
> Regardless the only way to settle the performance argument is to benchmark something, therefore...
>
> In what situations do you believe using UTF-32 dchar throughout the
> application will be faster than using all 3 types and implicit
> transcoding.. Consider:
>   - the input may be in any encoding
>   - the output may be in any encoding
>   - it may need to store large amounts of the input in memory
>
> ..can you think of any more?
>
> Regan
>
> --
> Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 25, 2004

Re: implicit char[] conversion

Posted by Regan Heath
in reply to antiAlias

Regan Heath

Posted in reply to antiAlias

On Tue, 24 Aug 2004 19:29:24 -0700, antiAlias <fu@bar.com> wrote:
> "Regan Heath" <regan@netwin.co.nz> wrote ..
>> > What happens when there's a partial character left undecoded
>> > at the end of  'src'?
>> ---------------------------------------
>> How is that even possible?
>
> It happens all the time with streamed input.

Ahhh.. I get it you were referring to not having all the input at one time, with some being left in the 'stream'.. I can see your concern now.

> However, as AJ pointed out,
> neither you nor Walter are apparently suggesting that the cast() approach be
> used for anything other than trivial conversions.

Correct, the cases where the current approach can actually create a bug, a bug that only sometimes happens.

> That is, one would not use
> this approach with respect to IO streaming. I had the (distinctly wrong)
> impression this implied-conversion was intended to be a jack-of-all-trades.
>
> Everything else in the post is therefore cast(void)  ~  so let's stop
> wasting our breath :)

Yay :)

> If these implicit conversions are put in place, then I respectfully suggest
> the std.utf functions be replaced with something that avoids fragmenting the
> heap in the manner they currently do (for non Latin-1); and it's not hard to
> make them an order-of-magnitude faster, too.

Good idea. If it's done it has to be done as efficiently as possible.

> Finally; there's still the problems related to string-concatentation and
> toString(), as described toward the end of this post

Yep. I think the toString restriction should be lifted, with implicit transcoding any of the string types should be valid.

I am still concerned about the number of transcoding operations that might occur in a unsuspecting programmers string concatenation...

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation