The case for ditching char and wchar (and renaming "dchar" as "char") (page 2)

Settings

Help

Index » General » The case for ditching char and wchar (and renaming "dchar" as "char") (page 2)

August 23, 2004

Re: The case for ditching char and wchar (and renaming "dchar" as "char")

Posted by Ben Hinkle
in reply to Roald Ribe

Permalink

Ben Hinkle

Posted in reply to Roald Ribe

Permalink

Roald Ribe wrote:

> 
> "Ben Hinkle" <bhinkle4@juno.com> wrote in message news:cgcoe6$2cq4$1@digitaldaemon.com...
> 
> [snip]
> 
>> There were huge threads about char vs wchar vs dchar a while ago (on the
> old
>> newsgroup, I think). All kinds of things like what the default should be,
>> what the names should be, what a string class could be etc. For example
>>  http://www.digitalmars.com/d/archives/20361.html
>>  http://www.digitalmars.com/d/archives/12382.html
>> or actually anything at
>>  http://www.digitalmars.com/d/archives/index.html
>> with the word "unicode" in the subject.
>>
>> By the way, why if there are N encodings are there N^2-N converters? Shouldn't there just be ~2*N to convert to/from one standard like dchar[]? IBM's ICU (at http://oss.software.ibm.com/icu/) uses wchar[] as the standard.
> 
> Indeed. There were several large discussions about this. Only a few
> scandinavian/north european readers of this group seemed to be
> positive at the time.
> I am happy to see that more people are warming to the idea.
> wchar (16-bit) is enough. It is even suggested as the best
> implementation size by some UNICODE coding experts.
> IBM / Sun / MS can not all be stupid at the same time...
> I think it would be smart to interoperate with the 16-bit size
> used internally in ICU, Java and MS-Windows. Only on unix/linux
> would it make sense to use 32-bits dchar.
> The 16 bits is enough for 99% of the cases/languages.
> The last 1% can be handled quite fast by cached indexing
> techniques in a String object. (this does not make for
> optimal speed in the 1% case, but it will more than pay
> for itself speedwise in 99% of all binary i/o operations :)
> 
> However, that is Walters main issue I think. He wants default 8-bit chars to be default because this will make for the best possible i/o speed with the current state of affairs. That is what I understould from the last discussion at least. I am sure he will comment this thread ;-) and correct me if I am wrong.
> 
> Regards,
> Roald

I didn't mean to suggest D ditch char or wchar or dchar. I'm just saying ICU uses wchar internally as the intermediate representation when converting between encodings. That is different than changing D's concept of strings. I should have added a sentance to my original post saying that I think D should keep its support of the "big three" char, wchar and dchar (with char[] as the standard concept of string) and have the library that handles conversions between unicode and non-unicode (or between non-unicode) encodings use whatever it wants as the intermediate representation. I think for that dchar would probably be fine - but I have no experience with that so that is just a naive guess.

Treating the unicode encodings specially seems more practical than saying all non-standard encodings are treated the same.

-Ben

August 23, 2004

Re: The case for ditching char

Posted by Ben Hinkle
in reply to Arcane Jill

Permalink

Ben Hinkle

Posted in reply to Arcane Jill

Permalink

Arcane Jill wrote:

> 
> The case for retaining wchar has been made, and essentially won. Please see separate thread about ICU (and maybe move this discussion there).

It's won already? What is this - a Mike Tyson fight? :-)
Are you referring to the old threads (on the old newsgroup) or this new
thread?

> "char", however, is still up for deletion, since all arguments against it still apply.
> 
> Jill

Once I argued that D's current concept of char should be called uchar or something to indicate the UTF-8 encoding (as opposed to C's char encoding) and that string literals have type uchar[]. I still think it would be interesting to try but it's a tweak on the current system that isn't *that* important.

August 23, 2004

Re: The case for ditching char and wchar (and renaming "dchar" as "char")

Posted by Walter
in reply to Arcane Jill

Permalink

Walter

Posted in reply to Arcane Jill

Permalink

I'm not sure what transcoding means, I assume it's converting from one string type to another.

I have some experiences with having only one character type - implementing a C compiler, which is largely string processing using ascii, implementing a Java compiler and working on the Java vm, which does everything as a wchar, and implementing a javascript compiler, interpreter, and runtime which does everything as a dchar.

dchar implementations consume memory at an alarming rate, and if you're doing server side code, this means you'll soon run into the point where virtual memory starts thrashing, and performance goes quickly down the tubes. I had to do a lot of work trying to overcome this problem. dchars are very convenient to work with, however, and make a great deal of sense as the temporary common intermediate form of all the conversions. I stress the temporary, though, as if you keep it around you'll start to notice the slowdowns it causes.

char implementations can be made really fast and memory efficient. I don't know if anyone has run any statistics, but the vast bulk of text processing done by programs is in ASCII.

wchar implementations are, of course, halfway in between. Microsoft went with wchars back when wchars could handle all of unicode in one word. Now that it doesn't anymore, that means that all wchar code is going to have to handle multiword encodings. So it's just as much extra code to write as for chars.

Java uses wchars, but Java was not designed for high performance (although years of herculean efforts by a lot of very smart people have brought Java a long ways in the performance department). My Java compiler written in C++, which used UTF-8 internally, ran 10x faster than the one written in Java, which used UTF-16. The speedup wasn't all due to the character encoding, but it helped.

C# uses wchars, and that maps directly onto the Win32 API, so that makes a lot of sense for Microsoft. D isn't just a Win32 language, however, and linux uses UTF-8. Furthermore, a lot of embedded systems stick with ASCII.

In other words, I think there's a strong case for all three character encodings in D. I agree that those using code pages will run into trouble now and then with this, but they will anyway, because code pages are just endless trouble since which code page some data is in is not inherent in the data. Your idea of having the compiler detect invalid UTF-8 sequences in string literals is very helpful here in heading a lot of these issues off at the pass.

I think it makes perfect sense for a transcoding library to standardize on dchar and dchar[] as its intermediate form. But don't take away char[] for people like me that want to use it for other purposes!

I also agree that for single characters, using dchar is the best way to go. Note that I've been redoing Phobos internals this way. For example, std.format generates dchars to feed to the consumer function. std.ctype takes dchar's as arguments.

August 23, 2004

Re: The case for ditching char and wchar (and renaming "dchar" as "char")

Posted by Regan Heath
in reply to Walter

Permalink

Regan Heath

Posted in reply to Walter

Permalink

Walter,

I pretty much agree with everything you have said here, I don't think we should remove any of the char types. That said I think...

The idea that a cast from one char type to another char type (implicit or explicit) should perform the correct UTF transcoding(conversion) is a good one, my arguments for this are as follows:

 - Q: if you paint a dchar[] as a char[] what do you get?
   A: a possibly (likely?) invalid UTF-8 sequence.

So why do it? I agree we do want to be able to 'paint' one type as another, but for a type with a specified encoding I don't think this makes any sense, does it? can you think of a reason to do it? Given that you could use ubyte, ushort or ulong instead. (types with no specified encoding).

 - Doing the transcoding means people writing string handling routines need only provide one routine and the result will automatically be transcoded to the type they're using.

This is such a great bonus! it will reduce the number of string handling routines by 2/3 as any routine will have it's result converted to the required type auto-magically.

 - The argument about consistency, that a ubyte cast to a ushort does not transcoding so a char to a dchar shouldn't either.

There are two ways of looking at this, on one hand you're saying they should all 'paint' as that is consistent. However, on the other I'm saying they should all produce a 'valid' result. So my argument here is that when you cast you expect a valid result, much like casting a float to an int does not just 'paint'.

I am interested to hear your opinions on this idea.

Regan.

On Mon, 23 Aug 2004 12:59:28 -0700, Walter <newshound@digitalmars.com> wrote:
> I'm not sure what transcoding means, I assume it's converting from one
> string type to another.
>
> I have some experiences with having only one character type - implementing a
> C compiler, which is largely string processing using ascii, implementing a
> Java compiler and working on the Java vm, which does everything as a wchar,
> and implementing a javascript compiler, interpreter, and runtime which does
> everything as a dchar.
>
> dchar implementations consume memory at an alarming rate, and if you're
> doing server side code, this means you'll soon run into the point where
> virtual memory starts thrashing, and performance goes quickly down the
> tubes. I had to do a lot of work trying to overcome this problem. dchars are
> very convenient to work with, however, and make a great deal of sense as the
> temporary common intermediate form of all the conversions. I stress the
> temporary, though, as if you keep it around you'll start to notice the
> slowdowns it causes.
>
> char implementations can be made really fast and memory efficient. I don't
> know if anyone has run any statistics, but the vast bulk of text processing
> done by programs is in ASCII.
>
> wchar implementations are, of course, halfway in between. Microsoft went
> with wchars back when wchars could handle all of unicode in one word. Now
> that it doesn't anymore, that means that all wchar code is going to have to
> handle multiword encodings. So it's just as much extra code to write as for
> chars.
>
> Java uses wchars, but Java was not designed for high performance (although
> years of herculean efforts by a lot of very smart people have brought Java a
> long ways in the performance department). My Java compiler written in C++,
> which used UTF-8 internally, ran 10x faster than the one written in Java,
> which used UTF-16. The speedup wasn't all due to the character encoding, but
> it helped.
>
> C# uses wchars, and that maps directly onto the Win32 API, so that makes a
> lot of sense for Microsoft. D isn't just a Win32 language, however, and
> linux uses UTF-8. Furthermore, a lot of embedded systems stick with ASCII.
>
> In other words, I think there's a strong case for all three character
> encodings in D. I agree that those using code pages will run into trouble
> now and then with this, but they will anyway, because code pages are just
> endless trouble since which code page some data is in is not inherent in the
> data. Your idea of having the compiler detect invalid UTF-8 sequences in
> string literals is very helpful here in heading a lot of these issues off at
> the pass.
>
> I think it makes perfect sense for a transcoding library to standardize on
> dchar and dchar[] as its intermediate form. But don't take away char[] for
> people like me that want to use it for other purposes!
>
> I also agree that for single characters, using dchar is the best way to go.
> Note that I've been redoing Phobos internals this way. For example,
> std.format generates dchars to feed to the consumer function. std.ctype
> takes dchar's as arguments.
>
>



-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 23, 2004

Re: The case for ditching char and wchar (and renaming "dchar" as "char")

Posted by Walter
in reply to Regan Heath

Permalink

Walter

Posted in reply to Regan Heath

Permalink

"Regan Heath" <regan@netwin.co.nz> wrote in message news:opsc7hhm0a5a2sq9@digitalmars.com...
> Walter,
>
> I pretty much agree with everything you have said here, I don't think we should remove any of the char types. That said I think...
>
> The idea that a cast from one char type to another char type (implicit or
> explicit) should perform the correct UTF transcoding(conversion) is a good
> one, my arguments for this are as follows:
>
>   - Q: if you paint a dchar[] as a char[] what do you get?
>     A: a possibly (likely?) invalid UTF-8 sequence.
>
> So why do it? I agree we do want to be able to 'paint' one type as another, but for a type with a specified encoding I don't think this makes any sense, does it? can you think of a reason to do it? Given that you could use ubyte, ushort or ulong instead. (types with no specified encoding).
>
>   - Doing the transcoding means people writing string handling routines
> need only provide one routine and the result will automatically be
> transcoded to the type they're using.
>
> This is such a great bonus! it will reduce the number of string handling routines by 2/3 as any routine will have it's result converted to the required type auto-magically.
>
>   - The argument about consistency, that a ubyte cast to a ushort does not
> transcoding so a char to a dchar shouldn't either.
>
> There are two ways of looking at this, on one hand you're saying they should all 'paint' as that is consistent. However, on the other I'm saying they should all produce a 'valid' result. So my argument here is that when you cast you expect a valid result, much like casting a float to an int does not just 'paint'.
>
> I am interested to hear your opinions on this idea.

I think your idea has a lot of merit. I'm certainly leaning that way.

August 24, 2004

Re: The case for ditching char and wchar (and renaming "dchar" as "char")

Posted by J C Calvarese
in reply to Ben Hinkle

Permalink

J C Calvarese

Posted in reply to Ben Hinkle

Permalink

Ben Hinkle wrote:
...
> There were huge threads about char vs wchar vs dchar a while ago (on the old
> newsgroup, I think). All kinds of things like what the default should be,
> what the names should be, what a string class could be etc. For example
>  http://www.digitalmars.com/d/archives/20361.html
>  http://www.digitalmars.com/d/archives/12382.html
> or actually anything at
>  http://www.digitalmars.com/d/archives/index.html
> with the word "unicode" in the subject.

In case anyone is interested, here's a page with links to many Unicode threads in D newsgroups:
http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

August 24, 2004

Re: The case for ditching char and wchar (and renaming "dchar" as "char")

Posted by antiAlias
in reply to Walter

Permalink

antiAlias

Posted in reply to Walter

Permalink

"Walter" <newshound@digitalmars.com> wrote in message ...
> "Regan" wrote:
> > There are two ways of looking at this, on one hand you're saying they should all 'paint' as that is consistent. However, on the other I'm
saying
> > they should all produce a 'valid' result. So my argument here is that
when
> > you cast you expect a valid result, much like casting a float to an int does not just 'paint'.
> >
> > I am interested to hear your opinions on this idea.
>
> I think your idea has a lot of merit. I'm certainly leaning that way.

On the one hand, this would be well served by an opCast() and/or opCast_r()
on the primitive types; just the kind of thing  suggested in a related
thread (which talked about overloadable methods for primitive types).

On the other hand, we're talking transcoding here. Are you gonna' limit this to UTF-8 only? Then, since the source and destination will typically be of different sizes, do you then force all casts between these types to have the destination be an array reference rather than an instance? One that is always allocated on the fly? Then there's performance. It's entirely possible to write transcoders that are minimally between 5 and 30 times faster than the std.utf ones. Some people actually do care about efficiency. I'm one of them.

If you do implement overloadable primitive-methods (like properties) then, will you allow a programmer to override them? So they can make the opCast() do something more specific to their own specific task?

That's seems like a lot to build into the core of a language.

Personally, I think it's 'borderline' to have so many data types available for similar things. If there were an "alias wchar[] string", and the core Object supported that via "string toString()", and the IUC library were adopted, then I think some of the general confusion would perhaps melt somewhat.

In many respects, too many choices is simply a BadThing (TM). Especially when there's precious little solid guidance to help. That guidance might come from a decent library that indicates how the types are used, and uses one obvious type (string?) consistently. Besides, if IUC were adopted, less people would have to worry about the distinction anyway.

Believe me when I say that Mango would dearly love to go dchar[] only. Actually, it probably will at the higher levels because it makes life simple for everyone. Oh, and I've been accused many times of being an efficiency fanatic, especially when it comes to servers. But there's always a tradeoff somewhere. Here, the tradeoff is simplicity-of-use versus quantities of RAM. Which one changes dramatically over time? Hmmmm ... let me see now ... 64bit-OS for desktops just around corner?

Even on an embedded device I'd probably go "dchar only" regarding I18N. Simply because the quantity of text processed on such devices is very limited. Before anyone shoots me over this one, I regularly write code for devices with just 4KB RAM ~ still use 16bit chars there when dealing with XML input.

So what am I saying here? Available RAM will always increase in great leaps. Contemplating that the latter should dictate ease-of-use within D is a serious breach of logic, IMO. Ease of use, and above all, /consistency/ should be paramount; if you have the programmer in mind.

August 24, 2004

Re: The case for ditching char and wchar (and renaming "dchar" as "char")

Posted by Regan Heath
in reply to antiAlias

Permalink

Regan Heath

Posted in reply to antiAlias

Permalink

On Mon, 23 Aug 2004 18:29:54 -0700, antiAlias <fu@bar.com> wrote:
> "Walter" <newshound@digitalmars.com> wrote in message ...
>> "Regan" wrote:
>> > There are two ways of looking at this, on one hand you're saying they
>> > should all 'paint' as that is consistent. However, on the other I'm
> saying
>> > they should all produce a 'valid' result. So my argument here is that
> when
>> > you cast you expect a valid result, much like casting a float to an 
>> int
>> > does not just 'paint'.
>> >
>> > I am interested to hear your opinions on this idea.
>>
>> I think your idea has a lot of merit. I'm certainly leaning that way.
>
> On the one hand, this would be well served by an opCast() and/or opCast_r()
> on the primitive types; just the kind of thing  suggested in a related
> thread (which talked about overloadable methods for primitive types).

True. However, what else should the opCast for char[] to dchar[] do except transcode it? What about opCast for int to dchar.. it seems to me there is only 1 choice about what to do, anything else would be operator abuse.

> On the other hand, we're talking transcoding here. Are you gonna' limit this
> to UTF-8 only?

I hope not, I am hoping to see:

       | UTF-8 | UTF-16 | UTF-32
--------------------------------
UTF-8  |   -        +       +
UTF-16 |   +        -       +
UTF-32 |   +        +       -

(+ indicates transcoding occurs)

> Then, since the source and destination will typically be of
> different sizes, do you then force all casts between these types to have the
> destination be an array reference rather than an instance?

I don't think transcoding makes any sense unless you're talking about a 'string' (i.e. char[], wchar[], or dchar[]) as opposed to a UTF-x fragment (i.e. char, wchar)

As AJ has frequently pointed out a char or wchar does not equal one "character" in some cases. Given that and assuming 'c' below is "not a whole character", I cannot see how:

dchar d;
char  c = ?; //not a whole character

d = c;            //implicit
d = cast(dchar)c; //explicit

would ever be able to transcode. So in this instance this should either be:

1. a compile error
2. a runtime error
3. a simple copy of the value, creating an invalid? (will it be AJ?) utf-32 character.

I opt for #1 where possible, otherwise #2, and consider #3 very bad if it has the possibilty to create an invalid utf-x code-point/fragment (I don't know the right term)

> One that is
> always allocated on the fly?

No.. bad idea methinks.

> Then there's performance. It's entirely
> possible to write transcoders that are minimally between 5 and 30 times
> faster than the std.utf ones. Some people actually do care about efficiency.
> I'm one of them.

Same here. I believe we all have the same goal, we just have different ideas about how best to get there.

> If you do implement overloadable primitive-methods (like properties) then,
> will you allow a programmer to override them? So they can make the opCast()
> do something more specific to their own specific task?

Can you think of a task you'd want to put one of these opCast methods to? (and give an example).

> That's seems like a lot to build into the core of a language.

The opCast overloading? or the original idea?

The only reservation I had about the original idea was that it seemed like "a lot to build into the core of a language" at first, and then I realised if you cast from one char type to another, the _only_ sensible thing to do is transcode, anything else is a bug/error.

In addition this sort of cast (char type to char type) doesn't occur a lot, but when it does you end up calling toUTFxx manually, why not just have it happen.

> Personally, I think it's 'borderline' to have so many data types available for similar things.

Like byte, short, int and long?

> If there were an "alias wchar[] string", and the core
> Object supported that via "string toString()", and the IUC library were
> adopted, then I think some of the general confusion would perhaps melt
> somewhat.

Maybe.

However toString for an 'int' (for example) only needs a char[] to represent all it's possible values (with one character equalling one 'char') so why use wchar or dchar for it's toString?

Conversly something else might find it more efficient to return it's toString as a dchar[].

You might argue either example is rare, certainly the latter is rarer than the former, but the efficiency gnome in my head won't shut up about this..

If we simply alias wchar[] as 'string' then these examples will need manual conversion to 'string' which involves calling toUTF16 or .. what would you call that conversion function such that it was obvious? toString? :)

> In many respects, too many choices is simply a BadThing (TM).

I think too many very similar but slightly different choices/methods is a bad thing, however, I don't see char, wchar and dchar as that. I think they are more like byte, short and int, different types for different uses, castable to/from each other where required. The choice they give you is being able to choose the right type for the job.

> Especially
> when there's precious little solid guidance to help. That guidance might
> come from a decent library that indicates how the types are used, and uses
> one obvious type (string?) consistently.

I think the confusion comes from being used to only 1 string type, for C/C++ programmers it's typically and 8bit unsigned type usually containing ASCII, for Java a 16bit signed/unsigned? type containing UTF-16?

Internationalisation is a new topic for me and many others (I suspect) even for Walter(?).

Having 3 types requiring manual transcoding between them _is_ a pain.

> Besides, if IUC were adopted, less
> people would have to worry about the distinction anyway.

The same could be said for the implicit transcoding from one type to the other.

> Believe me when I say that Mango would dearly love to go dchar[] only.
> Actually, it probably will at the higher levels because it makes life simple
> for everyone. Oh, and I've been accused many times of being an efficiency
> fanatic, especially when it comes to servers. But there's always a tradeoff
> somewhere. Here, the tradeoff is simplicity-of-use versus quantities of RAM.
> Which one changes dramatically over time? Hmmmm ... let me see now ...
> 64bit-OS for desktops just around corner?

What you really mean is you'd dearly love to not have to worry about the differences between the 3 types, implicit transcoding will give you that. Furthermore it's simplicity without sacraficing RAM.

The thing I like about implicit transcoding is that if for example you have a lot of string data stored in memory, you can store it in the most efficient format for that data, which may be char, wchar or dchar.

If you then want to call a function which takes a dchar on some/all of your, it data will be implicitly converted to dchar (if not already) for the function.

If at a later date you decide to change the format of the stored data, you don't have to find every call to a function and insert/remove toUTFxx calls.

To me, this sounds like the most efficient way to handle it.

> Even on an embedded device I'd probably go "dchar only" regarding I18N.
> Simply because the quantity of text processed on such devices is very
> limited. Before anyone shoots me over this one, I regularly write code for
> devices with just 4KB RAM ~ still use 16bit chars there when dealing with
> XML input.

We're you to use implicit transcoding you could store the data in memory in UTF-8, or UTF-16 then only transcode to UTF-32 when required, this would be more efficient.

> So what am I saying here? Available RAM will always increase in great leaps.

Probably.

> Contemplating that the latter should dictate ease-of-use within D is a
> serious breach of logic, IMO.

"the latter"?

> Ease of use, and above all, /consistency/
> should be paramount; if you have the programmer in mind.

This last statement applies to implicit transcoding perfectly.
It's easy and consistent.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 24, 2004

Re: The case for ditching char and wchar (and renaming "dchar" as "char")

Posted by antiAlias
in reply to Regan Heath

Permalink

antiAlias

Posted in reply to Regan Heath

Permalink

"Regan Heath" <regan@netwin.co.nz> wrote in message news:opsc7u56zf5a2sq9@digitalmars.com...
> On Mon, 23 Aug 2004 18:29:54 -0700, antiAlias <fu@bar.com> wrote:
> > "Walter" <newshound@digitalmars.com> wrote in message ...
> >> "Regan" wrote:
> >> > There are two ways of looking at this, on one hand you're saying they should all 'paint' as that is consistent. However, on the other I'm
> > saying
> >> > they should all produce a 'valid' result. So my argument here is that
> > when
> >> > you cast you expect a valid result, much like casting a float to an
> >> int
> >> > does not just 'paint'.
> >> >
> >> > I am interested to hear your opinions on this idea.
> >>
> >> I think your idea has a lot of merit. I'm certainly leaning that way.
> >
> > On the one hand, this would be well served by an opCast() and/or
> > opCast_r()
> > on the primitive types; just the kind of thing  suggested in a related
> > thread (which talked about overloadable methods for primitive types).
>
> True. However, what else should the opCast for char[] to dchar[] do except transcode it? What about opCast for int to dchar.. it seems to me there is only 1 choice about what to do, anything else would be operator abuse.
>
> > On the other hand, we're talking transcoding here. Are you gonna' limit
> > this
> > to UTF-8 only?
>
> I hope not, I am hoping to see:
>
>         | UTF-8 | UTF-16 | UTF-32
> --------------------------------
> UTF-8  |   -        +       +
> UTF-16 |   +        -       +
> UTF-32 |   +        +       -
>
> (+ indicates transcoding occurs)

========================
And what happens when just one additional byte-oriented encoding is
introduced? Perhaps UTF-7? Perhaps EBCDIC? The basic premise is flawed
because there's no flexibility.

>
> > Then, since the source and destination will typically be of
> > different sizes, do you then force all casts between these types to have
> > the
> > destination be an array reference rather than an instance?
>
> I don't think transcoding makes any sense unless you're talking about a
> 'string' (i.e. char[], wchar[], or dchar[]) as opposed to a UTF-x fragment
> (i.e. char, wchar)

========================
We /are/ talking about arrays. Perhaps if that sentence had ended "array
reference rather than an *array* instance?", it might have been more clear?
The point being made is that you would not be able to do anything like this:

char[15] dst;
dchar[10] src;

dst = cast(char[]) src;

because there's no ability via a cast() to indicate how many items from src were converted, and how many items in dst were populated. You are forced into this kind of thing:

char[] dst;
dchar[10] src;

dst = cast(char[]) src;

You see the distinction? It may be subtle to some, but it's a glaring imbalance to others. The lValue must always be a reference because it's gonna' be allocated dynamically to ensure all of the rValue will fit. In the end, it's just far better to use functions/methods that provide the feedback required so you can actually control what's going on (or such that a library function can). That way, you're not restricted in the same fashion. We don't need more asymmetry in D, and this just reeks of poor design, IMO.

To drive this home, consider the decoding version (rather than the encoding
above):

char[15] src;
dchar[] dst;

dst = cast(dchar[]) src;

What happens when there's a partial character left undecoded at the end of 'src'? There nothing here to tell you that you've got a dangly bit left at the end of the souce-buffer. It's gone. Poof! Any further decoding from the same file/socket/whatever is henceforth trashed, because the ball has been both dropped and buried. End of story.

> Having 3 types requiring manual transcoding between them _is_ a pain.

========================
It certainly is. That's why other languages try to avoid it at all costs.

Having it done "generously" by the compiler is also a pain, inflexible, and likely expensive. There are many things a programmer should take responsibility for; transcoding comes under that umbrella because (a) there can be subtle complexity involved and (b) it is relatively expensive to churn through text and convert it; particularly so with the Phobos utf-8 code.

What you appear to be suggesting is that this kind of thing should happen silently whilst one nonchalantly passes arguments around between methods. That's insane, so I hope that's not what you're advocating. Java, for example, does that at one specific layer (I/O), but you're apparently suggesting doing it at any old place! And several times over, just in case it wasn't good enough the first time :-)  Sorry man. This is inane. D is /not/ a scripting language; instead it's supposed to be a systems language.

> > Besides, if IUC were adopted, less
> > people would have to worry about the distinction anyway.
>
> The same could be said for the implicit transcoding from one type to the other.

========================
That's pretty short-sighted IMO. You appear to be saying that implicit
transcoding would take the place of ICU; terribly misleading. Transcoding is
just a very small part of that package. Please try to reread the comment as
"most people would be shielded completely by the library functions,
therefore there's far fewer scenarios where they'd ever have a need to drop
into anything else".

This would be a very GoodThing for D users. Far better to have a good library to take case of /all/ this crap than have the D language do some partial-conversions on the fly, and then quit because it doesn't know how to provide any further functionality. This is the classic core-language-versus-library-functionality bitchfest all over again. Building all this into a cast()? Hey! Let's make Walter's Regex class part of the compiler too; and make it do UTF-8 decoding /while/ it's searching, since you'll be able to pass it a dchar[] that will be generously converted to the accepted char[] for you "on-the-fly".

Excuse me for jesting, but perhaps the Bidi text algorithms plus date/numeric formatting & parsing will all fit into a single operator also? That's kind of what's being suggested. I believe there's a serious underestimate of the task being discussed.

>
> > Believe me when I say that Mango would dearly love to go dchar[] only.
> > Actually, it probably will at the higher levels because it makes life
> > simple
> > for everyone. Oh, and I've been accused many times of being an
efficiency
> > fanatic, especially when it comes to servers. But there's always a
> > tradeoff
> > somewhere. Here, the tradeoff is simplicity-of-use versus quantities of
> > RAM.
> > Which one changes dramatically over time? Hmmmm ... let me see now ...
> > 64bit-OS for desktops just around corner?
>
> What you really mean is you'd dearly love to not have to worry about the differences between the 3 types, implicit transcoding will give you that. Furthermore it's simplicity without sacraficing RAM.

========================
Ahh Thanks. I didn't realize that's what I "really meant". Wait a minute ...

>
> > Even on an embedded device I'd probably go "dchar only" regarding I18N.
> > Simply because the quantity of text processed on such devices is very
> > limited. Before anyone shoots me over this one, I regularly write code
> > for
> > devices with just 4KB RAM ~ still use 16bit chars there when dealing
with
> > XML input.
>
> We're you to use implicit transcoding you could store the data in memory in UTF-8, or UTF-16 then only transcode to UTF-32 when required, this would be more efficient.

========================
That's a rather large assumption, don't you think? More efficient? In which
particualr way? Is memory usage or CPU usage more important in /my/
particular applications? Please either refrain, or commit to rewriting all
my old code more efficiently for me ... for free <g>

August 24, 2004

Re: The case for ditching char and wchar (and renaming

Posted by Arcane Jill
in reply to Regan Heath

Permalink

Arcane Jill

Posted in reply to Regan Heath

Permalink

In article <opsc7u56zf5a2sq9@digitalmars.com>, Regan Heath says...

>True. However, what else should the opCast for char[] to dchar[] do except transcode it? What about opCast for int to dchar.. it seems to me there is only 1 choice about what to do, anything else would be operator abuse.

Correctly me if I'm wrong, but according to the docs, there /is/ no from-to
version of opCast(). opCast() remains almost completely useless, despite many
suggestions to fix it.

But there's no need to opCast() anything. Compiler magic can just call
std.toUTFxx() directly (which of course is what you said).

If you want to use a different encoder, just do it explicitly. e.g. #    dchar[] d = someOtherConverter(c);



>I don't think transcoding makes any sense unless you're talking about a 'string' (i.e. char[], wchar[], or dchar[]) as opposed to a UTF-x fragment (i.e. char, wchar)

Correct. It doesn't.


>As AJ has frequently pointed out a char or wchar does not equal one "character" in some cases. Given that and assuming 'c' below is "not a whole character", I cannot see how:
>
>dchar d;
>char  c = d; //not a whole character
>
>d = c;            //implicit
>d = cast(dchar)c; //explicit

The existing behavior is already flawed, but it's not going to change (unless char is ditched), because these are primitive types, and Walter says so. Here's what's wrong:

In the direction dchar -> char:

#    dchar d = '\u1234';
#    char c = cast(char) d;    // c = '4';

Well, you'd /expect/ that to go wrong. It's a narrowing conversion. But:

#    char c = 0xE9;  // The first byte of a three-byte UTF-8 sequence which,
#                    //  when encoded, will represent a character in the range
#                    //  U+9000 to U+9FFF
#    dchar d = c;    // Whoops! d now equals 'é'

Casting from char to wchar or dchar will /only/ work if the character is actually ASCII.




>3. a simple copy of the value, creating an invalid? (will it be AJ?) utf-32 character.

No value in the range U+0000 to U+00FF is invalid UTF32. But you can't convert a UTF-8-fragment to a character (see example above) and expect the result to be meaningful, except for ASCII.



>utf-x code-point/fragment (I don't know the right term)

Unicode uses the term "code unit". I have avoided that term as it's not altogether clear to those unfamiliar with the jargon. I usually say "UTF-8 fragment" on this newsgroup.



>> Then there's performance. It's entirely
>> possible to write transcoders that are minimally between 5 and 30 times
>> faster than the std.utf ones. Some people actually do care about
>> efficiency.
>> I'm one of them.
>
>Same here. I believe we all have the same goal, we just have different ideas about how best to get there.

The current implementation of the std.utf functions is not really relevant. /Tomorrow/'s implementation could be way faster than today's. I don't see any reason why Walter couldn't some day replace the implementation with the fastest one on the planet. No code would break.


>I think the confusion comes from being used to only 1 string type, for C/C++ programmers it's typically and 8bit unsigned type usually containing ASCII, for Java a 16bit signed/unsigned? type containing UTF-16?

Mebee, but in C++ I can do this:

#    class JillsString { /*...*/ };
#    class RegansString { /*...*/ };
#
#    JillsString s1 = "hello world";
#    RegansString s2 = s1;
#    std::string s3 = s2;
#    if (s3 == s1) { /*...*/ }

The "confusion" in D arises (IMO) because we don't have implicit conversion.

Arcane Jill

Top | Forum index | About this forum

Forums