The case for ditching char and wchar (and renaming "dchar" as "char")

implicit char[] conversion
Aug 25, 2004 antiAlias
Aug 25, 2004 Regan Heath
Aug 25, 2004 Sean Kelly
Aug 25, 2004 Sean Kelly
Aug 25, 2004 Arcane Jill
Aug 25, 2004 antiAlias
Aug 25, 2004 pragma
Aug 25, 2004 antiAlias
Aug 25, 2004 Regan Heath
Aug 26, 2004 pragma

Re: The case for ditching char and wchar (and renaming
Aug 24, 2004 Arcane Jill
Aug 25, 2004 Regan Heath

Aug 24, 2004

Walter

Aug 24, 2004

antiAlias

Re: The case for ditching char and wchar (and renaming
Aug 24, 2004 Sean Kelly
Aug 24, 2004 antiAlias
Aug 24, 2004 Sean Kelly

Aug 25, 2004

Regan Heath

Aug 25, 2004

antiAlias

Aug 25, 2004

Regan Heath

August 23, 2004

The case for ditching char and wchar (and renaming "dchar" as "char")

Posted by Arcane Jill

Permalink

Arcane Jill

Permalink

D has come a long way, and much of the original architecture is now redundant. Template/library based containers are now making built-in associative arrays redundant, for example. And now a new revolution is on its way - transcoding, which makes built in support for UTF-8 and friends equally redundant. (It does not, of course, make Unicode itself redundant!).

D's "char" type is, by definition, a fragment of UTF-8.
But UTF-8 is just an encoding.

D's "wchar" type is, by definition, a fragment of UTF-16. But UTF-16 is also just an encoding (or two).

D's "dchar" type flits ambiguously between a fragment of UTF-32 and an actual Unicode codepoint (the two are more or less interchangeable).

<sarcasm>
By extension of this logic, why not:

schar - a fragment of UTF-7
ichar - a fragment of ISO-8859-1
cchar - a fragment of WINDOWS-1252
.. and so on, for every encoding you can think of. Hang on - we're going to run
out of letters!

and of course, Phobos would have to implement all the conversion functions:
toUTF7(), toISO88591(), and so on.
</sarcasm>

Nonsense? Of course it is. But the analogy is intended to show that the current behavior of D is also nonsense. For N encodings, you need (N squared minus N) conversion functions, so the number is going to grow quite rapidly as the number of supported encodings increases. But if you instead use transcoding, then the number of conversion functions you need is simply N. Not only that, the mechanism is smoother, neater. Your code is more elegant. You simply don't have to /worry/ about all that nonsense trying to get the three built-in encodings to match, because the issue has simply gone away.

And once the issue has gone away, you no longer need a special type to hold fragments of UTF-8 or UTF-16. Bye bye char. Bye bye wchar.

Kris (antiAlias) has sent me the the transcoding interface which Mango requires. (Recieved, thanks). I've already written a generic one, but which didn't take those requirements into account. So today I'm going to merge the two approaches together and see what Kris thinks. So I'm pretty confident that within a few days, Kris and I will have got together a transcoding architecture we're both happy with - and since Kris has expertise in streams/Mango, and I have expertise in Unicode/internationalization, I'd make a pretty good wager that between us we're going to get it right. And we'll plumb in the UTF transcoders first. You can probably expect all that to be done within days rather than weeks.

So why would we then need old-style-char or wchar any more?

For reasons of space-efficiency, one might want to store text in memory in UTF-8 format. Fair enough. But if char were to be ditched, you could still do that. You'd simply use a ubyte[] for that purpose (just as you are now required to do if you want to store text in memory in UTF-7). After all - what actually /is/ a UTF fragment anyway? What meaning does the UTF-8 fragment 0x83 have in isolation? Answer - none. It has meaning only in the context of the bytes surrounding it. You don't need a special primitive type just to hold that fragment. And of course, there is /nothing/ to stop special string classes from being written to provide implementations of such space-efficient abstractions.

A further argument against char is people coming from C/C++ /will/ try to store ISO-8859-1 encoded strings in a char[]. And they will get away with it, too, so long as they don't try calling any toUTFxx() routines on them. Bugs can be deeply buried in this confusion, failing to surface for a very long time.

Discussion in another thread has focused on the the fact that Object.toString() returns a char[]. Regan and I have made the suggestion that the three string types be interchangable. But there's a better way: have just the /one/ string type. (As they say in Highlander, "There can be only one"). Problem gone away.

With new-style-char redefined, not merely as a UTF-32 code unit (a fragment of an encoding), but as an actual Unicode character, things become much, much simpler.

AND it would make life easier for Walter - fewer primitive types; less for the compiler to understand/do.

Java tried to do it this way. When Java was invented, they had a char type intended to hold a single Unicode character. They also had a byte type, an array of which could store ASCII or ISO-8859-1 or UTF-8 encoded text. They also had transcoding built in, to make it all hang together. Where it went wrong for Java was that Unicode changed from being 16-bits wide to being 21-bits wide (so suddenly Java's char was no longer wide enough, and they were forced to redefine Java strings as being UTF-16 encoded). But please note that Java did /not/ attempt to have separate char types for each encoding. Even /after/ Unicode exceeded 16-bits, Java was not tempted to introduce a new kind of char. Why not? Because having more than one char type is an ugly kludge (particularly if you're using Unicode by definition). It's an ugly kludge in D, too. I thought it was really good, once upon a time, but now that transcoding is moving out to libraries, and encompasses many /more/ encodings merely UTF-8/16/32, I no longer think that. Now is the best time of all for a rethink.

But ...

there's a down-side ... it would break a lot of existing code. Well, so what? This is a pre-1.0 wart-removing exercise. Like all of those other suggestions we're voting on in another thread, the time to make this change is now, before it's too late.

Arcane Jill

August 23, 2004

Re: The case for ditching char and wchar (and renaming

Posted by Matthias Becker
in reply to Arcane Jill

Permalink

Matthias Becker

Posted in reply to Arcane Jill

Permalink

After you proposed these ideas about allowing toString to return any character type I started thinking about it and finaly I thought: Why do we have more than one character-type? (Just like you do)

August 23, 2004

Re: The case for ditching char and wchar (and renaming "dchar" as "char")

Posted by Juanjo Álvarez
in reply to Arcane Jill

Permalink

Juanjo Álvarez

Posted in reply to Arcane Jill

Permalink

Arcane Jill wrote:

> ButY there's a better way:
> have just the /one/ string type. (As they say in Highlander, "There can be
> only one"). Problem gone away.

Yes, it makes a lot of sense. You have my (useless) vote.

> AND it would make life easier for Walter - fewer primitive types; less for the compiler to understand/do.

I think Walter should like it, if only for this.

> But ...
> 
> there's a down-side ... it would break a lot of existing code. Well, so what? This is a pre-1.0 wart-removing exercise. Like all of those other suggestions we're voting on in another thread, the time to make this change is now, before it's too late.

I'll be very happy to change my (little) code now.

August 23, 2004

Re: The case for ditching char and wchar (and renaming "dchar" as "char")

Posted by Ben Hinkle
in reply to Arcane Jill

Permalink

Ben Hinkle

Posted in reply to Arcane Jill

Permalink

Arcane Jill wrote:

> 
> D has come a long way, and much of the original architecture is now redundant. Template/library based containers are now making built-in associative arrays redundant, for example. And now a new revolution is on its way - transcoding, which makes built in support for UTF-8 and friends equally redundant. (It does not, of course, make Unicode itself redundant!).
> 
> D's "char" type is, by definition, a fragment of UTF-8.
> But UTF-8 is just an encoding.
> 
> D's "wchar" type is, by definition, a fragment of UTF-16. But UTF-16 is also just an encoding (or two).
> 
> D's "dchar" type flits ambiguously between a fragment of UTF-32 and an actual Unicode codepoint (the two are more or less interchangeable).
> 
> <sarcasm>
> By extension of this logic, why not:
> 
> schar - a fragment of UTF-7
> ichar - a fragment of ISO-8859-1
> cchar - a fragment of WINDOWS-1252
> .. and so on, for every encoding you can think of. Hang on - we're going
> to run out of letters!
> 
> and of course, Phobos would have to implement all the conversion
> functions: toUTF7(), toISO88591(), and so on.
> </sarcasm>
> 
> Nonsense? Of course it is. But the analogy is intended to show that the current behavior of D is also nonsense. For N encodings, you need (N squared minus N) conversion functions, so the number is going to grow quite rapidly as the number of supported encodings increases. But if you instead use transcoding, then the number of conversion functions you need is simply N. Not only that, the mechanism is smoother, neater. Your code is more elegant. You simply don't have to /worry/ about all that nonsense trying to get the three built-in encodings to match, because the issue has simply gone away.
> 
> And once the issue has gone away, you no longer need a special type to hold fragments of UTF-8 or UTF-16. Bye bye char. Bye bye wchar.
> 
> Kris (antiAlias) has sent me the the transcoding interface which Mango requires. (Recieved, thanks). I've already written a generic one, but which didn't take those requirements into account. So today I'm going to merge the two approaches together and see what Kris thinks. So I'm pretty confident that within a few days, Kris and I will have got together a transcoding architecture we're both happy with - and since Kris has expertise in streams/Mango, and I have expertise in Unicode/internationalization, I'd make a pretty good wager that between us we're going to get it right. And we'll plumb in the UTF transcoders first. You can probably expect all that to be done within days rather than weeks.
> 
> So why would we then need old-style-char or wchar any more?
> 
> For reasons of space-efficiency, one might want to store text in memory in UTF-8 format. Fair enough. But if char were to be ditched, you could still do that. You'd simply use a ubyte[] for that purpose (just as you are now required to do if you want to store text in memory in UTF-7). After all - what actually /is/ a UTF fragment anyway? What meaning does the UTF-8 fragment 0x83 have in isolation? Answer - none. It has meaning only in the context of the bytes surrounding it. You don't need a special primitive type just to hold that fragment. And of course, there is /nothing/ to stop special string classes from being written to provide implementations of such space-efficient abstractions.
> 
> A further argument against char is people coming from C/C++ /will/ try to store ISO-8859-1 encoded strings in a char[]. And they will get away with it, too, so long as they don't try calling any toUTFxx() routines on them. Bugs can be deeply buried in this confusion, failing to surface for a very long time.
> 
> Discussion in another thread has focused on the the fact that Object.toString() returns a char[]. Regan and I have made the suggestion that the three string types be interchangable. But there's a better way: have just the /one/ string type. (As they say in Highlander, "There can be only one"). Problem gone away.
> 
> With new-style-char redefined, not merely as a UTF-32 code unit (a fragment of an encoding), but as an actual Unicode character, things become much, much simpler.
> 
> AND it would make life easier for Walter - fewer primitive types; less for the compiler to understand/do.
> 
> Java tried to do it this way. When Java was invented, they had a char type intended to hold a single Unicode character. They also had a byte type, an array of which could store ASCII or ISO-8859-1 or UTF-8 encoded text. They also had transcoding built in, to make it all hang together. Where it went wrong for Java was that Unicode changed from being 16-bits wide to being 21-bits wide (so suddenly Java's char was no longer wide enough, and they were forced to redefine Java strings as being UTF-16 encoded). But please note that Java did /not/ attempt to have separate char types for each encoding. Even /after/ Unicode exceeded 16-bits, Java was not tempted to introduce a new kind of char. Why not? Because having more than one char type is an ugly kludge (particularly if you're using Unicode by definition). It's an ugly kludge in D, too. I thought it was really good, once upon a time, but now that transcoding is moving out to libraries, and encompasses many /more/ encodings merely UTF-8/16/32, I no longer think that. Now is the best time of all for a rethink.
> 
> But ...
> 
> there's a down-side ... it would break a lot of existing code. Well, so what? This is a pre-1.0 wart-removing exercise. Like all of those other suggestions we're voting on in another thread, the time to make this change is now, before it's too late.
> 
> Arcane Jill

There were huge threads about char vs wchar vs dchar a while ago (on the old
newsgroup, I think). All kinds of things like what the default should be,
what the names should be, what a string class could be etc. For example
 http://www.digitalmars.com/d/archives/20361.html
 http://www.digitalmars.com/d/archives/12382.html
or actually anything at
 http://www.digitalmars.com/d/archives/index.html
with the word "unicode" in the subject.

By the way, why if there are N encodings are there N^2-N converters? Shouldn't there just be ~2*N to convert to/from one standard like dchar[]? IBM's ICU (at http://oss.software.ibm.com/icu/) uses wchar[] as the standard.

-Ben

August 23, 2004

Re: The case for ditching char and wchar (and renaming

Posted by Arcane Jill
in reply to Ben Hinkle

Permalink

Arcane Jill

Posted in reply to Ben Hinkle

Permalink

In article <cgcoe6$2cq4$1@digitaldaemon.com>, Ben Hinkle says...
>

>There were huge threads about char vs wchar vs dchar a while ago (on the old newsgroup, I think). All kinds of things like what the default should be, what the names should be, what a string class could be etc. For example
> http://www.digitalmars.com/d/archives/20361.html http://www.digitalmars.com/d/archives/12382.html
>or actually anything at
> http://www.digitalmars.com/d/archives/index.html
>with the word "unicode" in the subject.

Well spotted. I had a look at some of those old threads, and it does seem that most of the views back there were saying much the same thing as I'm suggesting now, which is good, as I'm happy to count it as more votes for the proposal, AND evidence of ongoing discontent over some years. The difference between now and then is that /now/ we have transcoding classes underway, and we'll have a working architecture very very soon, which will be able to plug into any kind of string or stream class. This is the difference which makes ditching char and wchar an actual practical possibility now.

Incidently, there were plenty of views in those archives which basically said that the Unicode functions which now exist in etc.unicode (and which didn't exist at the time) should exist. That's one problem solved.




>By the way, why if there are N encodings are there N^2-N converters? Shouldn't there just be ~2*N to convert to/from one standard like dchar[]?

Well, that's how transcoding will do it, obviously. I was comparing it to the
present system, in which N == 3 (UTF-8, UTF-16 and UTF-32), and there are 6 (=
3^2-3) converters in std.utf, these being:

*) toUTF8(wchar[]);
*) toUTF8(dchar[]);
*) toUTF16(char[]);
*) toUTF16(dchar[]);
*) toUTF32(char[]);
*) toUTF32(wchar[]);

If the current (std.utf) scheme were to be extended to include, say, UTF-7 and UTF-EBCDIC, how would that scale up?



>IBM's ICU (at http://oss.software.ibm.com/icu/)

Bloody hell. I wish someone had pointed me at ICU earlier. That is exceptional. They've even got Unicode Regular Expressions! And transcoding functions. And it's open source, too!

Should I just give up on etc.unicode? Maybe we should just put a D wrapper around ICU instead, which would give D full Unicode support right now, and leave me free to do crypto stuff!


>IBM's ICU (at http://oss.software.ibm.com/icu/)
>uses wchar[] as the standard.

Ah, no it doesn't. I just checked. ICU has the types UChar (platform dependent,
but wchar for us) and UChar32 (definitely a dchar). So you see, both wchar[] and
dchar[] are "standards" for ICU. (That said, I've only looked at it for a few
seconds, so I may have misunderstood).

Anwyay, UTF-16 transcoding will easily take care of interfacing with any UTF-16 architecture. The present situation in D is no more compatible than what I'm suggesting.

Slightly modified proposal then - ditch char and wchar as before, PLUS, incorporate ICU into D's core and write a D wrapper for it. (And ditch etc.unicode - erk!) The ICU license is at http://oss.software.ibm.com/cvs/icu/~checkout~/icu/license.html.

Arcane Jill

August 23, 2004

Re: The case for ditching char and wchar (and renaming "dchar" as "char")

Posted by Roald Ribe
in reply to Ben Hinkle

Permalink

Roald Ribe

Posted in reply to Ben Hinkle

Permalink

"Ben Hinkle" <bhinkle4@juno.com> wrote in message news:cgcoe6$2cq4$1@digitaldaemon.com...

[snip]

> There were huge threads about char vs wchar vs dchar a while ago (on the
old
> newsgroup, I think). All kinds of things like what the default should be,
> what the names should be, what a string class could be etc. For example
>  http://www.digitalmars.com/d/archives/20361.html
>  http://www.digitalmars.com/d/archives/12382.html
> or actually anything at
>  http://www.digitalmars.com/d/archives/index.html
> with the word "unicode" in the subject.
>
> By the way, why if there are N encodings are there N^2-N converters? Shouldn't there just be ~2*N to convert to/from one standard like dchar[]? IBM's ICU (at http://oss.software.ibm.com/icu/) uses wchar[] as the standard.

Indeed. There were several large discussions about this. Only a few
scandinavian/north european readers of this group seemed to be
positive at the time.
I am happy to see that more people are warming to the idea.
wchar (16-bit) is enough. It is even suggested as the best
implementation size by some UNICODE coding experts.
IBM / Sun / MS can not all be stupid at the same time...
I think it would be smart to interoperate with the 16-bit size
used internally in ICU, Java and MS-Windows. Only on unix/linux
would it make sense to use 32-bits dchar.
The 16 bits is enough for 99% of the cases/languages.
The last 1% can be handled quite fast by cached indexing
techniques in a String object. (this does not make for
optimal speed in the 1% case, but it will more than pay
for itself speedwise in 99% of all binary i/o operations :)

However, that is Walters main issue I think. He wants default 8-bit chars to be default because this will make for the best possible i/o speed with the current state of affairs. That is what I understould from the last discussion at least. I am sure he will comment this thread ;-) and correct me if I am wrong.

Regards,
Roald

August 23, 2004

Re: The case for ditching char and wchar (and renaming

Posted by Arcane Jill
in reply to Roald Ribe

Permalink

Arcane Jill

Posted in reply to Roald Ribe

Permalink

In article <cgct0l$2eut$1@digitaldaemon.com>, Roald Ribe says...

>However, that is Walters main issue I think. He wants default 8-bit chars to be default because this will make for the best possible i/o speed with the current state of affairs. That is what I understould from the last discussion at least. I am sure he will comment this thread ;-) and correct me if I am wrong.

Is that true?

But UTF-8 /doesn't/ make the best possible I/O speed. To achieve that, you'd need to be using the OS-native encoding internally (ISO-8859-1 on most Linux boxes, WINDOWS-1252 on most Windows boxes). If UTF-8 is not used natively (which most of the time it isn't), you'd still need transcoding. Fact is, transcoding from UTF-16 to ISO-8859-1 or WINDOWS-1252 is going to be much faster than transcoding from UTF-8 to those encodings.

And in any case, the time spent transcoding is almost always going to be insignificant compared to time spent doing actual I/O. Think console input; writing to disk; reading from CD-ROM; writing to a socket; .... Transcoding is really not a bottleneck.

Jill

August 23, 2004

Re: The case for ditching char

Posted by Arcane Jill
in reply to Arcane Jill

Permalink

Arcane Jill

Posted in reply to Arcane Jill

Permalink

The case for retaining wchar has been made, and essentially won. Please see separate thread about ICU (and maybe move this discussion there).

"char", however, is still up for deletion, since all arguments against it still apply.

Jill

August 23, 2004

Re: The case for ditching char and wchar (and renaming

Posted by Juanjo Álvarez
in reply to Arcane Jill

Permalink

Juanjo Álvarez

Posted in reply to Arcane Jill

Permalink

Arcane Jill wrote:


> Slightly modified proposal then - ditch char and wchar as before, PLUS, incorporate ICU into D's core and write a D wrapper for it. (And ditch etc.unicode - erk!) The ICU license is at http://oss.software.ibm.com/cvs/icu/~checkout~/icu/license.html.

I suppose that if dchar it's what left of this ditching it should renamed to char.

AJ, I don't know a shit^D^D^D^D too much about Unicode but you excitement about ICU is really contagious, only one question, are the C wrappers at the same level than the C++/Java ones? If so it seems that with a little easy and boring (compared to writing etc.unicode) wrapping we're going to have a first-class Unicode lib :) => (i18n version of <g>)

August 23, 2004

Re: The case for ditching char and wchar (and renaming "dchar" as "char")

Posted by Andy Friesen
in reply to Arcane Jill

Permalink

Andy Friesen

Posted in reply to Arcane Jill

Permalink

Arcane Jill wrote:

> For reasons of space-efficiency, one might want to store text in memory in UTF-8
> format. Fair enough. But if char were to be ditched, you could still do that.
> You'd simply use a ubyte[] for that purpose (just as you are now required to do
> if you want to store text in memory in UTF-7). After all - what actually /is/ a
> UTF fragment anyway? What meaning does the UTF-8 fragment 0x83 have in
> isolation? Answer - none. It has meaning only in the context of the bytes
> surrounding it. You don't need a special primitive type just to hold that
> fragment. And of course, there is /nothing/ to stop special string classes from
> being written to provide implementations of such space-efficient abstractions.

I think it might be worth it for the conceptual clarity.  UTF-32 happens to be the character type that's hardest to break.  It seems logical that it be the default.

The programmer can still take control and use another encoding when the problem domain allows for it, but it's an optimization, not business as usual.

 -- andy

Top | Forum index | About this forum

Forums