Unicode in D

I think you'll be making a big mistake if you adopt C's obsolete char == byte concept of strings. Savvy language designers these days realize that, like int's and float's, char's should be a fundamental data type at a higher-level of abstraction than raw bytes. The model that most modern language designers are turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit.

If you do so, you make it possible for strings in your language to have a single, canonical form that all APIs use. Instead of the nightmare that C/C++ programmers face when passing string parameters ("now, let's see, is this a char* or a const char* or an ISO C++ string or an ISO wstring or a wchar_t* or a char[] or a wchar_t[] or an instance of one of countless string classes...?). The fact that not just every library but practically every project feels the need to reinvent its own string type is proof of the need for a good, solid, canonical form built right into the language.

Most language designers these days either get this from the start of they later figure it out and have to screw up their language with multiple string types.

Having canonical UTF-16 chars and strings internally does not mean that you can't deal with other character encodings externally. You can can convert to canonical form on import and convert back to some legacy encoding on export. When you create the strings yourself, or when they are created in Java or C# or Javascript or default XML or most new text protocols, no conversion will be necessary. It will only be needed for legacy data (or a very lightweight switch between UTF-8 and UTF-16). And for those cases where you have to work with legacy data and yet don't want to incur the overhead of encoding conversion in and out, you can still treat the external strings as byte arrays instead of strings, assuming you have a "byte" data type, and do direct byte manipulation on them. That's essentially what you would have been doing anyway if you had used the old char == byte model I see in your docs. You just call it "byte" instead of "char" so it doesn't end up being your default string type.

Having a modern UTF-16 char type, separate from arrays of "byte", gives you a consistency that allows for the creation of great libraries (since text is such a fundamental type). Java and C# designers figured this out from the start, and their libraries universally use a single string type. Perl figured it out pretty late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it's never clear which CPAN modules will work and which ones will fail, so you have to use pragmas ("use utf-8" vs. "use bytes") and do lots of testing.

I hope you'll consider making this change to your design. Have an 8-bit unsigned "byte" type and a 16-bit unsigned UTF-16 "char" and forget about this "8-bit char plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuff or I'm quite sure you'll later regret it. C/C++ are in that sorry state for legacy reasons only, not because their designers were foolish, but any new language that intentionally copies that "design" is likely to regret that decision.

January 16, 2003

Re: Unicode in D

Posted by Paul Sheer
in reply to globalization guy

Permalink

Paul Sheer

Posted in reply to globalization guy

Permalink

On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:

> I think you'll be making a big mistake if you adopt C's obsolete char == byte

what about embedded work? this needs to be lightweight

in any case, a 16 bit character set doesn't hold all
the charsets needed by the worlds languages, but a
20 bit charset (UTF-8) is overkill. then again, most
programmers get by with 8 bits 99% of the time. So you
need to give people options.

-paul

January 16, 2003

Re: Unicode in D

Posted by Theodore Reed
in reply to Paul Sheer

Permalink

Theodore Reed

Posted in reply to Paul Sheer

Permalink

On Thu, 16 Jan 2003 14:40:15 +0200
"Paul Sheer" <psheer@icon.co.za> wrote:

> On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:
> 
> > I think you'll be making a big mistake if you adopt C's obsolete char == byte
> 
> what about embedded work? this needs to be lightweight
> 
> in any case, a 16 bit character set doesn't hold all
> the charsets needed by the worlds languages, but a
> 20 bit charset (UTF-8) is overkill. then again, most
> programmers get by with 8 bits 99% of the time. So you
> need to give people options.

But the default option should be UTF-8 with a module available for conversion. (I tend to stay away from UTF-16 because of endian issues.) Also, I'm not sure where you're getting the 20-bit part. UTF-8 can encode everything in the Unicode 32-bit range. (Although it takes like 8 bytes towards the end.)

UTF-8 also addresses the lightweight bit, as long as you aren't using non-English characters, but even if you are, they aren't that much longer. And it's better than having to deal with 50 million 8-bit encodings.

FWIW, I wholeheartedly support Unicode strings in D.

-- 
Theodore Reed (rizen/bancus)       -==-       http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~

"We have committed a greater crime, and for this crime there is no name. What punishment awaits us if it be discovered we know not, for no such crime has come in the memory of men and there are no laws to provide for it." -- Equality 7-2521, Ayn Rand's Anthem

January 16, 2003

Re: Unicode in D

Posted by Sean L. Palmer
in reply to Theodore Reed

Permalink

Sean L. Palmer

Posted in reply to Theodore Reed

Permalink

I'm all for UTF-8.  Most fonts don't come anywhere close to having all the glyphs anyway, but it's still nice to use an encoding that actually has a real definition (whereas "byte" has no meaning whatsoever and could mean ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.)  UTF-8 allows you the full unicode range but the part that we use everyday just takes 1 byte per char, like usual.  I believe it even maps almost 1:1 to ASCII in that range.

You cannot however make a UTF-8 data type.  By definition each character may take more than one byte.  But you don't make arrays of characters, you make arrays of character building blocks (bytes) that are interpreted as characters.

Anyway we'd need some automated way to step through the array one character at a time.  Maybe string could be an array of bytes that pretends that it's an array of 32-bit unicode characters?

Sean

"Theodore Reed" <rizen@surreality.us> wrote in message news:20030116081437.1a593197.rizen@surreality.us...
> On Thu, 16 Jan 2003 14:40:15 +0200
> "Paul Sheer" <psheer@icon.co.za> wrote:
>
> > On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:
> >
> > > I think you'll be making a big mistake if you adopt C's obsolete char == byte
> >
> > what about embedded work? this needs to be lightweight
> >
> > in any case, a 16 bit character set doesn't hold all
> > the charsets needed by the worlds languages, but a
> > 20 bit charset (UTF-8) is overkill. then again, most
> > programmers get by with 8 bits 99% of the time. So you
> > need to give people options.
>
> But the default option should be UTF-8 with a module available for conversion. (I tend to stay away from UTF-16 because of endian issues.) Also, I'm not sure where you're getting the 20-bit part. UTF-8 can encode everything in the Unicode 32-bit range. (Although it takes like 8 bytes towards the end.)
>
> UTF-8 also addresses the lightweight bit, as long as you aren't using non-English characters, but even if you are, they aren't that much longer. And it's better than having to deal with 50 million 8-bit encodings.
>
> FWIW, I wholeheartedly support Unicode strings in D.
>
> --
> Theodore Reed (rizen/bancus)       -==-       http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~
>
> "We have committed a greater crime, and for this crime there is no name. What punishment awaits us if it be discovered we know not, for no such crime has come in the memory of men and there are no laws to provide for it." -- Equality 7-2521, Ayn Rand's Anthem

January 16, 2003

Re: Unicode in D

Posted by Theodore Reed
in reply to Sean L. Palmer

Permalink

Theodore Reed

Posted in reply to Sean L. Palmer

Permalink

On Thu, 16 Jan 2003 09:49:58 -0800
"Sean L. Palmer" <seanpalmer@directvinternet.com> wrote:

> I'm all for UTF-8.  Most fonts don't come anywhere close to having all the glyphs anyway, but it's still nice to use an encoding that actually has a real definition (whereas "byte" has no meaning whatsoever and could mean ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.) UTF-8 allows you the full unicode range but the part that we use everyday just takes 1 byte per char, like usual.  I believe it even maps almost 1:1 to ASCII in that range.

AFAIK, Unicode between 0 and 127 is the exact same thing as ASCII.

-- 
Theodore Reed (rizen/bancus)       -==-       http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~

"The word of Sin is Restriction. O man! refuse not thy wife, if she will! O lover, if thou wilt, depart! There is no bond that can unite the divided but love: all else is a curse. Accursed! Accursed be it to the aeons! Hell." -- Liber AL vel Legis, 1:41

January 16, 2003

Re: Unicode in D

Posted by Alix Pexton
in reply to Theodore Reed

Permalink

Alix Pexton

Posted in reply to Theodore Reed

Permalink

Theodore Reed wrote:
> On Thu, 16 Jan 2003 09:49:58 -0800
> "Sean L. Palmer" <seanpalmer@directvinternet.com> wrote:
> 
> 
>>I'm all for UTF-8.  Most fonts don't come anywhere close to having all
>>the glyphs anyway, but it's still nice to use an encoding that
>>actually has a real definition (whereas "byte" has no meaning
>>whatsoever and could mean ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.) UTF-8 allows you the full unicode range but the part that we use
>>everyday just takes 1 byte per char, like usual.  I believe it even
>>maps almost 1:1 to ASCII in that range.
> 
> 
> AFAIK, Unicode between 0 and 127 is the exact same thing as ASCII.
> 

As I see it there are two issues here. Firstly there is the ability to read and manipulate text streams that are incoded in one of the many multi-byte/variable-width formats. and secondly there is allow code to be written in mb/vw formats. The first can be achieved (though perhaps not transparently) using a library, while the second obviously requires work to be done on the front end of the compiler. The front end is freely available under the gpl/artistic licences, and I don't think it would be difficult to augment it with mb/vw support.
However, This doesn't give us an intergrated solution, such as you might find in other languages, but it is a start.

Alix Pexton
Webmaster - "the D journal"
www.thedjournal.com

PS
who need mb/vw when we have lojban ;) .

January 16, 2003

Re: Unicode in D

Posted by Martin M. Pedersen
in reply to globalization guy

Permalink

Martin M. Pedersen

Posted in reply to globalization guy

Permalink

Hi,

I have been thinking about this issue too, and also I think that Unicode
string should be a prime concern of D. And, yes, UTF-8 is the way to go.
I would very much like to see a string using canonical UTF-8 encoding being
built right into the language, as a class with value semanthics.

What we are faced with is:

1. We need char and wchar_t for compability with APIs.
2. We need good Unicode support.
3. We need a memory efficient representation of strings.
4. We need the ability easy manipulation of strings.

There are two fundamental types of text data: a character and a string. Also, Java uses two kinds of strings: a String class for storing strings, and a StringBuffer for manipulating strings. This separation solves many problems.

I believe that:

- A single character should be represented using 32-bit UCS-4 using native endianess - like the wchar_t commenly seen on UNIX. It probably should be struct in order to avoid overhead of vtbl, and still support character methods such as isUpper() and toUpper().

- A non-modifyable string should be stored using UTF-8. By non-modifyable I mean that they do not allow individual characters to be manipulated, but they do allow reassignment. Read-only forward characters iterators could also be supported in an efficient manner. As it has already been stated, they would in most cases be as memory efficient as C's char arrays. This also addresses Walter's concern of perfermance issues with CPU caches. But it also means that the concept of using arrays simply is not good enough. This string class should also provide functionality such a collate() method.

- A modifyable string should support manipulation of individual characters, and could likely be an array of UCS-4 characters.

Methods should be provided for converting to/from char* and wchar_t* (whether it is 16- or 32-bit) as needed for supporting C APIs. Some will argue that this would involve too many conversions. However, if you are using char* today on Windows, Windows will do this conversion all the time, and you probably do not notice. And if it really becomes a bottle-neck, optimization would be simple in most cases - just cache the converted string. And if you are only concerned using C APIs - use the C string functions such as strcat()/wcscat() or specialiced classes.

In addition character encoders could be provided for whatever representation is needed. I myself would like support for US-ASCII, EBCDIC, ISO-8859, UTF-7, UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, and US-ASCII/ISO-8859 with encoding of characters as in HTML (I don't remember what standard this is called,  but it specifies characters using "&somename;"). Other would have different needs, so it should be simple to implement a new character encoder/decoder.

Regards,
Martin M. Pedersen.

"globalization guy" <globalization_member@pathlink.com> wrote in message news:b05pdd$13bv$1@digitaldaemon.com...
> I think you'll be making a big mistake if you adopt C's obsolete char ==
byte
> concept of strings. Savvy language designers these days realize that, like
int's
> and float's, char's should be a fundamental data type at a higher-level of abstraction than raw bytes. The model that most modern language designers
are
> turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit.
>
> If you do so, you make it possible for strings in your language to have a single, canonical form that all APIs use. Instead of the nightmare that
C/C++
> programmers face when passing string parameters ("now, let's see, is this
a
> char* or a const char* or an ISO C++ string or an ISO wstring or a
wchar_t* or a
> char[] or a wchar_t[] or an instance of one of countless string
classes...?).
> The fact that not just every library but practically every project feels
the
> need to reinvent its own string type is proof of the need for a good,
solid,
> canonical form built right into the language.
>
> Most language designers these days either get this from the start of they
later
> figure it out and have to screw up their language with multiple string
types.
>
> Having canonical UTF-16 chars and strings internally does not mean that
you
> can't deal with other character encodings externally. You can can convert
to
> canonical form on import and convert back to some legacy encoding on
export.
> When you create the strings yourself, or when they are created in Java or
C# or
> Javascript or default XML or most new text protocols, no conversion will
be
> necessary. It will only be needed for legacy data (or a very lightweight
switch
> between UTF-8 and UTF-16). And for those cases where you have to work with legacy data and yet don't want to incur the overhead of encoding
conversion in
> and out, you can still treat the external strings as byte arrays instead
of
> strings, assuming you have a "byte" data type, and do direct byte
manipulation
> on them. That's essentially what you would have been doing anyway if you
had
> used the old char == byte model I see in your docs. You just call it
"byte"
> instead of "char" so it doesn't end up being your default string type.
>
> Having a modern UTF-16 char type, separate from arrays of "byte", gives
you a
> consistency that allows for the creation of great libraries (since text is
such
> a fundamental type). Java and C# designers figured this out from the
start, and
> their libraries universally use a single string type. Perl figured it out
pretty
> late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it's
never
> clear which CPAN modules will work and which ones will fail, so you have
to use
> pragmas ("use utf-8" vs. "use bytes") and do lots of testing.
>
> I hope you'll consider making this change to your design. Have an 8-bit
unsigned
> "byte" type and a 16-bit unsigned UTF-16 "char" and forget about this
"8-bit
> char plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuff
or I'm
> quite sure you'll later regret it. C/C++ are in that sorry state for
legacy
> reasons only, not because their designers were foolish, but any new
language
> that intentionally copies that "design" is likely to regret that decision.
>
>
>

January 17, 2003

Re: Unicode in D

Posted by globalization guy
in reply to Paul Sheer

Permalink

globalization guy

Posted in reply to Paul Sheer

Permalink

In article <b065i9$19aa$1@digitaldaemon.com>, Paul Sheer says...
>
>On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:
>
>> I think you'll be making a big mistake if you adopt C's obsolete char == byte
>
>what about embedded work? this needs to be lightweight

Good questions. I think you'll find if you sniff around that more and more embedded work is going to Unicode. The reason is because it is inevitable that any successful device that deals with natural language will be required to handle more and more characters as its market expands. When you add new characters by changing character sets, you get a high marginal cost per market and you still can't handle mixed language scenarios (which have become very common due to the Internet.) When you add new characters by *adding* character sets, you lose all of your "lightweight" benefits.

I attended a Unicode conference once where there was a separate embedded systems conference going on in the same building. By the end of the conference, we had almost merged, at least in the hallways. ;-)

Unicode, done right, gives you universality at a fraction of the cost of patchwork solutions to worldwide markets. Even in English, the range of characters being demanded by customers has continued to grow. It grew beyond ASCII years ago, has now gone beyond Latin-1. MS Windows had to add a proprietary extension to Latin-1 before they gave up entirely and went full Unicode, as did Apple with OS X, Sun with Java, Perl, HTML 4....

>
>in any case, a 16 bit character set doesn't hold all
>the charsets needed by the worlds languages, but a
>20 bit charset (UTF-8) is overkill. then again, most
>programmers get by with 8 bits 99% of the time. So you
>need to give people options.
>
>-paul

UTF-16 isn't a 16-bit character set. It's a 16-bit encoding of a character set that has an enormous repertoire. There is room for well over a million characters in the Universal Character Set (shared by Unicode and ISO 10646), and many of those "characters" are actually components meant to be combined with others to create a truly enormous variety of what most people think of as "characters". It is no longer correct to assume a 1:1 correspondence between a Unicode character and a glyph you see on a screen or on paper. (And that correspondence was lost way back when TrueType was created anyway).

The length of a string in these modern times is an abstract concept, not a physical one, when dealing with natural language. The nice 1:1 correspondences between code point / character / glyph are still available for artificial symbols created as sequences of ASCII printing characters, though, and that is true even in UTF-16 Unicode.

Unicode certainly does have room for all of the world's character sets. It is a subset of them all -- with "all" meaning those considered significant by the various national bodies represented in ISO and all of the industrial bodies providing input to the Unicode Technical Committee. It's not a universal superset in an absolute sense.

When you say "most programmers get by with 8 bits 99% of the time", I think you may be thinking a bit too narrowly. The composition of programmers has become more international than perhaps you realize, and the change isn't slowing down. Even in the West, most major companies have moved to Unicode *to solve their own problems*. MS programmers can't get by with 8-bits. Neither can Apple's, or Sun's, or Oracle's, or IBM's....

Another thing to consider is that programmers use the tools that exist, naturally. For a long time, major programming languages had the fundamental equivalence of byte and char at their core. Many people who got by with 8-bits did so because there was no practical alternative.

These days, there are, and modern languages need to be designed to take advantage of all the great advantages that come along with using Unicode.

>

January 17, 2003

Re: Unicode in D

Posted by globalization guy
in reply to Sean L. Palmer

Permalink

globalization guy

Posted in reply to Sean L. Palmer

Permalink

In article <b06r0m$1l3u$1@digitaldaemon.com>, Sean L. Palmer says...
>
>I'm all for UTF-8.  Most fonts don't come anywhere close to having all the glyphs anyway,...

Modern font systems cover different Unicode ranges with different fonts. A font that contains all the Unicode glyphs is of very limited use. (It tends to be useful for primitive tools that assume a single font for all glyphs. Such tools are being superceded by modern tools, though, and the complexities of rendering are being delegated to central rendering subsystems.)

>... but it's still nice to use an encoding that actually has a
>real definition (whereas "byte" has no meaning whatsoever and could mean
>ANSI, DOS OEM, ASCII-7, UTF-8, or MBCS.)  UTF-8 allows you the full unicode
>range but the part that we use everyday just takes 1 byte per char, like
>usual.

I'd be careful about the "part we use everyday" idea. I don't really know who's involved in this "D" project, but big company developers tend to work more and more in systems that handle a rich range of characters. The reason is because that's what their company needs to do every day, whether they do personally or not. That's what is swirling around the Internet every day.

It is true, though, that for Westerners, ASCII characters occur more commonly, so UTF-8 has a sort of "poor man's compression" advantage that is often useful.

> I believe it even maps almost 1:1 to ASCII in that range.
>
>You cannot however make a UTF-8 data type.  By definition each character may take more than one byte.  But you don't make arrays of characters, you make arrays of character building blocks (bytes) that are interpreted as characters.
>

No, you make arrays of UTF-16 code units. When you need to do work with arrays of characters UTF-16 is a better choice than UTF-8, though UTF-8 is better for data interchange with unknown recipients.

>Anyway we'd need some automated way to step through the array one character at a time.  Maybe string could be an array of bytes that pretends that it's an array of 32-bit unicode characters?

UTF-16. That's what it's for. UTF-32 is not practical for most purposes that involve large amounts of text.

>
>Sean
>
>"Theodore Reed" <rizen@surreality.us> wrote in message news:20030116081437.1a593197.rizen@surreality.us...
>> On Thu, 16 Jan 2003 14:40:15 +0200
>> "Paul Sheer" <psheer@icon.co.za> wrote:
>>
>> > On Thu, 16 Jan 2003 08:10:21 +0000, globalization guy wrote:
>> >
>> > > I think you'll be making a big mistake if you adopt C's obsolete char == byte
>> >
>> > what about embedded work? this needs to be lightweight
>> >
>> > in any case, a 16 bit character set doesn't hold all
>> > the charsets needed by the worlds languages, but a
>> > 20 bit charset (UTF-8) is overkill. then again, most
>> > programmers get by with 8 bits 99% of the time. So you
>> > need to give people options.
>>
>> But the default option should be UTF-8 with a module available for conversion. (I tend to stay away from UTF-16 because of endian issues.) Also, I'm not sure where you're getting the 20-bit part. UTF-8 can encode everything in the Unicode 32-bit range. (Although it takes like 8 bytes towards the end.)
>>
>> UTF-8 also addresses the lightweight bit, as long as you aren't using non-English characters, but even if you are, they aren't that much longer. And it's better than having to deal with 50 million 8-bit encodings.
>>
>> FWIW, I wholeheartedly support Unicode strings in D.
>>
>> --
>> Theodore Reed (rizen/bancus)       -==-       http://www.surreality.us/ ~OpenPGP Signed/Encrypted Mail Preferred; Finger me for my public key!~
>>
>> "We have committed a greater crime, and for this crime there is no name. What punishment awaits us if it be discovered we know not, for no such crime has come in the memory of men and there are no laws to provide for it." -- Equality 7-2521, Ayn Rand's Anthem
>
>

January 17, 2003

Re: Unicode in D

Posted by globalization guy
in reply to Theodore Reed

Permalink

globalization guy

Posted in reply to Theodore Reed

Permalink

In article <20030116081437.1a593197.rizen@surreality.us>, Theodore Reed says...
>
>On Thu, 16 Jan 2003 14:40:15 +0200
>"Paul Sheer" <psheer@icon.co.za> wrote:
>
>But the default option should be UTF-8 with a module available for conversion. (I tend to stay away from UTF-16 because of endian issues.)

The default (and only) form should be UTF-16 in the language itself. There is no endianness issue unless data is serialized. Serialization is a type of output like printing on paper, and I'm not suggesting serializing into UTF-16 by default. UTF-8 is the way to go for that. I'm only talking about the "model" used by the programming language.

Another way to look at it is to consider int's. Do you try to avoid the int data type? It has exactly the same endianness issues as UTF-16.

>Also, I'm not sure where you're getting the 20-bit part. UTF-8 can encode everything in the Unicode 32-bit range. (Although it takes like 8 bytes towards the end.)

He's right, actually. Unicode has a range of slightly over 20 bits. (1M + 62K, to be exact.) Originally, Unicode had a 16-bit range and ISO 10646 had a 31 bit range (not 32), but both now have converged on a little over 20.

>
>UTF-8 also addresses the lightweight bit, as long as you aren't using non-English characters, but even if you are, they aren't that much longer.

So does UTF-16 because although Western characters take a little more space than with UTF-8, processing is lighter weight, and that is usually more significant.

> And it's better than having to deal with 50 million 8-bit
>encodings.
>

Amen to that! Talk about heavyweight...

>FWIW, I wholeheartedly support Unicode strings in D.

Yes, indeed. It is a real benefit to give the users because with Unicode strings as standard, you get libraries that can take a lot of the really arcane issues off the programmers' shoulders (and put them on the library authors' shoulders, where tough stuff belongs). When D programmers then deal with Unicode XML, HTML 4, Unicode databases, talking to Java or C# or working with JavaScript, they can just send the strings to the libraries, confident that the "Unicode stuff" will be taken care of.

That's the kind of advantage modern developers get from Java that they don't get from good ol' C.

Top | Forum index | About this forum

Forums