January 17, 2003
In article <b07jht$22v4$1@digitaldaemon.com>, globalization guy says...

>That's the kind of advantage modern developers get from Java that they don't get from good ol' C.

provided solaris/jvm is configured correctly by (friggin) service provider


January 17, 2003
You make some great points. I have to ask, though, why UTF-16 as opposed to UTF-8?


January 17, 2003
In article <b08cdr$2fld$1@digitaldaemon.com>, Walter says...
>
>You make some great points. I have to ask, though, why UTF-16 as opposed to UTF-8?

Good question, and actually it's not an open and shut case. UTF-8 would not be a big mistake, but it might not be quite as good as UTF-16.

The biggest reason I think UTF-16 has the edge is that I think you'll probably want to treat your strings as arrays of characters on many occasions, and that's *almost* as easy to do with UTF-16 as with ASCII. It's really not very practical with UTF-8, though.

UTF-16 characters are almost always a single 16-bit code unit. Once in a billion characters or so, you get a character that is composed of two "surrogates". Sort of like half characters. Your code does have to keep this exceptional case in mind and handle it when necessary, though that is usually the type of problem you delegate to the standard library. In most cases, a function can just think of each surrogate as a character and not worry that it might be just half of the representation of a character -- as long as the two don't get separated. In almost all cases, though, you can think of a character as a single 16-bit entity, which is almost as simple as thinking of it as a single 8-bit entity. You can do bit operations on them and other C-like things and it should be very efficient.

Unlike UTF-16's two cases, one of which is very rare, UTF-8 has four cases, three of which are very common. All of your code needs to do a good job with those three cases. Only the fourth can be considered exceptional. (Of course it has to be handled, too, but it is like the exceptional UTF-16 case, where you don't have to optimize for it because it rarely occurs). Most strings will tend to have mixed-width characters, so a model of an array of elements isn't a very good one.

You can still implement your language with accessors that reach into a UTF-8 string and parse out the right character when you say "str[5]", but it will be further removed from the physical implementation than if you use UTF-16. For a somewhat lower-level language like "D", this probably isn't a very good fit.

The main benefit of UTF-8 is when exchanging text data with arbitrary external parties. UTF-8 has no endianness problem, so you don't have to worry about the *internal* memory model of the recipient. It has some other features that make it easier to digest by legacy systems that can only handle ASCII. They won't work right outside ASCII, but they'll often work for ASCII and they'll fail more gracefully than would be the case with UTF-16 (that is likely to contain embedded \0 bytes.)

None of these issues are relevant to your own program's *internal* text model. Internally, you're not worried about endianness. (You don't worry about the endianness of your int variables, do you?) You don't have to worry about losing a byte in RAM, etc.

When talking to external APIs, you'll still have to output in a form that the API can handle. Win32 APIs want UTF-16. Mac APIs want UTF-16. Java APIs want UTF-16, as do .Net APIs. Unix APIs are problematic, since there are so many and they aren't coordinated by a single body. Some will only be able to handle ASCII, others will be upgraded to UTF-8. I don't think the Unix system APIs will become UTF-16 because legacy is such a ball and chain in the Unix world, but the process is underway to upgrade the standard system encoding for all major Linux distributions to UTF-8.

If Linux APIs (and probably most Unix APIs eventually) are of primary importance, UTF-8 is still a possibility. I'm not totally ruling it out. It wouldn't hurt you much to use UTF-8 internally, but accessing strings as arrays of characters would require sort of a virtual string model that doesn't match the physical model quite as closely as you could get with UTF-16. The additional abstraction might have more overhead than you would prefer internally. If it's a choice between internal inefficiency and inefficiency when calling external APIs, I would usually go for the latter.

Most language designers who understand internationalization have decided to go with UTF-16 for languages that have their own rich set of internal libraries, and they have mechanisms for calling external APIs that convert the string encodings.



January 17, 2003
I read your post with great interest. However, I'm leaning towards UTF-8 for the following reasons (some of which you've covered):

1) In googling around and reading various articles, it seems that UTF-8 is gaining momentum as the encoding of choice, including html.

2) Linux is moving towards UTF-8 permeating the OS. Doing UTF-8 in D means that D will mesh naturally with Linux system api's.

3) Is Win32's "wide char" really UTF-16, including the multi word encodings?

4) I like the fact of no endianness issues, which is important when writing files and transmitting text - it's much more important an issue than the endianness of ints.

5) 16 bit accesses on Intel CPUs can be pretty slow compared to byte or
dword accesses (varies by CPU type).

6) Sure, UTF-16 reduces the frequency of multi character encodings, but the code to deal with it must still be there and must still execute.

7) I've converted some large Java text processing apps to C++, and converted the Java 16 bit char's to using UTF-8. That change resulted in *substantial* performance improvements.

8) I suspect that 99% of the text processed in computers is ascii. UTF-8 is a big win in memory and speed for processing english text.

9) A lot of diverse systems and lightweight embedded systems need to work with 8 bit chars. Going to UTF-16 would, I think, reduce the scope of applications and systems that D would be useful for. Going to UTF-8 would make it as broad as possible.

10) Interestingly, making char[] in D to be UTF-8 does not seem to step on or prevent dealing with wchar_t[] arrays being UTF-16.

11) I'm not convinced the char[i] indexing problem will be a big one. Most operations done on ascii strings remain unchanged for UTF-8, including things like sorting & searching.

See http://www.cl.cam.ac.uk/~mgk25/unicode.html

"globalization guy" <globalization_member@pathlink.com> wrote in message news:b09qpe$aff$1@digitaldaemon.com...
> In article <b08cdr$2fld$1@digitaldaemon.com>, Walter says...
> >
> >You make some great points. I have to ask, though, why UTF-16 as opposed
to
> >UTF-8?
>
> Good question, and actually it's not an open and shut case. UTF-8 would
not be a
> big mistake, but it might not be quite as good as UTF-16.
>
> The biggest reason I think UTF-16 has the edge is that I think you'll
probably
> want to treat your strings as arrays of characters on many occasions, and
that's
> *almost* as easy to do with UTF-16 as with ASCII. It's really not very
practical
> with UTF-8, though.
>
> UTF-16 characters are almost always a single 16-bit code unit. Once in a
billion
> characters or so, you get a character that is composed of two
"surrogates". Sort
> of like half characters. Your code does have to keep this exceptional case
in
> mind and handle it when necessary, though that is usually the type of
problem
> you delegate to the standard library. In most cases, a function can just
think
> of each surrogate as a character and not worry that it might be just half
of the
> representation of a character -- as long as the two don't get separated.
In
> almost all cases, though, you can think of a character as a single 16-bit entity, which is almost as simple as thinking of it as a single 8-bit
entity.
> You can do bit operations on them and other C-like things and it should be
very
> efficient.
>
> Unlike UTF-16's two cases, one of which is very rare, UTF-8 has four
cases,
> three of which are very common. All of your code needs to do a good job
with
> those three cases. Only the fourth can be considered exceptional. (Of
course it
> has to be handled, too, but it is like the exceptional UTF-16 case, where
you
> don't have to optimize for it because it rarely occurs). Most strings will
tend
> to have mixed-width characters, so a model of an array of elements isn't a
very
> good one.
>
> You can still implement your language with accessors that reach into a
UTF-8
> string and parse out the right character when you say "str[5]", but it
will be
> further removed from the physical implementation than if you use UTF-16.
For a
> somewhat lower-level language like "D", this probably isn't a very good
fit.
>
> The main benefit of UTF-8 is when exchanging text data with arbitrary
external
> parties. UTF-8 has no endianness problem, so you don't have to worry about
the
> *internal* memory model of the recipient. It has some other features that
make
> it easier to digest by legacy systems that can only handle ASCII. They
won't
> work right outside ASCII, but they'll often work for ASCII and they'll
fail more
> gracefully than would be the case with UTF-16 (that is likely to contain
> embedded \0 bytes.)
>
> None of these issues are relevant to your own program's *internal* text
model.
> Internally, you're not worried about endianness. (You don't worry about
the
> endianness of your int variables, do you?) You don't have to worry about
losing
> a byte in RAM, etc.
>
> When talking to external APIs, you'll still have to output in a form that
the
> API can handle. Win32 APIs want UTF-16. Mac APIs want UTF-16. Java APIs
want
> UTF-16, as do .Net APIs. Unix APIs are problematic, since there are so
many and
> they aren't coordinated by a single body. Some will only be able to handle ASCII, others will be upgraded to UTF-8. I don't think the Unix system
APIs will
> become UTF-16 because legacy is such a ball and chain in the Unix world,
but the
> process is underway to upgrade the standard system encoding for all major
Linux
> distributions to UTF-8.
>
> If Linux APIs (and probably most Unix APIs eventually) are of primary importance, UTF-8 is still a possibility. I'm not totally ruling it out.
It
> wouldn't hurt you much to use UTF-8 internally, but accessing strings as
arrays
> of characters would require sort of a virtual string model that doesn't
match
> the physical model quite as closely as you could get with UTF-16. The
additional
> abstraction might have more overhead than you would prefer internally. If
it's a
> choice between internal inefficiency and inefficiency when calling
external
> APIs, I would usually go for the latter.
>
> Most language designers who understand internationalization have decided
to go
> with UTF-16 for languages that have their own rich set of internal
libraries,
> and they have mechanisms for calling external APIs that convert the string encodings.
>
>
>


January 18, 2003
Walter wrote:
> 10) Interestingly, making char[] in D to be UTF-8 does not seem to step on
> or prevent dealing with wchar_t[] arrays being UTF-16.

You're planning on making this a part of char[]?  I was thinking of generating a StringUTF8 instance during compilation, but whatever.

I think we should kill off wchar if we go in this direction.  The char/wchar conflict is probably the worst part of D's design right now as it doesn't fit well with the rest of the language (limited and ambiguous overloading), and it would provide absolutely nothing that char doesn't already encapsulate.  If you need different encodings, use a library.

> 11) I'm not convinced the char[i] indexing problem will be a big one. Most
> operations done on ascii strings remain unchanged for UTF-8, including
> things like sorting & searching.

It's not such a speed hit any longer that all code absolutely must use slicing and iterators to be useful.

12) UTF-8 doesn't embed ANY control characters, so it can interface with unintelligent C libraries natively.  That's not a minor advantage when you're trying to get people to switch to it!

January 18, 2003
"Burton Radons" <loth@users.sourceforge.net> wrote in message news:b0a6rl$i4m$1@digitaldaemon.com...
> Walter wrote:
> > 10) Interestingly, making char[] in D to be UTF-8 does not seem to step
on
> > or prevent dealing with wchar_t[] arrays being UTF-16.
> You're planning on making this a part of char[]?  I was thinking of generating a StringUTF8 instance during compilation, but whatever.

I think making char[] a UTF-8 is the right way.

> I think we should kill off wchar if we go in this direction.  The char/wchar conflict is probably the worst part of D's design right now as it doesn't fit well with the rest of the language (limited and ambiguous overloading), and it would provide absolutely nothing that char doesn't already encapsulate.  If you need different encodings, use a library.

I agree that the char/wchar conflict is a screwup in D's design, and one I've not been happy with. UTF-8 offers a way out. wchar_t should still be retained, though, for interfacing with the win32 api.

> > 11) I'm not convinced the char[i] indexing problem will be a big one.
Most
> > operations done on ascii strings remain unchanged for UTF-8, including things like sorting & searching.
> It's not such a speed hit any longer that all code absolutely must use slicing and iterators to be useful.

Interestingly, if foreach is done right, iterating through char[] will work right, UTF-8 or not.

> 12) UTF-8 doesn't embed ANY control characters, so it can interface with unintelligent C libraries natively.  That's not a minor advantage when you're trying to get people to switch to it!

You're right.


January 18, 2003
"Walter" <walter@digitalmars.com> wrote in message news:b0a7ft$iei$1@digitaldaemon.com...
>
> "Burton Radons" <loth@users.sourceforge.net> wrote in message news:b0a6rl$i4m$1@digitaldaemon.com...
> > Walter wrote:
> > > 10) Interestingly, making char[] in D to be UTF-8 does not seem to
step
> on
> > > or prevent dealing with wchar_t[] arrays being UTF-16.
> > You're planning on making this a part of char[]?  I was thinking of generating a StringUTF8 instance during compilation, but whatever.
>
> I think making char[] a UTF-8 is the right way.

I would be more in favor of a String class that was utf8 internally
the problem with utf8 is that the the number of bytes and the number of
chars are dependant on the data
char[] to me implies an array of char's so
char [] foo ="aa"\0x0555;
is 4 bytes, but only 3 chars
so what is foo[2] ? and what if I set foo[1] = \0x467;
and what about wanting 8 bit ascii strings ?

if you are going UTF8 then think about the minor extension Java added to the encoding by allowing a two byte 0, which allows embedded 0 in strings without messing up the C strlen (which returns the byte length).

> > I think we should kill off wchar if we go in this direction.  The char/wchar conflict is probably the worst part of D's design right now as it doesn't fit well with the rest of the language (limited and ambiguous overloading), and it would provide absolutely nothing that char doesn't already encapsulate.  If you need different encodings, use a library.

> I agree that the char/wchar conflict is a screwup in D's design, and one I've not been happy with. UTF-8 offers a way out. wchar_t should still be retained, though, for interfacing with the win32 api.
>
> > > 11) I'm not convinced the char[i] indexing problem will be a big one.
> Most
> > > operations done on ascii strings remain unchanged for UTF-8, including things like sorting & searching.
> > It's not such a speed hit any longer that all code absolutely must use slicing and iterators to be useful.
>
> Interestingly, if foreach is done right, iterating through char[] will
work
> right, UTF-8 or not.


> > 12) UTF-8 doesn't embed ANY control characters, so it can interface with unintelligent C libraries natively.  That's not a minor advantage when you're trying to get people to switch to it!
>
> You're right.
>
>


January 18, 2003
>
> 6) Sure, UTF-16 reduces the frequency of multi character encodings, but
the
> code to deal with it must still be there and must still execute.
>
I was under the impression UTF-16 was glyph based, so each char (16bits) was a glyph of some form, not all glyph cause the graphics to move to the next char, so accents can be encoded as a postfix to the char they are over/under and charsets like chinesse have sequences that generate the correct visual reprosentation;

UTF-8 is just a way to encode UTF-16 so the it is compatable with ascii,
0..127 map to 0.127 then using 128..256 as special values identifing multi
byte values the string can be processed as 8bit ascii by software without
problem, only the visual reprosentation changes 128..256 on dos are the box
drawing and intl chars.
however a 3 UTF-16 char sequence will encode to 3 utf 8 encoded sequences
and if they are all >127 then that would be
6 or more bytes, so if you consider the 3 UTF-16 values to be one "char"
then the UTF8 should also consider the 6 or more byte sequence as one "char"
rather than 3 "chars"



January 18, 2003
"globalization guy" <globalization_member@pathlink.com> escreveu na mensagem news:b05pdd$13bv$1@digitaldaemon.com...
> I think you'll be making a big mistake if you adopt C's obsolete char ==
byte
> concept of strings. Savvy language designers these days realize that, like
int's
> and float's, char's should be a fundamental data type at a higher-level of abstraction than raw bytes. The model that most modern language designers
are
> turning to is to make the "char" a 16-bit UTF-16 (Unicode) code unit.
>
> If you do so, you make it possible for strings in your language to have a single, canonical form that all APIs use. Instead of the nightmare that
C/C++
> programmers face when passing string parameters ("now, let's see, is this
a
> char* or a const char* or an ISO C++ string or an ISO wstring or a
wchar_t* or a
> char[] or a wchar_t[] or an instance of one of countless string
classes...?).
> The fact that not just every library but practically every project feels
the
> need to reinvent its own string type is proof of the need for a good,
solid,
> canonical form built right into the language.
>
> Most language designers these days either get this from the start of they
later
> figure it out and have to screw up their language with multiple string
types.
>
> Having canonical UTF-16 chars and strings internally does not mean that
you
> can't deal with other character encodings externally. You can can convert
to
> canonical form on import and convert back to some legacy encoding on
export.
> When you create the strings yourself, or when they are created in Java or
C# or
> Javascript or default XML or most new text protocols, no conversion will
be
> necessary. It will only be needed for legacy data (or a very lightweight
switch
> between UTF-8 and UTF-16). And for those cases where you have to work with legacy data and yet don't want to incur the overhead of encoding
conversion in
> and out, you can still treat the external strings as byte arrays instead
of
> strings, assuming you have a "byte" data type, and do direct byte
manipulation
> on them. That's essentially what you would have been doing anyway if you
had
> used the old char == byte model I see in your docs. You just call it
"byte"
> instead of "char" so it doesn't end up being your default string type.
>
> Having a modern UTF-16 char type, separate from arrays of "byte", gives
you a
> consistency that allows for the creation of great libraries (since text is
such
> a fundamental type). Java and C# designers figured this out from the
start, and
> their libraries universally use a single string type. Perl figured it out
pretty
> late and as a result, with the addition of UTF-8 to Perl in v. 5.6, it's
never
> clear which CPAN modules will work and which ones will fail, so you have
to use
> pragmas ("use utf-8" vs. "use bytes") and do lots of testing.
>
> I hope you'll consider making this change to your design. Have an 8-bit
unsigned
> "byte" type and a 16-bit unsigned UTF-16 "char" and forget about this
"8-bit
> char plus 16-bit wide char on Win32 and 32-bit wide char on Linux" stuff
or I'm
> quite sure you'll later regret it. C/C++ are in that sorry state for
legacy
> reasons only, not because their designers were foolish, but any new
language
> that intentionally copies that "design" is likely to regret that decision.
>

Hi,

    There was a thread a year ago in the smalleiffel mailing list (starting
at http://groups.yahoo.com/group/smalleiffel/message/4075 ) about unicode
strings in Eiffel. It's a quite interesting read about the problems of
adding string-like Unicode classes. The main point is that true Unicode
support is very difficult to achieve just some libraries provide good,
correct and complete unicode encoders/decoders/renderers/etc.
    While I agree that some Unicode support is a necessity today (main
mother tongue is brazilian portuguese so I use non-ascii characters
everyday), we can't just add some base types and pretend everything is
allright. We won't correct incorrect written code with a primitive unicode
string. Most programmers don't think about unicode when they develop their
software, so almost every line of code dealing with texts contain some
assumptions about the character sets being used. Java has a primitive 16 bit
char, but basic library functions (because they need good performance) use
incorrect code for string handling stuff (the correct classes are in
java.text, providing means to correctly collate strings). Some times we are
just using plain old ASCII but we're bitten by the encoding issues. And when
we need to deal with true unicode support the libraries tricky us into
believing everything is ok.
    IMO D should support a simple char array to deal with ASCII (as it does
today) and some kind of standard library module to deal with unicode glyphs
and text. This could be included in phobos or even in deimos. Any
volunteers? With this we could force the programmer to deal with another set
of tools (albeit similar) when dealing with each kind of string: ASCII or
unicode. This module should allow creation of variable sized string and
glyphs through an opaque ADT. Each kind of usage has different semantics and
optimization strategies (e.g. Boyer-Moore is good for ASCII but with unicode
the space and time usage are worse).

    Best regards,
    Daniel Yokomiso.

P.S.: I had to written some libraries and components (EJBs) in several Java projects to deal with data-transfer in plain ASCII (communication with IBM mainframes). Each day I dreamed of using a language with simple one byte character strings, without problems with encoding and endianess (Solaris vs. Linux vs. Windows NT have some nice "features" in their JVMs if you aren't careful when writing Java code that uses "ASCII" String). But Java has a 16 bit character type and a SIGNED byte type, both awkward for this usage. A language shouldn't get in the way of simple code.

"Never argue with an idiot. They drag you down to their level then beat you with experience."


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.443 / Virus Database: 248 - Release Date: 11/1/2003


January 18, 2003
> I was under the impression UTF-16 was glyph based, so each char (16bits)
was
> a glyph of some form, not all glyph cause the graphics to move to the next char, so accents can be encoded as a postfix to the char they are
over/under
> and charsets like chinesse have sequences that generate the correct visual reprosentation;

First, UTF-16 is just one of the many standard encodings for the Unicode.
UTF-16 allows more then 16bit characters - with surrogates it can represent
all >1M codes.
(Unicode v2 used UCS-2 which is 16bit-only encoding)


> I was under the impression UTF-16 was glyph based

from The Unicode Standard, ch2 General Structure
http://www.unicode.org/uni2book/ch02.pdf
"Characters, not glyphs  -  The Unicode Standard encodes characters, not
glyphs.
The Unicode Standard draws a distinction between characters, which are the
smallest components of written language that have semantic value, and
glyphs, which represent the shapes that characters can have when they are
rendered or displayed. Various relationships may exist between characters
and glyphs: a single glyph may correspond to a single character, or to a
number of characters, or multiple glyphs may result from a single
character."

btw, there are many precomposed characters in the Unicode which can be represented with combining characters as well. ( [รข] and [a,(combining ^)] - equally valid representations for [a with circumflex] ).