Unicode discussion

DISCLAIMER: I am not a "D programmer". I certainly haven't written any
real-world applications in the language yet but I am very knowlegeable
in localisation issues.

After the recent discussion regarding Unicode in D, which seems to
have faded away now, I have decided to write some initial comments on
what needs to be done to the language and API's to make it support all
languages, not only English and Latin (which to my knowledge are the
only lnguages that can be written using 7-bit ASCII).

char types
----------

Today, according to the specification, there are three char
types. char, wchar and dchar. These are then used in an array to
create three different kinds of internal string representaions: UTF-8,
UTF-16 and UTF-32.

There are several problems with this. First and foremost, when an
expression such as this: "char[] foo" you get the impression that this
is an array of characters. This is wrong. The UTF-8 specification
dictates that a UTF-8 string is an array of bytes, not
characters. This is an important distiction to make since you cannot
take the n'th character from a UTF-8 stream like this: string[n],
since you may get a part of a multibyte character sequence.

The wchar data type has the exact same problem, since it uses UTF-16
which also uses variable lengths for its characters.

What is needed is a "char" datatype that is infact able to hold a
character. You need 21 bits to describe a unicode character (Unicode
allocates 17*2^16 code points, all of which are not yet defined) and
therefore it seems reasonable to use a 32-bit data type for this.

In my opinion this data type should be named "char". For UTF-8 and
UTF-16 strings, one can use the "byte" and "short" data types, which
would be in keeping with the Unicode standards which (to my knowledge,
I'd have to look up the exact wording) declare UTF-8 strings as being
sequences of bytes and 16-bit words respectively, and not
"characters".

String classes and functions
----------------------------

There are a set of const char[] arrays containing various character
sequences including: hexdigits, digits, uppercase, letters,
whitespace, etc... There are also character classification functions
that accept 8-bit characters. These should really be replaced by a new
but similar set of functions that work with 32-bit char types.

    isAlpha(), isNumber(), isUpper(), isLower(), isWhiteSpace()

These cannot be inlined functions since newer versions of the Unicode
standard can declare new code points and we need to be forward
compatible.

Another funtion is also needed: getCharacterCategory() which returns
the Unicode category. Some other functions are needed to determine
other properites of the characters such as the directionality. Take a
look at the Java classes java.text.BreakIterator and java.text.Bidi to
get some ideas.

Streams
-------

The current std.stream is not adequate for Unicode. It doesn't seem to
take encodings into consideration at all but is simply a binary
interface.

Strings in the Phobos stream library seems to deal primarily with
char[] and wchar[]. The most important stream type, dchar[] is not
even considered. Another problem with the library is that the point as
which native encodiding<->unicode conversion is performed is not
defined.

Personally, I have not given this much considering yet, although I
kind of like the way Java did it by introducing two different kinds of
streams, byte streams and character stream. More discussion is clearly
needed.

Interoperability
----------------

In particular, C often uses 8-bit char arrays to represent
strings. This causes a problem when all strings are 32-bit
internally. The most straightforward olution is to convert UTF-32
char[] to UTF-8 byte[] before a call to a legacy function. This would
also very elegantly deal with the problem is zero-terminated C
strings, vs. non-zero terminated D strings (one of the char[]->UTF-8
conversions functions should create a zero-terminated byte array).

December 15, 2003

Re: Unicode discussion

Posted by Walter
in reply to Elias Martenson

Permalink

Walter

Posted in reply to Elias Martenson

Permalink

"Elias Martenson" <elias-m@algonet.se> wrote in message news:brjvsf$28lb$1@digitaldaemon.com...
> char types
> ----------
>
> Today, according to the specification, there are three char
> types. char, wchar and dchar. These are then used in an array to
> create three different kinds of internal string representaions: UTF-8,
> UTF-16 and UTF-32.
>
> There are several problems with this. First and foremost, when an
> expression such as this: "char[] foo" you get the impression that this
> is an array of characters. This is wrong. The UTF-8 specification
> dictates that a UTF-8 string is an array of bytes, not
> characters. This is an important distiction to make since you cannot
> take the n'th character from a UTF-8 stream like this: string[n],
> since you may get a part of a multibyte character sequence.
>
> The wchar data type has the exact same problem, since it uses UTF-16 which also uses variable lengths for its characters.
>
> What is needed is a "char" datatype that is infact able to hold a character. You need 21 bits to describe a unicode character (Unicode allocates 17*2^16 code points, all of which are not yet defined) and therefore it seems reasonable to use a 32-bit data type for this.
>
> In my opinion this data type should be named "char". For UTF-8 and UTF-16 strings, one can use the "byte" and "short" data types, which would be in keeping with the Unicode standards which (to my knowledge, I'd have to look up the exact wording) declare UTF-8 strings as being sequences of bytes and 16-bit words respectively, and not "characters".

The data type you're looking for is implemented in D and is the 'dchar'. A 'dchar' is 32 bits wide, wide enough for all the current and future unicode characters. A 'char' is really a UTF-8 byte and a 'wchar' is really a UTF-16 short. Having 'char' be a separate type from 'byte' is pretty handy for overloading purposes. (A minor clarification, 'byte' in D is signed, I think you meant 'ubyte', since UTF-8 bytes are unsigned.)

> String classes and functions
> ----------------------------
>
> There are a set of const char[] arrays containing various character sequences including: hexdigits, digits, uppercase, letters, whitespace, etc... There are also character classification functions that accept 8-bit characters. These should really be replaced by a new but similar set of functions that work with 32-bit char types.
>
>      isAlpha(), isNumber(), isUpper(), isLower(), isWhiteSpace()
>
> These cannot be inlined functions since newer versions of the Unicode standard can declare new code points and we need to be forward compatible.
>
> Another funtion is also needed: getCharacterCategory() which returns the Unicode category. Some other functions are needed to determine other properites of the characters such as the directionality. Take a look at the Java classes java.text.BreakIterator and java.text.Bidi to get some ideas.

I agree that more needs to be done in the D runtime library along these lines. I am not an expert on unicode - would you care to write those functions and contribute them to the D project?

> Streams
> -------
>
> The current std.stream is not adequate for Unicode. It doesn't seem to take encodings into consideration at all but is simply a binary interface.

That's correct.

> Strings in the Phobos stream library seems to deal primarily with char[] and wchar[]. The most important stream type, dchar[] is not even considered. Another problem with the library is that the point as which native encodiding<->unicode conversion is performed is not defined.

That's correct as well. The library's support for unicode is inadequate. But there also is a nice package (std.utf) which will convert between char[], wchar[], and dchar[]. This can be used to convert the text strings into whatever unicode stream type the underlying operating system API supports. (For win32 this would be UTF-16, I am unsure what linux supports.)

> Personally, I have not given this much considering yet, although I kind of like the way Java did it by introducing two different kinds of streams, byte streams and character stream. More discussion is clearly needed.
>
> Interoperability
> ----------------
>
> In particular, C often uses 8-bit char arrays to represent
> strings. This causes a problem when all strings are 32-bit
> internally. The most straightforward olution is to convert UTF-32
> char[] to UTF-8 byte[] before a call to a legacy function. This would
> also very elegantly deal with the problem is zero-terminated C
> strings, vs. non-zero terminated D strings (one of the char[]->UTF-8
> conversions functions should create a zero-terminated byte array).

D is headed that way. The current version of the library I'm working on converts the char[] strings in the file name API's to UTF-16 via std.utf.toUTF16z(), for use calling the win32 API's.

December 15, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Walter

Permalink

Elias Martenson

Posted in reply to Walter

Permalink

Den Mon, 15 Dec 2003 02:28:01 -0800 skrev Walter:

>> In my opinion this data type should be named "char". For UTF-8 and UTF-16 strings, one can use the "byte" and "short" data types, which would be in keeping with the Unicode standards which (to my knowledge, I'd have to look up the exact wording) declare UTF-8 strings as being sequences of bytes and 16-bit words respectively, and not "characters".
> 
> The data type you're looking for is implemented in D and is the 'dchar'. A 'dchar' is 32 bits wide, wide enough for all the current and future unicode characters. A 'char' is really a UTF-8 byte and a 'wchar' is really a UTF-16 short. Having 'char' be a separate type from 'byte' is pretty handy for overloading purposes. (A minor clarification, 'byte' in D is signed, I think you meant 'ubyte', since UTF-8 bytes are unsigned.)

Actually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway.

The overloading issue is interesting, but may I suggest that char and whcar are at least renamed to something more appropriate? Maybe utf8byte and utf16byte? I feel it's important to point out that they aren't characters.

And here is also the core of the problem: having an array of "char" implies to the unwary programmer that the elements in the sequence are in fact "characters", and that you should be allowed to do stuff like isspace() on them. The fact that the libraries provide such function doesn't help either.

I was almost going to provide a summary of the issues we're having in C with regards to this, but I don't know if it's necessary, and it's also getting late here (work early tomorrow).

>> [ my own comments regarding strings snipped ]
> 
> I agree that more needs to be done in the D runtime library along these lines. I am not an expert on unicode - would you care to write those functions and contribute them to the D project?

I'd love to help out and do these things. But two things are needed first:

    - At least one other person needs to volunteer.
      I've had bad experiences when one person does this by himself,

    - The core concepts needs to be decided upon. Things seems to be
      somewhat in flux right now, with three different string types
      and all. At the very least it needs to be deicded what a "string"
      really is, is it a UTF-8 byte sequence or a UTF-32 character
      sequence? I haven't hid the fact that I would prefer the latter.

>> Streams
>> -------
>>
>> The current std.stream is not adequate for Unicode. It doesn't seem to take encodings into consideration at all but is simply a binary interface.
> 
> That's correct.

Agree. And as such it's very good.

>> Strings in the Phobos stream library seems to deal primarily with char[] and wchar[]. The most important stream type, dchar[] is not even considered. Another problem with the library is that the point as which native encodiding<->unicode conversion is performed is not defined.
> 
> That's correct as well. The library's support for unicode is inadequate. But there also is a nice package (std.utf) which will convert between char[], wchar[], and dchar[]. This can be used to convert the text strings into whatever unicode stream type the underlying operating system API supports. (For win32 this would be UTF-16, I am unsure what linux supports.)

Yes. But this would then assume that char[] is always in native encoding and doesn't rhyme very well with the assertion that char[] is a UTF-8 byte sequence.

Or, the specification could be read as the stream actually performs native decoding to UTF-8 when reading into a char[] array.

Unless fundamental encoding/decoding is embedded in the streams library,
it would be best to simply read text data into a byte array and then
perform native decoding manually afterwards using functions similar
to the C mbstowcs() and wcstombs(). The drawback to this is that you
cannot read text data in platform encoding without copying through
a separate buffer, even in cases when this is not needed.

>> In particular, C often uses 8-bit char arrays to represent
>> strings. This causes a problem when all strings are 32-bit
>> internally. The most straightforward olution is to convert UTF-32
>> char[] to UTF-8 byte[] before a call to a legacy function. This would
>> also very elegantly deal with the problem is zero-terminated C
>> strings, vs. non-zero terminated D strings (one of the char[]->UTF-8
>> conversions functions should create a zero-terminated byte array).
> 
> D is headed that way. The current version of the library I'm working on converts the char[] strings in the file name API's to UTF-16 via std.utf.toUTF16z(), for use calling the win32 API's.

This can be done in a much better, platform independent way, by using the native<->unicode conversion routines. In C, as already mentioned, these are called mbstowcs() and wcstombs(). For Windows, these would convert to and from UTF-16. For Unix, these would convert to and from whatever encoding the application is running under (dictated by the LC_CTYPE environment variable). There really is no need to make the API's platform dependent in any way here.

In general, you should be able to open a file, by specifying the file name as a dchar[], and then the libraries should handle the rest. This goes for all the other methods and functions that accept string parameters. This of course still depends on what a "string" really is, this really needs to be decided, and I think you are the only one who can make that call. Although more discussion on the subject might be needed first?

Regards

Elias Mårtenson

December 16, 2003

Re: Unicode discussion

Posted by Walter
in reply to Elias Martenson

Permalink

Walter

Posted in reply to Elias Martenson

Permalink

"Elias Martenson" <no@spam.spam> wrote in message news:pan.2003.12.15.23.07.24.569047@spam.spam...
> Actually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway.

In a higher level language, yes. But in doing systems work, one always seems to be looking at the lower level elements anyway. I wrestled with this for a while, and eventually decided that char[], wchar[], and dchar[] would be low level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.


> The overloading issue is interesting, but may I suggest that char and
whcar
> are at least renamed to something more appropriate? Maybe utf8byte and utf16byte? I feel it's important to point out that they aren't characters.

I see your point, but I just can't see making utf8byte into a keyword <g>. The world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much of an issue here.


> And here is also the core of the problem: having an array of "char" implies to the unwary programmer that the elements in the sequence are in fact "characters", and that you should be allowed to do stuff like isspace() on them. The fact that the libraries provide such function doesn't help either.

I think the library functions should be improved to handle unicode chars. But I'm not much of an expert on how to do it right, so it is the way it is for the moment.

> I'd love to help out and do these things. But two things are needed first:
>     - At least one other person needs to volunteer.
>       I've had bad experiences when one person does this by himself,

You're not by yourself. There's a whole D community here!

>     - The core concepts needs to be decided upon. Things seems to be
>       somewhat in flux right now, with three different string types
>       and all. At the very least it needs to be deicded what a "string"
>       really is, is it a UTF-8 byte sequence or a UTF-32 character
>       sequence? I haven't hid the fact that I would prefer the latter.

A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8, UTF-16, or UTF-32 representations.

> > That's correct as well. The library's support for unicode is inadequate.
But
> > there also is a nice package (std.utf) which will convert between
char[],
> > wchar[], and dchar[]. This can be used to convert the text strings into whatever unicode stream type the underlying operating system API
supports.
> > (For win32 this would be UTF-16, I am unsure what linux supports.)
> Yes. But this would then assume that char[] is always in native encoding
> and doesn't rhyme very well with the assertion that char[] is a UTF-8
> byte sequence.
> Or, the specification could be read as the stream actually performs native
> decoding to UTF-8 when reading into a char[] array.

char[] strings are UTF-8, and as such I don't know what you mean by 'native decoding'. There is only one possible conversion of UTF-8 to UTF-16.

> Unless fundamental encoding/decoding is embedded in the streams library,
> it would be best to simply read text data into a byte array and then
> perform native decoding manually afterwards using functions similar
> to the C mbstowcs() and wcstombs(). The drawback to this is that you
> cannot read text data in platform encoding without copying through
> a separate buffer, even in cases when this is not needed.

If you're talking about win32 code pages, I'm going to draw a line in the sand and assert that D char[] strings are NOT locale or code page dependent. They are UTF-8 strings. If you are reading code page or locale dependent strings, to put them into a char[] will require running it through a conversion.

> > D is headed that way. The current version of the library I'm working on converts the char[] strings in the file name API's to UTF-16 via std.utf.toUTF16z(), for use calling the win32 API's.
> This can be done in a much better, platform independent way, by using the native<->unicode conversion routines.

The UTF-8 to UTF-16 conversion is defined and platform independent. The D runtime library includes routines to convert back and forth between them. They could probably be optimized better, but that's another issue. I feel that by designing D around UTF-8, UTF-16 and UTF-32 the problems with locale dependent character sets are pushed off to the side as merely an input or output translation nuisance. The core routines all expect UTF strings, and so are platform and language independent. I personally think the future is UTF, and locale dependent encodings will fall by the wayside.

> In C, as already mentioned,
> these are called mbstowcs() and wcstombs(). For Windows, these would
> convert to and from UTF-16. For Unix, these would convert to and from
> whatever encoding the application is running under (dictated by the
> LC_CTYPE environment variable). There really is no need to make the
> API's platform dependent in any way here.

After wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the language and runtime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translating to/from UTF. This should wind up making D a far more portable language for internationalization than C/C++ are (ever wrestle with tchar.h? How about wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about having #ifdef _UNICODE all over the place? I've done that too much already. No thanks!)

UTF-8 is really quite brilliant. With just some minor extra care over writing ordinary ascii code, you can write portable code that is fully capable of handling the complete unicode character set.

> In general, you should be able to open a file, by specifying the file name as a dchar[], and then the libraries should handle the rest.

It does that now, except they take a char[].

> This
> goes for all the other methods and functions that accept string
> parameters. This of course still depends on what a "string" really is,
> this really needs to be decided, and I think you are the only one who
> can make that call. Although more discussion on the subject might be
> needed first?

It's been debated here before <g>.

December 16, 2003

Re: Unicode discussion

Posted by Lewis
in reply to Walter

Permalink

Lewis

Posted in reply to Walter

Permalink

Walter wrote:
> "Elias Martenson" <no@spam.spam> wrote in message
> news:pan.2003.12.15.23.07.24.569047@spam.spam...
> 
>>Actually, byte or ubyte doesn't really matter. One is not supposed to
>>look at the individual elements in a UTF-8 or a UTF-16 string anyway.
> 
> 
> In a higher level language, yes. But in doing systems work, one always seems
> to be looking at the lower level elements anyway. I wrestled with this for a
> while, and eventually decided that char[], wchar[], and dchar[] would be low
> level representations. One could design a wrapper class for them that
> overloads [] to provide automatic decoding if desired.
> 
> 
> 
>>The overloading issue is interesting, but may I suggest that char and
> 
> whcar
> 
>>are at least renamed to something more appropriate? Maybe utf8byte and
>>utf16byte? I feel it's important to point out that they aren't characters.
> 
> 
> I see your point, but I just can't see making utf8byte into a keyword <g>.
> The world has already gotten used to multibyte 'char' in C and the funky
> 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much
> of an issue here.
> 
> 
> 
>>And here is also the core of the problem: having an array of "char"
>>implies to the unwary programmer that the elements in the sequence
>>are in fact "characters", and that you should be allowed to do stuff
>>like isspace() on them. The fact that the libraries provide such
>>function doesn't help either.
> 
> 
> I think the library functions should be improved to handle unicode chars.
> But I'm not much of an expert on how to do it right, so it is the way it is
> for the moment.
> 
> 
>>I'd love to help out and do these things. But two things are needed first:
>>    - At least one other person needs to volunteer.
>>      I've had bad experiences when one person does this by himself,
> 
> 
> You're not by yourself. There's a whole D community here!
> 
> 
>>    - The core concepts needs to be decided upon. Things seems to be
>>      somewhat in flux right now, with three different string types
>>      and all. At the very least it needs to be deicded what a "string"
>>      really is, is it a UTF-8 byte sequence or a UTF-32 character
>>      sequence? I haven't hid the fact that I would prefer the latter.
> 
> 
> A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8,
> UTF-16, or UTF-32 representations.
> 
> 
>>>That's correct as well. The library's support for unicode is inadequate.
> 
> But
> 
>>>there also is a nice package (std.utf) which will convert between
> 
> char[],
> 
>>>wchar[], and dchar[]. This can be used to convert the text strings into
>>>whatever unicode stream type the underlying operating system API
> 
> supports.
> 
>>>(For win32 this would be UTF-16, I am unsure what linux supports.)
>>
>>Yes. But this would then assume that char[] is always in native encoding
>>and doesn't rhyme very well with the assertion that char[] is a UTF-8
>>byte sequence.
>>Or, the specification could be read as the stream actually performs native
>>decoding to UTF-8 when reading into a char[] array.
> 
> 
> char[] strings are UTF-8, and as such I don't know what you mean by 'native
> decoding'. There is only one possible conversion of UTF-8 to UTF-16.
> 
> 
>>Unless fundamental encoding/decoding is embedded in the streams library,
>>it would be best to simply read text data into a byte array and then
>>perform native decoding manually afterwards using functions similar
>>to the C mbstowcs() and wcstombs(). The drawback to this is that you
>>cannot read text data in platform encoding without copying through
>>a separate buffer, even in cases when this is not needed.
> 
> 
> If you're talking about win32 code pages, I'm going to draw a line in the
> sand and assert that D char[] strings are NOT locale or code page dependent.
> They are UTF-8 strings. If you are reading code page or locale dependent
> strings, to put them into a char[] will require running it through a
> conversion.
> 
> 
>>>D is headed that way. The current version of the library I'm working on
>>>converts the char[] strings in the file name API's to UTF-16 via
>>>std.utf.toUTF16z(), for use calling the win32 API's.
>>
>>This can be done in a much better, platform independent way, by using
>>the native<->unicode conversion routines.
> 
> 
> The UTF-8 to UTF-16 conversion is defined and platform independent. The D
> runtime library includes routines to convert back and forth between them.
> They could probably be optimized better, but that's another issue. I feel
> that by designing D around UTF-8, UTF-16 and UTF-32 the problems with locale
> dependent character sets are pushed off to the side as merely an input or
> output translation nuisance. The core routines all expect UTF strings, and
> so are platform and language independent. I personally think the future is
> UTF, and locale dependent encodings will fall by the wayside.
> 
> 
>>In C, as already mentioned,
>>these are called mbstowcs() and wcstombs(). For Windows, these would
>>convert to and from UTF-16. For Unix, these would convert to and from
>>whatever encoding the application is running under (dictated by the
>>LC_CTYPE environment variable). There really is no need to make the
>>API's platform dependent in any way here.
> 
> 
> After wrestling with this issue for some time, I finally realized that
> supporting locale dependent character sets in the core of the language and
> runtime library is a bad idea. The core will support UTF, and locale
> dependent representations will only be supported by translating to/from UTF.
> This should wind up making D a far more portable language for
> internationalization than C/C++ are (ever wrestle with tchar.h? How about
> wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about having
> #ifdef _UNICODE all over the place? I've done that too much already. No
> thanks!)
> 
> UTF-8 is really quite brilliant. With just some minor extra care over
> writing ordinary ascii code, you can write portable code that is fully
> capable of handling the complete unicode character set.
> 
> 
>>In general, you should be able to open a file, by specifying the file
>>name as a dchar[], and then the libraries should handle the rest.
> 
> 
> It does that now, except they take a char[].
> 
> 
>>This
>>goes for all the other methods and functions that accept string
>>parameters. This of course still depends on what a "string" really is,
>>this really needs to be decided, and I think you are the only one who
>>can make that call. Although more discussion on the subject might be
>>needed first?
> 
> 
> It's been debated here before <g>.
> 
> 

heres a page i found with some c++ code that may help in creating decoders
etc...

http://www.elcel.com/docs/opentop/API/ot/io/InputStreamReader.html

for windows coding its easy enough to use com api's to manipulate and
create unicode strings? (for utf16)

December 16, 2003

Re: Unicode discussion

Posted by Sean L. Palmer
in reply to Walter

Permalink

Sean L. Palmer

Posted in reply to Walter

Permalink

"Walter" <walter@digitalmars.com> wrote in message news:brll85$1oko$1@digitaldaemon.com...
>
> "Elias Martenson" <no@spam.spam> wrote in message news:pan.2003.12.15.23.07.24.569047@spam.spam...
> > Actually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway.
>
> In a higher level language, yes. But in doing systems work, one always
seems
> to be looking at the lower level elements anyway. I wrestled with this for
a
> while, and eventually decided that char[], wchar[], and dchar[] would be
low
> level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.

The problem is that [] would be a horribly inefficient way to index UTF-8 characters.  foreach would be ok.

Sean

December 16, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Walter

Permalink

Elias Martenson

Posted in reply to Walter

Permalink

Walter wrote:

> "Elias Martenson" <no@spam.spam> wrote in message
> news:pan.2003.12.15.23.07.24.569047@spam.spam...
> 
>>Actually, byte or ubyte doesn't really matter. One is not supposed to
>>look at the individual elements in a UTF-8 or a UTF-16 string anyway.
> 
> In a higher level language, yes. But in doing systems work, one always seems
> to be looking at the lower level elements anyway. I wrestled with this for a
> while, and eventually decided that char[], wchar[], and dchar[] would be low
> level representations. One could design a wrapper class for them that
> overloads [] to provide automatic decoding if desired.

All right. I can accept this, of course. The problem I still have with this is the syntax though. We got to remember here that most english-only speaking people have little or no understanding of Unicode and are quite happy using someCharString[n] to access individual characters.

> I see your point, but I just can't see making utf8byte into a keyword <g>.
> The world has already gotten used to multibyte 'char' in C and the funky
> 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see much
> of an issue here.

Yes, they have gotten used to it in C, and it's still a horrible hack. At least in C. It is possiblt to get the multiple encoding support to work in D, but it needs wrappers. More on that later.

>>And here is also the core of the problem: having an array of "char"
>>implies to the unwary programmer that the elements in the sequence
>>are in fact "characters", and that you should be allowed to do stuff
>>like isspace() on them. The fact that the libraries provide such
>>function doesn't help either.
> 
> I think the library functions should be improved to handle unicode chars.
> But I'm not much of an expert on how to do it right, so it is the way it is
> for the moment.

As for the functions that handle individual characters, the first thing that absolutely has to be done is to change them to accept dchar instead of char.

>>I'd love to help out and do these things. But two things are needed first:
>>    - At least one other person needs to volunteer.
>>      I've had bad experiences when one person does this by himself,
> 
> You're not by yourself. There's a whole D community here!

Indeed, but no one else volunteered yet. :-)

>>    - The core concepts needs to be decided upon. Things seems to be
>>      somewhat in flux right now, with three different string types
>>      and all. At the very least it needs to be deicded what a "string"
>>      really is, is it a UTF-8 byte sequence or a UTF-32 character
>>      sequence? I haven't hid the fact that I would prefer the latter.
> 
> 
> A string in D can be char[], wchar[], or dchar[], corresponding to UTF-8,
> UTF-16, or UTF-32 representations.

OK, if that is your descision then you will not see me argue against it. :-)

However, suppose you are going to write a function that accepts a string. Let's call it log_to_file(). How do you declare it? Today, you have three different options:

    void log_to_file(char[] str);
    void log_to_file(wchar[] str);
    void log_to_file(dchar[] str);

Which one of these should I use? Should I use all of them? Today, people seems to use the first option, but UTF-8 is horribly inefficient performance-wise.

Also, in the case of char and wchar strings, how do I access an individual character? Unless I missed something, the only way today is to use decode(). This is a fairly common operation which needs a better syntax, or people will keep accessing individual elements using the array notation (str[n]).

Obviously the three different string types needs to be wrapped somehow. Either through a class (named "String" perhaps?) or through a keyword ("string"?) that is able to encapsulate the different behaviour of the three different kinds of strings.

Would it be possible to use something like this?

    dchar get_first_char(string str)
    {
        return str[0];
    }

    string str1 = (dchar[])"A UTF-32 string";
    string str2 = (char[])"A UTF-8 string";

    // call the function to demonstrate that the "string"
    // type can be used in declarations
    dchar x = get_first_char(str1);
    dchar y = get_first_char(str2);

I.e. the "string" data type would be a wrapper or supertype for the three different string types.

> char[] strings are UTF-8, and as such I don't know what you mean by 'native
> decoding'. There is only one possible conversion of UTF-8 to UTF-16.

The native encoding is what the operating system uses. In Windows this is typically UTF-16, although it really depends. It's really a mess, since most applications actually use various locale-specific encodings, such as ISO-8859-1 or KOI8-R.

In Unix the platform specific encoding is determined by the environment variable LC_CTYPE, although the trend is to be moving towards UTF-8 for all locales. We're not quite there yet though. Check out http://www.utf-8.org/ for some information about this.

> If you're talking about win32 code pages, I'm going to draw a line in the
> sand and assert that D char[] strings are NOT locale or code page dependent.
> They are UTF-8 strings. If you are reading code page or locale dependent
> strings, to put them into a char[] will require running it through a
> conversion.

Right. So what you are saying is basically that there is a difference between reading to a ubyte[] and a char[] in that native decoding is performed in the latter case but not the former? (in other words, when reading to a char[] the data is passed through mbstowcs() internally?)

> The UTF-8 to UTF-16 conversion is defined and platform independent. The D
> runtime library includes routines to convert back and forth between them.
> They could probably be optimized better, but that's another issue. I feel
> that by designing D around UTF-8, UTF-16 and UTF-32 the problems with locale
> dependent character sets are pushed off to the side as merely an input or
> output translation nuisance. The core routines all expect UTF strings, and
> so are platform and language independent. I personally think the future is
> UTF, and locale dependent encodings will fall by the wayside.

Internally, yes. But there needs to be a clear layer where the platform encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding. Obviously this layer seems to be located in the streams. But we need a separate function to do this for byte arrays as well (since there are other ways of communicating with the outside world, memory mapped files for example). Why not use the same names as are used in C? mbstowcs() and wcstombs()?

> After wrestling with this issue for some time, I finally realized that
> supporting locale dependent character sets in the core of the language and
> runtime library is a bad idea. The core will support UTF, and locale
> dependent representations will only be supported by translating to/from UTF.
> This should wind up making D a far more portable language for
> internationalization than C/C++ are (ever wrestle with tchar.h? How about
> wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about having
> #ifdef _UNICODE all over the place? I've done that too much already. No
> thanks!)

Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually stems from the fact that according to the C standard wchar_t is not Unicode. It's simply a "wide character". The Unix standard goes a step further and defines wchar_t to be a unicode character. Obviously D goes the Unix route here (for dchar), and that is very good.

However, Windows defined wchar_t to be a 16-bit Unicode character back in the days Unicode fit inside 16 bits. This is the same mistake Java did, and we have now ended up with having UTF-16 strings internally.

So, in the end, C (if you want to be portable between Unix and Windows) and Java both no longer allows you to work with individual characters, unless you know what you are doing (i.e. you are prepared to deal with surrogate pairs manually). My suggestion for the "string" data type will hide all the nitty gritty details with various encodins and allow you to extract the n'th dchar from a string, regardless of the internal encoding.

> UTF-8 is really quite brilliant. With just some minor extra care over
> writing ordinary ascii code, you can write portable code that is fully
> capable of handling the complete unicode character set.

Indeed. And using UTF-8 internally is not a bad idea. The problem is that we're also allowed to use UTF-16 and UTF-32 as internal encoding, and if this is to remain, it needs to be abstracted away somehow.

>>In general, you should be able to open a file, by specifying the file
>>name as a dchar[], and then the libraries should handle the rest.
> 
> It does that now, except they take a char[].

Right. But wouldn't it be nicer if they accepted a "string"? The compiler could add automatic conversion to and from the "string" type as needed.

Regards

Elias Mårtenson

December 16, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Lewis

Permalink

Elias Martenson

Posted in reply to Lewis

Permalink

Lewis wrote:

> heres a page i found with some c++ code that may help in creating decoders
> etc...
> 
> http://www.elcel.com/docs/opentop/API/ot/io/InputStreamReader.html
> 
> for windows coding its easy enough to use com api's to manipulate and
> create unicode strings? (for utf16)

IBM has a set of Unicode tools. Last time I googled for them I found them right away but not I can't. I'll keel looking and post again when I find the link.

Regards

Elias Mårtenson

December 16, 2003

Re: Unicode discussion

Posted by uwem
in reply to Elias Martenson

Permalink

uwem

Posted in reply to Elias Martenson

Permalink

You mean icu?!

http://oss.software.ibm.com/icu/

Bye
uwe

In article <brmlf3$83b$1@digitaldaemon.com>, Elias Martenson says...
>
>Lewis wrote:
>
>> heres a page i found with some c++ code that may help in creating decoders etc...
>> 
>> http://www.elcel.com/docs/opentop/API/ot/io/InputStreamReader.html
>> 
>> for windows coding its easy enough to use com api's to manipulate and create unicode strings? (for utf16)
>
>IBM has a set of Unicode tools. Last time I googled for them I found them right away but not I can't. I'll keel looking and post again when I find the link.
>
>Regards
>
>Elias Mårtenson

December 16, 2003

Re: Unicode discussion

Posted by Ben Hinkle
in reply to Elias Martenson

Permalink

Ben Hinkle

Posted in reply to Elias Martenson

Permalink

I think Walter once said char had been called 'ascii'. That doesn't sound all that bad to me. Perhaps we should have the primitive types 'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar. Insane, I know, but at least then you never will mistake an ascii[] for a utf32[] (or a utf8[], for that matter).

-Ben

"Elias Martenson" <elias-m@algonet.se> wrote in message news:brml3p$7hp$1@digitaldaemon.com...
> Walter wrote:
>
> > "Elias Martenson" <no@spam.spam> wrote in message news:pan.2003.12.15.23.07.24.569047@spam.spam...
> >
> >>Actually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway.
> >
> > In a higher level language, yes. But in doing systems work, one always
seems
> > to be looking at the lower level elements anyway. I wrestled with this
for a
> > while, and eventually decided that char[], wchar[], and dchar[] would be
low
> > level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.
>
> All right. I can accept this, of course. The problem I still have with this is the syntax though. We got to remember here that most english-only speaking people have little or no understanding of Unicode and are quite happy using someCharString[n] to access individual
characters.
>
> > I see your point, but I just can't see making utf8byte into a keyword
<g>.
> > The world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see
much
> > of an issue here.
>
> Yes, they have gotten used to it in C, and it's still a horrible hack. At least in C. It is possiblt to get the multiple encoding support to work in D, but it needs wrappers. More on that later.
>
> >>And here is also the core of the problem: having an array of "char" implies to the unwary programmer that the elements in the sequence are in fact "characters", and that you should be allowed to do stuff like isspace() on them. The fact that the libraries provide such function doesn't help either.
> >
> > I think the library functions should be improved to handle unicode
chars.
> > But I'm not much of an expert on how to do it right, so it is the way it
is
> > for the moment.
>
> As for the functions that handle individual characters, the first thing that absolutely has to be done is to change them to accept dchar instead of char.
>
> >>I'd love to help out and do these things. But two things are needed
first:
> >>    - At least one other person needs to volunteer.
> >>      I've had bad experiences when one person does this by himself,
> >
> > You're not by yourself. There's a whole D community here!
>
> Indeed, but no one else volunteered yet. :-)
>
> >>    - The core concepts needs to be decided upon. Things seems to be
> >>      somewhat in flux right now, with three different string types
> >>      and all. At the very least it needs to be deicded what a "string"
> >>      really is, is it a UTF-8 byte sequence or a UTF-32 character
> >>      sequence? I haven't hid the fact that I would prefer the latter.
> >
> >
> > A string in D can be char[], wchar[], or dchar[], corresponding to
UTF-8,
> > UTF-16, or UTF-32 representations.
>
> OK, if that is your descision then you will not see me argue against it.
:-)
>
> However, suppose you are going to write a function that accepts a string. Let's call it log_to_file(). How do you declare it? Today, you have three different options:
>
>      void log_to_file(char[] str);
>      void log_to_file(wchar[] str);
>      void log_to_file(dchar[] str);
>
> Which one of these should I use? Should I use all of them? Today, people seems to use the first option, but UTF-8 is horribly inefficient performance-wise.
>
> Also, in the case of char and wchar strings, how do I access an individual character? Unless I missed something, the only way today is to use decode(). This is a fairly common operation which needs a better syntax, or people will keep accessing individual elements using the array notation (str[n]).
>
> Obviously the three different string types needs to be wrapped somehow. Either through a class (named "String" perhaps?) or through a keyword ("string"?) that is able to encapsulate the different behaviour of the three different kinds of strings.
>
> Would it be possible to use something like this?
>
>      dchar get_first_char(string str)
>      {
>          return str[0];
>      }
>
>      string str1 = (dchar[])"A UTF-32 string";
>      string str2 = (char[])"A UTF-8 string";
>
>      // call the function to demonstrate that the "string"
>      // type can be used in declarations
>      dchar x = get_first_char(str1);
>      dchar y = get_first_char(str2);
>
> I.e. the "string" data type would be a wrapper or supertype for the three different string types.
>
> > char[] strings are UTF-8, and as such I don't know what you mean by
'native
> > decoding'. There is only one possible conversion of UTF-8 to UTF-16.
>
> The native encoding is what the operating system uses. In Windows this is typically UTF-16, although it really depends. It's really a mess, since most applications actually use various locale-specific encodings, such as ISO-8859-1 or KOI8-R.
>
> In Unix the platform specific encoding is determined by the environment variable LC_CTYPE, although the trend is to be moving towards UTF-8 for all locales. We're not quite there yet though. Check out http://www.utf-8.org/ for some information about this.
>
> > If you're talking about win32 code pages, I'm going to draw a line in
the
> > sand and assert that D char[] strings are NOT locale or code page
dependent.
> > They are UTF-8 strings. If you are reading code page or locale dependent strings, to put them into a char[] will require running it through a conversion.
>
> Right. So what you are saying is basically that there is a difference between reading to a ubyte[] and a char[] in that native decoding is performed in the latter case but not the former? (in other words, when reading to a char[] the data is passed through mbstowcs() internally?)
>
> > The UTF-8 to UTF-16 conversion is defined and platform independent. The
D
> > runtime library includes routines to convert back and forth between
them.
> > They could probably be optimized better, but that's another issue. I
feel
> > that by designing D around UTF-8, UTF-16 and UTF-32 the problems with
locale
> > dependent character sets are pushed off to the side as merely an input
or
> > output translation nuisance. The core routines all expect UTF strings,
and
> > so are platform and language independent. I personally think the future
is
> > UTF, and locale dependent encodings will fall by the wayside.
>
> Internally, yes. But there needs to be a clear layer where the platform encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding. Obviously this layer seems to be located in the streams. But we need a separate function to do this for byte arrays as well (since there are other ways of communicating with the outside world, memory mapped files for example). Why not use the same names as are used in C? mbstowcs() and wcstombs()?
>
> > After wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the language
and
> > runtime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translating to/from
UTF.
> > This should wind up making D a far more portable language for internationalization than C/C++ are (ever wrestle with tchar.h? How
about
> > wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about
having
> > #ifdef _UNICODE all over the place? I've done that too much already. No thanks!)
>
> Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually stems from the fact that according to the C standard wchar_t is not Unicode. It's simply a "wide character". The Unix standard goes a step further and defines wchar_t to be a unicode character. Obviously D goes the Unix route here (for dchar), and that is very good.
>
> However, Windows defined wchar_t to be a 16-bit Unicode character back in the days Unicode fit inside 16 bits. This is the same mistake Java did, and we have now ended up with having UTF-16 strings internally.
>
> So, in the end, C (if you want to be portable between Unix and Windows) and Java both no longer allows you to work with individual characters, unless you know what you are doing (i.e. you are prepared to deal with surrogate pairs manually). My suggestion for the "string" data type will hide all the nitty gritty details with various encodins and allow you to extract the n'th dchar from a string, regardless of the internal encoding.
>
> > UTF-8 is really quite brilliant. With just some minor extra care over writing ordinary ascii code, you can write portable code that is fully capable of handling the complete unicode character set.
>
> Indeed. And using UTF-8 internally is not a bad idea. The problem is that we're also allowed to use UTF-16 and UTF-32 as internal encoding, and if this is to remain, it needs to be abstracted away somehow.
>
> >>In general, you should be able to open a file, by specifying the file name as a dchar[], and then the libraries should handle the rest.
> >
> > It does that now, except they take a char[].
>
> Right. But wouldn't it be nicer if they accepted a "string"? The compiler could add automatic conversion to and from the "string" type as needed.
>
> Regards
>
> Elias Mårtenson

Top | Forum index | About this forum

Forums