August 01, 2006
"Derek Parnell" <derek@nomail.afraid.org> wrote in message news:8n0koj5wjiio.qwc8ok4mrvr3$.dlg@40tude.net...
> On Mon, 31 Jul 2006 18:23:19 -0700, Andrew Fedoniouk wrote:
>
>> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eam1ec$10e1$1@digitaldaemon.com...
>>> Andrew Fedoniouk wrote:
>>>> The problem as I can see is this:
>>>> D propose to use transport encoding for manipulation purposes
>>>> which is main problem imo here - transport encodings are not
>>>> designed for the manipulation - it is extremely difficult to use
>>>> them for manipualtion in practice as we may see.
>>>
>>> I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.
>>
>> Sorry but strings in DMDScript are quite different in terms of
>> 0) there are no such thing as char in JavaScript.
>> 1) strings are Strings - not vectors of octets - js::string[] and
>> d::char[]
>> are different things.
>> 2) are not supposed to be used by any OS API.
>> 3) there are 12 or so methods of String class in JS - limited perimeter -
>> what model you've choosen to store them is irrelevant -
>> in some implementations they represented even by list of fixed runs.
>
> For what its worth, to do *character* manipulation I convert strings to UTF-32, do my stuff and convert back to the initial format.
>
> char[] somefunc(char[] x)
> {
>   return std.utf.toUTF8( somefunc( std.utf.toUTF32(x) ) );
> }
>
> wchar[] somefunc(wchar[] x)
> {
>   return std.utf.toUTF16( somefunc( std.utf.toUTF32(x) ) );
> }
>
> dchar[] somefunc(dchar[] x)
> {
>   dchar[] result;
>   ...
>   return result;
> }
>
> This seems to work fast enough for my purposes. DBuild (nee Build) uses
> this a lot.
>
> -- 

Derek, using dchar (ultimate char) is perfectly fine in DBuild(*)
circumstances - you are parsing - not dealing with OS in each line.

Using dchar has drawback - you need to recreate all string primitive ops from scratch including RegExp, etc.

Again dchar is ok - the only not ok is a strange selection for dchar null/nothing/nihil/nil/whatever value.

(* dbuild does not sound good in russian - very close to idiot in medical
meaning
consider builDer/buildDer/creaDor for example - with red D in the middle -
stylish at least)

Andrew.




August 01, 2006
Andrew Fedoniouk wrote:
> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eam1ec$10e1$1@digitaldaemon.com...
>> Andrew Fedoniouk wrote:
>>> The problem as I can see is this:
>>> D propose to use transport encoding for manipulation purposes
>>> which is main problem imo here - transport encodings are not
>>> designed for the manipulation - it is extremely difficult to use
>>> them for manipualtion in practice as we may see.
>> I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.
> 
> Sorry but strings in DMDScript are quite different in terms of
> 0) there are no such thing as char in JavaScript.

ECMAScript 262-3 (Javascript) defines the source character set to be UTF-16, and the source character set is what JS programs manipulate for strings and characters.

> 1) strings are Strings - not vectors of octets - js::string[] and d::char[] are different things.
> 2) are not supposed to be used by any OS API.
> 3) there are 12 or so methods of String class in JS - limited perimeter -
> what model you've choosen to store them is irrelevant -
> in some implementations they represented even by list of fixed runs.

I agree how it's stored in the JS implementation is irrelevant. My point was that in DMDScript they are stored as utf-8 strings, and they work with only minor extra effort - DMDScript implements all the string handling functions JS defines.


>>> - ***A  functions in Windows take byte string (LPSTR) and current
>>>   codepage id  to render text. ( byte + codepage = Unicode Code Point )
>> Win9x only supports the A functions,
> 
> You are not right here.
> 
> TextOutA and TextOutW are both supported by Win98.
> And intention in Harmonia was to use only those ***W
> functions which come out of the box on Win98 (without need of MSLU)

You're right in that Win98 exports a small handful of W functions without MSLU - but what those W functions actually do under the hood is translate the data based on the current code page and then call the corresponding A function. In other words, the Win9x W functions are rather pointless and don't support characters that are not in the current code page anyway. MSLU extends the same poor behavior to a bunch more pseudo W functions. This is why Phobos does not call W functions under Win9x.

Conversely, the A functions under NT and later translate the characters to - you guessed it - UTF-16 and then call the corresponding W function. This is why Phobos under NT does not call the A functions.


>> Win9x is obsolete anyway, and there's no reason to cripple a new language by accommodating the failures of an obsolete system.
> 
> There is a huge market of embedded devices.
> If you think that computer evolution expands only in more-ram-speed
> direction than you are in trouble.
> 
> http://www.litepc.com/graphics/eossystem.jpg

I agree there's a huge ecosystem of 32 bit embedded processors. And D works fine with Win9x - it just isn't crippled by Win9x's shortcomings.


>> When running on NT or later Windows, the W functions are used instead which work directly with UTF-16. Later Windows also support UTF-8 with the A functions.
> http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspx

That is consistent with what I wrote about it.


>>> - ***W functions in Windows use LPWSTR things which are
>>>   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
>>>   (  cast(dword) word  = Unicode Code Point )
>>>   Only few functions in Windows API treat LPWSTR as UTF-16.
>> BMP is a proper subset of UTF-16. The only difference is that BMP doesn't do the 2-word surrogate pair encodings. But those are reserved in BMP anyway, so there is no conflict. Windows has been upgraded to handle them. Early versions of NT that couldn't handle surrogate pairs didn't work with those code points anyway, so nothing is gained by going to code pages.
> 
> Sorry this scares me "BMP is a proper subset of UTF-16"
> UTF-16 is a group name of *byte stream encodings*
> (UTF-16LE and UTF-16BE) of Unicode Code Set.
> 
> BTW: which one of this UTFs D uses? Platform dependent I beleive.

D has been used for many years with foreign languages under Windows. If UTF-16 didn't work with Windows, I think it would have come up by now <g>.

As for whether it is LE or BE, it is whatever the local platform is, just like ints, shorts, longs, etc. are.

>> So, the W functions can and do take UTF-16 directly, and in fact the Phobos implementation does use the W functions, transmitting wchar[] to them, and it works fine.
>>
>> The neat thing about Phobos is it adapts to whether you are using Win9x, full 32 bit Windows, or Linux, and adjusts the char output accordingly so it "just works."
> 
> It should work well. Efficent I mean.

Yes.

> The language shall be agnostic to the meaning of char as much as possible.

That's C/C++'s approach, and it does not work very well. Check out tchar.h, there's a lovely disaster <g>. For another, just try using std::string with shift-JIS.

> It shall not prevent you to write effective algorithms.

Does UTF-8 prevent writing effective algorithms? I don't see how. DMDScript works, and is faster than any other JS implementation out there, including my own C++ version <g>. And frankly, my struggles with trying to internationalize C++ code for DMDScript is what led to D's support for UTF. The D implementation is shorter, simpler, and faster than the C++ one (which uses wchar's).


> Practically it is enough to have 16 (BMP) but...

I agree you can write code using BMP and ignore surrogate pairs today, and you'll probably never notice the bugs. But sooner or later, the surrogate pair problem is going to show up. Windows, Java, and Javascript have all had to go back and redo to deal with surrogate pairs.
August 01, 2006
On Mon, 31 Jul 2006 20:46:53 -0700, Andrew Fedoniouk <news@terrainformatica.com> wrote:

>
> (* dbuild does not sound good in russian - very close to idiot in medical
> meaning
> consider builDer/buildDer/creaDor for example - with red D in the middle -
> stylish at least)
>
> Andrew.
>
>

Really, Andrew, you are getting carried away in your demands.  You almost sound self-centered :).  dbuild is not made for Russians only.  Almost any English word conceived for a name might just have some sort of bad connotation in any one of the thousands of languages in this world.  Why should anyone feel obligated to accomodate your culture here?  I know Russians tend to be quite proud of their heritage, which is fine... but really, you are being quite silly to make these demands here.

That aside... my personal, self-centered feeling is that the name "bud" is quite adequate. :D

-JJR
August 01, 2006
"John Reimer" <terminal.node@gmail.com> wrote in message news:op.tdlccd0b6gr7xp@epsilon-alpha...
> On Mon, 31 Jul 2006 20:46:53 -0700, Andrew Fedoniouk <news@terrainformatica.com> wrote:
>
>>
>> (* dbuild does not sound good in russian - very close to idiot in medical
>> meaning
>> consider builDer/buildDer/creaDor for example - with red D in the
>> iddle  -
>> stylish at least)
>>
>> Andrew.
>>
>>
>
> Really, Andrew, you are getting carried away in your demands.  You almost sound self-centered :).  dbuild is not made for Russians only.  Almost any English word conceived for a name might just have some sort of bad connotation in any one of the thousands of languages in this world.  Why should anyone feel obligated to accomodate your culture here?  I know Russians tend to be quite proud of their heritage, which is fine... but really, you are being quite silly to make these demands here.
>
> That aside... my personal, self-centered feeling is that the name "bud" is quite adequate. :D

:D

BTW:  debilita [lat.] as a word with many variations is used in almost all laguages directly derived from latin.

You can say d'buil' on streets of say Munich and they will undersatnd you. Trust me , free beer will be yours.

So it is far from russian-centric :-P

Andrew.






August 01, 2006
"Walter Bright" <newshound@digitalmars.com> wrote in message news:eamql8$1jgc$1@digitaldaemon.com...
> Andrew Fedoniouk wrote:
>> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eam1ec$10e1$1@digitaldaemon.com...
>>> Andrew Fedoniouk wrote:
>>>> The problem as I can see is this:
>>>> D propose to use transport encoding for manipulation purposes
>>>> which is main problem imo here - transport encodings are not
>>>> designed for the manipulation - it is extremely difficult to use
>>>> them for manipualtion in practice as we may see.
>>> I disagree the characterization that it is "extremely difficult" to use for manipulation. foreach's direct support for it, as well as the functions in std.utf, make it straightforward. DMDScript is built around UTF-8, and manipulating multibyte characters in it has not turned out to be a significant problem.
>>
>> Sorry but strings in DMDScript are quite different in terms of 0) there are no such thing as char in JavaScript.
>
> ECMAScript 262-3 (Javascript) defines the source character set to be UTF-16, and the source character set is what JS programs manipulate for strings and characters.

Walter, please, forget about such thing as "character set is UTF-16" it is a non-sense.

Regarding ECMA-262:

"A conforming implementation of this International standard shall interpret
characters in conformance with the
Unicode Standard, Version 2.1 or later, and ISO/IEC 10646-1 with either
UCS-2 or UTF-16 as the adopted

encoding form..."

It is quite different from your interpretation. Compiler accepts input stream as either BMP codes or full unicode set encoded using UTF-16. There is no mentioning that String[n] will return you utf-16 code unit. That will be weird.


>
>> 1) strings are Strings - not vectors of octets - js::string[] and
>> d::char[] are different things.
>> 2) are not supposed to be used by any OS API.
>> 3) there are 12 or so methods of String class in JS - limited perimeter -
>> what model you've choosen to store them is irrelevant -
>> in some implementations they represented even by list of fixed runs.
>
> I agree how it's stored in the JS implementation is irrelevant. My point was that in DMDScript they are stored as utf-8 strings, and they work with only minor extra effort - DMDScript implements all the string handling functions JS defines.

Again it is up to you how they are stored internally and what you did there.

In D situation is completely different - there is a char and char[] opened to all winds.


>
>
>>>> - ***A  functions in Windows take byte string (LPSTR) and current
>>>>   codepage id  to render text. ( byte + codepage = Unicode Code Point )
>>> Win9x only supports the A functions,
>>
>> You are not right here.
>>
>> TextOutA and TextOutW are both supported by Win98.
>> And intention in Harmonia was to use only those ***W
>> functions which come out of the box on Win98 (without need of MSLU)
>
> You're right in that Win98 exports a small handful of W functions without MSLU - but what those W functions actually do under the hood is translate the data based on the current code page and then call the corresponding A function. In other words, the Win9x W functions are rather pointless and don't support characters that are not in the current code page anyway. MSLU extends the same poor behavior to a bunch more pseudo W functions. This is why Phobos does not call W functions under Win9x.

I wouldn't be so pessimistic about Win98 :)


>
> Conversely, the A functions under NT and later translate the characters to - you guessed it - UTF-16 and then call the corresponding W function. This is why Phobos under NT does not call the A functions.
>

Ok. And how do you call A functions?

Do you use proposed koi8chars, latin1chars, etc.?

You are using char for that. But wait, char cannot contain anything other than utf-8 :-P


>
>>> Win9x is obsolete anyway, and there's no reason to cripple a new language by accommodating the failures of an obsolete system.
>>
>> There is a huge market of embedded devices.
>> If you think that computer evolution expands only in more-ram-speed
>> direction than you are in trouble.
>>
>> http://www.litepc.com/graphics/eossystem.jpg
>
> I agree there's a huge ecosystem of 32 bit embedded processors. And D works fine with Win9x - it just isn't crippled by Win9x's shortcomings.
>
>
>>> When running on NT or later Windows, the W functions are used instead which work directly with UTF-16. Later Windows also support UTF-8 with the A functions.
>> http://blogs.msdn.com/michkap/archive/2005/05/11/416552.aspx
>
> That is consistent with what I wrote about it.
>

No doubts about it.
>
>>>> - ***W functions in Windows use LPWSTR things which are
>>>>   sequence of codepoints from Unicode Basic Multilingual Plane (BMP).
>>>>   (  cast(dword) word  = Unicode Code Point )
>>>>   Only few functions in Windows API treat LPWSTR as UTF-16.
>>> BMP is a proper subset of UTF-16. The only difference is that BMP doesn't do the 2-word surrogate pair encodings. But those are reserved in BMP anyway, so there is no conflict. Windows has been upgraded to handle them. Early versions of NT that couldn't handle surrogate pairs didn't work with those code points anyway, so nothing is gained by going to code pages.
>>
>> Sorry this scares me "BMP is a proper subset of UTF-16"
>> UTF-16 is a group name of *byte stream encodings*
>> (UTF-16LE and UTF-16BE) of Unicode Code Set.
>>
>> BTW: which one of this UTFs D uses? Platform dependent I beleive.
>
> D has been used for many years with foreign languages under Windows. If UTF-16 didn't work with Windows, I think it would have come up by now <g>.
>
> As for whether it is LE or BE, it is whatever the local platform is, just like ints, shorts, longs, etc. are.

>
>>> So, the W functions can and do take UTF-16 directly, and in fact the Phobos implementation does use the W functions, transmitting wchar[] to them, and it works fine.
>>>
>>> The neat thing about Phobos is it adapts to whether you are using Win9x, full 32 bit Windows, or Linux, and adjusts the char output accordingly so it "just works."
>>
>> It should work well. Efficent I mean.
>
> Yes.
>
>> The language shall be agnostic to the meaning of char as much as possible.
>
> That's C/C++'s approach, and it does not work very well. Check out tchar.h, there's a lovely disaster <g>. For another, just try using std::string with shift-JIS.
>
>> It shall not prevent you to write effective algorithms.
>
> Does UTF-8 prevent writing effective algorithms? I don't see how. DMDScript works, and is faster than any other JS implementation out there, including my own C++ version <g>. And frankly, my struggles with trying to internationalize C++ code for DMDScript is what led to D's support for UTF. The D implementation is shorter, simpler, and faster than the C++ one (which uses wchar's).
>
>
>> Practically it is enough to have 16 (BMP) but...
>
> I agree you can write code using BMP and ignore surrogate pairs today, and you'll probably never notice the bugs. But sooner or later, the surrogate pair problem is going to show up. Windows, Java, and Javascript have all had to go back and redo to deal with surrogate pairs.

Why? JavaScript for example has no such things as char.

String.charAt() returns guess what? Correct - String object.

No char - no problem :D

Why do they need to redefine anything then?

Again - let people decide of what char is and how to interpret it And that will be it.

Phobos can work with utf-8/16 and satisfy you and other UTF-masochists (no offence implied).  Ordinary people will do their own strings anyway. Just give them opAssign and dtor in structs and you will see explosion of perfect strings. That char#[] (read-only arrays) will also benefit here. oh.....

Changing char init value to 0 will not harm anybody but will allow to use char for other than

utf-8 purposes - it is only one from 40 in active use encodings anyway.

For persistence purposes (in compiled EXE) utf is the best choice probably. But in runtime - please not on language level.

Educated IMO, of course.

Andrew.


August 01, 2006
Andrew Fedoniouk a écrit :
> "John Reimer" <terminal.node@gmail.com> wrote in message news:op.tdlccd0b6gr7xp@epsilon-alpha...
> 
>>On Mon, 31 Jul 2006 20:46:53 -0700, Andrew Fedoniouk <news@terrainformatica.com> wrote:
>>
>>
>>>(* dbuild does not sound good in russian - very close to idiot in medical
>>>meaning
>>>consider builDer/buildDer/creaDor for example - with red D in the iddle  -
>>>stylish at least)
>>>
>>>Andrew.
>>>
>>>
>>
>>Really, Andrew, you are getting carried away in your demands.  You almost sound self-centered :).  dbuild is not made for Russians only.  Almost any English word conceived for a name might just have some sort of bad connotation in any one of the thousands of languages in this world.  Why should anyone feel obligated to accomodate your culture here?  I know Russians tend to be quite proud of their heritage, which is fine... but really, you are being quite silly to make these demands here.
>>
>>That aside... my personal, self-centered feeling is that the name "bud" is quite adequate. :D
> 
> 
> :D
> 
> BTW:  debilita [lat.] as a word with many variations is used in
> almost all laguages directly derived from latin.
> 
> You can say d'buil' on streets of say Munich and they will undersatnd you.
> Trust me , free beer will be yours.
> 
> So it is far from russian-centric :-P
> 
> Andrew.

As "débile" in french, pronounced something like "day bill". One has to correctly pronounce the ending D of dbuild to disambiguate it, but since we generally know about what we're speaking about in an IT related discussion, it should be OK, or even funny if we ambigously pronounce it in presence of humourous enough people.
August 01, 2006
Andrew Fedoniouk wrote:
> Compiler accepts input stream as either BMP codes or full unicode set encoded using UTF-16.

BMP is a subset of UTF-16.

> There is no mentioning that String[n] will return you utf-16 code
> unit. That will be weird.

String.charCodeAt() will give you the utf-16 code unit.

>> Conversely, the A functions under NT and later translate the characters to - you guessed it - UTF-16 and then call the corresponding W function. This is why Phobos under NT does not call the A functions.
> Ok. And how do you call A functions?

Take a look at std.file for an example.


>> Windows, Java, and Javascript have all had to go back and redo to deal with surrogate pairs.
> Why? JavaScript for example has no such things as char.
> String.charAt() returns guess what? Correct - String object.
> No char - no problem :D

See String.fromCharCode() and String.charCodeAt()

> Again - let people decide of what char is and how to interpret it And that will be it.

I've already explained the problems C/C++ have with that. They're real problems, bad and unfixable enough that there are official proposals to add new UTF basic types to to C++.

> Phobos can work with utf-8/16 and satisfy you and other UTF-masochists (no offence implied).

C++'s experience with this demonstrates that char* does not work very well with UTF-8. It's not just my experience, it's why new types were proposed for C++ (and not by me).

> Ordinary people will do their own strings anyway. Just give them opAssign and dtor in structs and you will see explosion of perfect strings. That char#[] (read-only arrays) will also benefit here. oh.....
> 
> Changing char init value to 0 will not harm anybody but will allow to use char for other than
> 
> utf-8 purposes - it is only one from 40 in active use encodings anyway.
> 
> For persistence purposes (in compiled EXE) utf is the best choice probably. But in runtime - please not on language level.

ubyte[] will enable you to use any encoding you wish - and that's what it's there for.
August 02, 2006
(Hope this long dialog will help all of us to better understand what UNICODE is)

"Walter Bright" <newshound@digitalmars.com> wrote in message news:eao5st$2r1f$1@digitaldaemon.com...
> Andrew Fedoniouk wrote:
> > Compiler accepts input stream as either BMP codes or full unicode set
> encoded using UTF-16.
>
> BMP is a subset of UTF-16.

Walter with deepest respect but it is not. Two different things.

UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.

If you will treat utf-16 sequence as a sequence of UCS-2 (BMP) codes you
are in trouble. See:

Sequence of two words D834 DD1E as UTF-16 will give you
one unicode character with code 0x1D11E  ( musical G clef ).
And the same sequence interpretted as UCS-2 sequence will
give you two (invlaid, non-printable but still) character codes.

You will get different length of the string at least.

>
> > There is no mentioning that String[n] will return you utf-16 code unit. That will be weird.
>
> String.charCodeAt() will give you the utf-16 code unit.
>
>>> Conversely, the A functions under NT and later translate the characters to - you guessed it - UTF-16 and then call the corresponding W function. This is why Phobos under NT does not call the A functions.
>> Ok. And how do you call A functions?
>
> Take a look at std.file for an example.

You mean here?:

char* namez = toMBSz(name);
 h = CreateFileA(namez,GENERIC_WRITE,0,null,CREATE_ALWAYS,
     FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,cast(HANDLE)null);

char* here is far from UTF-8 sequence.

>
>
>>> Windows, Java, and Javascript have all had to go back and redo to deal with surrogate pairs.
>> Why? JavaScript for example has no such things as char. String.charAt() returns guess what? Correct - String object. No char - no problem :D
>
> See String.fromCharCode() and String.charCodeAt()

ECMA-262

String.prototype.charCodeAt (pos)
Returns a number (a nonnegative integer less than 2^16) representing the
code point value of the
character at position pos in the string....

As you may see it is returning (unicode) *code point* from BMP set but it is far from UTF-16 code unit you've declared above.

Relaxing "a nonnegative integer less than 2^16" to
"a nonnegative integer less than 2^21" will not harm anybody. Or at least
such probability is vanishingly small.

>
>> Again - let people decide of what char is and how to interpret it And that will be it.
>
> I've already explained the problems C/C++ have with that. They're real problems, bad and unfixable enough that there are official proposals to add new UTF basic types to to C++.

Basic types of what?

>
>> Phobos can work with utf-8/16 and satisfy you and other UTF-masochists (no offence implied).
>
> C++'s experience with this demonstrates that char* does not work very well with UTF-8. It's not just my experience, it's why new types were proposed for C++ (and not by me).

Because char in C is not supposed  to hold multy-byte encodings.
At least std::string is strictly single byte thing by definition. And this
is perfectly fine. There is wchar_t for holding OS supported range in full.
On Win32 - wchar_t is 16bit (UCS-2 legacy) and in GCC/*nix it is 32bit.

>
>> Ordinary people will do their own strings anyway. Just give them opAssign and dtor in structs and you will see explosion of perfect strings. That char#[] (read-only arrays) will also benefit here. oh.....
>>
>> Changing char init value to 0 will not harm anybody but will allow to use char for other than
>>
>> utf-8 purposes - it is only one from 40 in active use encodings anyway.
>>
>> For persistence purposes (in compiled EXE) utf is the best choice probably. But in runtime - please not on language level.
>
> ubyte[] will enable you to use any encoding you wish - and that's what it's there for.

Thus the whole set of Windows API headers (and std.c.string for example)
seen in D has to be rewrited to accept ubyte[]. As char in D is not char in
C
Is this the idea?

Andrew.




August 02, 2006
On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:

> (Hope this long dialog will help all of us to better understand what UNICODE is)
> 
> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eao5st$2r1f$1@digitaldaemon.com...
>> Andrew Fedoniouk wrote:
>>> Compiler accepts input stream as either BMP codes or full unicode set
>> encoded using UTF-16.
>>
>> BMP is a subset of UTF-16.
> 
> Walter with deepest respect but it is not. Two different things.
> 
> UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.

Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.

...

>> ubyte[] will enable you to use any encoding you wish - and that's what it's there for.
> 
> Thus the whole set of Windows API headers (and std.c.string for example)
> seen in D has to be rewrited to accept ubyte[]. As char in D is not char in
> C
> Is this the idea?

Yes. I believe this is how it now should be done. The Phobos library is not correctly using char, char[], and ubyte[] when interfacing with Windows and C functions.

My guess is that Walter originally used 'char' to make things easier for C coders to move over to D, but in doing so, now with UTF support built-in, has caused more problems that the idea was supposed to solve. The move to UTF support is good, but the choice of 'char' for the name of a UTF-8 code-unit was, and still is, a big mistake. I would have liked something more like ...

  char  ==> An unsigned 8-bit byte. An alias for ubyte.
  schar ==> A UTF-8 code unit.
  wchar ==> A UTF-16 code unit.
  dchar ==> A UTF-32 code unit.

  char[] ==> A 'C' string
  schar[] ==> A UTF-8 string
  wchar[] ==> A UTF-16 string
  dchar[] ==> A UTF-32 string

And then have built-in conversions between the UTF encodings. So if people want to continue to use code from C/C++ that uses code-pages or similar they can stick with char[].



-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
2/08/2006 1:08:51 PM
August 02, 2006
"Derek Parnell" <derek@nomail.afraid.org> wrote in message news:13qrud1m5v15d$.ydqvoi8nx4f8.dlg@40tude.net...
> On Tue, 1 Aug 2006 19:57:08 -0700, Andrew Fedoniouk wrote:
>
>> (Hope this long dialog will help all of us to better understand what
>> UNICODE
>> is)
>>
>> "Walter Bright" <newshound@digitalmars.com> wrote in message news:eao5st$2r1f$1@digitaldaemon.com...
>>> Andrew Fedoniouk wrote:
>>>> Compiler accepts input stream as either BMP codes or full unicode set
>>> encoded using UTF-16.
>>>
>>> BMP is a subset of UTF-16.
>>
>> Walter with deepest respect but it is not. Two different things.
>>
>> UTF-16 is a variable-length enconding - byte stream. Unicode BMP is a range of numbers strictly speaking.
>
> Andrew is correct. In UTF-16, characters are variable length, from 2 to 4 bytes long. In UTF-8, characters are from 1 to 4 bytes long (this used to be up to 6 but that has changed). UCS-2 is a subset of Unicode characters that are all represented by 2-byte integers. Windows NT had implemented UCS-2 but not UTF-16, but Windows 2000 and above support UTF-16 now.
>
> ...
>
>>> ubyte[] will enable you to use any encoding you wish - and that's what it's there for.
>>
>> Thus the whole set of Windows API headers (and std.c.string for example)
>> seen in D has to be rewrited to accept ubyte[]. As char in D is not char
>> in
>> C
>> Is this the idea?
>
> Yes. I believe this is how it now should be done. The Phobos library is
> not
> correctly using char, char[], and ubyte[] when interfacing with Windows
> and
> C functions.
>
> My guess is that Walter originally used 'char' to make things easier for C coders to move over to D, but in doing so, now with UTF support built-in, has caused more problems that the idea was supposed to solve. The move to UTF support is good, but the choice of 'char' for the name of a UTF-8 code-unit was, and still is, a big mistake. I would have liked something more like ...
>
>  char  ==> An unsigned 8-bit byte. An alias for ubyte.
>  schar ==> A UTF-8 code unit.
>  wchar ==> A UTF-16 code unit.
>  dchar ==> A UTF-32 code unit.
>
>  char[] ==> A 'C' string
>  schar[] ==> A UTF-8 string
>  wchar[] ==> A UTF-16 string
>  dchar[] ==> A UTF-32 string
>
> And then have built-in conversions between the UTF encodings. So if people want to continue to use code from C/C++ that uses code-pages or similar they can stick with char[].
>
>

Yes, Derek, this will be probably near the ideal.

Andrew.