Unicode Character and String Intrinsics

Walter says (in response to my post)...
>> D needs a Unicode string primitive.
>It does already. In D, a char[] is really a utf-8 array.

I'm dubious about this claim.  ANSI C char arrays are UTF-8 too, if the contents are 7-bit ACSII (a subset of UTF-8).  That doesn't mean they support UTF-8.

UTF-8 is on D's very own 'to-do' list: http://www.digitalmars.com/d/future.html

UTF-8 has a maximum encoding length of 6 bytes for one character.  If such a character appears at index 100 in char[] myString, what is the return value from myString[100]?  The answer should be "one UTF-8 char with an internal 6-byte representation."  I don't think D does that.

Besides which, my idea was a native string primitive, not a quasi-array.  The
confusion of strings with arrays was a basic, fundamental mistake of C.  While
some string semantics do resemble those of arrays, this resemblance should not
mandate identical data types.  Strings are important enough to merit their own
intrinsic type.  Icon is not the only language to recognize that fact.  D
documents make no mention of any string primitive:
http://www.digitalmars.com/d/type.html
D has two intrinsic character types, a dynamic array type, and _no_ intrinsic
string type.

Characters should be defined as UTF-8 or UTF-16 or UTF-32, not "short" and "wide."  The differing cross-platform widths of the 'wide' char is asking for trouble; poof goes data portability.  D characters are not based on Unicode, but archaic MS Windows API and legacy C terminology spot-welded onto Linux.  How about Unicode as a basis?

The ideal type system would offer as intrinsic/primitive/native language types:
- UTF-8 char
- UTF-16 char
- UTF-32 char
- UTF-8 string
- UTF-16 string
- UTF-32 string
- built-in conversions between all of the above (e.g. UTF-8 to UTF-16)
- built-in conversions to/from UTF strings and C-style byte arrays

The preceding list will not seem very long when you consider how many numeric types D supports.  Strings are as important as numbers.

The old C 'char' type is merely a byte; D already has 'ubyte.'  The distinction between ubyte and char in D escapes me.  Maybe the reasoning is that a char might be 'wide' so D needs a separate type?  But that reason disappears once you have nice UTF characters.  So even if the list is a bit long it also eliminates two redundant types, char and wchar.

I would not be against retention of char and char[] for C compatibility purposes if someone could point out why 'ubyte' and 'char[]' do not suffice.  Otherwise I would just alias 'char' into 'ubyte' and be done with it.  The wchar could be stored inside a UTF-16 or UTF-32 char, or be declared as a struct.

To the user, strings would act like dynamic arrays.  Internally they are different animals.  Each 'element' of the 'array' can have varying length per Unicode specifications.  String primitives would hide Unicode complexity under the hood.

That's just the beginning.  Now that you have string intrinsics, you can give them special behaviors pertaining to i/o streams and such.  You can define 'streaming' conversions from other intrinsic types to strings for i/o purposes. And...permit me to dream!...you can define Icon-style string scanning expressions.

Mark

March 31, 2003

Re: Unicode Character and String Intrinsics

Posted by Mark Evans
in reply to Mark Evans

Permalink

Mark Evans

Posted in reply to Mark Evans

Permalink

>if someone could point out why 'ubyte' and 'char[]' do not suffice.

Typo:  that was "why 'ubyte' and 'ubyte[]' do not suffice."

- Mark

March 31, 2003

Re: Unicode Character and String Intrinsics

Posted by Walter
in reply to Mark Evans

Permalink

Walter

Posted in reply to Mark Evans

Permalink

"Mark Evans" <Mark_member@pathlink.com> wrote in message news:b6abjh$12m8$1@digitaldaemon.com...
> Walter says (in response to my post)...
> >> D needs a Unicode string primitive.
> >It does already. In D, a char[] is really a utf-8 array.
> I'm dubious about this claim.  ANSI C char arrays are UTF-8 too, if the
contents
> are 7-bit ACSII (a subset of UTF-8).  That doesn't mean they support
UTF-8.
> UTF-8 is on D's very own 'to-do' list: http://www.digitalmars.com/d/future.html

It is incompletely implemented, sure.

> UTF-8 has a maximum encoding length of 6 bytes for one character.  If such
a
> character appears at index 100 in char[] myString, what is the return
value from
> myString[100]?  The answer should be "one UTF-8 char with an internal
6-byte
> representation."  I don't think D does that.

No, it doesn't do that. Sometimes you want the byte, sometimes the assembled unicode char.

> Besides which, my idea was a native string primitive, not a quasi-array.
The
> confusion of strings with arrays was a basic, fundamental mistake of C.
While
> some string semantics do resemble those of arrays, this resemblance should
not
> mandate identical data types.  Strings are important enough to merit their
own
> intrinsic type.  Icon is not the only language to recognize that fact.  D
> documents make no mention of any string primitive:
> http://www.digitalmars.com/d/type.html
> D has two intrinsic character types, a dynamic array type, and _no_
intrinsic
> string type.

D does have an intrinsic string literal.

> Characters should be defined as UTF-8 or UTF-16 or UTF-32, not "short" and "wide."  The differing cross-platform widths of the 'wide' char is asking
for
> trouble; poof goes data portability.  D characters are not based on
Unicode, but
> archaic MS Windows API and legacy C terminology spot-welded onto Linux.
How
> about Unicode as a basis?

Actually, this has changed. Wide chars are now fixed at 16 bits, i.e. UTF-16. For UTF-32, just use uint's.

> The ideal type system would offer as intrinsic/primitive/native language
types:
> - UTF-8 char
> - UTF-16 char
> - UTF-32 char
> - UTF-8 string
> - UTF-16 string
> - UTF-32 string
> - built-in conversions between all of the above (e.g. UTF-8 to UTF-16)
> - built-in conversions to/from UTF strings and C-style byte arrays
>
> The preceding list will not seem very long when you consider how many
numeric
> types D supports.  Strings are as important as numbers.

That's actually pretty close to what D supports.

> The old C 'char' type is merely a byte; D already has 'ubyte.'  The
distinction
> between ubyte and char in D escapes me.  Maybe the reasoning is that a
char
> might be 'wide' so D needs a separate type?  But that reason disappears
once you
> have nice UTF characters.  So even if the list is a bit long it also
eliminates
> two redundant types, char and wchar.

The distinction is char is UTF-8, and byte is well, just a byte. The distinction comes in handy when dealing with overloaded functions.

> I would not be against retention of char and char[] for C compatibility
purposes
> if someone could point out why 'ubyte' and 'char[]' do not suffice.

Function overloading.

> Otherwise I
> would just alias 'char' into 'ubyte' and be done with it.  The wchar could
be
> stored inside a UTF-16 or UTF-32 char, or be declared as a struct.
>
> To the user, strings would act like dynamic arrays.  Internally they are different animals.  Each 'element' of the 'array' can have varying length
per
> Unicode specifications.  String primitives would hide Unicode complexity
under
> the hood.
>
> That's just the beginning.  Now that you have string intrinsics, you can
give
> them special behaviors pertaining to i/o streams and such.  You can define 'streaming' conversions from other intrinsic types to strings for i/o
purposes.
> And...permit me to dream!...you can define Icon-style string scanning expressions.
>
> Mark
>
>

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Mark Evans
in reply to Mark Evans

Permalink

Mark Evans

Posted in reply to Mark Evans

Permalink

Hi again Bill

After your 'meta-programming' talk I shudder to think what your idea of a maximalist is...maybe a computer that writes a compiler that generates source code for a computer motherboard design program to construct another computer that...

Under my scheme we gain 3 character types and drop 2: net gain 1. We gain 3 string types and drop 1: net gain 2. Total net gain, 3 types. What does that buy us? Complete internationalization of D, complete freedom from ugly C string idioms, data portability across platforms, ease of interfacing with Win32 APIs and other software languages.

The idea of "just one" Unicode type holds little water. Why don't you make the same argument about numeric types, of which we have some twenty-odd? Or how about if D offered just one data type, the bit, and let you construct everything else from that? If D does Unicode then D should do it right. It's a poor, asymmetric design to have some Unicode built-in and the rest tacked on as library routines.

Mark


> This is a rare occasion when I agree with Mark. The fact that a minimalist like me, and a maximalist like Mark, and a pragmatist like yourself seem to agree is something Walter should consider. I would want to hold built-in string support to just UTF-8. D could offer some support for the other formats through conversion routines in a standard library. Having a single string format would surely be simpler than supporting them all. Bill

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Mark Evans
in reply to Walter

Permalink

Mark Evans

Posted in reply to Walter

Permalink

>>The answer should be "one UTF-8 char with an internal 6-byte representation."
>No, it doesn't do that. Sometimes you want the byte, sometimes the assembled unicode char.

But the only use for raw bytes is precisely such low-level format conversions as are proposed to go under the hood. String usage involves character analysis, not bit shuffling. There is a place for getting raw bytes, but a string subscript is not it. Maybe a typecast to ubyte[], and then an array subscript.

The whole point of built-in Unicode support is to let users avoid dealing with bytes and let them deal with characters instead.

>D does have an intrinsic string literal.

But it's not Unicode, just char or wchar. Those are both fixed byte-width, but all Unicode chars, except UTF-32, are variable byte-width.

>Wide chars are now fixed at 16 bits, i.e. UTF-16.

Ditto. Wide chars are not UTF-16 chars since they are fixed width. UTF-16 characters can be 16 or 32 bits wide. (UTF-8 characters can be anywhere from 1 byte to 6 bytes wide.)

> For UTF-32, just use uint's.

Possible, but see my final point.

>That's actually pretty close to what D supports.

I don't see anything close. (a) There is no Unicode string primitive
(char[] is not a string primitive, let alone Unicode; it's an array
type). (b) There are no Unicode characters. There are merely types
with similar 'average' sizes being touted as Unicode capable (they are
not).

>> if someone could point out why 'ubyte' and 'ubyte[]' do not suffice.
>Function overloading.

This comment is a logical contradiction with prior remarks. If the distinction between ubyte and char matters for this reason, then the same reason makes a difference between uint and UTF-32. But in the latter case you say to just use uint. You can't have it both ways.

Thanks for taking all our thoughts into consideration.

Mark

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Walter
in reply to Mark Evans

Permalink

Walter

Posted in reply to Mark Evans

Permalink

"Mark Evans" <Mark_member@pathlink.com> wrote in message news:b6an4p$1bpj$1@digitaldaemon.com...
> The whole point of built-in Unicode support is to let users avoid dealing with bytes and let them deal with characters instead.

That's only partially true - the downside comes from needing high performance you'll need byte indices, not UTF character strides. There is no getting away from the variable byte encoding. In my (limited) experience with string processing and UTF-8, rarely is it necessary to decode it. Most manipulation is done with indices.

> >D does have an intrinsic string literal.
> But it's not Unicode, just char or wchar. Those are both fixed byte-width, but all Unicode chars, except UTF-32, are variable byte-width.

No, in D, the intrinsic string literal is not just char or wchar. It's a unicode string - its internal format is not fixed until semantic processing, when it is adjusted to be UTF-8, -16, or -32 as needed.

> >Wide chars are now fixed at 16 bits, i.e. UTF-16.
> Ditto. Wide chars are not UTF-16 chars since they are fixed width.

What I meant is they do not change size from implementation to implementation. They are 16 bits, and line up with the UTF-16 API's of Win32.

> UTF-16 characters can be 16 or 32 bits wide. (UTF-8 characters can be anywhere from 1 byte to 6 bytes wide.)

Yes.

> > For UTF-32, just use uint's.
> Possible, but see my final point.
> >That's actually pretty close to what D supports.
> I don't see anything close. (a) There is no Unicode string primitive
> (char[] is not a string primitive, let alone Unicode; it's an array
> type).

I think that's a matter of perspective.

> (b) There are no Unicode characters. There are merely types
> with similar 'average' sizes being touted as Unicode capable (they are
> not).

I believe they are unicode capable. Now, I have not written the I/O routines so they will print as unicode, and there are other gaps in the implementation, but the core concept is there.

> >> if someone could point out why 'ubyte' and 'ubyte[]' do not suffice.
> >Function overloading.
> This comment is a logical contradiction with prior remarks. If the distinction between ubyte and char matters for this reason, then the same reason makes a difference between uint and UTF-32. But in the latter case you say to just use uint. You can't have it both ways.

I think uint[] will well serve for UTF-32 because there is no need to be concerned about multiword encoding.

> Thanks for taking all our thoughts into consideration.

You're welcome.

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Matthew Wilson
in reply to Mark Evans

Permalink

Matthew Wilson

Posted in reply to Mark Evans

Permalink

One minor point:

We *must* have char/wchar and byte/ubyte/short/ushort as separate, and overloadable, entities. This is about the most egregious and toxic aspect of C/C++ that I can think of. Absolute nightmare when trying to write generic serialisation components, messing around with compiler discrimination pre-processor guff to work out whether the compiler "knows" about wchar_t, and crying oneself to sleep with char, signed char, unsigned char, etc. etc.

Following this logic, if D does evolve to support different character encoding schemes, it would be nice to have separate char types, although I know this will draw the succinctness crowd down on me like a pack of blood-thursty vultures.

Swoop away flying beasties, my gizard is exposed.



"Mark Evans" <Mark_member@pathlink.com> wrote in message news:b6abjh$12m8$1@digitaldaemon.com...
> Walter says (in response to my post)...
> >> D needs a Unicode string primitive.
> >It does already. In D, a char[] is really a utf-8 array.
>
> I'm dubious about this claim.  ANSI C char arrays are UTF-8 too, if the
contents
> are 7-bit ACSII (a subset of UTF-8).  That doesn't mean they support
UTF-8.
>
> UTF-8 is on D's very own 'to-do' list: http://www.digitalmars.com/d/future.html
>
> UTF-8 has a maximum encoding length of 6 bytes for one character.  If such
a
> character appears at index 100 in char[] myString, what is the return
value from
> myString[100]?  The answer should be "one UTF-8 char with an internal
6-byte
> representation."  I don't think D does that.
>
> Besides which, my idea was a native string primitive, not a quasi-array.
The
> confusion of strings with arrays was a basic, fundamental mistake of C.
While
> some string semantics do resemble those of arrays, this resemblance should
not
> mandate identical data types.  Strings are important enough to merit their
own
> intrinsic type.  Icon is not the only language to recognize that fact.  D
> documents make no mention of any string primitive:
> http://www.digitalmars.com/d/type.html
> D has two intrinsic character types, a dynamic array type, and _no_
intrinsic
> string type.
>
> Characters should be defined as UTF-8 or UTF-16 or UTF-32, not "short" and "wide."  The differing cross-platform widths of the 'wide' char is asking
for
> trouble; poof goes data portability.  D characters are not based on
Unicode, but
> archaic MS Windows API and legacy C terminology spot-welded onto Linux.
How
> about Unicode as a basis?
>
> The ideal type system would offer as intrinsic/primitive/native language
types:
> - UTF-8 char
> - UTF-16 char
> - UTF-32 char
> - UTF-8 string
> - UTF-16 string
> - UTF-32 string
> - built-in conversions between all of the above (e.g. UTF-8 to UTF-16)
> - built-in conversions to/from UTF strings and C-style byte arrays
>
> The preceding list will not seem very long when you consider how many
numeric
> types D supports.  Strings are as important as numbers.
>
> The old C 'char' type is merely a byte; D already has 'ubyte.'  The
distinction
> between ubyte and char in D escapes me.  Maybe the reasoning is that a
char
> might be 'wide' so D needs a separate type?  But that reason disappears
once you
> have nice UTF characters.  So even if the list is a bit long it also
eliminates
> two redundant types, char and wchar.
>
> I would not be against retention of char and char[] for C compatibility
purposes
> if someone could point out why 'ubyte' and 'char[]' do not suffice.
Otherwise I
> would just alias 'char' into 'ubyte' and be done with it.  The wchar could
be
> stored inside a UTF-16 or UTF-32 char, or be declared as a struct.
>
> To the user, strings would act like dynamic arrays.  Internally they are different animals.  Each 'element' of the 'array' can have varying length
per
> Unicode specifications.  String primitives would hide Unicode complexity
under
> the hood.
>
> That's just the beginning.  Now that you have string intrinsics, you can
give
> them special behaviors pertaining to i/o streams and such.  You can define 'streaming' conversions from other intrinsic types to strings for i/o
purposes.
> And...permit me to dream!...you can define Icon-style string scanning expressions.
>
> Mark
>
>

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Matthew Wilson
in reply to Mark Evans

Permalink

Matthew Wilson

Posted in reply to Mark Evans

Permalink

> This comment is a logical contradiction with prior remarks. If the distinction between ubyte and char matters for this reason, then the same reason makes a difference between uint and UTF-32. But in the latter case you say to just use uint. You can't have it both ways.
>

Agree. Let's have more char types

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Matthew Wilson
in reply to Mark Evans

Permalink

Matthew Wilson

Posted in reply to Mark Evans

Permalink

I'm sold. Where can I sign up?

I presume you'll be working on the libraries ... ;)

To suck up: I've been faffing around with this issue for years, and have been (unjustifiably, in my opinion) called on numerous times to expertly opine on it for clients. (My expertise is limited to the C/C++ char/wchar_t/horrid-TCHAR type stuff, which I'm well aware is not the full picture.) Your discussion here is the first time I even get a hint that I'm listening to someone that know's what they're talking about. It's nasty, nasty stuff, and I hope that your promise can bear fruit for D. If it can, then it'll earn massive brownie points for D over its peer languages. There's a big market out there of peoples whose character sets don't fall into 7-bits ...



"Mark Evans" <Mark_member@pathlink.com> wrote in message news:b6al79$1ahd$1@digitaldaemon.com...
> Hi again Bill
>
> After your 'meta-programming' talk I shudder to think what your idea of a maximalist is...maybe a computer that writes a compiler that generates source code for a computer motherboard design program to construct another computer that...
>
> Under my scheme we gain 3 character types and drop 2: net gain 1. We gain 3 string types and drop 1: net gain 2. Total net gain, 3 types. What does that buy us? Complete internationalization of D, complete freedom from ugly C string idioms, data portability across platforms, ease of interfacing with Win32 APIs and other software languages.
>
> The idea of "just one" Unicode type holds little water. Why don't you make the same argument about numeric types, of which we have some twenty-odd? Or how about if D offered just one data type, the bit, and let you construct everything else from that? If D does Unicode then D should do it right. It's a poor, asymmetric design to have some Unicode built-in and the rest tacked on as library routines.
>
> Mark
>
>
> > This is a rare occasion when I agree with Mark. The fact that a minimalist like me, and a maximalist like Mark, and a pragmatist like yourself seem to agree is something Walter should consider. I would want to hold built-in string support to just UTF-8. D could offer some support for the other formats through conversion routines in a standard library. Having a single string format would surely be simpler than supporting them all. Bill
>
>

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Mark Evans
in reply to Walter

Permalink

Mark Evans

Posted in reply to Walter

Permalink

>> The whole point of built-in Unicode support is to let users avoid dealing with bytes and let them deal with characters instead.
>
> That's only partially true - the downside comes from needing high performance you'll need byte indices, not UTF character strides. There is no getting away from the variable byte encoding.

If I understand correctly, the translation is that it's better to let end users process bytes, so they can waste hours <g> tuning inner loops, than to offer language support, with pre-tuned inner loops. I don't see that. In fact native language support is better from a performance perspective (both in time of execution and in time of development).

> In my (limited) experience with string processing and UTF-8, rarely is it necessary to decode it. Most manipulation is done with indices.

Manipulation is done with indices in C, because that is all C offers. It's one of the big problems with C vis-a-vis Unicode.

> in D, the intrinsic string literal is not just char or wchar. It's a unicode string - its internal format is not fixed until semantic processing, when it is adjusted to be UTF-8, -16, or -32 as needed.

I think your definition of "Unicode" is basically wrong. What you are calling UTF-8 and UTF-16 is really just fixed-width slots that the user must conglomerate, not true native Unicode characters. So we are talking past each other.  For example when you say "internal format" I don't suppose you have in mind that 6-byte-wide UTF-8 character I mentioned.

When I say Unicode character, I mean an object that the language recognizes, intrinsically, as a variable-byte-width object, but which it presents to the user as an integrated (opaque) whole. I do not mean a user-defined conglomeration of fixed-width fields. That seems to be your working definition and it does not satisfy me.

>> >Wide chars are now fixed at 16 bits, i.e. UTF-16.
>> Ditto. Wide chars are not UTF-16 chars since they are fixed width.
>
> What I meant is they do not change size from implementation to implementation.

That's what I understood you to mean; and that much is good, as far as it goes, but doesn't address Unicode.

> They are 16 bits, and line up with the UTF-16 API's of Win32.

If Windows supports full UTF-16, then D does not support UTF-16 API's of Win32 with any native data type. The user still faces the same labor (more or less) as supporting Unicode in ANSI C.


> I think that's a matter of perspective.... I believe they are unicode capable. Now, I have not written the I/O routines so they will print as unicode, and there are other gaps in the implementation, but the core concept is there.

I've tried to explain why there is no Unicode character in D, and on that basis alone, I could say there is no Unicode string in D.

The syntax and semantics of char[] are identical across all types of arrays, not limited to strings. (What syntax or semantics are unique to strings?)

End users can create and manipulate almost any data structure -- any collection of bits -- in D, or for that matter C, or assembly language, or even machine language. What I'm talking about is intrinsic language support to save the labor (and mistakes).

I could build Unicode strings with a Turing machine if I wanted to. That's not "language support" in my book.

Saying that we already have 8-bit things, and 16-bit things, and 32-bit things, and that users can do Unicode by combining these things in various ways, is not a reasonable argument that the language supports Unicode. At best one might say, D does not prevent users from implementing Unicode, if they want to take the extra trouble.

>> >> if someone could point out why 'ubyte' and 'ubyte[]' do not suffice.
>> >Function overloading.
>> This comment is a logical contradiction with prior remarks. If the distinction between ubyte and char matters for this reason, then the same reason makes a difference between uint and UTF-32. But in the latter case you say to just use uint. You can't have it both ways.
>
> I think uint[] will well serve for UTF-32 because there is no need to be concerned about multiword encoding.

Then you are ignoring your own argument about function overloading!

:-)

Mark

Top | Forum index | About this forum

Forums