November 21, 2005
On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote:

> Kris wrote:
>> "Georg Wrede" <georg.wrede@nospam.org> wrote ...
>> 
>>>Kris wrote:
>>>
>>>>This is the long standing mishmash between character literal arguments and parameters of type char[], wchar[], and/or dchar[]. Character literals don't really have a "solid" type ~ the compiler can, and will, convert between wide and narrow representations on the fly.
>>>
>>>Compared to the bit thing I recently "bitched" about, this, IMHO, is an issue one can accept better. :-)
>> 
>> That doesn't make it any less problematic :-)
>> 
>>>It is a problem for small example programs. Larger programs tend to (and IMHO should) have wrappers anyhow:
>> 
>> Not so. You'd see people complaining about this constantly if Stream.write() was not decorated to distinguish between the three relevant methods. Generally speaking, any code that deals with all three array types will bump into this. Mango.io has the same problem, since it exposes write() methods for every D type plus their array counterparts.
>> 
>>>Why not change Phobos
>>>
>>>void write ( char[] s) {.....};
>>>void write (wchar[] s) {.....};
>>>void write (dchar[] s) {.....};
>>>
>>>into
>>>
>>>void _write ( char[] s) {.....};
>>>void _write (wchar[] s) {.....};
>>>void _write (dchar[] s) {.....};
>>>void write (char[] s) {_write(s)};
>>>
>>>I think this would solve the issue with string literals as discussed in this thread.
>> 
>> Then, how would one write a dchar[] literal? You just moved the problem to the _write() method instead. I think there needs to be a general resolution instead.
>> 
>> One might infer the literal type from the content therein?
>> 
>>>Also, overloading would not be hampered.
>>>
>>>And, those who really _need_ types other than the 8 bit chars, could still have their types work as usual.
>> 
>> Ahh. I think non-ASCII folks would be troubled by this bias <g>
>> 
> 
> char[] does NOT NECESSARILY MEAN an ASCII-only string in D.
> 
> char[] can be a collection of UTF-8 code points, which further confuses the matter.
> 
> So long as you can process each variant of Unicode encodings (UTF-8, UTF-16, and UTF-32), it should NOT matter which you choose as your default encoding for your project's strings.  The only effect of the choice is the efficiency with which your project processes strings.  You should not lose any data, unless you make incorrect assumptions in your code.
> 
> I think it was a very wise decision to make char type separate from byte and ubyte, but I don't think it has separated far enough.  There should be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, cdpt32).  Then, there should be a single ASCII character type called 'char'.  This would allow strings to be defined to hold ASCII characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points.
> 
> String literals created from the D compiler should be stored as a specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and should be represented as the corresponding static array of the type of character.  The default encoding should be modifiable with either commandline options or with pragmas, preferrably pragmas.
> 
> For instance, if the default encoding were to be UTF-8, then a string literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]').
> 
> Also, it should be possible to explicitly specify the encoding for each string literal on a case-by-case basis.

Very nice. Well said James. It makes so much sense when laid out like this. D is only half way there to supporting international character sets.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
22/11/2005 10:51:23 AM
November 22, 2005
Derek Parnell wrote:
> On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote:
>>
>> I think it was a very wise decision to make char type separate from byte and ubyte, but I don't think it has separated far enough.  There should be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, cdpt32).  Then, there should be a single ASCII character type called 'char'.  This would allow strings to be defined to hold ASCII characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points.
>>
>> String literals created from the D compiler should be stored as a specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and should be represented as the corresponding static array of the type of character.  The default encoding should be modifiable with either commandline options or with pragmas, preferrably pragmas.
>>
>> For instance, if the default encoding were to be UTF-8, then a string literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]').
>>
>> Also, it should be possible to explicitly specify the encoding for each string literal on a case-by-case basis.
> 
> Very nice. Well said James. It makes so much sense when laid out like this.
> D is only half way there to supporting international character sets.

I agree, but there must be a way to improve internationalization without this degree of complexity.  If D ends up with 6+ character types I think I might scream.  Is there any reason to support C-style code pages in-language in D?  I would like to think not.  As it stands, D supports three compatible encodings (char, wchar, dchar) that the programmer may choose between for reasons of data size and algorithm complexity.  The ASCII-compatible subset of UTF-8 works fine with the char-based C functions, and the full UTF-16 or UTF-32 character sets are compatible with the wchar-based C functions (depending on platform)... so far as I know at any rate.  I grant that the variable size of wchar in C is an irritating problem, but it's not insurmountable.  Why bother with all that old C code page nonsense?


Sean
November 22, 2005
"James Dunne" <james.jdunne@gmail.com> wrote ...
> Kris wrote:
[snip]
>> Ahh. I think non-ASCII folks would be troubled by this bias <g>
>
> char[] does NOT NECESSARILY MEAN an ASCII-only string in D.
>
> char[] can be a collection of UTF-8 code points, which further confuses the matter.

Indeed. I was alluding to encoding multi-byte-utf8 literals by hand; but it was a piss-poor attempt at humour.


> So long as you can process each variant of Unicode encodings (UTF-8, UTF-16, and UTF-32), it should NOT matter which you choose as your default encoding for your project's strings.  The only effect of the choice is the efficiency with which your project processes strings.  You should not lose any data, unless you make incorrect assumptions in your code.

Right.


> String literals created from the D compiler should be stored as a specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and should be represented as the corresponding static array of the type of character.

They are. The 'c', 'w', and 'd' suffix provides the fine control. Auto instances map implicitly to 'c'. Explicitly typed instances (e.g. wchar[] s = "a wide string";) also provide fine control. The minor concern I have with this aspect is that the literal content does not play a role, whereas it does with char literals (such as '?', '\x0001', and '\X00000001'). No big deal there, although perhaps it's food for another topic?


> The default encoding should be modifiable with either commandline options or with pragmas, preferrably pragmas.

I wondered about that also. Walter pointed out it would be similar to the signed/unsigned char-type switch prevalent in C compilers, which can cause grief. Perhaps D does need defaults like that, but some consistency in the interpretation of string literals would have to happen first. This required a subtle change:

That change is to assign a resolvable type to 'undecorated' string-literal arguments in the same way as the "auto" keyword does. This would also make it consistent with undecorated integer-literals (as noted elsewhere). In short, an undecorated argument "literal" would be treated as a decorated "literal"c (that 'c' suffix makes it utf8), just like auto does. This would mean all uses of string literals are treated consistently, and all undecorated literals (string, char, numeric) have consistent rules when it comes to overload resolution (currently they do not).

To elaborate, here's the undecorated string literal asymmetry:

auto s = "literal";   // effectively adds an implicit 'c' suffix

myFunc ("literal");  // Should be changed to behave as above

What I hear you asking for is a way to alter that implicit suffix?  I'd be really happy to just get the consistency first :-)


These instances are all (clearly) explicitly typed:

char[] s = "literal";   // utf8
wchar[] s = "literal"; // utf16
dchar[] s = "literal"; // utf32

auto s = "literal"c;  // utf8
auto s = "literal"w;  // utf16
auto s = "literal"d;  // utf32

myFunc ("literal"c); // utf8
myFunc ("literal"w); // utf16
myFunc ("literal"d); // ut32


> For instance, if the default encoding were to be UTF-8, then a string literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]').
>
> Also, it should be possible to explicitly specify the encoding for each string literal on a case-by-case basis.

If I understand correctly, you can. See above.


November 22, 2005
On Mon, 21 Nov 2005 16:43:50 -0800, Sean Kelly wrote:

> Derek Parnell wrote:
>> On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote:
>>>
>>> I think it was a very wise decision to make char type separate from byte and ubyte, but I don't think it has separated far enough.  There should be code-point types such as char8, char16, and char32 (or cdpt8, cdpt16, cdpt32).  Then, there should be a single ASCII character type called 'char'.  This would allow strings to be defined to hold ASCII characters, UTF-8 code points, UTF-16 code points, or UTF-32 code points.
>>>
>>> String literals created from the D compiler should be stored as a specific encoding, whether that be ASCII, UTF-8, UTF-16, or UTF-32 and should be represented as the corresponding static array of the type of character.  The default encoding should be modifiable with either commandline options or with pragmas, preferrably pragmas.
>>>
>>> For instance, if the default encoding were to be UTF-8, then a string literal "hello world" should have a type of 'char8[11]' (or 'cdpt8[11]').
>>>
>>> Also, it should be possible to explicitly specify the encoding for each string literal on a case-by-case basis.
>> 
>> Very nice. Well said James. It makes so much sense when laid out like this. D is only half way there to supporting international character sets.
> 
> I agree, but there must be a way to improve internationalization without this degree of complexity.  If D ends up with 6+ character types I think I might scream.

Where did you get "6+ character types" from?

James is (at worst) only adding one, ASCII. So we would end up with

  utf8  <==> schar[]  (Short? chars)
  utf16 <==> wchar[]  (Wide chars)
  utf32 <==> dchar[]  (Double-wide chars)
  ascii <==> char[]   (byte size chars)

But the key point is that each element in these arrays would be a *character* (a.k.a. Code Point) rather than Code Units as they are now.

Thus a schar is an atomic value that represents a single character even if that character takes up one, two, or four bytes in RAM. And 'schar[4]' would represents a fixed size array of 4 code points.

In this scheme, the old 'char' would be a directly compatible value with C/C++ legacy code rather.


> Is there any reason to support C-style code pages in-language in D?

Huh? What code pages? This is nowhere near anything James was talking about.

> I would like to think not.  As it stands, D supports three compatible encodings (char, wchar, dchar) that the programmer may choose between for reasons of data size and algorithm complexity.  The ASCII-compatible subset of UTF-8 works fine with the char-based C functions, and the full UTF-16 or UTF-32 character sets are compatible with the wchar-based C functions (depending on platform)... so far as I know at any rate.  I grant that the variable size of wchar in C is an irritating problem, but it's not insurmountable.  Why bother with all that old C code page nonsense?

Sure the current system can work, but only if the coder does a lot of mundane, error-prone work, to make it happen. The compiler is a tool to help coders do better, so it should help us take care of incidental housekeeping so we can concentrate of algorithms rather than data representations in RAM.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
22/11/2005 12:15:38 PM
November 22, 2005
"Derek Parnell" <derek@psych.ward> wrote ...
> On Mon, 21 Nov 2005 16:43:50 -0800, Sean Kelly wrote:
[snip]
>
>> Derek Parnell wrote:
>>> On Mon, 21 Nov 2005 17:23:29 -0600, James Dunne wrote:
>>> Very nice. Well said James. It makes so much sense when laid out like
>>> this.
>>> D is only half way there to supporting international character sets.
>>
>> I agree, but there must be a way to improve internationalization without this degree of complexity.  If D ends up with 6+ character types I think I might scream.
>
> Where did you get "6+ character types" from?
>
> James is (at worst) only adding one, ASCII. So we would end up with
>
>  utf8  <==> schar[]  (Short? chars)
>  utf16 <==> wchar[]  (Wide chars)
>  utf32 <==> dchar[]  (Double-wide chars)
>  ascii <==> char[]   (byte size chars)
>
> But the key point is that each element in these arrays would be a *character* (a.k.a. Code Point) rather than Code Units as they are now.
>
> Thus a schar is an atomic value that represents a single character even if that character takes up one, two, or four bytes in RAM. And 'schar[4]' would represents a fixed size array of 4 code points.

Maybe. To maintain array indexing semantics, the compiler might implement such things as an array of pointers to byte arrays?

Then, there's at least this problem :: dchar is always self-contained. It does not have surrogates, ever. Given that it's more efficient to store as a one-dimensional array, surely this would cause inconsistencies in usage? And what about BMP utf16? It doesn't need such treatment either (though extended utf16 would do).

But I agree in principal ~ the semantics of indexing (as in arrays) don't work well with multi code-unit encodings. Packages to deal with such things typically offer iterators as a supplement. Take a look at ICU for examples?


[snip]
> Sure the current system can work, but only if the coder does a lot of mundane, error-prone work, to make it happen. The compiler is a tool to help coders do better, so it should help us take care of incidental housekeeping so we can concentrate of algorithms rather than data representations in RAM.

I suspect it's a tall order to build such things into the compiler; especially when the issues are not clear-cut, and when there are heavy-duty libraries to take up the slack? Don't those libraries take care of data representation and incidental housekeeping on behalf of the developer?


November 22, 2005
Derek Parnell wrote:
> 
> Where did you get "6+ character types" from?

I misunderstood and thought his cdpt8 would be added in addition to the existing character types.

> James is (at worst) only adding one, ASCII. So we would end up with
> 
>   utf8  <==> schar[]  (Short? chars)
>   utf16 <==> wchar[]  (Wide chars)
>   utf32 <==> dchar[]  (Double-wide chars)
>   ascii <==> char[]   (byte size chars)
> 
> But the key point is that each element in these arrays would be a
> *character* (a.k.a. Code Point) rather than Code Units as they are now.
> 
> Thus a schar is an atomic value that represents a single character even if
> that character takes up one, two, or four bytes in RAM. And 'schar[4]'
> would represents a fixed size array of 4 code points. 

This seems like it would invite a great degree of compiler complexity. What problem are we trying to solve again?  And why not just use dchar if it's important to have a 1-1 correspondence between element and character representation?

> Sure the current system can work, but only if the coder does a lot of
> mundane, error-prone work, to make it happen. The compiler is a tool to
> help coders do better, so it should help us take care of incidental
> housekeeping so we can concentrate of algorithms rather than data
> representations in RAM.

The only somewhat confusing issue to me is that the symbol names "char" and "wchar" imply that the data stored therein is a complete character, when this is only sometimes true.  I agree that this is a problem, but I'm not sure that variable width characters is the solution.  It makes array manipulations oddly inconsistent, for one thing.  Should the length property return the number of characters in the array?  Would a size property be needed to determine the memory footprint of this array?  What if I try something like this:

utf8[] myString = "multiwidth";
utf8[] slice = myString[0..1];
slice[0] = '\U00000001';

Would the sliced array resize to fit the potentially different-sized character being inserted, or would myString end up corrputed?


Sean
November 22, 2005
On Mon, 21 Nov 2005 19:05:53 -0800, Sean Kelly wrote:

> Derek Parnell wrote:
>> 
>> Where did you get "6+ character types" from?
> 
> I misunderstood and thought his cdpt8 would be added in addition to the existing character types.
> 
>> James is (at worst) only adding one, ASCII. So we would end up with
>> 
>>   utf8  <==> schar[]  (Short? chars)
>>   utf16 <==> wchar[]  (Wide chars)
>>   utf32 <==> dchar[]  (Double-wide chars)
>>   ascii <==> char[]   (byte size chars)
>> 
>> But the key point is that each element in these arrays would be a *character* (a.k.a. Code Point) rather than Code Units as they are now.
>> 
>> Thus a schar is an atomic value that represents a single character even if that character takes up one, two, or four bytes in RAM. And 'schar[4]' would represents a fixed size array of 4 code points.
> 
> This seems like it would invite a great degree of compiler complexity. What problem are we trying to solve again?  And why not just use dchar if it's important to have a 1-1 correspondence between element and character representation?

That is what I'm doing now to Build. Internally, all strings will be dchar[], but what I'm finding out is the huge lack of support for dchar[] in phobos. I've now coded my own routine to read text files in UTF formats, but store them as dchar[] in the application. Then I've had to code appropriate routines for all the other support functions: split(), strip(), find(), etc ...

>> Sure the current system can work, but only if the coder does a lot of mundane, error-prone work, to make it happen. The compiler is a tool to help coders do better, so it should help us take care of incidental housekeeping so we can concentrate of algorithms rather than data representations in RAM.
> 
> The only somewhat confusing issue to me is that the symbol names "char" and "wchar" imply that the data stored therein is a complete character, when this is only sometimes true.  I agree that this is a problem, but I'm not sure that variable width characters is the solution.  It makes array manipulations oddly inconsistent, for one thing.  Should the length property return the number of characters in the array?

Yes.

>  Would a
> size property be needed to determine the memory footprint of this array?

Yes.

>   What if I try something like this:
> 
> utf8[] myString = "multiwidth";
> utf8[] slice = myString[0..1];
> slice[0] = '\U00000001';
> 
> Would the sliced array resize to fit the potentially different-sized character being inserted, or would myString end up corrputed?

Yes, it would be complex. No, the myString would not be corrupted. It would just be the same as doing it 'manually', only the compiler will do the hack work for you.

  char[] myString = "multiwidth";
  char[] slice = myString[0..1];
  // modify base string.
  myString = "\U00000001" ~ myString[1..$];
  // reslice it because its address might have changed.
  slice = myString[0..1];

Messy doing it manually, so that's why a code-point array would be better than a byte/short/int array for strings.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
22/11/2005 2:17:40 PM
November 22, 2005
On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu@bar.com> wrote:
> The minor concern I have with
> this aspect is that the literal content does not play a role, whereas it
> does with char literals (such as '?', '\x0001', and '\X00000001').

But that makes sense, right? Character literals i.e. '\X00000001' will only _fit_ in certain types, the same is not true for string literals which will always _fit_ in all 3 even if the way they end up being represented is not exactly what you've typed (or is that the problem?)

If this were to change would it make this an error:

foo(wchar[] foo) {}
foo("\U00000040");

> No big
> deal there, although perhaps it's food for another topic?

Here seems like as good a place as any.

Regan
November 22, 2005
"Regan Heath" <regan@netwin.co.nz> wrote...
> On Mon, 21 Nov 2005 17:35:26 -0800, Kris <fu@bar.com> wrote:
>> The minor concern I have with
>> this aspect is that the literal content does not play a role, whereas it
>> does with char literals (such as '?', '\x0001', and '\X00000001').
>
> But that makes sense, right? Character literals i.e. '\X00000001' will only _fit_ in certain types, the same is not true for string literals which will always _fit_ in all 3 even if the way they end up being represented is not exactly what you've typed (or is that the problem?)
>
> If this were to change would it make this an error:
>
> foo(wchar[] foo) {}
> foo("\U00000040");
>
>> No big
>> deal there, although perhaps it's food for another topic?
>
> Here seems like as good a place as any.


Oh, that minor concern was in regard to consistency here also. I have no quibble with the character type being implied by content (consistent with numeric literals):

1) The type for literal chars is implied by their content ('?', '\u0001', '\U00000001')

2) The type of a numeric literal is implied by the content (0xFF, 0xFFFFFFFF, 1.234)

3) The type for literal strings is not influenced at all by the content.

Further; both #2 & #3 have suffixes to cement the type, but #1 does not (as far as I'm aware). These two inconsistencies are small, but they may influence concerns elsewhere ...


November 23, 2005
Derek Parnell wrote:
> On Mon, 21 Nov 2005 19:05:53 -0800, Sean Kelly wrote:
> 
> 
>>Derek Parnell wrote:
>>
>>>Where did you get "6+ character types" from?
>>
>>I misunderstood and thought his cdpt8 would be added in addition to the existing character types.
>>
>>
>>>James is (at worst) only adding one, ASCII. So we would end up with
>>>
>>>  utf8  <==> schar[]  (Short? chars)
>>>  utf16 <==> wchar[]  (Wide chars)
>>>  utf32 <==> dchar[]  (Double-wide chars)
>>>  ascii <==> char[]   (byte size chars)
>>>
>>>But the key point is that each element in these arrays would be a
>>>*character* (a.k.a. Code Point) rather than Code Units as they are now.
>>>
>>>Thus a schar is an atomic value that represents a single character even if
>>>that character takes up one, two, or four bytes in RAM. And 'schar[4]'
>>>would represents a fixed size array of 4 code points. 
>>
>>This seems like it would invite a great degree of compiler complexity. What problem are we trying to solve again?  And why not just use dchar if it's important to have a 1-1 correspondence between element and character representation?
> 
> 
> That is what I'm doing now to Build. Internally, all strings will be
> dchar[], but what I'm finding out is the huge lack of support for dchar[]
> in phobos. I've now coded my own routine to read text files in UTF formats,
> but store them as dchar[] in the application. Then I've had to code
> appropriate routines for all the other support functions: split(), strip(),
> find(), etc ... 
> 
> 
Then, wouldn't having good dchar[] support in Phobos be a better solution that having to introduce another type in the language to do the same thing that dchar[] does?
The only difference I see in such a type (a codepoint string) and a dchar string is in better storage size for the codepoint string, but is that difference worth it? (not to mention a codepoint string would have (in certain cases) much worse modification performance that a dchar string).

Also, what is Phobos lacking in dchar[] support?

-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to be... unnatural."