November 24, 2005
I want to thank everyone for reading and posting opinions on my proposal.

It appears I have done a bad job explaining some of it, and some of it simply doesn't work. I have a modified idea in mind which I think might work a whole bunch better and should also be much simpler too.

Thanks everyone.

Regan
November 24, 2005
Regan Heath wrote:
> On Thu, 24 Nov 2005 13:34:51 +0200, Jari-Matti Mäkelä  <jmjmak@invalid_utu.fi> wrote:
> 
>> Regan, your proposal is absolutely too complex. I don't get it and I  really don't like it. D is supposed to be a _simple_ language. Here's an  alternative proposal:

Sorry for being a bit impolite. I just wanted to show that it's completely possible to write Unicode-compliant programs without the need for several string-keywords. I believe a careful design&implementation removes most of the performance drawbacks.

> Thanks for your opinion. It appears some parts of my idea were badly  thought out. I was trying to end up with something simple, it seems a few  of my choices were bad ones and they simply complicated the idea.
> 
Thanks for bringing up some conversation. As you can see, neither of us is perfect => designing a modern programming language isn't as easy as it might have seen.

> I was trying to avoid picking any 1 type over the others (as you have  suggested here).

Actually I have to change my opinion. I think it would be good, if the compiler were allowed to choose the correct encoding. I don't think there will be any serious problems since nowadays most Win32-things use UTF-16 and *nix-systems UTF-8.

> It appears now that I should replace all my talk about cp1, cp2, cp4 and  cpn with "all characters are stored in a 32 bit type called uchar". If  anyone has a problem with that, I'd direct them to take a look at  std.format.doFormat and std.stdio.writef which convert all char[] data  into individual dchar's before converting it back to UTF-8 for output to  the screen.

That is one solution. Although I might let the compiler decide the encoding.

>> Please stop whining about the slowness of utf-conversions. If it's  really so slow, I would certainly want to see some real world benchmarks.
> 
> I mention performance only becase people have been concerned with it in  the past. I too have no idea how much time it takes and would like to see  a benchmark. The fact that D is already doing it with writef and no-one  has complained...

I can't say anything about the overall complexity class for programs that do Unicode, but at least my simple experiments [1] show that unoptimized use of writefln is 'only' 50% slower than optimized use of printf in C (both using the same gcc-backend). Though I'm not 100% sure this program of mine actually did any transcoding. In addition, I think most 'static' conversions can be precalculated.

[1] http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.learn/1983

Jari-Matti
November 24, 2005
Regan Heath wrote:
> But people don't care about code units, they care about characters. When  do you want to inspect or modify a single code unit? I would say, just  about never. On the other hand you might want to change the 4th character  of "smörgåsbord" which may be the 4th, 5th, 6th, or 7th index in a char[]  array. Ick.

True. BTW, is there a bug in std.string.insert? I tried to do:

char[] a = "blaahblaah";
std.string.insert(a, 5, "öö");
std.stdio.writefln(a);

Outputs:

blaahblaah

>> When would you actually need character based indexing?
>> I believe the answer is less often than you think.
> 
> Really? How often do you care what UTF-8 fragments are used to represent  your characters? The answer _should_ be never, however D forces you to  know, you cannot replace the 4th letter of "smörgåsbord" without knowing.  This is the problem, IMO.

I agree. You don't need it very often, but when you do, there's currently no possibility to do that. I think char[]-slicing and indexing should be a bit better (work in the Unicode character level) since you _never_ want to change code units. (And in case you do, just cast it to void[])

Jari-Matti
November 24, 2005
On Fri, 25 Nov 2005 00:31:32 +0200, Jari-Matti Mäkelä wrote:


> True. BTW, is there a bug in std.string.insert? I tried to do:
> 
> char[] a = "blaahblaah";
> std.string.insert(a, 5, "öö");
> std.stdio.writefln(a);
> 
> Outputs:
> 
> blaahblaah

No bug. The function is not designed to update the same string passed to the function. It returns an updated string.

char[] a = "blaahblaah";
a = std.string.insert(a, 5, "öö");
std.stdio.writefln(a);




-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
25/11/2005 10:04:41 AM
November 25, 2005
Regan Heath wrote:
> On Thu, 24 Nov 2005 17:35:24 +0200, Lionello Lunesu 
>> 
>> It makes far more sense to have only 1 _character_ type, that holds
>> any UNICODE character. Whether it comes from an utf8, utf16 or
>> utf32 string shouldn't matter:

True!

> Yeah, I'm starting to think that is the only way it works. The 3
> types were an attempt to avoid that for programs which do not need an
> int-sized  type for a char i.e. the quick and dirty ASCII program
> for example.
> 
> Interestingly it seems std.stdio and std.format are already involved
> in a  conspiracy to convert all our char[] output to dchar and back
> again one  character at a time before it eventually makes it to the
> screen.

Must've been the specters in the night again. :-)

> I like "uchar". I agree "char" should go back to being C's char type.
> I don't think we need a char[] all the C functions expect a null terminated  char*.

That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when (under pressure) converting code from C(++)!

November 25, 2005
On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote:

> Regan Heath wrote:
>> On Thu, 24 Nov 2005 17:35:24 +0200, Lionello Lunesu
>>> 
>>> It makes far more sense to have only 1 _character_ type, that holds any UNICODE character. Whether it comes from an utf8, utf16 or utf32 string shouldn't matter:
> 
> True!
> 
>> Yeah, I'm starting to think that is the only way it works. The 3 types were an attempt to avoid that for programs which do not need an int-sized  type for a char i.e. the quick and dirty ASCII program for example.
>> 
>> Interestingly it seems std.stdio and std.format are already involved in a  conspiracy to convert all our char[] output to dchar and back again one  character at a time before it eventually makes it to the screen.
> 
> Must've been the specters in the night again. :-)
> 
>> I like "uchar". I agree "char" should go back to being C's char type. I don't think we need a char[] all the C functions expect a null terminated  char*.
> 
> That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when (under pressure) converting code from C(++)!

I think that would interfere with the slice concept.
  char[] a = "some text";
  char[] b = a[4 .. 7];
  // Making 'b' a reference into 'a'.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
25/11/2005 11:37:08 AM
November 25, 2005
Regan Heath wrote:
> On Thu, 24 Nov 2005 13:29:10 +0200, Georg Wrede <georg.wrede@nospam.org>  wrote:
...
> "cpn" would need to change size for each character. It would be more than  a simple alias.
> 
> If it cannot change size, then it would need to be the largest size  required.
> 
> If that was also too weird/difficult then it would need to be 32 bits in  size all the time. I was trying to avoid this but it seems it may be  required?

Yes. I see no way to avoid "cpn" being 32 bit only.

> My proposal didn't suggest different encodings based on the system. It was  UTF-8 by default (all systems) and application specific otherwise. There  is nothing stopping us making the windows default to UTF-16 if that makes  sense. Which it seems to.

Windows, <sigh>. Looks like it.

They seem to have a habit of choosing what seems easiest at the outset, without ever learning to dig into issues first. Had they done it, they'd chosen UTF-8, like everybody else. :-(
November 25, 2005
Regan Heath wrote:
> On Thu, 24 Nov 2005 14:33:53 +0200, Georg Wrede wrote:
> 
> I was trying to avoid it being 32 bits large all the  time, but it seems to be the only way it works.

I agree. And I share the feeling.  :-)

>> If this is true, then we might consider blatantly skipping cp1 and cp2,  and only having cp4 (possibly also renaming it utfchar).
>>
>> This would make it possible for us to fully automate the extraction and  insertion of single "characters" into our new strings.
>>
>>      string foo = "gagaga";
>>      utfchar bar = '\UFE9D'; // you don't want to know the name :-)
>>      utfchar baf = 'a';
>>      foo ~= bar ~ baf;
> 
> It seems this may be the best solution. Oskar had a good name for it  "uchar". It means quick and dirty ASCII apps will have to use a 32 bit  sized char type. I can hear people complain already.. but it's odd that  no-one is complaining about writef doing this exact same thing!

Not too many have dissected writef. Or else we'd have heard some complaints already. ;-)

I actually thought about "uchar" for a while, but then I remembered that a lot of this utf disaster originates from unfortunate names. And C has a uchar type. So, I'd suggest "utfchar" or "unicode" or something to-the-point and unambiguous that's not in C.

>> For completeness, we could have the painting casts (as opposed to  converting casts). They'd be for the (seldom) situations where the  programmer _does_ want to do serious tinkering on our strings.
>>
>>      ubyte[] myarr1 = cast(ubyte[])foo;
>>      ushort[] myarr2 = cast(ushort[]) foo;
>>      uint[] myarr3 = cast(uint[]) foo;
>>
>> These give raw arrays, like exact images of the string.  The burden of COW would lie on the programmer.
> 
> I was thinking of using properties (Sean's idea) to access the data as a  certain type, eg.
> 
> ubyte[]  b = foo.utf8;
> ushort[] s = foo.utf16;
> uint[]   i = foo.utf32;
> 
> these properties would return the string in the specified encoding using  those array types.

So it'd be the same thing, except your code looks a lot nicer!

November 25, 2005
Derek Parnell wrote:
> On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote:

>>> I like "uchar". I agree "char" should go back to being C's char
>>> type. I don't think we need a char[] all the C functions expect a
>>> null terminated char*.
>> 
>> That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when
>> (under pressure) converting code from C(++)!
> 
> I think that would interfere with the slice concept.
>   char[] a = "some text";
>   char[] b = a[4 .. 7];
>   // Making 'b' a reference into 'a'.

Slicing C's char[] implies byte-wide, and non-UTF.
November 25, 2005
On Fri, 25 Nov 2005 05:16:15 +0200, Georg Wrede wrote:

> Derek Parnell wrote:
>> On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote:
> 
>>>> I like "uchar". I agree "char" should go back to being C's char type. I don't think we need a char[] all the C functions expect a null terminated char*.
>>> 
>>> That would be nice! What if we even decided that char[] is
>>> null-terminated? That'd massively reduce all kinds of bugs when
>>> (under pressure) converting code from C(++)!
>> 
>> I think that would interfere with the slice concept.
>>   char[] a = "some text";
>>   char[] b = a[4 .. 7];
>>   // Making 'b' a reference into 'a'.
> 
> Slicing C's char[] implies byte-wide, and non-UTF.

Exactly, and that why I'm worried by the suggestion that char[] be automatically zero-terminated, because slices are usually not zero-terminated.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
25/11/2005 3:12:51 PM