View mode: basic / threaded / horizontal-split · Log in · Help
November 24, 2005
Re: Unified String Theory..
I want to thank everyone for reading and posting opinions on my proposal.

It appears I have done a bad job explaining some of it, and some of it  
simply doesn't work. I have a modified idea in mind which I think might  
work a whole bunch better and should also be much simpler too.

Thanks everyone.

Regan
November 24, 2005
Re: Unified String Theory..
Regan Heath wrote:
> On Thu, 24 Nov 2005 13:34:51 +0200, Jari-Matti Mäkelä  
> <jmjmak@invalid_utu.fi> wrote:
> 
>> Regan, your proposal is absolutely too complex. I don't get it and I  
>> really don't like it. D is supposed to be a _simple_ language. Here's 
>> an  alternative proposal:

Sorry for being a bit impolite. I just wanted to show that it's 
completely possible to write Unicode-compliant programs without the need 
for several string-keywords. I believe a careful design&implementation 
removes most of the performance drawbacks.

> Thanks for your opinion. It appears some parts of my idea were badly  
> thought out. I was trying to end up with something simple, it seems a 
> few  of my choices were bad ones and they simply complicated the idea.
> 
Thanks for bringing up some conversation. As you can see, neither of us 
is perfect => designing a modern programming language isn't as easy as 
it might have seen.

> I was trying to avoid picking any 1 type over the others (as you have  
> suggested here).

Actually I have to change my opinion. I think it would be good, if the 
compiler were allowed to choose the correct encoding. I don't think 
there will be any serious problems since nowadays most Win32-things use 
UTF-16 and *nix-systems UTF-8.

> It appears now that I should replace all my talk about cp1, cp2, cp4 
> and  cpn with "all characters are stored in a 32 bit type called uchar". 
> If  anyone has a problem with that, I'd direct them to take a look at  
> std.format.doFormat and std.stdio.writef which convert all char[] data  
> into individual dchar's before converting it back to UTF-8 for output 
> to  the screen.

That is one solution. Although I might let the compiler decide the encoding.

>> Please stop whining about the slowness of utf-conversions. If it's  
>> really so slow, I would certainly want to see some real world benchmarks.
> 
> I mention performance only becase people have been concerned with it in  
> the past. I too have no idea how much time it takes and would like to 
> see  a benchmark. The fact that D is already doing it with writef and 
> no-one  has complained...

I can't say anything about the overall complexity class for programs 
that do Unicode, but at least my simple experiments [1] show that 
unoptimized use of writefln is 'only' 50% slower than optimized use of 
printf in C (both using the same gcc-backend). Though I'm not 100% sure 
this program of mine actually did any transcoding. In addition, I think 
most 'static' conversions can be precalculated.

[1] http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.learn/1983

Jari-Matti
November 24, 2005
Re: Unified String Theory..
Regan Heath wrote:
> But people don't care about code units, they care about characters. 
> When  do you want to inspect or modify a single code unit? I would say, 
> just  about never. On the other hand you might want to change the 4th 
> character  of "smörgåsbord" which may be the 4th, 5th, 6th, or 7th index 
> in a char[]  array. Ick.

True. BTW, is there a bug in std.string.insert? I tried to do:

char[] a = "blaahblaah";
std.string.insert(a, 5, "öö");
std.stdio.writefln(a);

Outputs:

blaahblaah

>> When would you actually need character based indexing?
>> I believe the answer is less often than you think.
> 
> Really? How often do you care what UTF-8 fragments are used to 
> represent  your characters? The answer _should_ be never, however D 
> forces you to  know, you cannot replace the 4th letter of "smörgåsbord" 
> without knowing.  This is the problem, IMO.

I agree. You don't need it very often, but when you do, there's 
currently no possibility to do that. I think char[]-slicing and indexing 
should be a bit better (work in the Unicode character level) since you 
_never_ want to change code units. (And in case you do, just cast it to 
void[])

Jari-Matti
November 24, 2005
Re: Unified String Theory..
On Fri, 25 Nov 2005 00:31:32 +0200, Jari-Matti Mäkelä wrote:


> True. BTW, is there a bug in std.string.insert? I tried to do:
> 
> char[] a = "blaahblaah";
> std.string.insert(a, 5, "öö");
> std.stdio.writefln(a);
> 
> Outputs:
> 
> blaahblaah

No bug. The function is not designed to update the same string passed to
the function. It returns an updated string.

char[] a = "blaahblaah";
a = std.string.insert(a, 5, "öö");
std.stdio.writefln(a);




-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
25/11/2005 10:04:41 AM
November 25, 2005
Re: Unified String Theory..
Regan Heath wrote:
> On Thu, 24 Nov 2005 17:35:24 +0200, Lionello Lunesu 
>> 
>> It makes far more sense to have only 1 _character_ type, that holds
>> any UNICODE character. Whether it comes from an utf8, utf16 or
>> utf32 string shouldn't matter:

True!

> Yeah, I'm starting to think that is the only way it works. The 3
> types were an attempt to avoid that for programs which do not need an
> int-sized  type for a char i.e. the quick and dirty ASCII program
> for example.
> 
> Interestingly it seems std.stdio and std.format are already involved
> in a  conspiracy to convert all our char[] output to dchar and back
> again one  character at a time before it eventually makes it to the
> screen.

Must've been the specters in the night again. :-)

> I like "uchar". I agree "char" should go back to being C's char type.
> I don't think we need a char[] all the C functions expect a null 
> terminated  char*.

That would be nice! What if we even decided that char[] is 
null-terminated? That'd massively reduce all kinds of bugs when (under 
pressure) converting code from C(++)!
November 25, 2005
Re: Unified String Theory..
On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote:

> Regan Heath wrote:
>> On Thu, 24 Nov 2005 17:35:24 +0200, Lionello Lunesu 
>>> 
>>> It makes far more sense to have only 1 _character_ type, that holds
>>> any UNICODE character. Whether it comes from an utf8, utf16 or
>>> utf32 string shouldn't matter:
> 
> True!
> 
>> Yeah, I'm starting to think that is the only way it works. The 3
>> types were an attempt to avoid that for programs which do not need an
>> int-sized  type for a char i.e. the quick and dirty ASCII program
>> for example.
>> 
>> Interestingly it seems std.stdio and std.format are already involved
>> in a  conspiracy to convert all our char[] output to dchar and back
>> again one  character at a time before it eventually makes it to the
>> screen.
> 
> Must've been the specters in the night again. :-)
> 
>> I like "uchar". I agree "char" should go back to being C's char type.
>> I don't think we need a char[] all the C functions expect a null 
>> terminated  char*.
> 
> That would be nice! What if we even decided that char[] is 
> null-terminated? That'd massively reduce all kinds of bugs when (under 
> pressure) converting code from C(++)!

I think that would interfere with the slice concept.
 char[] a = "some text";
 char[] b = a[4 .. 7];
 // Making 'b' a reference into 'a'.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
25/11/2005 11:37:08 AM
November 25, 2005
Re: Unified String Theory
Regan Heath wrote:
> On Thu, 24 Nov 2005 13:29:10 +0200, Georg Wrede 
> <georg.wrede@nospam.org>  wrote:
...
> "cpn" would need to change size for each character. It would be more 
> than  a simple alias.
> 
> If it cannot change size, then it would need to be the largest size  
> required.
> 
> If that was also too weird/difficult then it would need to be 32 bits 
> in  size all the time. I was trying to avoid this but it seems it may 
> be  required?

Yes. I see no way to avoid "cpn" being 32 bit only.

> My proposal didn't suggest different encodings based on the system. It 
> was  UTF-8 by default (all systems) and application specific otherwise. 
> There  is nothing stopping us making the windows default to UTF-16 if 
> that makes  sense. Which it seems to.

Windows, <sigh>. Looks like it.

They seem to have a habit of choosing what seems easiest at the outset, 
without ever learning to dig into issues first. Had they done it, they'd 
chosen UTF-8, like everybody else. :-(
November 25, 2005
Re: Unified String Theory [READ THIS FIRST]
Regan Heath wrote:
> On Thu, 24 Nov 2005 14:33:53 +0200, Georg Wrede wrote:
> 
> I was trying to avoid it being 32 bits large 
> all the  time, but it seems to be the only way it works.

I agree. And I share the feeling.  :-)

>> If this is true, then we might consider blatantly skipping cp1 and 
>> cp2,  and only having cp4 (possibly also renaming it utfchar).
>>
>> This would make it possible for us to fully automate the extraction 
>> and  insertion of single "characters" into our new strings.
>>
>>      string foo = "gagaga";
>>      utfchar bar = '\UFE9D'; // you don't want to know the name :-)
>>      utfchar baf = 'a';
>>      foo ~= bar ~ baf;
> 
> It seems this may be the best solution. Oskar had a good name for it  
> "uchar". It means quick and dirty ASCII apps will have to use a 32 bit  
> sized char type. I can hear people complain already.. but it's odd that  
> no-one is complaining about writef doing this exact same thing!

Not too many have dissected writef. Or else we'd have heard some 
complaints already. ;-)

I actually thought about "uchar" for a while, but then I remembered that 
a lot of this utf disaster originates from unfortunate names. And C has 
a uchar type. So, I'd suggest "utfchar" or "unicode" or something 
to-the-point and unambiguous that's not in C.

>> For completeness, we could have the painting casts (as opposed to  
>> converting casts). They'd be for the (seldom) situations where the  
>> programmer _does_ want to do serious tinkering on our strings.
>>
>>      ubyte[] myarr1 = cast(ubyte[])foo;
>>      ushort[] myarr2 = cast(ushort[]) foo;
>>      uint[] myarr3 = cast(uint[]) foo;
>>
>> These give raw arrays, like exact images of the string.  
>> The burden of COW would lie on the programmer.
> 
> I was thinking of using properties (Sean's idea) to access 
> the data as a  certain type, eg.
> 
> ubyte[]  b = foo.utf8;
> ushort[] s = foo.utf16;
> uint[]   i = foo.utf32;
> 
> these properties would return the string in the specified encoding 
> using  those array types.

So it'd be the same thing, except your code looks a lot nicer!
November 25, 2005
Re: Unified String Theory..
Derek Parnell wrote:
> On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote:

>>> I like "uchar". I agree "char" should go back to being C's char
>>> type. I don't think we need a char[] all the C functions expect a
>>> null terminated char*.
>> 
>> That would be nice! What if we even decided that char[] is 
>> null-terminated? That'd massively reduce all kinds of bugs when
>> (under pressure) converting code from C(++)!
> 
> I think that would interfere with the slice concept.
>   char[] a = "some text";
>   char[] b = a[4 .. 7];
>   // Making 'b' a reference into 'a'.

Slicing C's char[] implies byte-wide, and non-UTF.
November 25, 2005
Re: Unified String Theory..
On Fri, 25 Nov 2005 05:16:15 +0200, Georg Wrede wrote:

> Derek Parnell wrote:
>> On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote:
> 
>>>> I like "uchar". I agree "char" should go back to being C's char
>>>> type. I don't think we need a char[] all the C functions expect a
>>>> null terminated char*.
>>> 
>>> That would be nice! What if we even decided that char[] is 
>>> null-terminated? That'd massively reduce all kinds of bugs when
>>> (under pressure) converting code from C(++)!
>> 
>> I think that would interfere with the slice concept.
>>   char[] a = "some text";
>>   char[] b = a[4 .. 7];
>>   // Making 'b' a reference into 'a'.
> 
> Slicing C's char[] implies byte-wide, and non-UTF.

Exactly, and that why I'm worried by the suggestion that char[] be
automatically zero-terminated, because slices are usually not
zero-terminated. 

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
25/11/2005 3:12:51 PM
1 2 3 4 5
Top | Discussion index | About this forum | D home