November 25, 2005
Long post, with code examples.

Originally thought I was responding to John, but that might not be the case anymore. It's just a general heads'-up kind of reply on this lengthly topic. No offence intended to anyone.


"John Reimer" <terminal.node@gmail.com> wrote
> Georg Wrede wrote:
>> John Reimer wrote:
>>
>>> I have a proposal... okay it's not about strings.  After trying to follow all these posts, I now can say I'm thoroughly confused about everything UTF-like, unicodish, pointless, valueable, or characteristically encoded in 8, 16, and 32 discrete portions.
>>
>>
>> Yes, and I bet a bunch of other folks who don't write are even more confused. And that gets us conveniently to the exact point of all this: we who carry on this debate, do it precisely so that future D users could gain a few things:
>>
>>   1. Not get drowned in the current utf maze of glass walls and mirrors.
>>
>>   2. Real Soon Now, be able to do their coding without being forced to
>> know a single thing about utf.
>>
>>   3. Not have downright disinformation stuffed down their throats by D
>> documentation, specs, or the existing choice of data types in D.
>>
>>   4. Get rid of all the gotchas (especially the unobvious) hidden in the
>> current framework of what appears to be character types and handling.


==============


I certainly don't wish to discourage anyone from writing a useful String API. Quite the opposite, in fact. To that end, I'll add some long-winded concerns to watch out for.


1. It seems reasonable that one should come up with an abstraction of what the String should do, using either an abstract class or an interface. This eliminates ctor() considerations completely, and permits pretty much anyone to write a compatable String. Compatability is important if this is going to become fundamental to D. So, write an abstract specification; and then provide a rudamentary concrete class that implements the spec. Perhaps a simple dchar[] implementation? Anyway, onto an example specification:


2. Suppose we have this:

~~~~~~~~~~~~

class String
{
    // some read only methods
    abstract bool startsWith (String);
    abstract bool endsWith (String);
    abstract int indexOf (String, int start=0);

    // transcoding methods
    char*    cStr();
    char[]    utf8();
    wchar[] utf16();
    dchar[]  utf32();
    ...

    // some mutating methods
    abstract void prepend (String);
    abstract void append (String);

    abstract void setCharAt (int index, dchar chr);
    ...
}

~~~~~~~~~~~~~

There's immediately three things to note.

(a) many arguments are other instances of String, since otherwise you'd have to provide char/wchar/dchar instances of every method (what you're trying to avoid, right?).

(b) the setCharAt() method takes a dchar. How do you avoid that without wrapping dchar too? I don't think it would be practical, so dchar it stays?

(c) the operations noted explicitly avoid certain functionality that would add seriously to the complexity of a basic implementation (add your favourite collation-sequence example here). On the other hand, one can hide such nasties with careful choice of methods. For example, one could add a trimWhitespace() method, which can be implemented without the requirement of full Unicode character classification. The point is to be careful about the methods chosen.


3. Notice the distinction between read-only and mutating methods. To assist in writing deterministic (and performant) multi-threaded code, it would be advantageous to split the specification into mutable and non-mutable variations (I'll assume the benefits of doing so are acknowledged)

~~~~~~~~~~

class String    // a read-only String
{
    // some read only methods
    abstract bool startsWith (String);
    abstract bool endsWith (String);
    abstract int indexOf (String, int start=0);

    // transcoding methods
    char*    cStr();
    char[]    utf8();
    wchar[] utf16();
    dchar[]  utf32();
    ..
}


class MutableString : String   // a modifiable String
{
    // some mutating methods
    abstract void prepend (String);
    abstract void append (String);

    abstract void setCharAt (int index, dchar chr);
    ...
}

~~~~~~~~~~~

Now you can pass either type to a method that accepts the read-only String, yet can be somewhat assured of the intent when a called function expects a MutableString as an argument (it is expecting to change the darned thing <g>), and the compiler will catch such mismatches appropriately.


4. Using abstract classes is cool, but it limits the ability of someone trying to build compatible, alternate, implementations: they'd be limited in what to use as a base-class. To open up the compatability aspect, we adopt interfaces instead:

~~~~~~~~~~~~~

interface IString
{
    // some read only methods
    bool startsWith (String);
    bool endsWith (String);
    int indexOf (String, int start=0);

    // transcoding methods
    char*    cStr();
    char[]    utf8();
    wchar[] utf16();
    dchar[]  utf32();
    ..
}

interface IMutableString : IString
{
    // some mutating methods
    void prepend (String);
    void append (String);

    void setCharAt (int index, dchar chr);
    ...
}

~~~~~~~~~~~

At this point, there's little stopping another developer making a compatible implementation, yet with completely different internals (and ctors) than the original reference implementation.

For example: the ICU wrappers could implement these interfaces, and Hey Presto! Full compatability with the basic specification!


5. So here's where some little gotcha's come into play:

(a) Notice that the transcoding routines should never be providing access to the internal content? After all, it's read-only. This is where a read-only attribute would come in handy; e.g. "readonly char[] utf8();". The upside here is that the class could 'cache' the utf8 transcoding, or not transcode at all if the implementation is native utf8. Unfortunately, D currently expects the recipient to "play by the rules" of CoW; something that is completely unenforcable by the class designer. String is just the kind of class that needs readonly support. Let's hope that support comes along soon (and, yes, it can be done with CoW ~ but that's not enforceable).

(b) Notice also that the prepend() and append() methods take a String as an argument. They do this such that alternate implementations are allowed to play too. However, this requires any implementation of append(String) to call one of the transcoder methods of it's argument, to get the appending content. From a functional perspective, this is wonderful. From a performance perspective it's not. This is another reason why a String class might 'cache' its transcodings. Again, there's the readonly concern, since CoW is not enforcable by the class designer.




>> At least I personally expect this whole "utf" issue to be over and done with, in a couple of weeks. (Heh, knowing SW projects, that probably means before year's end.) Once this is fixed, we have a _much_ smoother API, both factually, but especially in concept. And then -- we'll not hear a word about utf during the entire next year. 8-|
>
>
> Really? Such optimism! :)  Who says the smoother API will be agreed upon or even adopted?  I hope it does, whatever that API is or wherever that API currently exists.  If that API resides in ICU, then there's a stopgap solution until people make up there minds (specifically people like Walter).  If otherwise, then a solution will be long in coming, I think, and will be debated until the end of our days. Although, I'm still curious to know why people think they can change D without Walter's input in the matter.  I expect you people are planning on making a submission to phobos or something and hoping Walter will agree?


Some very real concerns. Trying to get the NG to agree on anything has historically been a notable waste of time. Example: I don't expect many people to agree with the considerations layed out above. Getting Walter to agree on something even where there's group concensus has, at times, proved futile in the past also. However, this is a presumably a Phobos submission rather than a language change? Big difference there.

The reason such as String class does not exist today (in Phobos) is that nobody could agree if they even wanted a class implementation; let alone what it should do! :)

I hope something good happens here. But ain't holding my breath for long <g>



>> Well, IMHO, Kris and Regan have been talking about apples and oranges, without either noticing.
>>
>> Regan is talking about this utf thing in terms of what we here have been discussing, while Kris means the entire ICU issue. At least I believe this is so, and that they've not necessarily believed the other one understands the ("same") issue. (Kris, Regan, correct me if I'm wrong here.)


I do hope it's perfectly clear by now that I've been saying "take a look at the bigger picture first!" all along?


>> And John, I assume you think, taking on the whole ICU issue (I'm using a wrong term here, I know, but you know what I mean) is a little too big job for us, right? Which I wholeheartedly agree with.
>
>
> Well, it is.  But, I certainly respect the necessity of the discussion. I get a little listless, though, wondering whether it's going anywhere.

It's a misconception that my stance is about adopting ICU, so I hope this has been clarified for all who might have felt that way <g>


>> More specifically, the ICU thing is not something I believe D should even tackle. For the next couple of years, I think *those* application programmers who care about such, should use a library (like ICU, or whatever). What D provides will be adequate UTF handling -- as far as slicin' n' streamin' are concerned, nothing fancier. After a couple of years, we can always check the issue again. Maybe by that time a bunch of now broken issues have been settled, maybe there actually is some need for such functionality, maybe by that time they have stopped fighting with the available compilers (bugs), maybe... Then we can check it out.
>
>
> That's a fair assessment of the situation.  That makes much more sense.

Again; to do a good job we have to take such things into account <g>

Really thought I would be replying to John here, but it turned out otherwise. Hope that's OK with JJR?



November 25, 2005
Kris wrote:
> Long post, with code examples.

<snip long post>

> 
>>>More specifically, the ICU thing is not something I believe D should even tackle. For the next couple of years, I think *those* application programmers who care about such, should use a library (like ICU, or whatever). What D provides will be adequate UTF handling -- as far as slicin' n' streamin' are concerned, nothing fancier. After a couple of years, we can always check the issue again. Maybe by that time a bunch of now broken issues have been settled, maybe there actually is some need for such functionality, maybe by that time they have stopped fighting with the available compilers (bugs), maybe... Then we can check it out.
>>
>>
>>That's a fair assessment of the situation.  That makes much more sense.
> 
> 
> Again; to do a good job we have to take such things into account <g>
> 
> Really thought I would be replying to John here, but it turned out otherwise. Hope that's OK with JJR?
> 
> 
> 

This is perfectly fine, Kris, and I thank you for it.  I think you've clarified your perspective well here.  Really, I think we were referring to "ICU" rather loosely here as a _symbol_ representing a comprehensive unicode solution for D, which would be a major undertaking.

That said, I've always liked the idea of a solid String class, something that could be built upon or expanded over time.  Your sample specification is food for thought.  If that's the type of API that people could agree upon, then I think the D community can get somewhere.  Georg, when you mentioned an API, is that the general idea to which you were referring?  Or did you mean something else? Regan, your thoughts?

In the past, quite a few people in the community rejected the idea of a string class; they said it wasn't necessary, or they didn't want any string management turning Object Oriented.  Their resistance perplexed me because I figured there could be only benefits to adopting such a package. Those that didn't want to use it could stick to the basic D types.

Another benefit of adopting a string package is that it can be a ready addition to Phobos.  And that, like you said Kris, is much more likely to happen than any promotions for language changes.

-JJR
November 25, 2005
On Fri, 25 Nov 2005 10:32:47 -0800, kris wrote:


[snip]

> There's the one that says "add some array properties as a convenience for transcoding", such as adding .utf8 .utf16 and .utf32 properties as appropriate. That would be nice!

For what it's worth, here's a small convenience module...

==================================
module transcode;
private import std.utf;

void transcode(  char[] a,  inout  char[] b ) { b = a; }
void transcode(  char[] a,  inout wchar[] b ) { b = std.utf.toUTF16(a); }
void transcode(  char[] a,  inout dchar[] b ) { b = std.utf.toUTF32(a); }

void transcode( wchar[] a,  inout  char[] b ) { b = std.utf.toUTF8 (a); }
void transcode( wchar[] a,  inout wchar[] b ) { b = a; }
void transcode( wchar[] a,  inout dchar[] b ) { b = std.utf.toUTF32(a); }

void transcode( dchar[] a,  inout  char[] b ) { b = std.utf.toUTF8 (a); }
void transcode( dchar[] a,  inout wchar[] b ) { b = std.utf.toUTF16(a); }
void transcode( dchar[] a,  inout dchar[] b ) { b = a; }

unittest
{
   char[] s8;
  wchar[] s16;
  dchar[] s32;

   char[] t8;
  wchar[] t16;
  dchar[] t32;

  s8 = "some text";
  transcode(s8, s16);
  transcode(s16, s32);
  transcode(s32, t16);
  transcode(t16, t8);

  assert(s8 == t8);

  transcode(t8,t32);
  assert(t32 == s32);

  transcode(s32,t8);
  assert(t8 == s8);
  assert(s8 != cast(char[])s16);

}
=================================

-- 
Derek Parnell
Melbourne, Australia
26/11/2005 9:41:52 AM
November 25, 2005
"Derek Parnell" <derek@psych.ward> wrote in message news:1ypwa2hwmja.q9pdllu3i85s.dlg@40tude.net...
> On Fri, 25 Nov 2005 10:32:47 -0800, kris wrote:
>
>
> [snip]
>
>> There's the one that says "add some array properties as a convenience for transcoding", such as adding .utf8 .utf16 and .utf32 properties as appropriate. That would be nice!
>
> For what it's worth, here's a small convenience module...
>
> ==================================
> module transcode;
> private import std.utf;
>
> void transcode(  char[] a,  inout  char[] b ) { b = a; }
> void transcode(  char[] a,  inout wchar[] b ) { b = std.utf.toUTF16(a); }
> void transcode(  char[] a,  inout dchar[] b ) { b = std.utf.toUTF32(a); }
>
> void transcode( wchar[] a,  inout  char[] b ) { b = std.utf.toUTF8 (a); }
> void transcode( wchar[] a,  inout wchar[] b ) { b = a; }
> void transcode( wchar[] a,  inout dchar[] b ) { b = std.utf.toUTF32(a); }
>
> void transcode( dchar[] a,  inout  char[] b ) { b = std.utf.toUTF8 (a); }
> void transcode( dchar[] a,  inout wchar[] b ) { b = std.utf.toUTF16(a); }
> void transcode( dchar[] a,  inout dchar[] b ) { b = a; }
>
> unittest
> {
>   char[] s8;
>  wchar[] s16;
>  dchar[] s32;
>
>   char[] t8;
>  wchar[] t16;
>  dchar[] t32;
>
>  s8 = "some text";
>  transcode(s8, s16);
>  transcode(s16, s32);
>  transcode(s32, t16);
>  transcode(t16, t8);
>
>  assert(s8 == t8);
>
>  transcode(t8,t32);
>  assert(t32 == s32);
>
>  transcode(s32,t8);
>  assert(t8 == s8);
>  assert(s8 != cast(char[])s16);
>
> }
> =================================

or this somewhat dubious variation :)

s8 = "some text";
s8.transcode(s16):

Now, let's assume for a moment that the user intends to somehow modify the return content.

This again brings up the issue about CoW ~ a user might consider these as always being /copies/ of the original content, since they've been transcoded. Right? After being transcoding into a freshly allocated chunk of the heap, as a user I wouldn't expect to .dup the result.

Yet, this is not a valid assumption ~ your example return the original content directly in 3 cases, which a user might happily modify, thinking s/he's working with a private copy. To get around this, the user must explicitly follow CoW and always .dup the result. Even when it's presumeably redundant to do so. Alternatively, as the utility designer, you must always .dup the non-transcoded return value. Just in case.

This seems utterly wrong. And it's such a fundamental thing too. Perhaps I don't get it?




November 26, 2005
John Reimer wrote:
> Georg Wrede wrote:
> 
> Thanks for your remarks, Georg.  It's always a pleasure.

Good thihg I always read the whole thing before commenting.

:-)
November 26, 2005
Kris wrote:
> Long post, with code examples.

Read it. Now it is 3:15 AM, so I won't waste my sleep and everybody else's reading bandwidth with replying to this (well thought out, and containing crucial issues) post.

I hope I'll have time enough to write a decent reply already tomorrow.
This post certainly deserves it. :-)
g
November 26, 2005
Georg Wrede wrote:
> John Reimer wrote:
> 
>> Georg Wrede wrote:
>>
>> Thanks for your remarks, Georg.  It's always a pleasure.
> 
> 
> Good thihg I always read the whole thing before commenting.
> 
> :-)

Uh, oh, I must have said something bad... :-P
November 26, 2005
On Fri, 25 Nov 2005 15:09:44 -0800, Kris wrote:

> "Derek Parnell" <derek@psych.ward> wrote in message news:1ypwa2hwmja.q9pdllu3i85s.dlg@40tude.net...
>> On Fri, 25 Nov 2005 10:32:47 -0800, kris wrote:
>>
>>
>> [snip]
>>
>>> There's the one that says "add some array properties as a convenience for transcoding", such as adding .utf8 .utf16 and .utf32 properties as appropriate. That would be nice!
>>
>> For what it's worth, here's a small convenience module...
>>
>> ==================================
>> module transcode;
>> private import std.utf;
>>
>> void transcode(  char[] a,  inout  char[] b ) { b = a; }
>> void transcode(  char[] a,  inout wchar[] b ) { b = std.utf.toUTF16(a); }
>> void transcode(  char[] a,  inout dchar[] b ) { b = std.utf.toUTF32(a); }
>>
>> void transcode( wchar[] a,  inout  char[] b ) { b = std.utf.toUTF8 (a); }
>> void transcode( wchar[] a,  inout wchar[] b ) { b = a; }
>> void transcode( wchar[] a,  inout dchar[] b ) { b = std.utf.toUTF32(a); }
>>
>> void transcode( dchar[] a,  inout  char[] b ) { b = std.utf.toUTF8 (a); }
>> void transcode( dchar[] a,  inout wchar[] b ) { b = std.utf.toUTF16(a); }
>> void transcode( dchar[] a,  inout dchar[] b ) { b = a; }
>>
>> unittest
>> {
>>   char[] s8;
>>  wchar[] s16;
>>  dchar[] s32;
>>
>>   char[] t8;
>>  wchar[] t16;
>>  dchar[] t32;
>>
>>  s8 = "some text";
>>  transcode(s8, s16);
>>  transcode(s16, s32);
>>  transcode(s32, t16);
>>  transcode(t16, t8);
>>
>>  assert(s8 == t8);
>>
>>  transcode(t8,t32);
>>  assert(t32 == s32);
>>
>>  transcode(s32,t8);
>>  assert(t8 == s8);
>>  assert(s8 != cast(char[])s16);
>>
>> }
>> =================================
> 
> or this somewhat dubious variation :)
> 
> s8 = "some text";
> s8.transcode(s16):
> 
> Now, let's assume for a moment that the user intends to somehow modify the return content.
> 
> This again brings up the issue about CoW ~ a user might consider these as always being /copies/ of the original content, since they've been transcoded. Right? After being transcoding into a freshly allocated chunk of the heap, as a user I wouldn't expect to .dup the result.
> 
> Yet, this is not a valid assumption ~ your example return the original content directly in 3 cases,

Yeah, I realized this just before I went out to do the shopping. The 'fix' is easy though.

 void transcode(  char[] a,  inout  char[] b ) { b = a.dup; }
 void transcode( wchar[] a,  inout wchar[] b ) { b = a.dup; }
 void transcode( dchar[] a,  inout dchar[] b ) { b = a.dup; }

> which a user might happily modify, thinking s/he's working with a private copy. To get around this, the user must explicitly follow CoW and always .dup the result. Even when it's presumeably redundant to do so. Alternatively, as the utility designer, you must always .dup the non-transcoded return value. Just in case.
> 
> This seems utterly wrong. And it's such a fundamental thing too. Perhaps I don't get it?

It called a mistake, Kris. And yes, even I make them on rare occasions ;-)

-- 
Derek Parnell
Melbourne, Australia
26/11/2005 6:39:42 PM
November 26, 2005
kris wrote:
> Georg Wrede wrote:
>> kris wrote:
>>> Georg Wrede wrote:
>>>> kris wrote:
>>>>> 
>>>>> Designing it with respect to performance and immutability are
>>>>> also not so tough (though D badly needs read-only arrays).
>>>> 
>>>> (OT) never thought about that! Please elaborate.
>>> 
>>> On Read-Only arrays? Sure.
>>> 
>>> One can easily design a class such that it cannot be mutated when
>>> passed from one function to another. However, when it comes to arrays, access to content by the callee is wide open to abuse.
>>> That is, if funcA wants to give funcB read-only access to a large
>>> quantity of data, one should clone the thing /just in case/ funcB
>>> mutates it. This then pervades throughout structs and classes
>>> without respect to attribute visibility.
>>> 
>>> The D notion is that CoW will be somehow be adhered to by the
>>> callee ~ it will be a "good" function, and clone the array before
>>> touching it. Yet this is not enforced by the compiler, to any
>>> degree. Thus the caller ends up doing the work, just to be sure.
>>> 
>>> This, I'm sure you'll agree is a bit daft. It's also a
>>> significant performance problem for server-code, or anywhere
>>> where immutability is a high priority. Anyone who regularly uses
>>> multiple threads will attest that enforced immutability is a
>>> welcoming lifeboat within a cold sea of unrest and uncertainly.
>> 
>> Ah, right. Interesting that there's no convenient hardware support
>> for such. Well, can't have everything, do we. :-)
> 
> Hardware support is not needed for such things. 

True.

> Instead the language needs a means to decorate a return-type as being
> read only (or something akin), and enforce subsequent usage as an
> rValue only.

Since we use references in D instead of pointers, it might not be too hard to do. The reference might have an attribute for read-only.

Of course there could (and would) be more than one reference to an array in the application, some of which are "read-only", but that is no problem.

I'd imagine this would be quite easy to implement into D.

And add the same to references to structs, while at it.
November 26, 2005
On Fri, 25 Nov 2005 14:31:05 -0800, John Reimer <terminal.node@gmail.com> wrote:
> Kris wrote:
>> Long post, with code examples.
>
> <snip long post>
>
>>
>>>> More specifically, the ICU thing is not something I believe D should even tackle. For the next couple of years, I think *those* application programmers who care about such, should use a library (like ICU, or whatever). What D provides will be adequate UTF handling -- as far as slicin' n' streamin' are concerned, nothing fancier. After a couple of years, we can always check the issue again. Maybe by that time a bunch of now broken issues have been settled, maybe there actually is some need for such functionality, maybe by that time they have stopped fighting with the available compilers (bugs), maybe... Then we can check it out.
>>>
>>>
>>> That's a fair assessment of the situation.  That makes much more sense.
>>   Again; to do a good job we have to take such things into account <g>
>>  Really thought I would be replying to John here, but it turned out
>> otherwise. Hope that's OK with JJR?
>>
>
> This is perfectly fine, Kris, and I thank you for it.  I think you've clarified your perspective well here.  Really, I think we were referring to "ICU" rather loosely here as a _symbol_ representing a comprehensive unicode solution for D, which would be a major undertaking.

Not just a "unicode" solution but a solution for all major character encodings. It's waaay more than I am looking for at this point in time. D can already convert between the 3 UTF encodings it uses and I'm not looking for anything more at this stage.

> That said, I've always liked the idea of a solid String class, something
> that could be built upon or expanded over time.  Your sample
> specification is food for thought.  If that's the type of API that
> people could agree upon, then I think the D community can get somewhere.
>   Georg, when you mentioned an API, is that the general idea to which
> you were referring?  Or did you mean something else? Regan, your
> thoughts?

I'm looking at this from a slightly different angle. I don't want a class with an API which defines methods like "startsWith" and "endsWith" etc. I think they're un-necessary at this stage and here's why...

The first stage, to my mind is being able to index and slice complete characters as opposed to fragments of characters and to be able to do this regardless of the actual encoding used to store the data.

For example:

   string test = "smörgåsbord";
   assert(test[2] == 'ö');

Regardless of whether this is stored in UTF-8, UTF-16 or UTF-32 this
should just work.
(the string class I posted to start this thread can do this)

Once we can do this, we can write "startsWith" and "endsWith" trivially.

   bool startsWith(string s, string text) {}
   bool endsWith(string s, string text) {}

Provided D supports it's array method calling feature here too, we could call these like so:

   string s;
   s.startsWith(new string("test"));

Which should hopefully keep the people who prefer the object call style happy.

Further, if "string" becomes a built in type this becomes:

   string s;
   s.startsWith("test");

I now see this "string" type as an addition to the language, not as a replacement for char[], wchar[] and dchar[].

"string" would ideally become the type everyone used for general purpose string handling, only in areas where the encoding itself or code fragments were important would people use the char[], wchar[] or dchar[] types and then some of that could be handled by having a "string" which can use an encoding specified at run time (as opposed to only compile time like the class I posted)

My latest "string" class effort is attached. It is by no means exactly what I envision, it's more of a "test the theory", "explore the consequences" sort of thing. As Kris said, there are consequences and trade-offs, speed for space etc. I hope for a "string" which can provide the trade offs each person and situation desires, and if not, we still have char[], wchar[] and dchar[] to fall back to.

> In the past, quite a few people in the community rejected the idea of a string class; they said it wasn't necessary, or they didn't want any string management turning Object Oriented.  Their resistance perplexed me because I figured there could be only benefits to adopting such a package. Those that didn't want to use it could stick to the basic D types.

These are all reasons why I think it should be a built in type, and why we should not define an class-style API at this stage.

Regan