String theory and confusion (page 5) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » String theory and confusion (page 5)

November 26, 2005

Re: String theory and confusion

Posted by Georg Wrede
in reply to kris

Georg Wrede

Posted in reply to kris

kris wrote:
> Georg Wrede wrote:
>> 
>> Slothfullness... I'll tell Bill you're picking on me!
> 
> Sorry. That wasn't intended to be a personal attribution <g>

Actually I was picking, on Bill and Windows. He'd be the last person there's any use discussing with about slothfullnes. :-)

>> But I'm trying to maintain a balance here. Right now we are
>> pressured for time (or actually Walter is -- I can't imagine that
>> his wife hasn't left him already, especially considering how
>> quickly .140 came out with all those things), and we should get
>> this utf thing out of the way, so other things can be tackled.
> 
> Which UTF thing? There's so many threads going on its hard to keep
> track.

True. :-)  And everybody have their own idea, so I'm referring mostly to what I've written.

And judging from the verbosity and diversity of this "utf discussion", there's still a lot of misconceptions, ideas based on them, and plain beliefs around.

>> I'd say we can right now implement stuff less-than well --
>> _as_long_ as the setup is drawn right. In other words, so that
>> later we (without changing the API) can rewrite and optimize the
>> individual routines. 
>
> I'm really missing something here. You're talking about an API for
> what? It must be a String class, yes? Getting it "right" so that it
> doesn't change, is not something that can be done on a whim. I know
> you know that, so what's the huge rush all of a sudden? Don't you
> think it would be better to build something and let it mature with
> use for a period of time? 

The "rush":

We almost had a feature freeze already. The metaprogramming thing kind of broke it. Then the "utf"-or-whatchmacallit popped up. If you read my posts from the last month or so, you'll see what effort it took to get us even to this level. Now it's time to put that to use, and make the few changes to D I've suggested. And then let this "utf" thing rest for a good while. (Let things "sink in", so to speak.) At the current rate, it'll take som 6 months before enough folks are so clear on these issues that a meaningful try to go further is feasible, without 95% of the ng. writing time going to re-re-re-explaining the obvious to each participant separately. Or to apples and oranges. (This was not personal! I meant in general between posters.)

> Why not pick up one of the String classes that's been around for a
> year or more? There's at least three of them that old. As you might
> guess, this is not a new topic at all :-)

:-) And the likelihood of String classes getting into D may not have changed. (Which is why I'm not pushing it.)

UTF32 would not need any String class, it can simply be used as an array. The other two (16 and 8) are problematic, so it would be natural to have them as classes. (And for symmetry, obviously a 32 then too.)

But then again, I can understand Walter's reluctance. D is a C-family language, and it would be kind of neat to have the language itself nice and tight. With USASCII there was no problem with that. And that made it possible to have strings just be arrays, which is kinda cool.

Having string classes in libraries, or even in Phobos, would be natural now. Somehow I understand Walter's reluctance, though. The compilers (DMC, DMD) are (almost?) totally non-OO, and Walter writes the DMD front-end in D, i.e. he uses D in another way than the future average D programmers will (and should).

This doesn't necessarily mean I agree, of course.

>> So, fixing a print function that does superfluous intermediate conversions, should, IMHO, not be on our agenda at all before
>> spring. DMD 1.0 or not.
> 
> I believe you're badly miscontruing something here, Georg. Who ever
> said anything about fixing writef? I certainly countered an argument
> that was effectively stating "if writef can do it like that, then
> that's probably good enough for everything else". Merely pointing out
> that printf exists for convenience, not performance, should hardly be
> interpreted in this manner :-D

I totally agree with you! My mistake, I read you as promoting an aggressively time-optimizing rewrite of all such things. ;-(

> Can you please tell me about this Spring date? Is something important
>  happening then?

There's no date. Just the everlasting assumption that 1.0 is 3 to 6 months away from now. :-D :-D

November 26, 2005

Re: String theory by request?

Posted by Georg Wrede
in reply to John Reimer

Georg Wrede

Posted in reply to John Reimer

>>>> More specifically, the ICU thing is not something I believe D
>>>> should even tackle. For the next couple of years, I think
>>>> *those* application programmers who care about such, should use
>>>> a library (like ICU, or whatever). What D provides will be
>>>> adequate UTF handling -- as far as slicin' n' streamin' are
>>>> concerned, nothing fancier. After a couple of years, we can
>>>> always check the issue again. Maybe by that time a bunch of now
>>>> broken issues have been settled, maybe there actually is some
>>>> need for such functionality, maybe by that time they have
>>>> stopped fighting with the available compilers (bugs), maybe...
>>>> Then we can check it out.
>>> 
>>> That's a fair assessment of the situation.  That makes much more
>>> sense.
>> 
>> Again; to do a good job we have to take such things into account
>> <g>
>> 
>> Really thought I would be replying to John here, but it turned out
>> otherwise. Hope that's OK with JJR?
> 
> This is perfectly fine, Kris, and I thank you for it.  I think you've
> clarified your perspective well here.  Really, I think we were
> referring to "ICU" rather loosely here as a _symbol_ representing a
> comprehensive unicode solution for D, which would be a major
> undertaking.
> 
> That said, I've always liked the idea of a solid String class,
> something that could be built upon or expanded over time.  Your
> sample specification is food for thought.  If that's the type of API
> that people could agree upon, then I think the D community can get
> somewhere. Georg, when you mentioned an API, is that the general idea
> to which you were referring?  Or did you mean something else? Regan,
> your thoughts?

Yes. Except I wasn't including the String class at the time, but I'm now picking up the can of Weider Body Building Proteins, which will be needed in the upcoming debate with Walter. ;-)

Ideally, we'd have both OO and non-OO for strings. The Docs would have the Strings prominently placed, while the non-OO API would be somewhere that the casual reader doesn't stumble upon too soon. :-)

The String class would be needed for normal usage of D, and the non-OO API "because D is a systems language", or whatever. And, the non-OO API would be much smaller, the excuse being that mostly Systems Programmers will use it, for other purposes than general application programming. (Slicin' n' streamin', not very much else.)

(And I believe the changes needed for the non-OO api are truly minimal. The rewards however would be big.)

> In the past, quite a few people in the community rejected the idea of
> a string class; they said it wasn't necessary, or they didn't want
> any string management turning Object Oriented.  Their resistance
> perplexed me because I figured there could be only benefits to
> adopting such a package. Those that didn't want to use it could stick
> to the basic D types.

Understandable. The past was USASCII, while folks (not even noticing) already used utf. So strings-as-arrays seemed too lucrative.
"Want a String class in your app? Write one into the app."

> Another benefit of adopting a string package is that it can be a
> ready addition to Phobos.

Yup. Sending Walter the diffs (and docs) would probably stand a better chance than just asking.

November 26, 2005

Re: String theory and confusion

Posted by Kris
in reply to Derek Parnell

Kris

Posted in reply to Derek Parnell

"Derek Parnell" <derek@psych.ward> wrote in message news:k04hb75t8pzh$.q9xcre0h0fgw.dlg@40tude.net...
> On Fri, 25 Nov 2005 15:09:44 -0800, Kris wrote:
>
>> "Derek Parnell" <derek@psych.ward> wrote in message news:1ypwa2hwmja.q9pdllu3i85s.dlg@40tude.net...
>>> On Fri, 25 Nov 2005 10:32:47 -0800, kris wrote:
>>>
>>>
>>> [snip]
>>>
>>>> There's the one that says "add some array properties as a convenience for transcoding", such as adding .utf8 .utf16 and .utf32 properties as appropriate. That would be nice!
>>>
>>> For what it's worth, here's a small convenience module...
>>>
>>> ==================================
>>> module transcode;
>>> private import std.utf;
>>>
>>> void transcode(  char[] a,  inout  char[] b ) { b = a; }
>>> void transcode(  char[] a,  inout wchar[] b ) { b =
>>> std.utf.toUTF16(a); }
>>> void transcode(  char[] a,  inout dchar[] b ) { b =
>>> std.utf.toUTF32(a); }
>>>
>>> void transcode( wchar[] a,  inout  char[] b ) { b = std.utf.toUTF8
>>> (a); }
>>> void transcode( wchar[] a,  inout wchar[] b ) { b = a; }
>>> void transcode( wchar[] a,  inout dchar[] b ) { b =
>>> std.utf.toUTF32(a); }
>>>
>>> void transcode( dchar[] a,  inout  char[] b ) { b = std.utf.toUTF8
>>> (a); }
>>> void transcode( dchar[] a,  inout wchar[] b ) { b =
>>> std.utf.toUTF16(a); }
>>> void transcode( dchar[] a,  inout dchar[] b ) { b = a; }
>>>
>>> unittest
>>> {
>>>   char[] s8;
>>>  wchar[] s16;
>>>  dchar[] s32;
>>>
>>>   char[] t8;
>>>  wchar[] t16;
>>>  dchar[] t32;
>>>
>>>  s8 = "some text";
>>>  transcode(s8, s16);
>>>  transcode(s16, s32);
>>>  transcode(s32, t16);
>>>  transcode(t16, t8);
>>>
>>>  assert(s8 == t8);
>>>
>>>  transcode(t8,t32);
>>>  assert(t32 == s32);
>>>
>>>  transcode(s32,t8);
>>>  assert(t8 == s8);
>>>  assert(s8 != cast(char[])s16);
>>>
>>> }
>>> =================================
>>
>> or this somewhat dubious variation :)
>>
>> s8 = "some text";
>> s8.transcode(s16):
>>
>> Now, let's assume for a moment that the user intends to somehow modify
>> the
>> return content.
>>
>> This again brings up the issue about CoW ~ a user might consider these as
>> always being /copies/ of the original content, since they've been
>> transcoded. Right? After being transcoding into a freshly allocated chunk
>> of
>> the heap, as a user I wouldn't expect to .dup the result.
>>
>> Yet, this is not a valid assumption ~ your example return the original content directly in 3 cases,
>
> Yeah, I realized this just before I went out to do the shopping. The 'fix' is easy though.
>
> void transcode(  char[] a,  inout  char[] b ) { b = a.dup; }
> void transcode( wchar[] a,  inout wchar[] b ) { b = a.dup; }
> void transcode( dchar[] a,  inout dchar[] b ) { b = a.dup; }
>
>> which a user might happily modify, thinking
>> s/he's working with a private copy. To get around this, the user must
>> explicitly follow CoW and always .dup the result. Even when it's
>> presumeably
>> redundant to do so. Alternatively, as the utility designer, you must
>> always
>> .dup the non-transcoded return value. Just in case.
>>
>> This seems utterly wrong. And it's such a fundamental thing too. Perhaps
>> I
>> don't get it?
>
> It called a mistake, Kris. And yes, even I make them on rare occasions ;-)

:-)

I didn't mean your code was utterly wrong, Derek. I meant the philosophy ain't straight. One should not have to make copies "just in case". It's terribly wasteful ...

November 26, 2005

Re: String theory and confusion

Posted by Derek Parnell
in reply to Kris

Derek Parnell

Posted in reply to Kris

On Sat, 26 Nov 2005 12:18:08 -0800, Kris wrote:


> I didn't mean your code was utterly wrong, Derek. I meant the philosophy ain't straight. One should not have to make copies "just in case". It's terribly wasteful ...

My code was mistaken because it's behaviour was not consistent, but you're right too.

-- 
Derek Parnell
Melbourne, Australia
27/11/2005 8:03:39 AM

November 26, 2005

Re: String theory by request?

Posted by Georg Wrede
in reply to Kris

Georg Wrede

Posted in reply to Kris

Kris wrote:

...

>>> 1. Not get drowned in the current utf maze of glass walls and
>>> mirrors.
>>> 
>>> 2. Real Soon Now, be able to do their coding without being forced
>>> to know a single thing about utf.
>>> 
>>> 3. Not have downright disinformation stuffed down their throats
>>> by D documentation, specs, or the existing choice of data types
>>> in D.
>>> 
>>> 4. Get rid of all the gotchas (especially the unobvious) hidden
>>> in the current framework of what appears to be character types
>>> and handling.

...

> 1. It seems reasonable that one should come up with an abstraction
> of what the String should do, using either an abstract class or an
> interface.
> 
> This eliminates ctor() considerations completely, and permits
> pretty much anyone to write a compatable String. Compatability
> is important if this is going to become fundamental to D. So,
> write an abstract specification; and then provide a rudamentary
> concrete class that implements the spec. Perhaps a simple dchar[]
> implementation?

Sounds good.

> Anyway, onto an example specification:
>
> 2. Suppose we have this:
> 
> ~~~~~~~~~~~~
> 
> class String
> {
>     // some read only methods
>     abstract bool startsWith (String);
>     abstract bool endsWith (String);
>     abstract int indexOf (String, int start=0);
> 
>     // transcoding methods
>     char*    cStr();
>     char[]    utf8();
>     wchar[] utf16();
>     dchar[]  utf32();
>     ...
> 
>     // some mutating methods
>     abstract void prepend (String);
>     abstract void append (String);
> 
>     abstract void setCharAt (int index, dchar chr);
>     ...
> }
> 
> ~~~~~~~~~~~~~
> 
> There's immediately three things to note.
> 
> (a) many arguments are other instances of String, since otherwise
> you'd have to provide char/wchar/dchar instances of every method
> (what you're trying to avoid, right?).
> 
> (b) the setCharAt() method takes a dchar. How do you avoid that
> without wrapping dchar too? I don't think it would be practical, so
> dchar it stays?
> 
> (c) the operations noted explicitly avoid certain functionality that
> would add seriously to the complexity of a basic implementation (add
> your favourite collation-sequence example here). On the other hand,
> one can hide such nasties with careful choice of methods. For
> example, one could add a trimWhitespace() method, which can be
> implemented without the requirement of full Unicode character
> classification. The point is to be careful about the methods chosen.
> 
> 3. Notice the distinction between read-only and mutating methods.
> To assist in writing deterministic (and performant) multi-threaded
> code, it would be advantageous to split the specification into
> mutable and non-mutable variations (I'll assume the benefits of
> doing so are acknowledged)
> 
> ~~~~~~~~~~
> 
> class String    // a read-only String
> {
>     // some read only methods
>     abstract bool startsWith (String);
>     abstract bool endsWith (String);
>     abstract int indexOf (String, int start=0);
> 
>     // transcoding methods
>     char*    cStr();
>     char[]    utf8();
>     wchar[] utf16();
>     dchar[]  utf32();
>     ..
> }
> 
> 
> class MutableString : String   // a modifiable String
> {
>     // some mutating methods
>     abstract void prepend (String);
>     abstract void append (String);
> 
>     abstract void setCharAt (int index, dchar chr);
>     ...
> }
> 
> ~~~~~~~~~~~
> 
> Now you can pass either type to a method that accepts the read-only
> String, yet can be somewhat assured of the intent when a called
> function expects a MutableString as an argument (it is expecting
> to change the darned thing <g>), and the compiler will catch such mismatches appropriately.
> 
> 4. Using abstract classes is cool, but it limits the ability of
> someone trying to build compatible, alternate, implementations:
> they'd be  limited in what to use as a base-class.
> To open up the compatability aspect, we adopt interfaces instead:
> 
> ~~~~~~~~~~~~~
> 
> interface IString
> {
>     // some read only methods
>     bool startsWith (String);
>     bool endsWith (String);
>     int indexOf (String, int start=0);
> 
>     // transcoding methods
>     char*    cStr();
>     char[]    utf8();
>     wchar[] utf16();
>     dchar[]  utf32();
>     ..
> }
> 
> interface IMutableString : IString
> {
>     // some mutating methods
>     void prepend (String);
>     void append (String);
> 
>     void setCharAt (int index, dchar chr);
>     ...
> }
> 
> ~~~~~~~~~~~
> 
> At this point, there's little stopping another developer making a
> compatible implementation, yet with completely different internals
> (and ctors) than the original reference implementation.
> For example: the ICU wrappers could implement these interfaces,
> and Hey Presto! Full compatability with the basic specification!

Now this looks good!

Not having looked closer at the ICU wrappers etc. yet, I can't say anything for or gainst the ICU-part, but you certainly made a compelling case for using interfaces.

> 5. So here's where some little gotcha's come into play:
> 
> (a) Notice that the transcoding routines should never be providing
> access to the internal content? After all, it's read-only. This
> is where a read-only attribute would come in handy; e.g. "readonly char[] utf8();". The upside here is that the class could
> 'cache' the utf8 transcoding, or not transcode
> at all if the implementation is native utf8. Unfortunately, D
> currently expects the recipient to "play by the rules" of CoW;
> something that is completely unenforcable by the class designer.
> String is just the kind of class that needs readonly support. Let's
> hope that support comes along soon
> (and, yes, it can be done with CoW ~ but that's not enforceable).
> 
> (b) Notice also that the prepend() and append() methods take a String
> as an argument. They do this such that alternate implementations are
> allowed to play too. However, this requires any implementation of
> append(String) to call one of the transcoder methods of it's
> argument, to get the appending content. From a functional
> perspective, this is wonderful. From a performance perspective it's
> not. This is another reason why a String class
> might 'cache' its transcodings. Again, there's the readonly concern,
> since CoW is not enforcable by the class designer.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation