April 28, 2008
Janice Caron wrote:

> 2008/4/28  <p9e883002@sneakemail.com>:
> >  import std.string;
> > 
> >  int main( string[] args) {
> >     char[] a = cast(char[])args[0].toupper();
> 
> **** UNDEFINED BEHAVIOR ****
> (1) args might be placed in a hardware-locked read-only segment. Then
> the following line would fail
> (2) there might be other pointers to the string, which expect it never
> to change.
> 
> >     a[2..5] = "XXX";
> >     a = cast(char[])tolower( cast(string)a );
> >     writefln(a);
> >     return 0;
> >  }
> > 
> >  Finally. It works.
> 
> But not necessarily on all architectures, because of the undefined behavior. This is how you do it without undefined behavior.
> 
>     import std.string;
> 
>     int main( string[] args) {
>         string a = args[0].toupper();
>         a = a[0..2] ~ "XXX" ~ a[5..$];
>         a = a.tolower();
>         writefln(a);
>         return 0;
>     }

Ack! That's horrible. Instead of using the information I have, the offset and length of the slice I want to manipulate, I have to derive two offset/length pairs to the bits I do not want to do anything to.

1) Whatever happened to polymorphism?

Eg. Why can't the standard string library recognise that I, as the programmer, know what I need to do to my data. It's my job.

So, if I assign the results of a string library function/method to a mutable variable (Just a variable really. An invariant variable is a constant!), then it should be possible (*IS* possible) to recognise that and return an appropriate result. Duplicating the input if required.

The idea that runtime obtained or derived strings can be made truely invariant is purely theoretical. Whilst the compiler can place compile time contants into hardware protected, read-only memory segments, doing this at runtime would be horribly costly and hardly beneficial.

IA-86 allows memory to be set readonly at runtime, but only in page sized chunks. Which means that either:

- every derived string would need to be placed in its own 4k multiple sized chunk of ram.

-or, each page would have to constantly be switched from read-only to read-write and back again as new entities are added and old ones go out of scope.

And if you are not using hardware protection, then the invariance is only notional as  D can call C, and C allows me access to pointers. And once I have one of those, I can scribble anywhere that isn't hardware protected.

All this smacks of D reinventing, with all the same mistakes, the whole Java String vs. StringBuffer dichotomy:

    http://www.javaworld.com/javaworld/jw-03-2000/jw-0324-javaperf.html

And Java had the VM to isolate it from non-complient code.

One of several "mission statements" that drew me to D when I forst encountered it nearly 3 years ago, was the pragmatism embodied in articles like this:

    http://www.digitalmars.com/d/2.0/builtin.html

and this:

    http://www.digitalmars.com/d/2.0/cppstrings.html

and statements like this:

    "No pointless wrappers around C runtime library functions or OS API
functions D provides direct access to C runtime library functions and
operating system API functions. Pointless D wrappers around those
functions just adds blather, bloat, baggage and bugs."

Coming back to try and use D after a prolonged absence, the changes in the interim period seem to be eshewing that pragmatism in favour of some kind of mixed OO/functional purity ethic. Is there an ex-Haskeller in the house?

I admit openly to still being in the throws of finding my way around the language and the library, and have being making seemingly elementary mistakes in interpreting the documentation. But one of the major attractions of D over C/C++ is its built-in string types and manipulations. As good as these are, there is still the need for a library of common operations upon them. If everytime I want to use one of these library calls, I have to cast my mutable string into and invariant and then cast the result back to mutable inorder to be able to use the built-in manipulations, lifes going to get very boring, very fast.

The alternative I guess is to sit down and write my own library that performs the same operations as std.string, but on the native string type. Which kinda dilutes the purpose of having standard libraries.


Sorry to be so verbose, and please don't anyone take any of this personally. I'm critiquing the code I am encountering, and the problems I am having using it. Not the prople who wrote it.

Cheers, b.
-- 

April 28, 2008
<p9e883002@sneakemail.com> wrote
> Hi all,
>
> 'scuse me for not being familiar with previous or ongoing discussion on
> this
> subject, but I'm just coming back to D after a couple of years away.
>
> I have some strings read in from external source that I need to convert to
> uppercase. A quick look at Phobos and I find std.string has a toupper
> method.
> <very good example case removed>

This is all not an issue if Walter adopts 'scoped const' contracts.

http://d.puremagic.com/issues/show_bug.cgi?id=1961

The current con for this method is that it is another 'confusing' const syntax.  So is what I propose more confusing, or is what this poor developer had to go through more confusing?

-Steve


April 28, 2008
2008/4/28 Me Here <p9e883002@sneakemail.com>:
>  Ack! That's horrible. Instead of using the information I have, the
>  offset and length of the slice I want to manipulate, I have to derive
>  two offset/length pairs to the bits I do not want to do anything to.

Not necessarily. Instead of

    a = a[0..2] ~ "XXX" ~ a[5..$];

you could do

    char[] tmp = a.dup;
    tmp[2..5] = "XXX";
    a = assumeUnique(tmp);

(I forget which module you have to import to get assumeUnique). But what you mustn't ever do is cast away invariant.


>  1) Whatever happened to polymorphism?

What's polymorphism got to do with anything? A string is an array, not a class.


>  So, if I assign the results of a string library function/method to a
>  mutable variable (Just a variable really. An invariant variable is a
>  constant!), then it should be possible (*IS* possible) to recognise
>  that and return an appropriate result.

Functions don't overload on return value.


>  The idea that runtime obtained or derived strings can be made truely
>  invariant is purely theoretical.

But the fact that someone else might be sharing the data is not.

>  But one of the
>  major attractions of D over C/C++ is its built-in string types

D has no built in string type. string is just an alias for invariant(char)[].

>  If everytime I want to use one
>  of these library calls, I have to cast my mutable string into and
>  invariant and then cast the result back to mutable

That's one approach. Another is don't try to treat strings as mutable.
April 28, 2008
2008/4/28 Steven Schveighoffer <schveiguy@yahoo.com>:
>  > I have some strings read in from external source that I need to convert to
>  > uppercase. A quick look at Phobos and I find std.string has a toupper
>  > method.
>  > <very good example case removed>
>
>  This is all not an issue if Walter adopts 'scoped const' contracts.

toupper() couldn't be reused for all constancies, because the invariant version should employ copy-on-write, wheras any other versions would not be able to do this.

That is,

    toupper("HELLO");

can return the original, if and only if the string is invariant.
April 28, 2008
"Janice Caron" wrote
> 2008/4/28 Steven Schveighoffer:
>>  > I have some strings read in from external source that I need to
>> convert to
>>  > uppercase. A quick look at Phobos and I find std.string has a toupper
>>  > method.
>>  > <very good example case removed>
>>
>>  This is all not an issue if Walter adopts 'scoped const' contracts.
>
> toupper() couldn't be reused for all constancies, because the invariant version should employ copy-on-write, wheras any other versions would not be able to do this.
>
> That is,
>
>    toupper("HELLO");
>
> can return the original, if and only if the string is invariant.

toupper is probably a bad example, as your case seems like the rarest :) But I understand what you are saying.

The desire to have string processing functions work with all constancies seems very reasonable and useful to me.  To deny usage of toupper unless you idup the array, just to have the ability to optimize on a corner case seems incorrect, and to probably produce less efficient code for 90% of the cases. If the scoped const proposal was never accepted, and I used Phobos, I'd probably suggest a const and mutable version of toupper that allowed for those of us who use mutable strings a lot, and maybe not so much multithreadding, to not have to jump through hoops for any string processing.

Maybe the solution to this is to write specializations which use COW with the invariant version, perhaps with pure functions, which always assume invariant parameters.  So you have a pure toupper which handles the invariant version, and a scoped const version which allows using the function on non-invariant parameters, which can't be optimized the same anyways...

-Steve


April 28, 2008
Janice Caron wrote:

> you could do
> 
>     char[] tmp = a.dup;
>     tmp[2..5] = "XXX";
>     a = assumeUnique(tmp);
> 

Ah! Again, 3 lines instead of 1. Plus two function calls and a temporary variable.

You do realise that there is a very strong correlation between bugs and line count? That's been so for all of the last 30+years regardless of language or paradigm.

So, you made it more verbose and more complex and much slower.

And, in doing so, introduced more scopes for errors than you've cured.

>That's one approach. Another is don't try to treat strings as mutable.

Ram is mutable--it's its purpose in being, Variables live in RAM, and vary--that's their purpose in being.

Making a copy of a <strike>string</strike> piece of ram and throwing the old one away, every time I want alter its contents...kinda reminds me of disposable nappies. A costly convenience.

I'll revert to 1.x and pray that 2.x fades away through lack of interest before it turns D into Yet Another Dead Language--for OO purists and academics only.

Cheers, b.

-- 

April 28, 2008
On 28/04/2008, Me Here <p9e883002@sneakemail.com> wrote:
> Ah! Again, 3 lines instead of 1. Plus two function calls and a temporary variable.

To be fair though, the problem here is that the functions you are calling (std.string.toupper and std.string.tolower) don't do what you want. This is not a fault of the language - it's a limitation of the library.

To that end, as others have said, this problem could be solved simply enough by the addition of another module - say, std.stringbuffer - in which we alias char[] to stringbuffer (or maybe a StringBuffer class - I'm not sure what's best) and provide a whole bunch of functions optimized for those mutable char arrays.

To blame the language for the lack of library is the wrong approach. D2 has some killer, kickass features. The template metaprogramming power alone is enough to make C++ programmers weep. I'm looking forward to pure functions, and a new generation of multithreading.

If there's enough interest, and if Walter approves, I could certainly kickstart std.stringbuffer. Is that the right way to go? What do people think?
April 28, 2008
p9e883002@sneakemail.com wrote:
> So, that's two copies of the string, plus a slice, plus an extra method call to achieve what used to be achievable in place on the original string. Which is now immutable, but I'll never need it again. 
> 
> Of course, on these short 1-off strings it doesn't matter a hoot. But when the strings are 200 to 500 characters a pop and there are 20,000,000 of them. It matters.
> 
> Did I suggest this was an optimisation?

You bring up a good point.

On a tiny example such as yours, where you can see everything that is going on at a glance, such as where strings come from and where they are going, there isn't any point to immutable strings. You're right about that.

The problems start happening as the complexity rises. Strings get passed around, stored, modified, etc. It's real easy to lose track of who owns a string, who else has references to the string, who has rights to change the string and who doesn't.

For example, you're changing the char[][] passed in to main(). What if one of those strings is a literal in the read-only data section?

So what happens is code starts defensively making copies of the string "just in case." I'll argue that in a complex program, you'll actually wind up making far more copies than you will with invariant strings.
April 28, 2008
Janice Caron wrote:
> If there's enough interest, and if Walter approves, I could certainly
> kickstart std.stringbuffer. Is that the right way to go? What do
> people think?

What it will do is provide a useful solution for those who really want to use mutable strings. I bet that, though, after a while they'll evolve to eschew it in favor of immutable strings. It's easier than arguing about it <g>.
April 28, 2008
== Quote from Janice Caron (caron800@googlemail.com)'s article
> 2008/4/28 Steven Schveighoffer <schveiguy@yahoo.com>:
> >  > I have some strings read in from external source that I need to convert to
> >  > uppercase. A quick look at Phobos and I find std.string has a toupper
> >  > method.
> >  > <very good example case removed>
> >
> >  This is all not an issue if Walter adopts 'scoped const' contracts.
> toupper() couldn't be reused for all constancies, because the
> invariant version should employ copy-on-write, wheras any other
> versions would not be able to do this.
> That is,
>     toupper("HELLO");
> can return the original, if and only if the string is invariant.

Can you explain this in light of Steven's 'scoped const' proposal?  By my understanding (assuming scoped const):

    string bufI = "HELLO";
    char[] bufM = "HELLO".dup;
    const(char)[] bufC = bufM;

    string retI = toupper( bufI ); // return value is invariant - ok
    char[] retM = toupper( bufM ); // return value is mutable - ok
    const(char)[] retC = toupper( bufC ); // return value is const - ok
    const(char)[] retC2 = toupper( bufI ); // return value is invariant - ok

    bufM[0] = 'J';
    assert( retC[0] == 'J' );

The above seems perfectly fine, because it's impossible to pass a mutable array and return a const reference to it--the return value will be mutable as well.

By contrast, let's assume the invariant implementation:

    string toupper( string buf );

    char[] buf = "HELLO".dup;

    toupper( buf ); // fails
    toupper( buf.idup ); // works
    toupper( assertUnique( buf ) ); // works

In the first case I have to copy buf to pass it to toupper, and in the second I have
to perform a cast operation (albeit wrapped in a function to hide the truth).
Assuming for a moment that mutable strings are useful and so I won't be able to
use the 'string' alias all the time, can you explain what is good about either of
these scenarios?


Sean