April 28, 2008
"Walter Bright" wrote
> Lars Ivar Igesund wrote:
>> After working with Java for quite some time, I have naturally drifted
>> from
>> using invariant strings to stringbuffers.
>
> Java strings lack slicing, so they're crippled anyway. I believe that slicing is one of those paradigm-shifting features, so I am not making an irrelevant point.

Java's String.substring(start, last) works just like slicing...

Not that I don't love D slicing above calling a function, but saying that Java doesn't have slicing is completely false.

Where they lack is in the support of mutable strings, and especially having strings be treated as native arrays.  D excels in those areas.

-Steve


April 28, 2008
Walter Bright wrote:

> Lars Ivar Igesund wrote:
>> After working with Java for quite some time, I have naturally drifted from using invariant strings to stringbuffers.
> 
> Java strings lack slicing, so they're crippled anyway. I believe that slicing is one of those paradigm-shifting features, so I am not making an irrelevant point.

I agree that Java strings are crippled, but considering that String is easier to use there than StringBuffer, I certainly would need good reasons to prefer the latter? And I have.

Your point about slicing may not be irrelevant, but the kickass-ness of the feature only truly comes to its right when combined with non-allocating string operations.

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango
April 29, 2008
Walter Bright wrote:

>p9e883002@sneakemail.com wrote:

>>Did I suggest this was an optimisation?
>
>You bring up a good point.

Sorry to have provoked you Walter, but thanks for your reply.

>On a tiny example such as yours, where you can see everything that is going on at a glance, such as where strings come from and where they are going, there isn't any point to immutable strings. You're right about that.

Well obviously the example was trivial to concentrate attention upon the issue I was having.

>  It's real easy to lose track of who owns a string, who else has references to the string, who has rights to change the string and who doesn't.

The keyword in there is "who". The problem is that you are pessimising the entire language, once rightly famed for it's performance, for *all* users. For the notional convenience of those few writing threaded applications. Now don't go taking that the wrong way. In other circles, I am known as "Mr. Threading". At least for my advocacy of them, if not my expertise. Though I have been using threads for a relatively long time, going way back to pre-1.0 OS/2 (then known internally as CP/DOS). Only mentioned to show I'm not in the "thread is spelt f-o-r-k" camp.

>For example, you're changing the char[][] passed in to main(). What if one of those strings is a literal in the read-only data section?

Okay. So that begs the question of how does runtime external data end up in a read-only data section? Of course, it can be done, but that then begs the question: why? But let's ignore that for now and concentrate on the development on my application that wants to mutate one or more of those strings.

The first time I try to mutate one, I'm going to hit an error, either compile time or runtime, and immediately know, assuming the error message is reasonably understandable, that I need to make a copy of the immutable to string into something I can mutate. A quick, *single* dup, and I'm away and running.

Provided that I have the tools to do what I need that is. In this case, and the entire point of the original post, that means a library of common string manipulation functions that work on my good old fashioned char[]s without my needing jump through the hoops of neo-orthodoxy to use them.

But, as I tried to point out in the post to which you replied, the whole 'args' thing is a red herring. It was simply a convenient source of non-compile-time data. I couldn't get the std.stream example to compile. Apparently due to a bug in the v2 libraries--see elsewhere.

In this particular case, I turned to D in order to manipulate 125,000,000 x 500 to 2000 byte strings. A dump of a inverted index DB. I usually do this kinda stuff in a popular scripting language, but that proved to be rather too slow for this volume of data. Each of those records needs to go through multiple mutations. From uppercasing of certain fields; the complete removal of certain characters within substantial subsets of each record; to the recalculation and adjustment of an embedded hex digest within each record to reflect the preceding changes. All told, each record my go through anything from 5 to 300 separate mutations.

Doing this via immutable buffers is going to create scads and scads of short-lived, immutable sub-elements that will just tax the GC to hell and impose unnecessary and unacceptable time penalties on the process. And I almost certainly will have to go through the process many times before I get the data in the ultimate form I need.

>So what happens is code starts defensively making copies of the string "just in case." I'll argue that in a complex program, you'll actually wind up making far more copies than you will with invariant strings.
>[from another post] I bet that, though, after a while they'll evolve to eschew it in favor of immutable strings. It's easier than arguing about it

You are so wrong here. I spent 2 of the worst years of my coding career working in Java, and ended up fighting it all the way. Whilst some of that was due to their sudden re-invention of major parts of the system libraries in completely incompatible ways when the transition from (from memory) 1.2 to 1.3 occurred--and being forced to make the change because of the near total abandonment of support or bug fixing for the 'old libraries'. Another big part of the problem was the endless complexities involved in switching between the String type and the StringBuffer type.

Please learn from history. Talk to (experienced) Java programmers. I mean real working stiffs, not OO-purists from academia. Preferably some that have experience of other languages also. It took until v1.5 before the performance of Java--and the dreaded GC pregnant pause--finally reached a point where Java performance for manipulating large datasets was both reasonable, and more importantly, reasonably deterministic. Don't make their mistakes over.

Too many times in the last thirty years I've seen promising, pragmatic software technologies tail off into academic obscurity because th primary motivators suddenly "got religion". Whether OO dogma or functional purity or whatever other flavour of neo-orthodoxy became flavour de jour, The assumption that "they'll see the light eventually" has been the downfall of many a promising start.

Just as the answer to the occasional hit-and-run death is not banning cars, so fixing unintentional aliasing in threaded applications does not lie in forcing all character arrays to be immutable.

For one reason, it doesn't stop there. Character arrays, are just arrays of numbers. Exactly the same problems arise with arrays of integers, reals, associative arrays. etc. Imagine the costs of duplicating an entire hash every time you add a new key or alter a value. The penalties grow exponentially with the size of the hash (array of ints, longs, reals ...).

And before you reject this notion on the basis that "I'd never do that", what's the difference? Are strings any more vulnerable to the problems invariance is meant to tackle that these other datatypes?

Try manipulating large datasets--images, DNA data, signal processing, finite element analysis; any of the types of applications for which multi-threading isn't just a way allow the program to do something useful while the user decides which button to click--in any of the "referentially transparent" languages that are concurrency capable and see the hoops you have to leap through to achieve anything like descent performance. Eg. Haskell Unsafe* library routines (Basically, abandon referential transparency for this data so that we can get something done in a reasonable time frame!). Look for "If you can match 1-core C speed using 4-core Haskell parallelism without "unsafe pseudo-C in Haskell" trickery, I will be impressed. ..." in the following article:   http://reddit.com/r/programming/info/61p6f/comments/

The abandonment or deprecation of lvalue slices on string types is the thin end of the wedge toward referential transparency and despite all the academic hype and impressive (small scale) demos of the 'match made in heaven' that is 'referential transparency & concurrency', try to seek out real-world examples of the combination running in real-world environments. Ie. Where someone other than the tax-payer of whatever country is paying for the development, and the time pressure to obtain the results are a little more demanding than Thesis submission date and you'll find them very conspicuous by their absence.

Such ideas look great on paper, in the heady world of ideal Turing Machines with unlimited length tapes (unbounded memory). But once you bring them back to the real world of finite RAM, fragmentable heaps and GC, they becomes impractical. Unworkable for real data sets in real time.

Don't feel the need to argue this on-forum. If it hasn't persuaded you that forcing invariance upon one datatype, through providing a string library that only work with invariant strings, will do little to address the problems it attempts to solve, then I doubt further discussion will. Please return to the pragmatism that so stood out in your early visions for D and abandon this folly before, as with so many of the follies of the gentleman academic of yore, it becomes a life-long quest ending up as a memorial or tombstone.

Cheers, b.
-- 
April 29, 2008
Steven Schveighoffer wrote:
> Java's String.substring(start, last) works just like slicing...

No it doesn't. It makes a copy (I don't know if this is true of *all* versions of Java).
April 29, 2008
== Quote from Me Here (p9e883002@sneakemail.com)'s article
> Don't feel the need to argue this on-forum. If it hasn't persuaded you that forcing invariance upon one datatype, through providing a string library that only work with invariant strings, will do little to address the problems it attempts to solve, then I doubt further discussion will.

There's always Tango :p

> Please return to the pragmatism that so stood out in your early visions for D and abandon this folly before, as with so many of the follies of the gentleman academic of yore, it becomes a life-long quest ending up as a memorial or tombstone.

As a point of interest, this quote is at the top of the DigitalMars D page:

"It seems to me that most of the "new" programming languages fall into one
of two categories: Those from academia with radical new paradigms and those
from large corporations with a focus on RAD and the web. Maybe it's time for a
new language born out of practical experience implementing compilers." -- Michael


Sean
April 29, 2008
Walter Bright escribió:
> Steven Schveighoffer wrote:
>> Java's String.substring(start, last) works just like slicing...
> 
> No it doesn't. It makes a copy (I don't know if this is true of *all* versions of Java).

A String holds an char[], the "start" in it and it's "length". A substring just creates another String instance with "start" and "length" changed.

So it makes a new String, but the underlying char[] remains the same.
April 29, 2008
Walter Bright wrote:
> Steven Schveighoffer wrote:
>> Java's String.substring(start, last) works just like slicing...
> 
> No it doesn't. It makes a copy (I don't know if this is true of *all* versions of Java).

Java's 6's string.substring method (JDK 1.6.0_04, 64-bit Windows):

public String substring(int beginIndex, int endIndex) {
if (beginIndex < 0) {
	throw new StringIndexOutOfBoundsException(beginIndex);
}
if (endIndex > count) {
	throw new StringIndexOutOfBoundsException(endIndex);
}
if (beginIndex > endIndex) {
	throw new StringIndexOutOfBoundsException(endIndex -beginIndex);
}
return ((beginIndex == 0) && (endIndex == count)) ? this :
	new String(offset + beginIndex, endIndex - beginIndex, value);
}

The important part is new String(offset + beginIndex, endIndex - beginIndex, value) which does indeed do a "slice" of sorts (that is, it returns a string with the same char array backing it with a new offset and length). No copying of data is done.
April 29, 2008
== Quote from Robert Fraser (fraserofthenight@gmail.com)'s article
> Walter Bright wrote:
> > Steven Schveighoffer wrote:
> >> Java's String.substring(start, last) works just like slicing...
> >
> > No it doesn't. It makes a copy (I don't know if this is true of *all*
> > versions of Java).
> Java's 6's string.substring method (JDK 1.6.0_04, 64-bit Windows):
> public String substring(int beginIndex, int endIndex) {
> if (beginIndex < 0) {
> 	throw new StringIndexOutOfBoundsException(beginIndex);
> }
> if (endIndex > count) {
> 	throw new StringIndexOutOfBoundsException(endIndex);
> }
> if (beginIndex > endIndex) {
> 	throw new StringIndexOutOfBoundsException(endIndex -beginIndex);
> }
> return ((beginIndex == 0) && (endIndex == count)) ? this :
> 	new String(offset + beginIndex, endIndex - beginIndex, value);
> }
> The important part is new String(offset + beginIndex, endIndex -
> beginIndex, value) which does indeed do a "slice" of sorts (that is, it
> returns a string with the same char array backing it with a new offset
> and length). No copying of data is done.

Right.  The issue in Java is that the String wrapper class is still allocated on the heap so DMA is still occurring.  D, on the other hand, uses a fat reference so creating a slice doesn't touch the heap at all.


Sean
April 29, 2008
Robert Fraser wrote:
> The important part is new String(offset + beginIndex, endIndex - beginIndex, value) which does indeed do a "slice" of sorts (that is, it returns a string with the same char array backing it with a new offset and length). No copying of data is done.

Yes, you are right. I was wrong. But Java is still new'ing a new instance of String for each slice. And it still uses two levels of indirection to get to the string data.
April 29, 2008
Janice Caron wrote:

>2008/4/28 Me Here <p9e883002@sneakemail.com>:
>(I forget which module you have to import to get assumeUnique). But
>what you mustn't ever do is cast away invariant.
>
>>  1) Whatever happened to polymorphism?
>
>What's polymorphism got to do with anything? A string is an array, not a class.
>
>
>>  So, if I assign the results of a string library function/method to a
>>  mutable variable (Just a variable really. An invariant variable is a
>>  constant!), then it should be possible (*IS* possible) to recognise
>>  that and return an appropriate result.
>
>Functions don't overload on return value.

They don't? Why not? Seems like a pretty obvious step to me.

Rather than having to have methods:

    futzIt_returnString()
    futzIt_returnInt()
    futzIt_returnReal()
    futzIt_returnComplex()

where 'futzIt' might me "read a string from the command line and return it to me as some type (if possible)",

I can just do

int i = futzIt( ... );
real r = futzIt( ... );

And let the compiler work out which futzIt() I need to call, and take care of mangling the names to allow them to coexists.
You mean D doesn't already have this facility?

Seems lie it would be a far more productive and useful expenditure of effort than all this invariant stuff.

>
>
>>  The idea that runtime obtained or derived strings can be made truly
>>  invariant is purely theoretical.
>
>But the fact that someone else might be sharing the data is not.

By "someone else" you mean 'another thread'?
If so, then if that is a possibility, if my code is using threads, then I, the programmer,
will be aware of that  and will be able to take appropriate choices.

I /might/ chose to use invariance to 'protect' this particular piece of data from the problems
of shared state concurrency--if there is any possibility that I intend to shared this particular piece of data.
But in truth, it is very unlikely that I *will* make /that/ choice. Here's why.

What does it mean to make and hold multiple (mutated) copies of a single entity?

That is, I obtain a piece of data from somewhere and make it invariant.
Somehow two threads obtain references to that piece of data.
If none of them attempt to change it, then it makes no difference that it is marked invariant.
If however, one of them is programmed to change it, then it now has a different,
version of that entity to the other thread. But what does that mean? Who has the 'right' version?

Show me a real situation where two threads can legitimately be making disparate modifications to a single entity,
string or otherwise, and I'll show you a programming error. Once two threads make disparate modifications to an entity,
they are separate entities. And they should have been given copies, not references to a single copy, in the first place.

If the intent is that the share a single entity, then any legitimate modifications to that single entity should be reflected
in the views of that single entity by both threads. And therefore subjected to locking, or STM or whatever mechanism is
used to control that modification.

This whole thing of invariance and concurrency seems to be aimed at enabling the use of COW.
Which smacks of someone trying to emulate fork-like behaviours using threads.

And if that is the case, and I very much hope it isn't, then let me tell you as someone who is intimately familiar with the
one existing system that wen this route (iThreads: look'em up), that it is a total disaster,

The whole purpose and advantage of multi-threading, over multi-processing, is (mutable) shared state. And the elimination of
costs of serialisation and narrow bandwidth if IPC in the forking concurrency mode. Attempting to emulate that model
using threading gives few of its advantages, all of its disadvantages, and throws away all of the advantages of threading.
It is a complete and utter waste of time and effort.

If the aim is to simplify the use of threading for common programming scenarios
and bring it within the grasp of non-threading specialist programmers,
then there are far more effective and less costly ways of achieving that.

>
>>  But one of the
>>  major attractions of D over C/C++ is its built-in string types
>
>D has no built in string type. string is just an alias for invariant(char)[].

Semantics.

D has built-in support for a string-type (see http://www.digitalmars.com/d/2.0/overview.html) from which I quote:

    "Strings"

    "String manipulation is so common, and so clumsy in C and C++, that it needs direct support in the language".
    "Modern languages handle string concatenation, copying, etc., and so does D".
    "Strings are a direct consequence of improved array handling."

What invariant strings do, and as far as I can see the only significant thing they do, is to reinvent the clumsiness
of C & C++ by making strings a second-class data type again.

If the point is to try and make threading easier, it will fail miserably once people realise that it creates the scope for
multiple concurrent versions of supposedly single entities. Which breaks just about every programming rule in the book,
and creates scope for far more intractable errors than it fixes.

>That's one approach. Another is don't try to treat strings as mutable.

If the intention of invariance is some move toward OO or functional purity, then I again quote from the same document:

    "Who D is Not For"

    [some categories elided]

    "Language purists. D is a practical language, and each feature of it is evaluated in that light, rather than by an ideal. "
    "For example, D has constructs and semantics that virtually eliminate the need for pointers for ordinary tasks. "
    "But pointers are still there, because sometimes the rules need to be broken."
    "Similarly, casts are still there for those times when the typing system needs to be overridden."

Cheers, b.


--