December 21, 2006
Andrei Alexandrescu (See Website For Email) wrote:
> Don Clugston wrote:
>> Andrei Alexandrescu (See Website For Email) wrote:
>>> Similarly, let's say that a group of revolutionaries convinces Walter (as I understand happened in case of using "length" and "$" inside slice expressions, which is a shame and an absolute disaster that must be undone at all costs) to implement "auto"
>>
>> This off-hand remark worries me. I presume that you mean being able to reference the length of a string, from inside the slice? (rather than simply the notation).
>> And the problem being that it requires a sliceable entity to know its length? Or is the problem more serious than that?
>> It's worrying because any change would break an enormous amount of code.
> 
> It would indeed break an enormous amount of code, but "all costs" includes "enormous costs". :o) A reasonable migration path is to deprecate them soon and make them illegal over the course of one year.
> 
> A small book could be written on just how bad language design is using "length" and "$" to capture slice size inside a slice expression. I managed to write two lengthy emails to Walter about them, and just barely got started. Long story short, "length" introduces a keyword through the back door, effectively making any use of "length" anywhere unrecommended and highly fragile. 

That hadn't occurred to me, but you're right.  I never use length in that context precisely because it does look like it could be a local identifier, whereas I know it'll be clear it's not if I use $.  Also "length" is just too long to be of much use to me as a shortcut.  If I'm going to be that verbose I might as well type out the whole "varname.length".

> Using "$" is a waste of symbolic real estate to serve a narrow purpose; the semantics isn't naturally generalized to its logical conclusion; 

I do use this one, but I agree.  It is unnecessarily special cased for built-in array types.  For user-defined types, in 'myvar[0..$]' the $ does not expand to 'myvar.length' as one would naturally expect it to. Or any sort of opLength() call.  It's just a syntax error.

> and the choice of symbol itself as a reminiscent of Perl's regexp is at best dubious ("#" would have been vastly better as it has count connotation in natural language, and making it into an operator would have fixed the generalization issue). 

I think you'll have to admit that's just your personal taste there. Using $ to indicate 'end' is a regexp thing, but regexp's go way beyond Perl.

I don't really care what it is as long as there's an terse way to specify 'the end' in an indexing expression.

> As things stand now, the rules governing the popping up of "length" and "$" constitute a sudden boo-boo on an otherwise carefully designed expression landscape.


After trying to write a multi-dimensional array class, my opinion is that D slice support could use some upgrades overall.  What I'd like to see:

--MultiRange Slice--
* A way to have multiple ranges in a slice, and a mix slice of and non-slice indices:
    A[i..j, k..m]
    A[i..j, p, k..m]

  I'm not saying built-in arrays like int[] should allow the above expressions, but that at least user types should be allowed to have such opSlice methods.  (Currently opSlice's are limited to having 2 arguments that represent the values that appear on either side of a single '..' token. You can only have two arguments max, but the arguments can be of any type.)

The problem is that opSlice has to look like opSlice(T1 lo, T2 hi) right now -- just two parameters (or zero).

One possible solution is to turn a single i..j into a single int[2] argument (or a mytype[2], for the general case).  But that means one won't be able to distinguish A[[1,3]] from A[1..3].  It also means more interesting extensions to slice syntax, like adding a stepsize on a range, will be ruled out.

Another solution is a built-in slice type.  Ranges like a..b would get converted to slice instances automatically.  It would basically be a struct with two ints in the simplest case, but to support user types as indexes it would need to be template-like, i.e. slice!(type).  A slice would look basically like
    struct slice(T=int) { T lo,hi; }
It could also have a .step property.  With the above, lo and hi would have to be of the same type, but really it makes sense to let them differ, so slice!(T1,T2).  For a range with stepsize, slice!(Tlo,Thi,Tstep).

To make writing opSlice methods sane, a single number like the p above should be converted to a slice also.  So all arguments passed to opSlice would be of type slice, and in the simple case of integer indices, it would just be:
    Type opSlice(slice s) { return x[s.lo..s.hi]; }
since integers would be the default types for slice.


--User Definable '$'--
* A way to specify 'the end' in user types.  In the general case the meaning of '$' in a slice cannot be known (because any type can be used as an index), nor can it be simply substituted with something like a .length property, because it may depend on context.  Consider a multi-dimensional array class --

     A[0..$,3..$]

The first $ means one thing, and the second one means another.

One solution - make an opLength that gets called with the parameter number in which the $ appears.  [My hypothesis is that the param# is the only context that ever matters in determining the meaning of $.]  So in the above int opLength(int i) would get called twice, once with i==0, once with i==1.   opLength can be made to return any type if the user just wants it to get 'passed through' to the opSlice call.  If you don't need the context you can define it as opLength().

--Step sizes--
This is a handy feature of Python slices.  The general syntax for a slice in Python is lo:hi:step, meaning go from 'lo' to 'hi', stepping by 'step' at a time.   But any of the 3 components can be left out.
lo:hi means step=1.
lo::2 means go to the end, stepping by 2.
:hi means 0 to hi.  Negative steps are also allowed:
hi:lo:-1 means go backwards from hi to lo
::-1 go backwards from the last to first element

D syntax could be something like lo..hi:step.  I like the omission part of Python's syntax.  If D had that then most uses of $ would go away since we'd have A[3..] as an alternative to A[3..$].



--bb
December 21, 2006
== Quote from Andrei Alexandrescu (See Website For Email)
(SeeWebsiteForEmail@erdani.org)'s article
> Don Clugston wrote:
> > Andrei Alexandrescu (See Website For Email) wrote:
> >> Similarly, let's say that a group of revolutionaries convinces Walter (as I understand happened in case of using "length" and "$" inside slice expressions, which is a shame and an absolute disaster that must be undone at all costs) to implement "auto"
> >
> > This off-hand remark worries me. I presume that you mean being able to reference the length of a string, from inside the slice? (rather than simply the notation).
> >
> > And the problem being that it requires a sliceable entity to know its
> > length? Or is the problem more serious than that?
> > It's worrying because any change would break an enormous amount of code.
>
> It would indeed break an enormous amount of code, but "all costs" includes "enormous costs". :o) A reasonable migration path is to deprecate them soon and make them illegal over the course of one year.
>
> A small book could be written on just how bad language design is using "length" and "$" to capture slice size inside a slice expression. I managed to write two lengthy emails to Walter about them, and just barely got started. Long story short, "length" introduces a keyword through the back door, effectively making any use of "length" anywhere unrecommended and highly fragile. Using "$" is a waste of symbolic real estate to serve a narrow purpose; the semantics isn't naturally generalized to its logical conclusion; and the choice of symbol itself as a reminiscent of Perl's regexp is at best dubious ("#" would have been vastly better as it has count connotation in natural language, and making it into an operator would have fixed the generalization issue). As things stand now, the rules governing the popping up of "length" and "$" constitute a sudden boo-boo on an otherwise carefully designed expression landscape.

I guess the question is, what is the best alternative.  I agree about 'length', and I usually don't use "length" in this way, but I do things like x[$-2..$] all the time.  Some proposals:

1. Symbols

Going in the symbol direction, it might also make sense to *add* something like "^" for the start of a container.  This would be useful with AAs and user defined types.  We could use both: a[^+2..$-2].  This would only really be useful with containers that did not index from 0, i.e. non-integer or AA indices.

char[char[]] words;
words[^.."brink"]; // all words in dictionary before 'brink'
words["brack"..$] // instead of symbols

Which could translate to:
words.opSlice(words.opBegin(), "brink")
words.opSlice("brack", opEnd())

2. I like this better: I call it "with without with"

In order to maximize the dollar value :) of syntax symbol real estate, the meaning of $ could be expanded as follows:

Something like X[$begin..$end] could be a shortcut for either X[0..X.length]
for arrays, or X[X.opBegin()..X.opEnd()] for user types.

I think the above solves the problem, doesn't it?  The "$end" phrase is terse enough for most coders, unique enough to avoid namespace conflicts, avoids the problem of keywords ghosting in and out of existence in mid-expression, and avoids ruining $ (or # if #end is used instead) for the symbol space.

We can stop right there... or go on to something for post-1.0:

Other applications of $ could be:

A. Syntax reduction for enumerated types and fields:

  struct Colors {
    enum { red, green, blue };
    void set(int c);
  };

  Colors c;
  c.set($red);

  This use of enumerated type is becoming more common, having "$" be a
  shortcut for <context>.X might make a lot of code more readable.  The
  question then becomes, "Which contexts are searched for .X?"

B. Reserved for language features.

  Leave this open for language designer use.  All $xyz expressions are context
  dependent keywords.  This allows much shorter words to be used, and allows
  language features to be named intelligently without worrying about crashing
  into user-defined names.  For example, C could never introduce a new keyword
  called "begin" or "end", since it would break nearly every C program, but we
  can easily add a keyword called $begin which will not conflict with anything,
  since the $ saves us from conflicts.

  Most of the discussions for new features here have at least some arguments on
  how to add the new syntax for the feature, what other uses those symbols could
  be used for, etc.  The $xyz route allows Walter to introduce lots of language
  concepts in the future without conflicts.  It could even be used to prototype
  keywords that are experimental.  They can even be removed or promoted to non-$
  status later if desired.

  NOTE that if # was used instead of $, it would dovetail nicely with the
  "#line" and "#file" quasi-keywords.

> > These issues you're raising seem to be far too fundamental to be fixed in the next few days, casting grave doubts on whether a D1.0 release on Jan 1 is a good idea.
>
> The lvalue/rvalue issue is fundamental. I'm not in the position to assess whether it's a maker or breaker of D 1.0.
>
> The "length"/"$" issue is not fundamental the same way that C's declaration syntax, Java's throw specifications, C++'s use of "<" and ">" for templates, and Mao Zedong's refusal to use a toothbrush are not fundamental. It will "just" go down in history as a huge embarrassment and a good resource for cheap shooters and naysayers. If I understand its genesis, it will also be a canonical example of why design by committee is bad.
>
> Andrei

I like the terseness of "$" but I'm willing to do away with it if it really is that bad.  What I'm wondering, is how far do you think we need to roll back the syntax, before it's "The Right Thing" (tm) again?

Do we really need to go all the way to myarray[0..myarray.length], or can some intermediate solution work?

Kevin
December 21, 2006
Thomas Kuehne wrote:
> enum S{ FOO }
> template Templ(S T) { }
> mixin Templ!(S.FOO) bar;
> 
> Do you consider S an keyword here?

You're right, it makes parsing dependent on the symbol table, breaking a nice property of D. Back to the drawing board.

Andrei
December 21, 2006
Chris Nicholson-Sauls wrote:
> Benji Smith wrote:
>> Are there languages where this is currently possible?
> 
> C++, by returning a referance.

Perl 5 too.

Andrei
December 21, 2006
Benji Smith wrote:
> Andrei Alexandrescu (See Website For Email) wrote:
>> Let me illustrate further why ident is important and what solution we should have for it. Consider C's response to ident:
>>
>> #define IDENT(e) (e)
>>
>  > ...
>  >
>> ...leading to the following implementation of ident:
>>
>> auto ident(auto x) {
>>   return x;
>> }
> 
> I don't get it.
> 
> Why is it necessary (or even desirable) for functions to return lvalues?

Methods might want to return lvalues, but indeed the need is not overwhelming. (They could return pointers after all.) But the point is different. You want to have a grip on all types, and ident shows that you can't. For example, in current D you can't (barring a hack that I saw in a post around here) have a template that takes a function and creates one of the exact signature. That is a vastly useful and desirable thing to want; think e.g. of a function that memoizes any other function.


Andrei
December 21, 2006
Derek Parnell wrote:
> On Wed, 20 Dec 2006 06:24:28 -0800, Andrei Alexandrescu (See Website For
> Email) wrote:
> 
>  
>> A small book could be written on just how bad language design is using "length" and "$" to capture slice size inside a slice expression. I managed to write two lengthy emails to Walter about them, and just barely got started. 
> 
> Please share your thoughts here if you can too.

Gladly; I dug my email and let me share a couple of excerpts.

---------

int length = 5;
int[] a = new int[length * 2];
int[] b = a[length .. length * 2];
int c = a[length - 1 .. (b[0 .. length])[0]);

In each of its uses, length has a different semantics. The behavior is well-defined for all cases, but nonintuitive and about as pleasant as nails on the blackboard.

Now D has a compile-time option to ban the "length" name in scopes in which the slice operator is used. That would render the example above illegal. There is also a rule that identifiers in nested scopes cannot mask one another. So length will be banned from *any* scope that nests a scope using a slice:

int length;
if (a) {
  foreach (b; c) {
    while (d) {
      switch (e) {
        case f: g = h[0 .. length - 1];
        ...
      }
    }
  }
}

This code will not compile. Worse, it *will* compile until you add the slice operation. Combining the two rules and taking them to their logical conclusion, any code using "length" is frail because there's always a risk that somebody might insert a slice, rendering the entire function uncompilable. What happened is that now "length" has become a backdoor-introduced keyword. Books will advise users to never use it even when it works, coding standards will ban it, language lawyers will use it to detract D, and users of other languages will smile condescendingly and stay with their languages.

There are a few ways out of it. "length" could be actually made a keyword. But even that one isn't very uniform, and steals yet another good identifier name.

Another way out of it is to ban "length" but stick with "$". But "$" has another bunch of problems. It's a special character used only once, and only in a very particular situation. There is no general concept standing behind its usage: it sticks out like a sore thumb. "$" isn't the last index in an array. It's that only when used inside a slice, and refers only to the innermost index of the array. Quite a waste of a special character out there, and to little usefulness.

But if we made "$" into an operator identifying the last element of _any_ array, which could refer to the last element of _the left-hand side_ array if we so want, then all of a sudden it becomes useful in a myriad of situations:

int i = a[$ - 1]; // get last element
int i = a[$b - 1]; // get a's element at position b.length - 1
if (a[$ - 1] == x) { ... }
if ($a > 0) { ... }
if ($a == $b) { ... }
swap(a[0], a[$ - 1]); // swap first and last element

---------------

Grammar for nullary/unary $:

---------------

I think I nailed down the way the count operator $ can work in a manner that's terse, expressive, and safe.

My basic goal is to enable the operator $ to be unary (applying to an array) to return its size, and also nullary (applying to nothing) to implicitly mean "fetch the size of the innermost array in the expression". So this code should work:

int[] foo;
foo[$ - 1]; // refers to foo's last element
foo[$foo - 1]; // same
int[][] bar;
bar[foo[$]]; // refers to bar indexed with foo's last element
bar[foo[$bar]]; // refers to bar indexed with foo's element at $bar

To insert my operator $ within D's grammar, go to the grammar page: http://www.digitalmars.com/d/expression.html$UnaryExpression and scroll down to Unary Expression. There, add the following rules:

UnaryExpression:
    PostfixExpression
    & UnaryExpression
    ... etc. etc. ...
    $ Identifier
    $ PostfixExpression . Identifier
    $ PostfixExpression ( )
    $ PostfixExpression ( ArgumentList )
    $ IndexExpression
    $ SliceExpression
    $ ArrayLiteral
    $ ( Expression )

Now a unary expression can be the $ operator followed by an identifier, a member access, a function call, an array access, or a slice expression (awesome! pick the size of the slice!), a literal array (for conformity), or a parenthesized expression. Perfect!

But we haven't yet filled the role of $ as a nullary operator. To do so, let's go in the grammar to http://www.digitalmars.com/d/expression.html$PrimaryExpression and append one more rule to it the PrimaryExpression rule:

PrimaryExpression:
    Identifier
    .Identifier
    ... etc. etc. ...
    $

Now the grammar is unambiguous and will properly distinguish unary and nullary uses of the $ operator.

This is more elegant than the current crap with "$" and "length" popping up. Besides, you can now use $ in many more places than inside []s. However, the grammar size does increase quite a bit, which is more fuss than I hoped for just one operator.

A simpler grammar would have been to simply allow:

UnaryExpression:
    PostfixExpression
    & UnaryExpression
    ... etc. etc. ...
    $ PostfixExpression

But this would have been ambiguous. If the compiler sees "$-1", then the bad grammar says that's a unary use of $ because -1 is a PostfixExpression. But that's not what we wanted! We wanted $ to be nullary. That's why I needed to put all the cases in UnaryExpression.



Andrei
December 21, 2006
Andrei Alexandrescu (See Website For Email) wrote:
> Derek Parnell wrote:
>> On Wed, 20 Dec 2006 06:24:28 -0800, Andrei Alexandrescu (See Website For
>> Email) wrote:
>>
>>  
>>> A small book could be written on just how bad language design is using "length" and "$" to capture slice size inside a slice expression. I managed to write two lengthy emails to Walter about them, and just barely got started. 
>>
>> Please share your thoughts here if you can too.
> 
> Gladly; I dug my email and let me share a couple of excerpts.
<snipped excerpts>

Wow, I understand it now. I only hope that at least 'length' will be deprecated before 1.0.

I like your dollars. I'm not so good with grammars, will your proposal also work for user defined types?
December 21, 2006
Lutger wrote:
> Wow, I understand it now. I only hope that at least 'length' will be deprecated before 1.0.
> 
> I like your dollars.

Well, just don't take'em away from my bank account :o).

> I'm not so good with grammars, will your proposal also work for user defined types?

The plan is that $expression is rewritten into (expression).length. The
consistent thing to do is to make that into an onXyz() function, but I
don't find this name inconsistency jarring.


Andrei
December 21, 2006
Bill Baxter wrote:
> After trying to write a multi-dimensional array class, my opinion is that D slice support could use some upgrades overall.

I'd be very interested in looking at what you've come up with. With my own implementation of a multi-dimensional array type a couple of months ago, I came to the same conclusion. I posted about it in:

news://news.digitalmars.com:119/edrv0n$hth$1@digitaldaemon.com
http://www.digitalmars.com/d/archives/digitalmars/D/announce/4717.html

> What I'd like to see:
> 
> --MultiRange Slice--
> * A way to have multiple ranges in a slice, and a mix slice of and non-slice indices:
>     A[i..j, k..m]
>     A[i..j, p, k..m]
(snip)
>      A[0..$,3..$]

Yes, I would too. It is quite frustrating having the syntax in the language but not being allowed to utilize it... :)

I work around this by instead using a custom slice syntax instead:

A[range(i,j), range(k,m)]
A[range(i,j), p, range(k,m)]
A[range(0,end), range(3..end)]
A[end-1, p % end]

Basicly, the transformation is:

$ => end
a..b => range(a,b)

I briefly described this in:
news://news.digitalmars.com:119/eft9id$2aq3$1@digitaldaemon.com

The resulting code becomes quite optimal without the need for a position dependent opLength type of operator, but handling all the cases puts a larger burden on the implementor of opIndex.

> The problem is that opSlice has to look like opSlice(T1 lo, T2 hi) right now -- just two parameters (or zero).
[snip]
> Another solution is a built-in slice type.  Ranges like a..b would get converted to slice instances automatically.  

Yes, this would be my suggestion too. Adding an opApply to one such built in range type would also have the nice side effect of allowing the syntactical sugar:

foreach(i; 5..10)

> --User Definable '$'--
[snip]
> One solution - make an opLength that gets called with the parameter number in which the $ appears. 

Yes, that is probably the cleanest solution. And if no such opLength(int) overload exists, return the result of opLength() (or possibly .length)

/Oskar
December 21, 2006
Andrei Alexandrescu (See Website For Email) wrote:
> 
> A simpler grammar would have been to simply allow:
> 
> UnaryExpression:
>     PostfixExpression
>     & UnaryExpression
>     ... etc. etc. ...
>     $ PostfixExpression
> 
> But this would have been ambiguous. If the compiler sees "$-1", then the bad grammar says that's a unary use of $ because -1 is a PostfixExpression. But that's not what we wanted! We wanted $ to be nullary. That's why I needed to put all the cases in UnaryExpression.
> 

Nice post, and one heck of an argument!

FWIW, I advocated something similar during the last round of debates before the '$' operator was introduced.  What I wanted to see was '$' to become like 'this' within slice and array expressions, so that the issues regarding 'length' could be resolved.  In essence one could simply say '$.length' and mean 'the length of the current array':

b[0 .. $.length];
a[0 .. $.getIndexOf(';')];

So in essence, every use of '$' would be a 'nullary' operator - an alias if you will.

I'd imagine that extending things in this manner would simplify things grammatically while allowing for a wider category of uses.  However, it doesn't solve the issue that you brought up, and that I've quoted above.

c[$-1];

It looks like it should be an implicit cast of the '$' to a size_t (length), via it's use in an expression.  Any thoughts on this?

-- 
- EricAnderton at yahoo