Thread overview
Re: iteration over a string
May 28, 2013
Jonathan M Davis
May 28, 2013
Timothee Cour
May 28, 2013
bearophile
May 28, 2013
bearophile
May 28, 2013
Timothee Cour
May 28, 2013
bearophile
May 28, 2013
Timothee Cour
May 28, 2013
bearophile
May 28, 2013
Jonathan M Davis
May 28, 2013
On Tuesday, May 28, 2013 00:26:03 Timothee Cour wrote:
> Questions regarding iteration over code points of a utf8 string:
> 
> In all that follows, I don't want to go through intermediate UTF32 representation by making a copy of my string, but I want to iterate over its code points.
> 
> say my string is declared as:
> string a="Ωabc"; //if email reader screws this up, it's a 'Omega' followed
> by abc
> 
> A)
> this doesn't work obviously:
> foreach(i,ai; a){
>   write(i,",",ai," ");
> }
> //prints 0,� 1,� 2,a 3,b 4,c (ie decomposes at the 'char' level, so 5
> elements)

Yes. I'd love it if it were a warning or error to not give an explicit iteration type for foreach with strings, but I don't think that Walter is willing to do that. He seems to think that everyone should understand Unicode and therefore have no problems with the fact that foreach iterates over code units rather than code points.

> B)
> foreach(i,dchar ai;a){
>   write(i,",",ai," ");
> }
> // prints 0,Ω 2,a 3,b 4,c (ie decomposes at code points, so 4 elements)
> But index i skips position 1, indicating the start index of code points; it
> prints [0,2,3,4]
> Is that a bug or a feature?

Feature. It's the index of the array, so it's code units, and in general is more useful (at least for more advanced string processing).

> C)
> writeln(a.walkLength); // prints 4
> for(size_t i;!a.empty;a.popFront,i++)
>   write(i,",",a.front," ");
> 
> // prints 0,Ω 1,a 2,b 3,c
> This seems the most correct for interpreting a string as a range over code
> points, where index i has positions [0,1,2,3]
> 
> Is there a more idiomatic way?

Not really (though maybe it could be done with zip and iota or something if you really wanted to), but it's also not something that you'd do normally. Remember that ranges don't provide indices unless they're random access, and narrow strings aren't random access as far as ranges are concerned. You have to count the elements for pretty much _any_ non-random-access range if you want to know which element you're on.

> D)
> How to make the standard algorithms (std.map, etc) work well with the
> iteration over code points as in method C above ?
> 
> For example this one is very confusing for me:
> string a="ΩΩab";
> auto b1=a.map!(a=>"<"d~a~">"d).array;
> writeln(b1.length);//6

For ranges, use walkLength, not length. length will _not_ be correct in the general case when strings are involved, because that's the number of code units rather than code points.

> E)
> The fact that there are 2 ways to iterate over strings is confusing:
> For example reading at docs, ForeachType is different from ElementType and
> ElementType is special cased for narrow strings;
> foreach(i;ai;a){foo(i,ai);} doesn't behave as for(size_t
> i;!a.empty;a.popFront,i++) {foo(i,a.front);}
> walkLength != length for strings
> 
> F)
> Why can't we have the following design instead:
> * no special case with isNarrowString scattered throughout phobos
> * iteration with foreach behaves as iteration with popFront/empty/front,
> and walkLength == length
> * ForeachType == ElementType (ie one is redundant)
> * require *explicit user syntax* to construct a range over code points from
> a string:
> 
> struct CodepointRange{
>  this(string a){...}
>  auto popFront(){}
>  auto empty(){}
>  auto length(){}//
> }
> 
> now the user can do:
> a.map!foo => will iterate over char
> a.CodepointRange.map!foo => will iterate over code points.
>
> Everything seems more orhogonal that way, and user has clear understanding of complexity of each operation.

The reason that we don't do that is mostly because it makes pretty much no sense to iterate over ranges of code units 99.99% of the time. Very nearly the _only_ time that it makes sense is when you know that all of the characters that you're operating on are ASCII characters. By iterating over code points, the situation is far more correct (the not completely correct, due to the difference between code points and graphemes). If we did everything on code units by default, D code in general would not operate at all correctly on Unicode most of the time.

The difference between foreach and ranges for strings is most definitely unfortunate, but it really isn't all that complicated ultimately. Just remember that foreach itself operates on code units (so you have to use dchar explicitly if you want to iterate over code points) and that all of the range functions operate on code points. So, when you're operating on narrow strings as ranges, don't use any operations that the range API doesn't consider them to have (in particular, length, random access, and slicing). In general, this is quite easy to do, because you write range-based functions in a generic fashion using traits such as hasLength and hasSlicing to determine a range's capabilities. So, you really don't have to do anything special for strings - unless you want to optimize your code for strings. And if you want to do that, then you need to understand how Unicode works with regards to code units and code points and code up your specialiazation appropriately.

Maybe there's a better way to explain this, but it really seems to me like you're overcomplicating it. If you want to operate on code units, use the built-in operations, as the language itself considers strings to be arrays of code units. If you want to operate on code points, use the range operations - and only the range operations that a given range supports (meaning don't do things like use length when hasLength!R is false like it is for narrow strings).

- Jonathan M Davis
May 28, 2013
2A)
Thanks for your answer;
You skipped over this one, which I don't understand:

> string a="ΩΩab";
> auto b1=a.map!(a=>"<"d~a~">"d).array;
> writeln(b1);//["<Ω>", "<Ω>", "<a>", "<b>", "", ""]
> Why are there 2 empty strings at the end? (one per Omega if you vary the
number of such symbols in the string).

The above is just weird; is that a bug?

2B)
> The reason that we don't do that is mostly because it makes pretty much
no sense to iterate over ranges of code units 99.99% of the time

And yet "ΩΩab".map!foo seems to operate on code units and in order to operate on code points we need "ΩΩab".stride(1).map!foo, so isn't it inconsistent with the fact that std.algorithm.find("ΩΩab") operates on code points ?

2C)
> Remember that ranges don't provide indices unless they're random access,

For input ranges ranges, I don't understand why the compiler can't accept
the foreach(i,ai;a) syntax:

it should behave as follows:
foreach(i , ai; a){expr}

rewritten as:
for(size_t i=0, ai=a.front; !a.empty; a.popFront;){expr}

but it doesn't compile (it only accepts foreach(ai;a){expr}




On Tue, May 28, 2013 at 1:25 AM, Jonathan M Davis <jmdavisProg@gmx.com>wrote:

> On Tuesday, May 28, 2013 00:26:03 Timothee Cour wrote:
> > Questions regarding iteration over code points of a utf8 string:
> >
> > In all that follows, I don't want to go through intermediate UTF32 representation by making a copy of my string, but I want to iterate over its code points.
> >
> > say my string is declared as:
> > string a="Ωabc"; //if email reader screws this up, it's a 'Omega'
> followed
> > by abc
> >
> > A)
> > this doesn't work obviously:
> > foreach(i,ai; a){
> >   write(i,",",ai," ");
> > }
> > //prints 0,� 1,� 2,a 3,b 4,c (ie decomposes at the 'char' level, so 5
> > elements)
>
> Yes. I'd love it if it were a warning or error to not give an explicit
> iteration type for foreach with strings, but I don't think that Walter is
> willing to do that. He seems to think that everyone should understand
> Unicode
> and therefore have no problems with the fact that foreach iterates over
> code
> units rather than code points.
>
> > B)
> > foreach(i,dchar ai;a){
> >   write(i,",",ai," ");
> > }
> > // prints 0,Ω 2,a 3,b 4,c (ie decomposes at code points, so 4 elements)
> > But index i skips position 1, indicating the start index of code points;
> it
> > prints [0,2,3,4]
> > Is that a bug or a feature?
>
> Feature. It's the index of the array, so it's code units, and in general is more useful (at least for more advanced string processing).
>
> > C)
> > writeln(a.walkLength); // prints 4
> > for(size_t i;!a.empty;a.popFront,i++)
> >   write(i,",",a.front," ");
> >
> > // prints 0,Ω 1,a 2,b 3,c
> > This seems the most correct for interpreting a string as a range over
> code
> > points, where index i has positions [0,1,2,3]
> >
> > Is there a more idiomatic way?
>
> Not really (though maybe it could be done with zip and iota or something if
> you really wanted to), but it's also not something that you'd do normally.
> Remember that ranges don't provide indices unless they're random access,
> and
> narrow strings aren't random access as far as ranges are concerned. You
> have
> to count the elements for pretty much _any_ non-random-access range if you
> want to know which element you're on.
>
> > D)
> > How to make the standard algorithms (std.map, etc) work well with the
> > iteration over code points as in method C above ?
> >
> > For example this one is very confusing for me:
> > string a="ΩΩab";
> > auto b1=a.map!(a=>"<"d~a~">"d).array;
> > writeln(b1.length);//6
>
> For ranges, use walkLength, not length. length will _not_ be correct in the general case when strings are involved, because that's the number of code units rather than code points.
>
> > E)
> > The fact that there are 2 ways to iterate over strings is confusing:
> > For example reading at docs, ForeachType is different from ElementType
> and
> > ElementType is special cased for narrow strings;
> > foreach(i;ai;a){foo(i,ai);} doesn't behave as for(size_t
> > i;!a.empty;a.popFront,i++) {foo(i,a.front);}
> > walkLength != length for strings
> >
> > F)
> > Why can't we have the following design instead:
> > * no special case with isNarrowString scattered throughout phobos
> > * iteration with foreach behaves as iteration with popFront/empty/front,
> > and walkLength == length
> > * ForeachType == ElementType (ie one is redundant)
> > * require *explicit user syntax* to construct a range over code points
> from
> > a string:
> >
> > struct CodepointRange{
> >  this(string a){...}
> >  auto popFront(){}
> >  auto empty(){}
> >  auto length(){}//
> > }
> >
> > now the user can do:
> > a.map!foo => will iterate over char
> > a.CodepointRange.map!foo => will iterate over code points.
> >
> > Everything seems more orhogonal that way, and user has clear
> understanding
> > of complexity of each operation.
>
> The reason that we don't do that is mostly because it makes pretty much no
> sense to iterate over ranges of code units 99.99% of the time. Very nearly
> the
> _only_ time that it makes sense is when you know that all of the characters
> that you're operating on are ASCII characters. By iterating over code
> points,
> the situation is far more correct (the not completely correct, due to the
> difference between code points and graphemes). If we did everything on code
> units by default, D code in general would not operate at all correctly on
> Unicode most of the time.
>
> The difference between foreach and ranges for strings is most definitely
> unfortunate, but it really isn't all that complicated ultimately. Just
> remember that foreach itself operates on code units (so you have to use
> dchar
> explicitly if you want to iterate over code points) and that all of the
> range
> functions operate on code points. So, when you're operating on narrow
> strings
> as ranges, don't use any operations that the range API doesn't consider
> them
> to have (in particular, length, random access, and slicing). In general,
> this
> is quite easy to do, because you write range-based functions in a generic
> fashion using traits such as hasLength and hasSlicing to determine a
> range's
> capabilities. So, you really don't have to do anything special for strings
> -
> unless you want to optimize your code for strings. And if you want to do
> that,
> then you need to understand how Unicode works with regards to code units
> and
> code points and code up your specialiazation appropriately.
>
> Maybe there's a better way to explain this, but it really seems to me like
> you're overcomplicating it. If you want to operate on code units, use the
> built-in operations, as the language itself considers strings to be arrays
> of
> code units. If you want to operate on code points, use the range
> operations -
> and only the range operations that a given range supports (meaning don't do
> things like use length when hasLength!R is false like it is for narrow
> strings).
>
> - Jonathan M Davis
>


May 28, 2013
Timothee Cour:

>> string a="ΩΩab";
>> auto b1=a.map!(a=>"<"d~a~">"d).array;
>> writeln(b1);//["<Ω>", "<Ω>", "<a>", "<b>", "", ""]
>> Why are there 2 empty strings at the end? (one per Omega if you vary the
> number of such symbols in the string).
>
> The above is just weird; is that a bug?

I think it's a bug, it's shown here, I will put it in Bugzilla:

import std.stdio: writeln;
import std.algorithm: map;
import std.array: array;
void main() {
    string a = "ΩΩab";
    a.map!(a => "<"d ~ a ~ ">"d).writeln;
    a.map!(a => "<"d ~ a ~ ">"d).array.writeln;
}


The output shows that maybe it's a problem of array():

["<Ω>", "<Ω>", "<a>", "<b>"]
["<Ω>", "<Ω>", "<a>", "<b>", "", ""]



> For input ranges ranges, I don't understand why the compiler can't accept
> the foreach(i,ai;a) syntax:
>
> it should behave as follows:
> foreach(i , ai; a){expr}
>
> rewritten as:
> for(size_t i=0, ai=a.front; !a.empty; a.popFront;){expr}
>
> but it doesn't compile (it only accepts foreach(ai;a){expr}

The idea must work in all cases. For opApply, built-in arrays, built-in associative arrays and for ranges. I think it causes some clashing. If you will find it causes no clashing, then it's a candidate to become an enhancement request.

Bye,
bearophile
May 28, 2013
> I think it's a bug, it's shown here, I will put it in Bugzilla:

http://d.puremagic.com/issues/show_bug.cgi?id=10191

Bye,
bearophile
May 28, 2013
On Tue, May 28, 2013 at 3:20 AM, bearophile <bearophileHUGS@lycos.com>wrote:

> Timothee Cour:
>
>  string a="ΩΩab";
>>> auto b1=a.map!(a=>"<"d~a~">"d).**array;
>>> writeln(b1);//["<Ω>", "<Ω>", "<a>", "<b>", "", ""]
>>> Why are there 2 empty strings at the end? (one per Omega if you vary the
>>>
>> number of such symbols in the string).
>>
>> The above is just weird; is that a bug?
>>
>
> I think it's a bug, it's shown here, I will put it in Bugzilla:
>
> import std.stdio: writeln;
> import std.algorithm: map;
> import std.array: array;
> void main() {
>     string a = "ΩΩab";
>     a.map!(a => "<"d ~ a ~ ">"d).writeln;
>     a.map!(a => "<"d ~ a ~ ">"d).array.writeln;
> }
>
>
> The output shows that maybe it's a problem of array():
>
> ["<Ω>", "<Ω>", "<a>", "<b>"]
>
> ["<Ω>", "<Ω>", "<a>", "<b>", "", ""]
>
>
>
>  For input ranges ranges, I don't understand why the compiler can't accept
>> the foreach(i,ai;a) syntax:
>>
>> it should behave as follows:
>> foreach(i , ai; a){expr}
>>
>> rewritten as:
>> for(size_t i=0, ai=a.front; !a.empty; a.popFront;){expr}
>>
>> but it doesn't compile (it only accepts foreach(ai;a){expr}
>>
>
> The idea must work in all cases. For opApply, built-in arrays, built-in associative arrays and for ranges. I think it causes some clashing. If you will find it causes no clashing, then it's a candidate to become an enhancement request.
>
> Bye,
> bearophile
>

thanks for filing 10191!

Let's dig more the indexed foreach proposal:

opApply => present with corresponding signature to foreach(i,ai;a) => use it
built-in arrays => its an randaccess iterator so already works
associative arrays => already works, compiler rewrites to key,value
ranges => if it's not an associative array, random access iterator, doesn't
have opApply with corresponding signature, then use the size_t index I
mentioned (existing code makes it compile error, this proposal allows it).

Where do you think it can cause clashing?


May 28, 2013
On Tuesday, May 28, 2013 03:05:52 Timothee Cour wrote:
> 2A)
> Thanks for your answer;
> 
> You skipped over this one, which I don't understand:
> > string a="ΩΩab";
> > auto b1=a.map!(a=>"<"d~a~">"d).array;
> > writeln(b1);//["<Ω>", "<Ω>", "<a>", "<b>", "", ""]
> > Why are there 2 empty strings at the end? (one per Omega if you vary the
> 
> number of such symbols in the string).
> 
> The above is just weird; is that a bug?

I didn't look into it, since I was in a hurry, but it definitely looks like a bug.

> 2B)
> 
> > The reason that we don't do that is mostly because it makes pretty much
> 
> no sense to iterate over ranges of code units 99.99% of the time
> 
> And yet "ΩΩab".map!foo seems to operate on code units and in order to operate on code points we need "ΩΩab".stride(1).map!foo, so isn't it inconsistent with the fact that std.algorithm.find("ΩΩab") operates on code points ?

Any and all range-based functions operate on code points when operating on strings. If they do otherwise, it's a bug - probably due to a failed attempt to optimize for strings.

> 2C)
> 
> > Remember that ranges don't provide indices unless they're random access,
> 
> For input ranges ranges, I don't understand why the compiler can't accept
> the foreach(i,ai;a) syntax:
> 
> it should behave as follows:
> foreach(i , ai; a){expr}
> 
> rewritten as:
> for(size_t i=0, ai=a.front; !a.empty; a.popFront;){expr}
> 
> but it doesn't compile (it only accepts foreach(ai;a){expr}

foreach and ranges don't do anything with indices. I believe that the reason that's usually brought up as to why is that for some types of ranges, it would make no sense. There's probably an enhancement request open for it though as it would certainly seem to make sense for most ranges. Though if you're looking at the counter as being an index, it really doesn't make sense for anything other than random-access ranges, as only they have indexing operations.

In any case, if you're feeling fancy, you can use iota and zip to solve the problem:

 foreach(i, e; zip(iota(0, size_t.max), range))
 {...}

and lockstep uses opApply to add an index, but it doesn't work with just one range, so it wouldn't help with just giving foreach an index with normal range iteration.

- Jonathan M Davis
May 28, 2013
Timothee Cour:

> Where do you think it can cause clashing?

I don't know.

But my solution is to introduce a simple "enumerate" range:
http://d.puremagic.com/issues/show_bug.cgi?id=5550

So if you have a range and you don't need indexes you use:

foreach (item; myrange) {}

If you also want an index:

foreach (index, item; myrange.enumerate) {}

This is the solution used by Python, it's clean, and it doesn't make the foreach syntax & compiler more complex. Later the compiler should optimize this idiom well.

Bye,
bearophile
May 28, 2013
Thanks for the enumerate pointer.

3A)
why not having an Enumerate(R) containing a single 'opApply' public method,
to avoid returning tuples in 'front()'

3B)
requiring one to use 'myrange.enumerate' for inputRanges and 'myrange' in
other cases is bad for generic programming and user time. This is a very
frequent idiom that should just work out of the box. Language is more
orthogonal and less surprising in this case. That being said, enumerate
still has its value (eg for foreach (i, k, v;
associativeArray.byPair.enumerate()) )

3C)
if no-one can find a concrete argument why it would cause clashing /
confusion, I will file an enhancement request for it (dip sounds overkill)



On Tue, May 28, 2013 at 10:31 AM, bearophile <bearophileHUGS@lycos.com>wrote:

> Timothee Cour:
>
>
>  Where do you think it can cause clashing?
>>
>
> I don't know.
>
> But my solution is to introduce a simple "enumerate" range: http://d.puremagic.com/issues/**show_bug.cgi?id=5550<http://d.puremagic.com/issues/show_bug.cgi?id=5550>
>
> So if you have a range and you don't need indexes you use:
>
> foreach (item; myrange) {}
>
> If you also want an index:
>
> foreach (index, item; myrange.enumerate) {}
>
> This is the solution used by Python, it's clean, and it doesn't make the foreach syntax & compiler more complex. Later the compiler should optimize this idiom well.
>
> Bye,
> bearophile
>


May 28, 2013
Timothee Cour:

> 3A)
> why not having an Enumerate(R) containing a single 'opApply' public method,

Unfortunately opApply doesn't not work well with all the other range-based functions.
Also that Enumerate code is for illustrative purposes, it's not meant to be good library code.


> 3B)
> requiring one to use 'myrange.enumerate' for inputRanges and 'myrange' in
> other cases is bad for generic programming and user time.

Generic programming must know if you are iterating only on items or on index-item pairs. As Python Zen says, explicit is better than implicit.


> This is a very
> frequent idiom that should just work out of the box. Language is more
> orthogonal and less surprising in this case.

Yet Python designers have preferred to use enumerate instead of messing with the Python for.

Bye,
bearophile