Thread overview
iteration over a string
May 28, 2013
Timothee Cour
May 28, 2013
Ali Çehreli
May 28, 2013
Ali Çehreli
May 28, 2013
Diggory
May 28, 2013
Questions regarding iteration over code points of a utf8 string:

In all that follows, I don't want to go through intermediate UTF32 representation by making a copy of my string, but I want to iterate over its code points.

say my string is declared as:
string a="Ωabc"; //if email reader screws this up, it's a 'Omega' followed
by abc

A)
this doesn't work obviously:
foreach(i,ai; a){
  write(i,",",ai," ");
}
//prints 0,� 1,� 2,a 3,b 4,c (ie decomposes at the 'char' level, so 5
elements)

B)
foreach(i,dchar ai;a){
  write(i,",",ai," ");
}
// prints 0,Ω 2,a 3,b 4,c (ie decomposes at code points, so 4 elements)
But index i skips position 1, indicating the start index of code points; it
prints [0,2,3,4]
Is that a bug or a feature?

C)
writeln(a.walkLength); // prints 4
for(size_t i;!a.empty;a.popFront,i++)
  write(i,",",a.front," ");

// prints 0,Ω 1,a 2,b 3,c
This seems the most correct for interpreting a string as a range over code
points, where index i has positions [0,1,2,3]

Is there a more idiomatic way?

D)
How to make the standard algorithms (std.map, etc) work well with the
iteration over code points as in method C above ?

For example this one is very confusing for me:
string a="ΩΩab";
auto b1=a.map!(a=>"<"d~a~">"d).array;
writeln(b1.length);//6
writeln(b1);//["<Ω>", "<Ω>", "<a>", "<b>", "", ""]
Why are there 2 empty strings at the end? (one per Omega if you vary the
number of such symbols in the string).


E)
The fact that there are 2 ways to iterate over strings is confusing:
For example reading at docs, ForeachType is different from ElementType and
ElementType is special cased for narrow strings;
foreach(i;ai;a){foo(i,ai);} doesn't behave as for(size_t
i;!a.empty;a.popFront,i++) {foo(i,a.front);}
walkLength != length for strings

F)
Why can't we have the following design instead:
* no special case with isNarrowString scattered throughout phobos
* iteration with foreach behaves as iteration with popFront/empty/front,
and walkLength == length
* ForeachType == ElementType (ie one is redundant)
* require *explicit user syntax* to construct a range over code points from
a string:

struct CodepointRange{
 this(string a){...}
 auto popFront(){}
 auto empty(){}
 auto length(){}//
}

now the user can do:
a.map!foo => will iterate over char
a.CodepointRange.map!foo => will iterate over code points.

Everything seems more orhogonal that way, and user has clear understanding of complexity of each operation.


May 28, 2013
On 05/28/2013 12:26 AM, Timothee Cour wrote:

> In all that follows, I don't want to go through intermediate UTF32
> representation by making a copy of my string, but I want to iterate over
> its code points.

Yes, the whole situation is a little messy. :)

There is also std.range.stride:

    foreach (ai; a.stride(1)) {
        // ...
    }

If you need the index as well, and do not want to manage it explicitly, one way is to use zip and sequence:

import std.stdio;
import std.range;

void main()
{
    string a="Ωabc";

    foreach (i, ai; zip(sequence!"n", a.stride(1))) {
        write(i,",",ai," ");
    }
}

The output:

Ω a b c

Ali

May 28, 2013
On 05/28/2013 12:42 AM, Ali Çehreli wrote:

> The output:
>
> Ω a b c

Rather:

0,Ω 1,a 2,b 3,c

Ali

May 28, 2013
Most algorithms for strings need the offset rather than the character index, so:

foreach (i; dchar c; str)

Gives the offset into the string for "i"

If you really need the character index just count it:

int charIndex = 0;
foreach (dchar c; str) {
   // ...

   ++charIndex;
}

If strings were treated specially so that they looked like arrays of dchars but used UTF-8 internally it would hide all sorts of performance costs. Random access into a UTF-8 string by the character index is O(n) whereas index by the offset is O(1).

If you are using random access by character index heavily you should therefore convert to a dstring first and then you can get the O(1) random access time.