iteration over a string

May 28, 2013

Timothee Cour

May 28, 2013

Ali Çehreli

May 28, 2013

Ali Çehreli

May 28, 2013

Diggory

May 28, 2013

iteration over a string

Posted by Timothee Cour

Permalink

Timothee Cour

Attachments:

text/html part

Permalink

Questions regarding iteration over code points of a utf8 string:

In all that follows, I don't want to go through intermediate UTF32 representation by making a copy of my string, but I want to iterate over its code points.

say my string is declared as:
string a="Ωabc"; //if email reader screws this up, it's a 'Omega' followed
by abc

A)
this doesn't work obviously:
foreach(i,ai; a){
  write(i,",",ai," ");
}
//prints 0,� 1,� 2,a 3,b 4,c (ie decomposes at the 'char' level, so 5
elements)

B)
foreach(i,dchar ai;a){
  write(i,",",ai," ");
}
// prints 0,Ω 2,a 3,b 4,c (ie decomposes at code points, so 4 elements)
But index i skips position 1, indicating the start index of code points; it
prints [0,2,3,4]
Is that a bug or a feature?

C)
writeln(a.walkLength); // prints 4
for(size_t i;!a.empty;a.popFront,i++)
  write(i,",",a.front," ");

// prints 0,Ω 1,a 2,b 3,c
This seems the most correct for interpreting a string as a range over code
points, where index i has positions [0,1,2,3]

Is there a more idiomatic way?

D)
How to make the standard algorithms (std.map, etc) work well with the
iteration over code points as in method C above ?

For example this one is very confusing for me:
string a="ΩΩab";
auto b1=a.map!(a=>"<"d~a~">"d).array;
writeln(b1.length);//6
writeln(b1);//["<Ω>", "<Ω>", "<a>", "<b>", "", ""]
Why are there 2 empty strings at the end? (one per Omega if you vary the
number of such symbols in the string).


E)
The fact that there are 2 ways to iterate over strings is confusing:
For example reading at docs, ForeachType is different from ElementType and
ElementType is special cased for narrow strings;
foreach(i;ai;a){foo(i,ai);} doesn't behave as for(size_t
i;!a.empty;a.popFront,i++) {foo(i,a.front);}
walkLength != length for strings

F)
Why can't we have the following design instead:
* no special case with isNarrowString scattered throughout phobos
* iteration with foreach behaves as iteration with popFront/empty/front,
and walkLength == length
* ForeachType == ElementType (ie one is redundant)
* require *explicit user syntax* to construct a range over code points from
a string:

struct CodepointRange{
 this(string a){...}
 auto popFront(){}
 auto empty(){}
 auto length(){}//
}

now the user can do:
a.map!foo => will iterate over char
a.CodepointRange.map!foo => will iterate over code points.

Everything seems more orhogonal that way, and user has clear understanding of complexity of each operation.

On 05/28/2013 12:26 AM, Timothee Cour wrote: > In all that follows, I don't want to go through intermediate UTF32 > representation by making a copy of my string, but I want to iterate over > its code points. Yes, the whole situation is a little messy. :) There is also std.range.stride: foreach (ai; a.stride(1)) { // ... } If you need the index as well, and do not want to manage it explicitly, one way is to use zip and sequence: import std.stdio; import std.range; void main() { string a="Ωabc"; foreach (i, ai; zip(sequence!"n", a.stride(1))) { write(i,",",ai," "); } } The output: Ω a b c Ali

Most algorithms for strings need the offset rather than the character index, so: foreach (i; dchar c; str) Gives the offset into the string for "i" If you really need the character index just count it: int charIndex = 0; foreach (dchar c; str) { // ... ++charIndex; } If strings were treated specially so that they looked like arrays of dchars but used UTF-8 internally it would hide all sorts of performance costs. Random access into a UTF-8 string by the character index is O(n) whereas index by the offset is O(1). If you are using random access by character index heavily you should therefore convert to a dstring first and then you can get the O(1) random access time.

Forums