Why the hell doesn't foreach decode strings (page 4)

El 20/10/2011 21:37, Martin Nowak escribió: > It just took me over one hour to find out the unthinkable. > foreach(c; str) will deduce c to immutable(char) and doesn't care about > unicode. > Now there is so many unicode transcoding happening in the language that > it starts to get annoying, > but the most basic string iteration doesn't support it by default? Maybe I didn't fully get your point, but, you do know that you can do the following, right? string str = "Ñandú"; foreach(dchar c; str) ... and it decodes full unicode characters just fine. Or maybe you are just talking about the auto type inference, just to make sure... OTOH, as others say, it's not rare to iterate on 8-bit units if you're dealing with ASCII or if you are parsing and looking for operators and separators (which are normally 8-bit). Then you can leave the rest untouched or extract the parts in between without caring if they are 1 byte per character or several. (e.g. parsing XML, or JSON, or CSV, or INI, or conf, or D source, etc.)

Am 22.10.2011 12:55, schrieb Alvaro: > El 20/10/2011 21:37, Martin Nowak escribió: >> It just took me over one hour to find out the unthinkable. >> foreach(c; str) will deduce c to immutable(char) and doesn't care about >> unicode. >> Now there is so many unicode transcoding happening in the language that >> it starts to get annoying, >> but the most basic string iteration doesn't support it by default? > > Maybe I didn't fully get your point, but, you do know that you can do > the following, right? > > string str = "Ñandú"; > foreach(dchar c; str) > ... One visible Unicode character can consists of several dchars. (This is called "Grapheme") Cheers, - Daniel

On 2011-10-21 20:38, Walter Bright wrote: > On 10/21/2011 2:51 AM, Martin Nowak wrote: >> You have a good point here. I would have immediately thrown out the >> loop AFTER >> profiling. >> What hits me here is that I had an incorrect program with built-in >> unicode aware >> strings. >> This is counterintuitive to correct unicode handling throughout the >> std library, >> and even more to the complementary operation of appending any char >> type to strings. > > I understand the issue, but I don't think it's resolvable. It's a lot > like the signed/unsigned issue. Java got rid of it by simply not having > any unsigned types. Can't we implement a new string type that people can choose to use if they want. It will hide all the Unicode details that has been brought up by this thread. -- /Jacob Carlborg

October 22, 2011

Re: Why the hell doesn't foreach decode strings

Posted by Timon Gehr
in reply to Jacob Carlborg

Permalink

Timon Gehr

Posted in reply to Jacob Carlborg

Permalink

On 10/22/2011 02:14 PM, Jacob Carlborg wrote:
> On 2011-10-21 20:38, Walter Bright wrote:
>> On 10/21/2011 2:51 AM, Martin Nowak wrote:
>>> You have a good point here. I would have immediately thrown out the
>>> loop AFTER
>>> profiling.
>>> What hits me here is that I had an incorrect program with built-in
>>> unicode aware
>>> strings.
>>> This is counterintuitive to correct unicode handling throughout the
>>> std library,
>>> and even more to the complementary operation of appending any char
>>> type to strings.
>>
>> I understand the issue, but I don't think it's resolvable. It's a lot
>> like the signed/unsigned issue. Java got rid of it by simply not having
>> any unsigned types.
>
> Can't we implement a new string type that people can choose to use if
> they want. It will hide all the Unicode details that has been brought up
> by this thread.
>

Having multiple standard string types is bad. Furthermore, it is hard to meaningfully hide all the Unicode details. Not even immutable(dchar)[] necessarily encodes one character as one code unit.

Walter Bright <newshound2@digitalmars.com> wrote: > On 10/22/2011 2:21 AM, Peter Alexander wrote: >> Which operations do you believe would be less efficient? > > All of the ones that don't require decoding, such as searching, would be less efficient if decoding was done. You can std.algorithm.find to do searching, not foreach. The former can decide whichever efficient method to use.

On 10/22/2011 09:37 PM, kennytm wrote: > Walter Bright<newshound2@digitalmars.com> wrote: >> On 10/22/2011 2:21 AM, Peter Alexander wrote: >>> Which operations do you believe would be less efficient? >> >> All of the ones that don't require decoding, such as searching, would be >> less efficient if decoding was done. > > You can std.algorithm.find to do searching, not foreach. The former can > decide whichever efficient method to use. Afaics the current std.algorithm.find implementation decodes its arguments.

On 10/22/11 3:05 PM, Timon Gehr wrote: > On 10/22/2011 09:37 PM, kennytm wrote: >> Walter Bright<newshound2@digitalmars.com> wrote: >>> On 10/22/2011 2:21 AM, Peter Alexander wrote: >>>> Which operations do you believe would be less efficient? >>> >>> All of the ones that don't require decoding, such as searching, would be >>> less efficient if decoding was done. >> >> You can std.algorithm.find to do searching, not foreach. The former can >> decide whichever efficient method to use. > > Afaics the current std.algorithm.find implementation decodes its arguments. That can be easily fixed. Currently single-element find does decoding but substring find avoids it if possible: https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm.d#L2819 Andrei

On 10/22/2011 10:42 PM, Andrei Alexandrescu wrote: > On 10/22/11 3:05 PM, Timon Gehr wrote: >> On 10/22/2011 09:37 PM, kennytm wrote: >>> Walter Bright<newshound2@digitalmars.com> wrote: >>>> On 10/22/2011 2:21 AM, Peter Alexander wrote: >>>>> Which operations do you believe would be less efficient? >>>> >>>> All of the ones that don't require decoding, such as searching, >>>> would be >>>> less efficient if decoding was done. >>> >>> You can std.algorithm.find to do searching, not foreach. The former can >>> decide whichever efficient method to use. >> >> Afaics the current std.algorithm.find implementation decodes its >> arguments. > > That can be easily fixed. Currently single-element find does decoding > but substring find avoids it if possible: > > https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm.d#L2819 > Ok, I actually did not see that, thanks. However this is usually still not the most efficient implementation in case the first argument is string and the other is wstring/dstring.

On 10/22/11 5:32 PM, Timon Gehr wrote: > Ok, I actually did not see that, thanks. However this is usually still > not the most efficient implementation in case the first argument is > string and the other is wstring/dstring. I understand. For some reason I'm not seeing the URL of your pull request fixing that :o). Andrei

On Fri, 21 Oct 2011 14:39:58 -0400, Walter Bright <newshound2@digitalmars.com> wrote: > On 10/21/2011 4:14 AM, Steven Schveighoffer wrote: >>> Making such a string type would be terribly inefficient. It would make D >>> completely uncompetitive for processing strings. >> >> I don't think it would. Do you have any proof to support this? > > I've done string processing code, and done a lot of profiling of them. Every cycle is critical, and decoding adds a *lot* of cycles. What I mean is, default to a well-built string type, and let people who want to deal with arrays of code-units deal with arrays of code-units. This schizophrenic view phobos has of char[] arrays as not being arrays is horrendous to work with. For my usage, I almost never iterate over string characters or graphemes, I just pass strings. -Steve

Forums