Jump to page: 1 27  
Page
Thread overview
Why the hell doesn't foreach decode strings
Oct 20, 2011
Martin Nowak
Oct 20, 2011
Peter Alexander
Oct 21, 2011
Walter Bright
Oct 21, 2011
Adam D. Ruppe
Oct 21, 2011
Jonathan M Davis
Oct 21, 2011
Walter Bright
Oct 21, 2011
Jonathan M Davis
Oct 21, 2011
Walter Bright
Oct 21, 2011
Martin Nowak
Oct 21, 2011
Walter Bright
Oct 21, 2011
bearophile
Oct 22, 2011
Jacob Carlborg
Oct 22, 2011
Timon Gehr
Oct 21, 2011
Michel Fortin
Oct 21, 2011
Jonathan M Davis
Oct 21, 2011
Walter Bright
Oct 21, 2011
Walter Bright
Oct 28, 2011
Christophe Travert
Oct 24, 2011
Norbert Nemec
Oct 21, 2011
Peter Alexander
Oct 21, 2011
Jonathan M Davis
Oct 21, 2011
so
Oct 21, 2011
bearophile
Oct 22, 2011
so
Oct 22, 2011
Martin Nowak
Oct 22, 2011
Peter Alexander
Oct 22, 2011
Walter Bright
Oct 22, 2011
kennytm
Oct 22, 2011
Timon Gehr
Oct 22, 2011
Timon Gehr
Oct 24, 2011
Simen Kjaeraas
Oct 24, 2011
Dmitry Olshansky
Oct 24, 2011
Walter Bright
Oct 24, 2011
Timon Gehr
Oct 24, 2011
Dmitry Olshansky
Oct 26, 2011
Dmitry Olshansky
Oct 28, 2011
Christophe
Oct 24, 2011
Simen Kjaeraas
Oct 29, 2011
Jonathan M Davis
Oct 24, 2011
Walter Bright
Oct 25, 2011
Michel Fortin
Oct 26, 2011
Michel Fortin
Oct 26, 2011
Michel Fortin
Oct 21, 2011
so
Oct 21, 2011
Jacob Carlborg
Oct 21, 2011
so
Oct 21, 2011
Timon Gehr
Oct 22, 2011
Alvaro
Oct 22, 2011
Daniel Gibson
October 20, 2011
It just took me over one hour to find out the unthinkable.
foreach(c; str) will deduce c to immutable(char) and doesn't care about unicode.
Now there is so many unicode transcoding happening in the language that it starts to get annoying,
but the most basic string iteration doesn't support it by default?
October 20, 2011
On 20/10/11 8:37 PM, Martin Nowak wrote:
> It just took me over one hour to find out the unthinkable.
> foreach(c; str) will deduce c to immutable(char) and doesn't care about
> unicode.
> Now there is so many unicode transcoding happening in the language that
> it starts to get annoying,
> but the most basic string iteration doesn't support it by default?

D has got itself into a tricky situation in this regard. Doing it either way introduces an unintuitive mess.

The way it is now, you get the problem that you just described where foreach is unaware of Unicode.

If you changed it to loop as Unicode, then indices won't match up:

immutable(int)[] a = ...
foreach (x, i; a)
    assert(x == a[i]); // ok

immutable(char)[] b = ...
foreach (x, i; b)
    assert(x == b[i]); // not necessarily!

Also, the loop won't necessarily iterate b.length times. There's inconsistencies all over the place.

The whole mess is caused by conflating the idea of an array with a variable length encoding that happens to use an array for storage. I don't believe there is any clean and tidy way to fix the problem without breaking compatibility.
October 21, 2011
On 10/20/2011 2:49 PM, Peter Alexander wrote:
> The whole mess is caused by conflating the idea of an array with a variable
> length encoding that happens to use an array for storage. I don't believe there
> is any clean and tidy way to fix the problem without breaking compatibility.

There is no 'fixing' it, even to break compatibility. Sometimes you want to look at an array of utf8 as 8 bit characters, and sometimes as 20 bit dchars. Someone will be dissatisfied no matter what.

There is no way to program strings in D without being aware of UTF-8 encoding.
October 21, 2011
Walter Bright wrote:
> Sometimes you want to look at an array of utf8 as 8 bit characters, and sometimes as 20 bit dchars.

Well, they could always cast it to a ubyte[].

But on the other hand, you can just ask for dchar now.

So yeah.
October 21, 2011
On Thursday, October 20, 2011 19:26:43 Walter Bright wrote:
> On 10/20/2011 2:49 PM, Peter Alexander wrote:
> > The whole mess is caused by conflating the idea of an array with a variable length encoding that happens to use an array for storage. I don't believe there is any clean and tidy way to fix the problem without breaking compatibility.
> There is no 'fixing' it, even to break compatibility. Sometimes you want to look at an array of utf8 as 8 bit characters, and sometimes as 20 bit dchars. Someone will be dissatisfied no matter what.
> 
> There is no way to program strings in D without being aware of UTF-8 encoding.

True, but if the default were dchar, then the common case would be have fewer bugs (still allowing you to explicitly use char or wchar when you want to). At minimum, I think that it would be a good idea to implement http://d.puremagic.com/issues/show_bug.cgi?id=6652 and make it a warning not to explicitly give the type with foreach for arrays of char or wchar. It would catch bugs without changing the behavior of any existing code, and it still allows you to iterate over either code units or code points.

- Jonathan M Davis
October 21, 2011
On 10/20/2011 7:37 PM, Jonathan M Davis wrote:
> True, but if the default were dchar, then the common case would be have fewer
> bugs

Is that really the common case? It's certainly the *slow* case. Common string operations like searching, copying, etc., do not require decoding.

> (still allowing you to explicitly use char or wchar when you want to). At
> minimum, I think that it would be a good idea to implement
> http://d.puremagic.com/issues/show_bug.cgi?id=6652 and make it a warning not
> to explicitly give the type with foreach for arrays of char or wchar. It would
> catch bugs without changing the behavior of any existing code, and it still
> allows you to iterate over either code units or code points.

I like the type deduction feature of foreach, and don't think it should be removed for strings. Currently, it's consistent - T[] gets an element type of T.

I want to reiterate that there's no way to program strings in D without being cognizant of them being a multibyte representation. D is both a high level and a low level language, and you can pick which to use, but you still gotta pick.
October 21, 2011
On Thursday, October 20, 2011 20:37:40 Walter Bright wrote:
> On 10/20/2011 7:37 PM, Jonathan M Davis wrote:
> > True, but if the default were dchar, then the common case would be have fewer bugs
> 
> Is that really the common case? It's certainly the *slow* case. Common string operations like searching, copying, etc., do not require decoding.

And why would you iterate over a string with foreach without decoding it unless you specifically need to operate on code units (which I would expect to be _very_ rare)? Sure, copying doesn't require decoding, but searching sure does (unless you're specifically looking for a code unit rather than a code point, which would not be normal). Most anything which needs to operate on the characters of a string needs to decode them. And iterating over them to do much of anything would require decoding, since otherwise you're operating on code units, and how often does anyone do that unless they're specifically messing around with character encodings?

Sure, if you _know_ that you're dealing with a string with only ASCII, it's faster to just iterate over chars, but then you can explicitly give the type of the foreach variable as char, but normally what people care about is iterating over characters, not pieces of characters. So, I would expect the case where people _want_ to iterate over chars to be rare. In most cases, iterating over a string as chars is a bug - one which in many cases won't be quickly caught, because the programmer is English speaking and uses almost exclusively ASCII for whatever testing that they do.

The default for string handling really should be to treat them as ranges of dchar but still make it easy for them to be treated as arrays of code units when necessary. There's plenty of code in Phobos which is able to special case strings and make operating on them more efficient when it's not necessary to operate on them as ranges of dchar or when decoding the string explicitly with functions such as stride. But the default is still to operate on them as ranges of dchar, because that is what is normally correct. Defaulting to the guaranteed correct handling of characters and special casing when it's possible to write code more efficiently than that is definitely the way to go about it, and it's how Phobos generally does it. The fact that foreach doesn't is incongruous with how strings are handled in most other cases.

> > (still allowing you to explicitly use char or wchar when you want to).
> > At
> > minimum, I think that it would be a good idea to implement
> > http://d.puremagic.com/issues/show_bug.cgi?id=6652 and make it a warning
> > not to explicitly give the type with foreach for arrays of char or
> > wchar. It would catch bugs without changing the behavior of any
> > existing code, and it still allows you to iterate over either code
> > units or code points.
> 
> I like the type deduction feature of foreach, and don't think it should be removed for strings. Currently, it's consistent - T[] gets an element type of T.

Sure, the type deduction of foreach is great, and it's completely consistent that iterating over an array of chars would iterate over chars rather than dchars when you don't give the type. However, in most cases, that is _not_ what the programmer actually wants. They want to iterate over characters, not pieces of characters. So, the default at this point is _wrong_ in the common case. As such, I'm very leery of any code which uses foreach over a string without specifying the iteration type. And in fact, unless the code is clearly intended to operate on code units, I would expect a comment indicating that the use of char instead of dchar was intentional, or I'd still consider it likely that it's a bug and a mistake on the programmer's part (likely due to a misunderstanding of unicode and how D deals with it).

> I want to reiterate that there's no way to program strings in D without being cognizant of them being a multibyte representation. D is both a high level and a low level language, and you can pick which to use, but you still gotta pick.

I fully agree that programmers need to properly understand unicode to use strings in D properly. However, the problem is that the default handling of strings with foreach is _not_ what programmers are going to normally want, so the default will cause bugs. If strings defaulted to iterating as ranges of dchar, or if programmers had to say what type they wanted to iterate over when dealing with strings (or at least got a warning if they didn't), then there would be fewer bugs. Pretty much every time that the use of strings with foreach comes up on this list, most everyone agrees that it's a wart in the language that the default is to iterate over chars rather than dchars. Not everyone agrees on the best way to fix the problem, but most everyone agrees that it _is_ a problem.

- Jonathan M Davis
October 21, 2011
Actually, I'd have to say that having foreach default to iterating over char for string is a bit like if it defaulted to iterating over byte for int[]. Sure, it would work in some cases, but in the general case, it would be very wrong.

Yes, it's consistent with how arrays are normally iterated over to have foreach iterate over char for char[], which is why it's arguably not a good idea to make it default to dchar. But it's _wrong_ in most cases, so at least giving a warning when the programmer doesn't give an iteration type for foreach with an array of char or wchar would be a big help.

It's this very problem that leads some people to argue that string should be its own type which holds an array of code units (which can be accessed when needed) rather than doing what we do now where we try and treat a string as both an array of chars and a range of dchars. The result is schizophrenic.

- Jonathan M Davis
October 21, 2011
On 10/20/2011 8:58 PM, Jonathan M Davis wrote:
> And why would you iterate over a string with foreach without decoding it
> unless you specifically need to operate on code units (which I would expect to
> be _very_ rare)? Sure, copying doesn't require decoding, but searching sure
> does

No, it doesn't. If I'm searching for a dchar, I'll be searching for a substring in the UTF-8 string. It's far, FAR more efficient to search as a substring rather than decoding while searching.

Even more, 99.9999% of searches involve an ascii search string. It is simply not necessary to decode the searched string, as encoded chars cannot be ascii. For example:

   foreach (c; somestring)
         if (c == '+')
		found it!

gains absolutely nothing by decoding somestring.


> (unless you're specifically looking for a code unit rather than a code
> point, which would not be normal). Most anything which needs to operate on the
> characters of a string needs to decode them. And iterating over them to do
> much of anything would require decoding, since otherwise you're operating on
> code units, and how often does anyone do that unless they're specifically
> messing around with character encodings?

What you write sounds intuitively correct, but in my experience writing Unicode processing code, it simply isn't true. One rarely needs to decode.


> However, in most cases, that is _not_
> what the programmer actually wants. They want to iterate over characters, not
> pieces of characters. So, the default at this point is _wrong_ in the common
> case.

This is simply not my experience when working with Unicode. Performance takes a big hit when one structures an algorithm to require decoding/encoding. Doing the algorithm using substrings is a huge win.

Take a look at dmd's lexer, it handles Unicode correctly and avoids doing decoding as much as possible.

October 21, 2011
On 10/20/2011 9:06 PM, Jonathan M Davis wrote:
> It's this very problem that leads some people to argue that string should be
> its own type which holds an array of code units (which can be accessed when
> needed) rather than doing what we do now where we try and treat a string as
> both an array of chars and a range of dchars. The result is schizophrenic.

Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings.
« First   ‹ Prev
1 2 3 4 5 6 7