October 21, 2011
On 2011-10-21 03:58:50 +0000, Jonathan M Davis <jmdavisProg@gmx.com> said:

> Sure, if you _know_ that you're dealing with a string with only ASCII, it's
> faster to just iterate over chars

It works for non-ASCII too. You're probably missing an interesting property of UTF encodings: if you want to search for a substring in a well-formed UTF sequence, you do not need to decode the bigger string, comparing the UTF-x code units of the substring with the UTF-x code units of the bigger string is plenty enough.

Similarly, if you're searching for the 'ê' code point in an UTF-8 string, the most efficient way is to search the string for the two-byte UTF-8 sequence you would use to encode 'ê' (in other words, convert 'ê' to a string). Decoding the whole string is a wasteful process.


> Sure, if you _know_ that you're dealing with a string with only ASCII, it's
> faster to just iterate over chars, but then you can explicitly give the type
> of the foreach variable as char, but normally what people care about is
> iterating over characters, not pieces of characters.

If you want to iterate over what people consider characters, then you need to take into account combining marks that form multi-code-point graphemes. (You'll probably want to deal with unicode normalization too.) Treating code points as if they were characters is a misconception in the same way as treating UTF-16 code units as character is: both works most of the time but also fail in a number of cases.


> So, I would expect the
> case where people _want_ to iterate over chars to be rare. In most cases,
> iterating over a string as chars is a bug - one which in many cases won't be
> quickly caught, because the programmer is English speaking and uses almost
> exclusively ASCII for whatever testing that they do.

That's a real problem. But is treating everything as dchar the only solution to that problem?


> Defaulting to the
> guaranteed correct handling of characters and special casing when it's
> possible to write code more efficiently than that is definitely the way to go
> about it, and it's how Phobos generally does it.

Iterating on dchar is not guarantied to be correct, it only has significantly more chances of being correct.


> The fact that foreach doesn't
> is incongruous with how strings are handled in most other cases.

You could also argue that ranges are doing things the wrong way.


>> I like the type deduction feature of foreach, and don't think it should be
>> removed for strings. Currently, it's consistent - T[] gets an element type
>> of T.
> 
> Sure, the type deduction of foreach is great, and it's completely consistent
> that iterating over an array of chars would iterate over chars rather than
> dchars when you don't give the type. However, in most cases, that is _not_
> what the programmer actually wants. They want to iterate over characters, not
> pieces of characters.

I note that you keep confusing characters with code units.

>> I want to reiterate that there's no way to program strings in D without
>> being cognizant of them being a multibyte representation. D is both a high
>> level and a low level language, and you can pick which to use, but you
>> still gotta pick.
> 
> I fully agree that programmers need to properly understand unicode to use
> strings in D properly. However, the problem is that the default handling of
> strings with foreach is _not_ what programmers are going to normally want, so
> the default will cause bugs.

That said I wouldn't expect most programmers understand Unicode. Giving them dchars by default won't eliminate bugs related to multi-code-point characters, but it'll likely eliminate bugs relating to multi-code-unit sequences. That could be a good start. I'd say choosing dchar is a practical compromise between the "characters by default" and "the type of the array by default", but it is neither of those ideals. How is that pragmatic trade-off going to fare a few years in the future? I'm a little skeptical that this is the ideal solution.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

October 21, 2011
On 2011-10-20 23:49, Peter Alexander wrote:
> On 20/10/11 8:37 PM, Martin Nowak wrote:
>> It just took me over one hour to find out the unthinkable.
>> foreach(c; str) will deduce c to immutable(char) and doesn't care about
>> unicode.
>> Now there is so many unicode transcoding happening in the language that
>> it starts to get annoying,
>> but the most basic string iteration doesn't support it by default?
>
> D has got itself into a tricky situation in this regard. Doing it either
> way introduces an unintuitive mess.
>
> The way it is now, you get the problem that you just described where
> foreach is unaware of Unicode.
>
> If you changed it to loop as Unicode, then indices won't match up:
>
> immutable(int)[] a = ...
> foreach (x, i; a)
> assert(x == a[i]); // ok
>
> immutable(char)[] b = ...
> foreach (x, i; b)
> assert(x == b[i]); // not necessarily!

The index could skip certain indexes to make that assert pass. But that would be confusing as well.

-- 
/Jacob Carlborg
October 21, 2011
On Thu, 20 Oct 2011 22:37:56 +0300, Martin Nowak <dawg@dawgfoto.de> wrote:

> It just took me over one hour to find out the unthinkable.
> foreach(c; str) will deduce c to immutable(char) and doesn't care about unicode.
> Now there is so many unicode transcoding happening in the language that it starts to get annoying,
> but the most basic string iteration doesn't support it by default?

I really can't believe why people expect that.
By definition string is an array of chars, and this way it is consistent.
All we need is:

foreach(c, whatever(str))

instead of:

foreach(c, str)
October 21, 2011
On Fri, 21 Oct 2011 06:39:56 +0200, Walter Bright <newshound2@digitalmars.com> wrote:

> On 10/20/2011 8:58 PM, Jonathan M Davis wrote:
>> And why would you iterate over a string with foreach without decoding it
>> unless you specifically need to operate on code units (which I would expect to
>> be _very_ rare)? Sure, copying doesn't require decoding, but searching sure
>> does
>
> No, it doesn't. If I'm searching for a dchar, I'll be searching for a substring in the UTF-8 string. It's far, FAR more efficient to search as a substring rather than decoding while searching.
>
> Even more, 99.9999% of searches involve an ascii search string. It is simply not necessary to decode the searched string, as encoded chars cannot be ascii. For example:
>
>     foreach (c; somestring)
>           if (c == '+')
> 		found it!
>
> gains absolutely nothing by decoding somestring.
>
>
>> (unless you're specifically looking for a code unit rather than a code
>> point, which would not be normal). Most anything which needs to operate on the
>> characters of a string needs to decode them. And iterating over them to do
>> much of anything would require decoding, since otherwise you're operating on
>> code units, and how often does anyone do that unless they're specifically
>> messing around with character encodings?
>
> What you write sounds intuitively correct, but in my experience writing Unicode processing code, it simply isn't true. One rarely needs to decode.
>
>
>> However, in most cases, that is _not_
>> what the programmer actually wants. They want to iterate over characters, not
>> pieces of characters. So, the default at this point is _wrong_ in the common
>> case.
>
> This is simply not my experience when working with Unicode. Performance takes a big hit when one structures an algorithm to require decoding/encoding. Doing the algorithm using substrings is a huge win.
>
> Take a look at dmd's lexer, it handles Unicode correctly and avoids doing decoding as much as possible.
>
You have a good point here. I would have immediately thrown out the loop AFTER profiling.
What hits me here is that I had an incorrect program with built-in unicode aware strings.
This is counterintuitive to correct unicode handling throughout the std library,
and even more to the complementary operation of appending any char type to strings.

martin
October 21, 2011
On Fri, 21 Oct 2011 00:43:20 -0400, Walter Bright <newshound2@digitalmars.com> wrote:

> On 10/20/2011 9:06 PM, Jonathan M Davis wrote:
>> It's this very problem that leads some people to argue that string should be
>> its own type which holds an array of code units (which can be accessed when
>> needed) rather than doing what we do now where we try and treat a string as
>> both an array of chars and a range of dchars. The result is schizophrenic.
>
> Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings.

I don't think it would.  Do you have any proof to support this?

-Steve
October 21, 2011
On 21/10/11 3:26 AM, Walter Bright wrote:
> On 10/20/2011 2:49 PM, Peter Alexander wrote:
>> The whole mess is caused by conflating the idea of an array with a
>> variable
>> length encoding that happens to use an array for storage. I don't
>> believe there
>> is any clean and tidy way to fix the problem without breaking
>> compatibility.
>
> There is no 'fixing' it, even to break compatibility. Sometimes you want
> to look at an array of utf8 as 8 bit characters, and sometimes as 20 bit
> dchars. Someone will be dissatisfied no matter what.

Then separate those ways of viewing strings.

Here's one solution that I believe would satisfy everyone:

1. Remove the string, wstring and dstring aliases. An array of char should be an array of char, i.e. the same as array of byte. Same for arrays of wchar and dchar. This way, arrays of T have no subtle differences for certain kinds of T.

2. Add string, wstring and dstring structs with the following interface:

 a. foreach should iterate as dchar.
 b. @property front() would be dchar.
 c. @property length() would not exist.
 d. @property buffer() returns the underlying immutable array of char, wchar etc.
 e. Remove opIndex and co.

What this does:
- Makes all array types consistent and intuitive.
- Makes looping over strings do the expected thing.
- Provides an interface to the underlying 8-bit chars for those that want it.


Of course, people will still need to understand UTF-8. I don't think that's a problem. It's unreasonable to expect the language to do the thinking for you. The problem is that we have people that *do* understand UTF-8 (like the OP), but *don't* understand D's strings.
October 21, 2011
On 10/21/2011 2:51 AM, Martin Nowak wrote:
> You have a good point here. I would have immediately thrown out the loop AFTER
> profiling.
> What hits me here is that I had an incorrect program with built-in unicode aware
> strings.
> This is counterintuitive to correct unicode handling throughout the std library,
> and even more to the complementary operation of appending any char type to strings.

I understand the issue, but I don't think it's resolvable. It's a lot like the signed/unsigned issue. Java got rid of it by simply not having any unsigned types.
October 21, 2011
On Friday, October 21, 2011 11:11 Peter Alexander wrote:
> On 21/10/11 3:26 AM, Walter Bright wrote:
> > On 10/20/2011 2:49 PM, Peter Alexander wrote:
> >> The whole mess is caused by conflating the idea of an array with a
> >> variable
> >> length encoding that happens to use an array for storage. I don't
> >> believe there
> >> is any clean and tidy way to fix the problem without breaking
> >> compatibility.
> > 
> > There is no 'fixing' it, even to break compatibility. Sometimes you want to look at an array of utf8 as 8 bit characters, and sometimes as 20 bit dchars. Someone will be dissatisfied no matter what.
> 
> Then separate those ways of viewing strings.
> 
> Here's one solution that I believe would satisfy everyone:
> 
> 1. Remove the string, wstring and dstring aliases. An array of char should be an array of char, i.e. the same as array of byte. Same for arrays of wchar and dchar. This way, arrays of T have no subtle differences for certain kinds of T.
> 
> 2. Add string, wstring and dstring structs with the following interface:
> 
> a. foreach should iterate as dchar.
> b. @property front() would be dchar.
> c. @property length() would not exist.
> d. @property buffer() returns the underlying immutable array of char,
> wchar etc.
> e. Remove opIndex and co.
> 
> What this does:
> - Makes all array types consistent and intuitive.
> - Makes looping over strings do the expected thing.
> - Provides an interface to the underlying 8-bit chars for those that
> want it.
> 
> 
> Of course, people will still need to understand UTF-8. I don't think that's a problem. It's unreasonable to expect the language to do the thinking for you. The problem is that we have people that *do* understand UTF-8 (like the OP), but *don't* understand D's strings.

In another post in this thread, Walter said in reference to post on essentially this idea: "Making such a string type would be terribly inefficient. It would make D completely uncompetitive for processing strings." Now, whether that's true is debatable, but that's his stance on the idea.

- Jonathan M Davis
October 21, 2011
On 10/20/2011 09:37 PM, Martin Nowak wrote:
> It just took me over one hour to find out the unthinkable.
> foreach(c; str) will deduce c to immutable(char) and doesn't care about
> unicode.
> Now there is so many unicode transcoding happening in the language

In the standard library. Not in the language.

> that it starts to get annoying,
> but the most basic string iteration doesn't support it by default?

I actually like the current behaviour.
October 21, 2011
On 10/21/2011 4:14 AM, Steven Schveighoffer wrote:
>> Making such a string type would be terribly inefficient. It would make D
>> completely uncompetitive for processing strings.
>
> I don't think it would. Do you have any proof to support this?

I've done string processing code, and done a lot of profiling of them. Every cycle is critical, and decoding adds a *lot* of cycles.