December 31, 2011
On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
> On 12/30/11 6:07 PM, Timon Gehr wrote:
>> alias std.string.representation raw;
>
> I meant your implementation is incomplete.

It was more a sketch than an implementation. It is not even type safe :o).

>
> But the main point is that presence of representation/raw is not the
> issue.
> The availability of good-for-nothing .length and operator[] are
> the issue. Putting in place the convention of using .raw is hardly
> useful within the context.
>

D strings are arrays. An array without .length and operator[] is close to being good for nothing. The language specification is quite clear about the fact that e.g. char is not a character but an utf-8 code unit. Therefore char[] is an array of code units. length gives the number of code units. operator[i] gives the i-th code unit. Nothing wrong or good-for-nothing about that. .raw would return ubyte[], therefore it would lose all type information. Effectively, what .raw does is a type cast that will let code point data alias with integral data.

Consider:

void foo(ubyte[] b)in{assert(b.length);}body{
    b[0]=2; // perfectly fine
}

void main(){
    char[] s = "☺".dup;
    auto b = s.raw;
    foo(b);
    writeln(s); // oops...
}

I fail to understand why that is desirable.
December 31, 2011
On 2011-12-30 23:00:49 +0000, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> said:

> Using .raw is /optimal/ because it states the assumption appropriately. The user knows '$' cannot be in the prefix of any other symbol, so she can state the byte alone is the character. If that were a non-ASCII character, the assumption wouldn't have worked.
> 
> So yeah, UTF-8 is great. But it is not miraculous. We need .raw.

After reading most of the thread, it seems to me like you're deconstructing strings as arrays one piece at a time, to the point where instead of arrays we'd basically get a string struct and do things on it. Maybe it's part of a grand scheme, more likely it's one realization after another leading to one change after another… let's see where all this will lead us:

0. in the beginning, strings were char[] arrays
1. arrays are generalized as ranges
2. phobos starts treating char arrays as bidirectional ranges of dchar (instead of random access ranges of char)
3. foreach on char[] should iterate over dchar by default
4. remove .length, random access, and slicing from char arrays
5. replace char[] with a struct { ubyte[] raw; }

Number 1 is great by itself, no debate there. Number 2 is debatable. Number 3 and 4 are somewhat required for consistency with number 2. Number 5 is just the logical conclusion of all these changes.

If we want a fundamental change to what strings are in D, perhaps we should start focusing on the broader issue instead of trying to pass piecemeal changes one after the other. For consistency's sake, I think we should either stop after 1 or go all the way to 5. Either we do it fully or we don't do it at all.

All those divergent interpretations of strings end up hurting the language. Walter and Andrei ought to find a way to agree with each other.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

December 31, 2011
On Friday, December 30, 2011 20:55:42 Timon Gehr wrote:
> 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.

The problem is that what's more likely to happen in a lot of cases is that they use it wrong and don't notice, because they're only using ASCII in testing, _but_ they have bugs all over the place, because their code is actually used with unicode in the field.

Yes, diligent programmers will generally find such problems, but with the current scheme, it's _so_ easy to use length when you shouldn't, that it's pretty much a guarantee that it's going to happen. I'm not sure that Andrei's suggestion is the best one at this point, but I sure wouldn't be against it being introduced. It wouldn't entirely fix the problem by any means, but programmers would then have to work harder at screwing it up and so there would be fewer mistakes.

Arguably, the first issue with D strings is that we have char. In most languages, char is supposed to be a character, so many programmers will code with that expectation. If we had something like utf8unit, utf16unit, and utf32unit (arguably very bad, albeit descriptive, names) and no char, then it would force programmers to become semi-educated about the issues. There's no way that that's changing at this point though.

- Jonathan M Davis
December 31, 2011
On 12/31/2011 04:30 AM, Jonathan M Davis wrote:
> On Friday, December 30, 2011 20:55:42 Timon Gehr wrote:
>> 1. They don't notice. Then it is not a problem, because they are
>> obviously only using ASCII characters and it is perfectly reasonable to
>> assume that code units and characters are the same thing.
>
> The problem is that what's more likely to happen in a lot of cases is that
> they use it wrong and don't notice, because they're only using ASCII in
> testing, _but_ they have bugs all over the place, because their code is
> actually used with unicode in the field.
>

Then that is the fault of the guy who created the tests. At least that guy should be familiar with the issues, otherwise he is at the wrong position. Software should never be released without thorough testing.

> Yes, diligent programmers will generally find such problems, but with the
> current scheme, it's _so_ easy to use length when you shouldn't, that it's
> pretty much a guarantee that it's going to happen. I'm not sure that Andrei's
> suggestion is the best one at this point, but I sure wouldn't be against it
> being introduced. It wouldn't entirely fix the problem by any means, but
> programmers would then have to work harder at screwing it up and so there
> would be fewer mistakes.

Programmers would then also have to work harder at doing it right and at memoizing special cases, so there is absolutely no net gain.

>
> Arguably, the first issue with D strings is that we have char. In most
> languages, char is supposed to be a character, so many programmers will code
> with that expectation. If we had something like utf8unit, utf16unit, and
> utf32unit (arguably very bad, albeit descriptive, names) and no char, then it
> would force programmers to become semi-educated about the issues. There's no
> way that that's changing at this point though.
>
> - Jonathan M Davis

A programmer has to have basic knowledge of the language he is programming in. That includes knowing the meaning of all basic types. If he fails at that, testing should definitely catch that kind of trivial bugs.
December 31, 2011
On 12/30/2011 7:30 PM, Jonathan M Davis wrote:
> Yes, diligent programmers will generally find such problems, but with the
> current scheme, it's _so_ easy to use length when you shouldn't, that it's
> pretty much a guarantee that it's going to happen.

I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8 correctly, but it turned out that the naive version that used [i] and .length worked correctly. This is typical, not exceptional.

This was definitely not true of older multibyte schemes, like Shift-JIS (shudder), but those schemes ought to be terminated with extreme prejudice. But it definitely will take a long time to live down the bugs and miasma of code that had to deal with them. C and C++ still live with that because of their agenda of backwards compatibility. They still support EBCDIC, after all, that was obsolete even in the 70's. And I still see posts on comp.moderated.c++ that say "you shouldn't write string code like that, because it won't work on EBCDIC!" Sheesh!
December 31, 2011
On 12/30/11 10:09 PM, Walter Bright wrote:
> On 12/30/2011 7:30 PM, Jonathan M Davis wrote:
>> Yes, diligent programmers will generally find such problems, but with the
>> current scheme, it's _so_ easy to use length when you shouldn't, that
>> it's
>> pretty much a guarantee that it's going to happen.
>
> I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8
> correctly, but it turned out that the naive version that used [i] and
> .length worked correctly. This is typical, not exceptional.

The lower frequency of bugs makes them that much more difficult to spot. This is essentially similar to the UTF16/UCS-2 morass: in a vast majority of the time the programmer may consider UTF16 a coding with one code unit per code point (which is what UCS-2 is). The existence of surrogates didn't make much of a difference because, again, very often the wrong assumption just worked. Well that all didn't go over all that well.

We need .raw and we must abolish .length and [] for narrow strings.


Andrei
December 31, 2011
On Sat, Dec 31, 2011 at 12:09 AM, Andrei Alexandrescu < SeeWebsiteForEmail@erdani.org> wrote:

> On 12/30/11 10:09 PM, Walter Bright wrote:
>
>> On 12/30/2011 7:30 PM, Jonathan M Davis wrote:
>>
>>> Yes, diligent programmers will generally find such problems, but with the
>>> current scheme, it's _so_ easy to use length when you shouldn't, that
>>> it's
>>> pretty much a guarantee that it's going to happen.
>>>
>>
>> I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8 correctly, but it turned out that the naive version that used [i] and .length worked correctly. This is typical, not exceptional.
>>
>
> The lower frequency of bugs makes them that much more difficult to spot. This is essentially similar to the UTF16/UCS-2 morass: in a vast majority of the time the programmer may consider UTF16 a coding with one code unit per code point (which is what UCS-2 is). The existence of surrogates didn't make much of a difference because, again, very often the wrong assumption just worked. Well that all didn't go over all that well.
>
> We need .raw and we must abolish .length and [] for narrow strings.
>
>
> Andrei
>


I don't know that Phobos would be an appropriate place for it but offering some easy to access string data containing extensive and advanced unicode which users could easily add to their programs unit tests may help people ensure proper unicode usage. Unicode seems to be one of those things where you either know it really well or you know just enough to get yourself in trouble so having test data written by unicode experts could be very useful for the rest of us mortals.

I googled around a bit.  This Stack Overflow came up <
http://stackoverflow.com/questions/6136800/unicode-test-strings-for-unit-tests>
that recommends these
 - UTF-8 stress test:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
 - Quick Brown Fox in a variety of languages:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt

I didn't see too much beyond those two.

Regards,
Brad A.


December 31, 2011
On 12/30/2011 11:09 PM, Andrei Alexandrescu wrote:
> On 12/30/11 10:09 PM, Walter Bright wrote:
>> I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8
>> correctly, but it turned out that the naive version that used [i] and
>> .length worked correctly. This is typical, not exceptional.
>
> The lower frequency of bugs makes them that much more difficult to spot. This is
> essentially similar to the UTF16/UCS-2 morass: in a vast majority of the time
> the programmer may consider UTF16 a coding with one code unit per code point
> (which is what UCS-2 is). The existence of surrogates didn't make much of a
> difference because, again, very often the wrong assumption just worked. Well
> that all didn't go over all that well.

I'm not so sure it's quite the same. Java was designed before there were surrogate pairs, they kinda got the rug pulled out from under them. So, they simply have no decent way to deal with it. There isn't even a notion of a dchar character type. Java was designed with codeunit==codepoint, it is embedded in the design of the language, library, and culture.

This is not true of D. It's designed from the ground up to deal properly with UTF. D has very simple language features to deal with it.

> We need .raw and we must abolish .length and [] for narrow strings.

I don't believe that fixes anything and breaks every D project out there. We're chasing phantoms here, and I worry a lot about over-engineering trivia.

And, we already have a type to deal with it: dstring
December 31, 2011
2011/12/31 Walter Bright <newshound2@digitalmars.com>:
> On 12/30/2011 11:09 PM, Andrei Alexandrescu wrote:
>>
>> On 12/30/11 10:09 PM, Walter Bright wrote:
>>>
>>> I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8 correctly, but it turned out that the naive version that used [i] and .length worked correctly. This is typical, not exceptional.
>>
>>
>> The lower frequency of bugs makes them that much more difficult to spot.
>> This is
>> essentially similar to the UTF16/UCS-2 morass: in a vast majority of the
>> time
>> the programmer may consider UTF16 a coding with one code unit per code
>> point
>> (which is what UCS-2 is). The existence of surrogates didn't make much of
>> a
>> difference because, again, very often the wrong assumption just worked.
>> Well
>> that all didn't go over all that well.
>
>
> I'm not so sure it's quite the same. Java was designed before there were surrogate pairs, they kinda got the rug pulled out from under them. So, they simply have no decent way to deal with it. There isn't even a notion of a dchar character type. Java was designed with codeunit==codepoint, it is embedded in the design of the language, library, and culture.
>
> This is not true of D. It's designed from the ground up to deal properly with UTF. D has very simple language features to deal with it.
>
>
>> We need .raw and we must abolish .length and [] for narrow strings.
>
>
> I don't believe that fixes anything and breaks every D project out there. We're chasing phantoms here, and I worry a lot about over-engineering trivia.
>
> And, we already have a type to deal with it: dstring

I fully agree with Walter. No need more wrapper for string.

Kenji Hara
December 31, 2011
On 12/31/11 2:04 AM, Walter Bright wrote:
> On 12/30/2011 11:09 PM, Andrei Alexandrescu wrote:
>> On 12/30/11 10:09 PM, Walter Bright wrote:
>>> I'm not so sure about that. Timon Gehr's X macro tried to handle
>>> UTF-8 correctly, but it turned out that the naive version that
>>> used [i] and .length worked correctly. This is typical, not
>>> exceptional.
>>
>> The lower frequency of bugs makes them that much more difficult to
>>  spot. This is essentially similar to the UTF16/UCS-2 morass: in a
>> vast majority of the time the programmer may consider UTF16 a
>> coding with one code unit per code point (which is what UCS-2 is).
>> The existence of surrogates didn't make much of a difference
>> because, again, very often the wrong assumption just worked. Well
>> that all didn't go over all that well.
>
> I'm not so sure it's quite the same. Java was designed before there
> were surrogate pairs, they kinda got the rug pulled out from under
> them. So, they simply have no decent way to deal with it. There isn't
> even a notion of a dchar character type. Java was designed with
> codeunit==codepoint, it is embedded in the design of the language,
> library, and culture.
>
> This is not true of D. It's designed from the ground up to deal
> properly with UTF.

I disagree. It is designed to make dealing with UTF possible.

> D has very simple language features to deal with
> it.

Disagree. I mean simple they are, no contest. They could and should be much better, make correct code easier to write, and make incorrect code more difficult to write. Claiming we reached perfection there doesn't quite fit.

>> We need .raw and we must abolish .length and [] for narrow
>> strings.
>
> I don't believe that fixes anything and breaks every D project out
> there.

I agree. This is the only reason that keeps me from furthering the issue.

> We're chasing phantoms here, and I worry a lot about over-engineering
> trivia.

I disagree. I understand that seems trivia to you, but that doesn't make your opinion any less wrong, not to mention provincial through insistence it's applicable beyond a small team of experts. Again: I know no other - I literally mean not one - person who writes string code like you do (and myself after learning it from you); the current system is adequate; the proposed system is perfect - save for breaking backwards compatibility, which makes the discussion moot. But it being moot does not afford me to concede this point. I am right.

> And, we already have a type to deal with it: dstring

No.


Andrei