View mode: basic / threaded / horizontal-split · Log in · Help
December 31, 2011
Re: string is rarely useful as a function argument
On 12/31/2011 01:12 AM, Andrei Alexandrescu wrote:
> On 12/30/11 6:07 PM, Timon Gehr wrote:
>> alias std.string.representation raw;
>
> I meant your implementation is incomplete.

It was more a sketch than an implementation. It is not even type safe :o).

>
> But the main point is that presence of representation/raw is not the
> issue.
> The availability of good-for-nothing .length and operator[] are
> the issue. Putting in place the convention of using .raw is hardly
> useful within the context.
>

D strings are arrays. An array without .length and operator[] is close 
to being good for nothing. The language specification is quite clear 
about the fact that e.g. char is not a character but an utf-8 code unit. 
Therefore char[] is an array of code units. length gives the number of 
code units. operator[i] gives the i-th code unit. Nothing wrong or 
good-for-nothing about that. .raw would return ubyte[], therefore it 
would lose all type information. Effectively, what .raw does is a type 
cast that will let code point data alias with integral data.

Consider:

void foo(ubyte[] b)in{assert(b.length);}body{
    b[0]=2; // perfectly fine
}

void main(){
    char[] s = "☺".dup;
    auto b = s.raw;
    foo(b);
    writeln(s); // oops...
}

I fail to understand why that is desirable.
December 31, 2011
Re: string is rarely useful as a function argument
On 2011-12-30 23:00:49 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail@erdani.org> said:

> Using .raw is /optimal/ because it states the assumption appropriately. 
> The user knows '$' cannot be in the prefix of any other symbol, so she 
> can state the byte alone is the character. If that were a non-ASCII 
> character, the assumption wouldn't have worked.
> 
> So yeah, UTF-8 is great. But it is not miraculous. We need .raw.

After reading most of the thread, it seems to me like you're 
deconstructing strings as arrays one piece at a time, to the point 
where instead of arrays we'd basically get a string struct and do 
things on it. Maybe it's part of a grand scheme, more likely it's one 
realization after another leading to one change after another… let's 
see where all this will lead us:

0. in the beginning, strings were char[] arrays
1. arrays are generalized as ranges
2. phobos starts treating char arrays as bidirectional ranges of dchar 
(instead of random access ranges of char)
3. foreach on char[] should iterate over dchar by default
4. remove .length, random access, and slicing from char arrays
5. replace char[] with a struct { ubyte[] raw; }

Number 1 is great by itself, no debate there. Number 2 is debatable. 
Number 3 and 4 are somewhat required for consistency with number 2. 
Number 5 is just the logical conclusion of all these changes.

If we want a fundamental change to what strings are in D, perhaps we 
should start focusing on the broader issue instead of trying to pass 
piecemeal changes one after the other. For consistency's sake, I think 
we should either stop after 1 or go all the way to 5. Either we do it 
fully or we don't do it at all.

All those divergent interpretations of strings end up hurting the 
language. Walter and Andrei ought to find a way to agree with each 
other.

-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
December 31, 2011
Re: string is rarely useful as a function argument
On Friday, December 30, 2011 20:55:42 Timon Gehr wrote:
> 1. They don't notice. Then it is not a problem, because they are
> obviously only using ASCII characters and it is perfectly reasonable to
> assume that code units and characters are the same thing.

The problem is that what's more likely to happen in a lot of cases is that 
they use it wrong and don't notice, because they're only using ASCII in 
testing, _but_ they have bugs all over the place, because their code is 
actually used with unicode in the field.

Yes, diligent programmers will generally find such problems, but with the 
current scheme, it's _so_ easy to use length when you shouldn't, that it's 
pretty much a guarantee that it's going to happen. I'm not sure that Andrei's 
suggestion is the best one at this point, but I sure wouldn't be against it 
being introduced. It wouldn't entirely fix the problem by any means, but 
programmers would then have to work harder at screwing it up and so there 
would be fewer mistakes.

Arguably, the first issue with D strings is that we have char. In most 
languages, char is supposed to be a character, so many programmers will code 
with that expectation. If we had something like utf8unit, utf16unit, and 
utf32unit (arguably very bad, albeit descriptive, names) and no char, then it 
would force programmers to become semi-educated about the issues. There's no 
way that that's changing at this point though.

- Jonathan M Davis
December 31, 2011
Re: string is rarely useful as a function argument
On 12/31/2011 04:30 AM, Jonathan M Davis wrote:
> On Friday, December 30, 2011 20:55:42 Timon Gehr wrote:
>> 1. They don't notice. Then it is not a problem, because they are
>> obviously only using ASCII characters and it is perfectly reasonable to
>> assume that code units and characters are the same thing.
>
> The problem is that what's more likely to happen in a lot of cases is that
> they use it wrong and don't notice, because they're only using ASCII in
> testing, _but_ they have bugs all over the place, because their code is
> actually used with unicode in the field.
>

Then that is the fault of the guy who created the tests. At least that 
guy should be familiar with the issues, otherwise he is at the wrong 
position. Software should never be released without thorough testing.

> Yes, diligent programmers will generally find such problems, but with the
> current scheme, it's _so_ easy to use length when you shouldn't, that it's
> pretty much a guarantee that it's going to happen. I'm not sure that Andrei's
> suggestion is the best one at this point, but I sure wouldn't be against it
> being introduced. It wouldn't entirely fix the problem by any means, but
> programmers would then have to work harder at screwing it up and so there
> would be fewer mistakes.

Programmers would then also have to work harder at doing it right and at 
memoizing special cases, so there is absolutely no net gain.

>
> Arguably, the first issue with D strings is that we have char. In most
> languages, char is supposed to be a character, so many programmers will code
> with that expectation. If we had something like utf8unit, utf16unit, and
> utf32unit (arguably very bad, albeit descriptive, names) and no char, then it
> would force programmers to become semi-educated about the issues. There's no
> way that that's changing at this point though.
>
> - Jonathan M Davis

A programmer has to have basic knowledge of the language he is 
programming in. That includes knowing the meaning of all basic types. If 
he fails at that, testing should definitely catch that kind of trivial bugs.
December 31, 2011
Re: string is rarely useful as a function argument
On 12/30/2011 7:30 PM, Jonathan M Davis wrote:
> Yes, diligent programmers will generally find such problems, but with the
> current scheme, it's _so_ easy to use length when you shouldn't, that it's
> pretty much a guarantee that it's going to happen.

I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8 
correctly, but it turned out that the naive version that used [i] and .length 
worked correctly. This is typical, not exceptional.

This was definitely not true of older multibyte schemes, like Shift-JIS 
(shudder), but those schemes ought to be terminated with extreme prejudice. But 
it definitely will take a long time to live down the bugs and miasma of code 
that had to deal with them. C and C++ still live with that because of their 
agenda of backwards compatibility. They still support EBCDIC, after all, that 
was obsolete even in the 70's. And I still see posts on comp.moderated.c++ that 
say "you shouldn't write string code like that, because it won't work on 
EBCDIC!" Sheesh!
December 31, 2011
Re: string is rarely useful as a function argument
On 12/30/11 10:09 PM, Walter Bright wrote:
> On 12/30/2011 7:30 PM, Jonathan M Davis wrote:
>> Yes, diligent programmers will generally find such problems, but with the
>> current scheme, it's _so_ easy to use length when you shouldn't, that
>> it's
>> pretty much a guarantee that it's going to happen.
>
> I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8
> correctly, but it turned out that the naive version that used [i] and
> .length worked correctly. This is typical, not exceptional.

The lower frequency of bugs makes them that much more difficult to spot. 
This is essentially similar to the UTF16/UCS-2 morass: in a vast 
majority of the time the programmer may consider UTF16 a coding with one 
code unit per code point (which is what UCS-2 is). The existence of 
surrogates didn't make much of a difference because, again, very often 
the wrong assumption just worked. Well that all didn't go over all that 
well.

We need .raw and we must abolish .length and [] for narrow strings.


Andrei
December 31, 2011
Re: string is rarely useful as a function argument
On Sat, Dec 31, 2011 at 12:09 AM, Andrei Alexandrescu <
SeeWebsiteForEmail@erdani.org> wrote:

> On 12/30/11 10:09 PM, Walter Bright wrote:
>
>> On 12/30/2011 7:30 PM, Jonathan M Davis wrote:
>>
>>> Yes, diligent programmers will generally find such problems, but with the
>>> current scheme, it's _so_ easy to use length when you shouldn't, that
>>> it's
>>> pretty much a guarantee that it's going to happen.
>>>
>>
>> I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8
>> correctly, but it turned out that the naive version that used [i] and
>> .length worked correctly. This is typical, not exceptional.
>>
>
> The lower frequency of bugs makes them that much more difficult to spot.
> This is essentially similar to the UTF16/UCS-2 morass: in a vast majority
> of the time the programmer may consider UTF16 a coding with one code unit
> per code point (which is what UCS-2 is). The existence of surrogates didn't
> make much of a difference because, again, very often the wrong assumption
> just worked. Well that all didn't go over all that well.
>
> We need .raw and we must abolish .length and [] for narrow strings.
>
>
> Andrei
>


I don't know that Phobos would be an appropriate place for it but offering
some easy to access string data containing extensive and advanced unicode
which users could easily add to their programs unit tests may help people
ensure proper unicode usage. Unicode seems to be one of those things where
you either know it really well or you know just enough to get yourself in
trouble so having test data written by unicode experts could be very useful
for the rest of us mortals.

I googled around a bit.  This Stack Overflow came up <
http://stackoverflow.com/questions/6136800/unicode-test-strings-for-unit-tests>
that recommends these
- UTF-8 stress test:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
- Quick Brown Fox in a variety of languages:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt

I didn't see too much beyond those two.

Regards,
Brad A.
December 31, 2011
Re: string is rarely useful as a function argument
On 12/30/2011 11:09 PM, Andrei Alexandrescu wrote:
> On 12/30/11 10:09 PM, Walter Bright wrote:
>> I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8
>> correctly, but it turned out that the naive version that used [i] and
>> .length worked correctly. This is typical, not exceptional.
>
> The lower frequency of bugs makes them that much more difficult to spot. This is
> essentially similar to the UTF16/UCS-2 morass: in a vast majority of the time
> the programmer may consider UTF16 a coding with one code unit per code point
> (which is what UCS-2 is). The existence of surrogates didn't make much of a
> difference because, again, very often the wrong assumption just worked. Well
> that all didn't go over all that well.

I'm not so sure it's quite the same. Java was designed before there were 
surrogate pairs, they kinda got the rug pulled out from under them. So, they 
simply have no decent way to deal with it. There isn't even a notion of a dchar 
character type. Java was designed with codeunit==codepoint, it is embedded in 
the design of the language, library, and culture.

This is not true of D. It's designed from the ground up to deal properly with 
UTF. D has very simple language features to deal with it.

> We need .raw and we must abolish .length and [] for narrow strings.

I don't believe that fixes anything and breaks every D project out there. We're 
chasing phantoms here, and I worry a lot about over-engineering trivia.

And, we already have a type to deal with it: dstring
December 31, 2011
Re: string is rarely useful as a function argument
2011/12/31 Walter Bright <newshound2@digitalmars.com>:
> On 12/30/2011 11:09 PM, Andrei Alexandrescu wrote:
>>
>> On 12/30/11 10:09 PM, Walter Bright wrote:
>>>
>>> I'm not so sure about that. Timon Gehr's X macro tried to handle UTF-8
>>> correctly, but it turned out that the naive version that used [i] and
>>> .length worked correctly. This is typical, not exceptional.
>>
>>
>> The lower frequency of bugs makes them that much more difficult to spot.
>> This is
>> essentially similar to the UTF16/UCS-2 morass: in a vast majority of the
>> time
>> the programmer may consider UTF16 a coding with one code unit per code
>> point
>> (which is what UCS-2 is). The existence of surrogates didn't make much of
>> a
>> difference because, again, very often the wrong assumption just worked.
>> Well
>> that all didn't go over all that well.
>
>
> I'm not so sure it's quite the same. Java was designed before there were
> surrogate pairs, they kinda got the rug pulled out from under them. So, they
> simply have no decent way to deal with it. There isn't even a notion of a
> dchar character type. Java was designed with codeunit==codepoint, it is
> embedded in the design of the language, library, and culture.
>
> This is not true of D. It's designed from the ground up to deal properly
> with UTF. D has very simple language features to deal with it.
>
>
>> We need .raw and we must abolish .length and [] for narrow strings.
>
>
> I don't believe that fixes anything and breaks every D project out there.
> We're chasing phantoms here, and I worry a lot about over-engineering
> trivia.
>
> And, we already have a type to deal with it: dstring

I fully agree with Walter. No need more wrapper for string.

Kenji Hara
December 31, 2011
Re: string is rarely useful as a function argument
On 12/31/11 2:04 AM, Walter Bright wrote:
> On 12/30/2011 11:09 PM, Andrei Alexandrescu wrote:
>> On 12/30/11 10:09 PM, Walter Bright wrote:
>>> I'm not so sure about that. Timon Gehr's X macro tried to handle
>>> UTF-8 correctly, but it turned out that the naive version that
>>> used [i] and .length worked correctly. This is typical, not
>>> exceptional.
>>
>> The lower frequency of bugs makes them that much more difficult to
>>  spot. This is essentially similar to the UTF16/UCS-2 morass: in a
>> vast majority of the time the programmer may consider UTF16 a
>> coding with one code unit per code point (which is what UCS-2 is).
>> The existence of surrogates didn't make much of a difference
>> because, again, very often the wrong assumption just worked. Well
>> that all didn't go over all that well.
>
> I'm not so sure it's quite the same. Java was designed before there
> were surrogate pairs, they kinda got the rug pulled out from under
> them. So, they simply have no decent way to deal with it. There isn't
> even a notion of a dchar character type. Java was designed with
> codeunit==codepoint, it is embedded in the design of the language,
> library, and culture.
>
> This is not true of D. It's designed from the ground up to deal
> properly with UTF.

I disagree. It is designed to make dealing with UTF possible.

> D has very simple language features to deal with
> it.

Disagree. I mean simple they are, no contest. They could and should be 
much better, make correct code easier to write, and make incorrect code 
more difficult to write. Claiming we reached perfection there doesn't 
quite fit.

>> We need .raw and we must abolish .length and [] for narrow
>> strings.
>
> I don't believe that fixes anything and breaks every D project out
> there.

I agree. This is the only reason that keeps me from furthering the issue.

> We're chasing phantoms here, and I worry a lot about over-engineering
> trivia.

I disagree. I understand that seems trivia to you, but that doesn't make 
your opinion any less wrong, not to mention provincial through 
insistence it's applicable beyond a small team of experts. Again: I know 
no other - I literally mean not one - person who writes string code like 
you do (and myself after learning it from you); the current system is 
adequate; the proposed system is perfect - save for breaking backwards 
compatibility, which makes the discussion moot. But it being moot does 
not afford me to concede this point. I am right.

> And, we already have a type to deal with it: dstring

No.


Andrei
7 8 9 10 11 12 13 14 15
Top | Discussion index | About this forum | D home