December 29, 2011
On 12/29/11 12:28 PM, Don wrote:
> On 28.12.2011 20:00, Andrei Alexandrescu wrote:
>> Oh, one more thing - one good thing that could come out of this thread
>> is abolition (through however slow a deprecation path) of s.length and
>> s[i] for narrow strings. Requiring s.rep.length instead of s.length and
>> s.rep[i] instead of s[i] would improve the quality of narrow strings
>> tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.
>> Then, people would access the decoding routines on the needed occasions,
>> or would consciously use the representation.
>>
>> Yum.
>
>
> If I understand this correctly, most others don't. Effectively, .rep
> just means, "I know what I'm doing", and there's no change to existing
> semantics, purely a syntax change.

Exactly!

> If you change s[i] into s.rep[i], it does the same thing as now. There's
> no loss of functionality -- it's just stops you from accidentally doing
> the wrong thing. Like .ptr for getting the address of an array.
> Typically all the ".rep" everywhere would get annoying, so you would write:
> ubyte [] u = s.rep;
> and use u from then on.
>
> I don't like the name 'rep'. Maybe 'raw' or 'utf'?
> Apart from that, I think this would be perfect.

Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great.

Now I'm twice sorry this will not happen...


Andrei
December 29, 2011
Don't we already have String-like support with ranges?  I'm not sure I understand the point in having special behavior for char arrays.

Sent from my iPhone

On Dec 28, 2011, at 8:17 PM, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 12/28/11 4:18 PM, foobar wrote:
>> On Wednesday, 28 December 2011 at 21:57:00 UTC, Andrei Alexandrescu wrote:
>>> On 12/28/11 1:48 PM, foobar wrote:
>>>> On Wednesday, 28 December 2011 at 19:30:04 UTC, Andrei Alexandrescu wrote:
>>>>> On 12/28/11 1:18 PM, foobar wrote:
>>>>>> That's a good idea which I wonder about its implementation strategy.
>>>>> 
>>>>> Implementation would entail a change in the compiler.
>>>>> 
>>>>> Andrei
>>>> 
>>>> Why? D should be plenty powerful to implement this without modifying the compiler. Sounds like you suggest that char[] will behave differently than other T[] which is a very poor idea IMO.
>>> 
>>> It's an awesome idea, but for an academic debate at best.
>>> 
>>> Andrei
>> 
>> I don't follow you. You've suggested a change that I agree with. Adam provided a prototype string library type that accomplishes your specified goals without any changes to the compiler. What are we missing here? IF it boils down to changing the compiler or leaving the status-quo, I'm voting against the compiler change.
> 
> If we have two facilities (string and e.g. String) we've lost. We'd need to slowly change the built-in string type.
> 
> I discussed the matter with Walter. He completely disagrees, and sees the idea as a sheer way to complicate stuff for no good. He mentions how he frequently uses .length, indexing, and slicing in narrow strings.
> 
> I know Walter's code, so I know where he's coming from. He understands UTF in and out, and I have zero doubt he actually knows all essential constants, masks, and ranges by heart. I've seen his code and indeed it's an amazing feat of minimal opportunistic on-demand decoding. So I know where he's coming from, but I also know next to nobody codes like him. A casual string user almost always writes string code (iteration, indexing) the wrong way and would be tremendously helped by a clean distinction between abstraction and representation.
> 
> Nagonna happen.
> 
> 
> Andrei
> 
December 29, 2011
On Thursday, December 29, 2011 11:32:52 Sean Kelly wrote:
> Don't we already have String-like support with ranges?  I'm not sure I understand the point in having special behavior for char arrays.

To avoid common misusage. It's way to easy to misuse the length property on narrow strings. Programmers shouldn't be using the length property on narrow strings unless they know what they're doing, but it's likely the first thing that any programmer is going to use for the length of a string, because that's how arrays in general work.

If it weren't legal to simply use the length property of a char[] or to directly slice it or index it, then those common misuages would be harder to do. You could still do them via .rep or .raw or whatever we'd call it, but it would no longer be the path of least resistance.

Yes, Phobos may avoid the issue, because for the most part its developers understand the issues, but many programmers who do not understand them, will make mistakes in their own code which should arguably be harder to make, simply because it's the path of least resistance, and they don't know any better.

- Jonathan M Davis
December 29, 2011
On Thursday, December 29, 2011 17:01:19 deadalnix wrote:
> Le 28/12/2011 21:43, Jonathan M Davis a écrit :
> > Agreed. And for a number of functions, taking const(char)[] would be worse, because they would have to dup or idup the string, whereas with immutable(char)[], they can safely slice it without worrying about its value changing.
> 
> Is inout a solution for the standard lib here ?
> 
> The user could idup if a string is needed from a const/mutable char[]

In some places, yes. Phobos doesn't use inout as much as it probably should, simply because it was only recently that inout was made to work properly. Regardless, you have to be careful about taking const(char)[], because there's a risk of forcing what could be an unnecessary idup. The best solution to that, however, depends on what exactly the function is doing. If it's simply slicing a portion of the string that's passed in and returning it, then inout is a great solution. On the other hand, if it actually needs an immutable(char)[] internally, then there's a good chance that it should just take a string. It depends on what the function is ultimately doing.

- Jonathan M Davis
December 30, 2011
On Thu, 29 Dec 2011 18:36:27 -0000, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 12/29/11 12:28 PM, Don wrote:
>> On 28.12.2011 20:00, Andrei Alexandrescu wrote:
>>> Oh, one more thing - one good thing that could come out of this thread
>>> is abolition (through however slow a deprecation path) of s.length and
>>> s[i] for narrow strings. Requiring s.rep.length instead of s.length and
>>> s.rep[i] instead of s[i] would improve the quality of narrow strings
>>> tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.
>>> Then, people would access the decoding routines on the needed occasions,
>>> or would consciously use the representation.
>>>
>>> Yum.
>>
>>
>> If I understand this correctly, most others don't. Effectively, .rep
>> just means, "I know what I'm doing", and there's no change to existing
>> semantics, purely a syntax change.
>
> Exactly!
>
>> If you change s[i] into s.rep[i], it does the same thing as now. There's
>> no loss of functionality -- it's just stops you from accidentally doing
>> the wrong thing. Like .ptr for getting the address of an array.
>> Typically all the ".rep" everywhere would get annoying, so you would write:
>> ubyte [] u = s.rep;
>> and use u from then on.
>>
>> I don't like the name 'rep'. Maybe 'raw' or 'utf'?
>> Apart from that, I think this would be perfect.
>
> Yes, I mean "rep" as a short for "representation" but upon first sight the connection is tenuous. "raw" sounds great.
>
> Now I'm twice sorry this will not happen...

+1 for this idea, however named.

R

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/
December 30, 2011
there are lot of people suggesting to change how string behaves. but remember, d is awesome compared to other languages for not wrapping string in a class or struct.

you can use string/char[] without loosing your _nativeness_. programmers targeting embedded systems are really happy because of this.

by the way, I don't want to blame someone, but I think we diverged from the original purpose of this topic. __"string is rarely useful as a function argument"__

I think he points out that choosing _string_ type in function arguments is _wrong_ in most cases. and there isn't much use of inout in phobos as it was broken for a long time.
December 30, 2011
Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
> On 12/29/11 12:28 PM, Don wrote:
>> On 28.12.2011 20:00, Andrei Alexandrescu wrote:
>>> Oh, one more thing - one good thing that could come out of this thread
>>> is abolition (through however slow a deprecation path) of s.length and
>>> s[i] for narrow strings. Requiring s.rep.length instead of s.length and
>>> s.rep[i] instead of s[i] would improve the quality of narrow strings
>>> tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.
>>> Then, people would access the decoding routines on the needed occasions,
>>> or would consciously use the representation.
>>>
>>> Yum.
>>
>>
>> If I understand this correctly, most others don't. Effectively, .rep
>> just means, "I know what I'm doing", and there's no change to existing
>> semantics, purely a syntax change.
>
> Exactly!
>
>> If you change s[i] into s.rep[i], it does the same thing as now. There's
>> no loss of functionality -- it's just stops you from accidentally doing
>> the wrong thing. Like .ptr for getting the address of an array.
>> Typically all the ".rep" everywhere would get annoying, so you would
>> write:
>> ubyte [] u = s.rep;
>> and use u from then on.
>>
>> I don't like the name 'rep'. Maybe 'raw' or 'utf'?
>> Apart from that, I think this would be perfect.
>
> Yes, I mean "rep" as a short for "representation" but upon first sight
> the connection is tenuous. "raw" sounds great.
>
> Now I'm twice sorry this will not happen...
>

Maybe it could happen if we
 1. make dstring the default strings type -- code units and characters would be the same
 or 2. forward string.length to std.utf.count and opIndex to std.utf.toUTFindex

so programmers could use the slices/indexing/length (no lazyness problems), and if they really want codeunits use .raw/.rep (or better .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)

But generally I liked the idea of just having an alias for strings...

>
> Andrei

-- Joshua Reusch
December 30, 2011
On 12/30/2011 08:33 PM, Joshua Reusch wrote:
> Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
>> On 12/29/11 12:28 PM, Don wrote:
>>> On 28.12.2011 20:00, Andrei Alexandrescu wrote:
>>>> Oh, one more thing - one good thing that could come out of this thread
>>>> is abolition (through however slow a deprecation path) of s.length and
>>>> s[i] for narrow strings. Requiring s.rep.length instead of s.length and
>>>> s.rep[i] instead of s[i] would improve the quality of narrow strings
>>>> tremendously. Also, s.rep[i] should return ubyte/ushort, not
>>>> char/wchar.
>>>> Then, people would access the decoding routines on the needed
>>>> occasions,
>>>> or would consciously use the representation.
>>>>
>>>> Yum.
>>>
>>>
>>> If I understand this correctly, most others don't. Effectively, .rep
>>> just means, "I know what I'm doing", and there's no change to existing
>>> semantics, purely a syntax change.
>>
>> Exactly!
>>
>>> If you change s[i] into s.rep[i], it does the same thing as now. There's
>>> no loss of functionality -- it's just stops you from accidentally doing
>>> the wrong thing. Like .ptr for getting the address of an array.
>>> Typically all the ".rep" everywhere would get annoying, so you would
>>> write:
>>> ubyte [] u = s.rep;
>>> and use u from then on.
>>>
>>> I don't like the name 'rep'. Maybe 'raw' or 'utf'?
>>> Apart from that, I think this would be perfect.
>>
>> Yes, I mean "rep" as a short for "representation" but upon first sight
>> the connection is tenuous. "raw" sounds great.
>>
>> Now I'm twice sorry this will not happen...
>>
>
> Maybe it could happen if we
> 1. make dstring the default strings type --

Inefficient.

> code units and characters would be the same

Wrong.

> or 2. forward string.length to std.utf.count and opIndex to
> std.utf.toUTFindex

Inconsistent and inefficient (it blows up the algorithmic complexity).

>
> so programmers could use the slices/indexing/length (no lazyness
> problems), and if they really want codeunits use .raw/.rep (or better
> .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)
>

Anyone who intends to write efficient string processing code needs this. Anyone who does not want to write string processing code will not need to index into a string -- standard library functions will suffice.

> But generally I liked the idea of just having an alias for strings...

Me too. I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen?

1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.

2. They get screwed up string output, look for the reason, patch up their code with some functions from std.utf and will never make the same mistakes again.


I have *never* seen an user in D.learn complain about it. They might have been some I missed, but it is certainly not a prevalent problem. Also, just because an user can type .rep does not mean he understands Unicode: He is able to make just the same mistakes as before, even more so, as the array he is getting back has the _wrong element type_.

December 30, 2011
On Friday, 30 December 2011 at 19:55:45 UTC, Timon Gehr wrote:
> I think the way we have it now is optimal. The only reason we are discussing this is because of fear that uneducated users will write code that does not take into account Unicode characters above code point 0x80. But what is the worst thing that can happen?
>
> 1. They don't notice. Then it is not a problem, because they are obviously only using ASCII characters and it is perfectly reasonable to assume that code units and characters are the same thing.
>
> 2. They get screwed up string output, look for the reason, patch up their code with some functions from std.utf and will never make the same mistakes again.
>
>
> I have *never* seen an user in D.learn complain about it. They might have been some I missed, but it is certainly not a prevalent problem. Also, just because an user can type .rep does not mean he understands Unicode: He is able to make just the same mistakes as before, even more so, as the array he is getting back has the _wrong element type_.

I strongly agree with this. It would be nice to have everything be simple, work correctly *and* efficiently at the same time, but I don't believe the proposed changes make a definite improvement.

In the end, if you don't want to use the standard library or other UTF-aware string libraries, you'll have to know the basics of UTF to write the correct code. I too wish it was harder to write it incorrectly, but the current solution is simply the best one to appear yet.
December 30, 2011
On 12/30/11 1:55 PM, Timon Gehr wrote:
> Me too. I think the way we have it now is optimal.

What we have now is adequate. The scheme I proposed is optimal.

I agree with all of your other remarks.


Andrei