March 07, 2014
On Friday, 7 March 2014 at 14:13:54 UTC, Adam D. Ruppe wrote:
> On Friday, 7 March 2014 at 10:44:46 UTC, Kagamin wrote:
>> Now it's passed by value.
>
> That won't work for operator overloading though (which is the really interesting case here).

Alternatively for small methods you can rely on inlining, which dereferences the argument. If the method is big, the reference is probably unimportant.
March 07, 2014
On Friday, 7 March 2014 at 13:40:31 UTC, Michel Fortin wrote:
> if you want to parse XML then you'll need to work with code points (in theory, in practice you might still want direct access to code units for performance reasons)

AFAIK, xml control characters are all ascii, and what's between them you can slice or dup without consideration, so code units should be more than enough.
March 07, 2014
I don't like it at all.

1) It is a huge breakage and you have been refusing to do one even for more important problems. What is about this sudden change of mind?

2) It is regression back to C++ days of no-one-cares-about-Unicode pain. Thinking about strings as character arrays is so natural and convenient that if language/Phobos won't punish you for that, it will be extremely widespread.

Rendering correctness is very application-specific but providing basic guarantees that string is not completely broken is useful.

Now real problems I see:

1) stuff like readText() returns char[] instead of requiring explicit default encoding

2) lack of convenient .raw property which will effectively do cast(ubyte[])

3) the fact that std.string always assumes unicode and never forwards to std.ascii for http://dlang.org/phobos/std_encoding.html#.AsciiString / ubyte[]
March 07, 2014
On Friday, 7 March 2014 at 14:44:43 UTC, Kagamin wrote:
> Alternatively for small methods you can rely on inlining, which dereferences the argument.

Yeah, that's usually the way to go, inlining can also avoid pushing other arguments to the stack on 32 bit which is a big win too. But you can't inline asm function, and checking the overflow flag needs asm. (or a compiler intrinsic.)

For the library typedef case too, this means wrapping any function that returns a struct too which is annoying if nothing else.
March 07, 2014
I'm with Walter on this, and it's why I don't use char ranges. Though converting to ubyte feels weird.
March 07, 2014
On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
> I don't like it at all.
>
> 1) It is a huge breakage

Can we look at some example situations that this will break?

> and you have been refusing to do one even for more important problems.

This is a fallacy.

> 2) It is regression back to C++ days of no-one-cares-about-Unicode pain. Thinking about strings as character arrays is so natural and convenient that if language/Phobos won't punish you for that, it will be extremely widespread.

Thinking about dstrings as character arrays is less flawed only to a certain extent.

> Now real problems I see:
>
> 1) stuff like readText() returns char[] instead of requiring explicit default encoding
>
> 2) lack of convenient .raw property which will effectively do cast(ubyte[])
>
> 3) the fact that std.string always assumes unicode and never forwards to std.ascii for http://dlang.org/phobos/std_encoding.html#.AsciiString / ubyte[]

I think these are fixable without breaking anything? So why not go for it? The first two sound trivial (.raw can be an UFCS property).
March 07, 2014
On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev wrote:
> On Friday, 7 March 2014 at 15:03:24 UTC, Dicebot wrote:
>> I don't like it at all.
>>
>> 1) It is a huge breakage
>
> Can we look at some example situations that this will break?

Any code that relies on countUntil to count dchar's? Or, to generalize, almost any code that uses std.algorithm functions with string?

>> and you have been refusing to do one even for more important problems.
>
> This is a fallacy.

Ok :)

>> 2) It is regression back to C++ days of no-one-cares-about-Unicode pain. Thinking about strings as character arrays is so natural and convenient that if language/Phobos won't punish you for that, it will be extremely widespread.
>
> Thinking about dstrings as character arrays is less flawed only to a certain extent.

Sure. But I find this extent practical enough to make the difference. It is good compromise between perfectly correct (and very slow) string processing and having your program unusable with anything but basic latin symbol set.

>> Now real problems I see:
>>
>> 1) stuff like readText() returns char[] instead of requiring explicit default encoding
>>
>> 2) lack of convenient .raw property which will effectively do cast(ubyte[])
>>
>> 3) the fact that std.string always assumes unicode and never forwards to std.ascii for http://dlang.org/phobos/std_encoding.html#.AsciiString / ubyte[]
>
> I think these are fixable without breaking anything? So why not go for it? The first two sound trivial (.raw can be an UFCS property).

(1) will likely require deprecation (== breakage) of old interface, but yes, those are relatively trivial. It is just has not been important enough to me to spend time on pushing it. Still struggling to finish my template argument list proposal :(

March 07, 2014
On Friday, 7 March 2014 at 16:43:30 UTC, Dicebot wrote:
> On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev wrote:
>> Can we look at some example situations that this will break?
>
> Any code that relies on countUntil to count dchar's? Or, to generalize, almost any code that uses std.algorithm functions with string?

This is a pretty fragile design in the first place, since we use the same basic type (integers) to count two different things (code units / code points). Code that relies on this behavior would need to be explicitly tested with Unicode data to be sure that it works correctly - otherwise, it will only appear at a glance that it works right if it's only tested with ASCII.

Correct code where these indices never left the equation will not be affected, e.g.:

auto s = "日本語";
auto x = s.countUntil("本語"); // was 1, will be 3
s = s.drop(x);
assert(s == "本語"); // still OK

>> Thinking about dstrings as character arrays is less flawed only to a certain extent.
>
> Sure. But I find this extent practical enough to make the difference. It is good compromise between perfectly correct (and very slow) string processing and having your program unusable with anything but basic latin symbol set.

I think that if we are to draw a line somewhere on what to support and not, the decision should not be embedded as deep into the language. Ideally, it would be clearly visible in the code that you are counting code points.
March 07, 2014
On Friday, 7 March 2014 at 17:04:30 UTC, Vladimir Panteleev wrote:
> I think that if we are to draw a line somewhere on what to support and not, the decision should not be embedded as deep into the language. Ideally, it would be clearly visible in the code that you are counting code points.

Well if you consider really breaking changes, simply prohibiting plain random access to char[] and forcing to use either .raw or .decode is one thing I'd love to see (with .byGrapheme as library cherry on top)
March 07, 2014
On Friday, 7 March 2014 at 03:52:42 UTC, Walter Bright wrote:
> Ok, I have a plan. Each step will be separated by at least one version:
>
> 1. implement decode() as an algorithm for string types, so one can write:
>
>     string s;
>     s.decode.algorithm...
>
> suggest that people start doing that instead of:
>
>     s.algorithm...

I think .decode should be something more explicit (byCodePoint OSLT), just so it's clear that it's not magical and does not solve all problems.

> 2. Emit warning when people use std.array.front(s) with strings.
>
> 3. Deprecate std.array.front for strings.
>
> 4. Error for std.array.front for strings.
>
> 5. Implement new std.array.front for strings that doesn't decode.

Until then, how will people use strings with algorithms when they mean to use them per-byte? A .raw property which casts to ubyte[]?