May 07, 2021
On Friday, 7 May 2021 at 18:44:26 UTC, Steven Schveighoffer wrote:
> The end result of something like you allude to would result in nearly all of phobos NOT working with arrays.

int[5] arr;
arr.sort(); // fails, you need to use []

Array!int arr;
arr.sort(); // fails, you need to use []

some random phobos functions special-case this to make it work which is the real wtf and those should be undone, just get the user to slice a static array.

So I'd just make it all consistent.


But tbh I don't feel that strongly about it... except for string. string should no longer be a range. Delete its popFront overload and let the user pick byCodeUnit or byCodePoint or whatever. Just rip that band aid right off.

Just even for the others, even if the [] was deemed unacceptable, i don't love the ufcs solution.

So many people try to do freestanding functions for other types, inspired by the phobos popFront.. and isInputRange fails because phobos itself must import the ufcs module. Other new people do foo.empty and it fails because they didn't import the module.

So like even if the behavior remained the same as today, I'd like to define it a little differently.

but meh dont wanna continue too far down this particular thing since it is the part of my rant i care the least about.
May 07, 2021
On 5/7/21 1:05 PM, Steven Schveighoffer wrote:
> The problem I have is, you have a function like:
> 
> foo(T)(T s) if (isSomeString!T)
> 
> The *intention* here is that, I want to NOT have to write:
> 
> foo(string s) { impl }
> foo(wstring s) { impl }
> foo(dstring s) { impl }
> ... // etc with const, mutable
> 
> BUT, if I have an enum that converts to a string, then if I actually DID write all those, then it would compile. However, the template version does not. This is the confusion that a user and library author has.

Of course. I understand that very well. But that's a minor confusion and inconvenience; people understand very well that e.g. this won't work:

void foo(float);
void foo(double);
void main() { foo(1); }

The reason is slightly different but the point is the same: convertibility has its subtleties and programming languages comprehend small surprises.

Supporting enum strings and alias this at the huge cost we incur now is definitely over two standard deviations away from what's reasonable.

> I think the problem here is that the language doesn't give you a good way to express that. So we rely on template constraints that both can't exactly express that intention, and where the approximations create various template instantiations that cause strange problems (i.e. if you accept an enum that converts to string, it's still an enum inside the template). Whereas the language
> 
> I'm not suggesting any specific changes here, but I recognize there is a disconnect from what we *want* to express, and what the language provides.

That I am on board with.

May 07, 2021
On 5/7/21 2:22 PM, Jacob Carlborg wrote:
> On 2021-05-07 17:24, Andrei Alexandrescu wrote:
> 
>> Compare all that with:
>>
>> 0. We put a String type in the standard library. It uses UTF8 inside and supports iteration by either bytes, UTF8, UTF16, or UTF32. It manages its own memory so no need for the GC. It disallows remote coupling across callers/callees. Case closed.
> 
> You can have enums with the base type being a struct or a class. How does putting a String type in the standard library help with the enum problem you're describing?

The solution to that is "We do not support enums". But if you use a non-templated class String, you won't feel much of a pain in the first place because the enums will be converted to String objects upon call.

The String type solves all other problems mentioned.
May 07, 2021
On 5/7/21 2:25 PM, Jacob Carlborg wrote:
> On 2021-05-07 17:24, Andrei Alexandrescu wrote:
> 
>> 0. We put a String type in the standard library.
> 
> If you're going to make strings a user defined type, how are you planning to support things like switch statements with strings?

Built-in strings remain as they are.

May 07, 2021
On 5/7/21 2:44 PM, Steven Schveighoffer wrote:
> On 5/7/21 2:17 PM, Adam D. Ruppe wrote:
>> I think it was actually a mistake for Phobos to UFCS shoe-horn in range functions on arrays too - this includes strings as well as int[] and such as well.
> 
> The most common range BY FAR in all of D code is an array.
> 
> The end result of something like you allude to would result in nearly all of phobos NOT working with arrays.
> 
> Just a taste:
> 
> int[] arr = genArray;
> arr.sort(); // fail.
> 
> I don't want to go to that place, ever.
> 
> -Steve

Yah, ranges are a generalization of arrays. It would be odd if the generalization of arrays didn't work when tried with arrays.
May 07, 2021

On Friday, 7 May 2021 at 15:24:42 UTC, Andrei Alexandrescu wrote:

>
  1. We put a String type in the standard library. It uses UTF8 inside and supports iteration by either bytes, UTF8, UTF16, or UTF32. It manages its own memory so no need for the GC. It disallows remote coupling across callers/callees. Case closed.

This is a bit orthogonal, but... An important characteristic of utf-8 arrays is that they are simultaneously a random access range of bytes and an input range of utf-8 characters. For efficiency it's often important to switch back and forth between these two interpretations.

byLine is one type of example, where a byte oriented search is done (e.g. with memchr), but afterward the representation array is accessed as utf-8 input range.

byLine implementations will usually work by iterating forward, but there are random access use cases as well. For example, it is perfectly reasonable to divide a utf-8 array in roughly in half using byte offsets, then searching for the nearest utf-8 character boundary. At after this both halves are treated as utf-8 input ranges, not random access.

This switching between interpretations doesn't fit well with current distinction between char[] and byte[]. A numbers of algorithms in phobos operate on one or the other, but not both.

It'd be very useful to have an approach to utf-8 strings that enabled switching interpretations easily, without casting.

--Jon

May 08, 2021

On Friday, 7 May 2021 at 15:24:42 UTC, Andrei Alexandrescu wrote:

>

Compare all that with:

We put a String type in the standard library. It uses UTF8 inside and supports iteration by either bytes, UTF8, UTF16, or UTF32. It manages its own memory so no need for the GC. It disallows remote coupling across callers/callees. Case closed.

True. But why have it easy when you can have it complicated?

May 08, 2021

On Friday, 7 May 2021 at 17:05:08 UTC, Steven Schveighoffer wrote:

>

The problem I have is, you have a function like:

auto foo(T)(T s) if (isSomeString!T) { impl }

The intention here is that, I want to NOT have to write:

auto foo(string s) { impl }
auto foo(wstring s) { impl }
auto foo(dstring s) { impl }
... // etc with const, mutable

BUT, if I have an enum that converts to a string, then if I actually DID write all those, then it would compile. However, the template version does not. This is the confusion that a user and library author has.

Maybe this is special casing here, but if you have a finite list of types you want to support, it might be easier to add an AliasSeq of all string types to std.traits or so and use

static foreach (String; Strings)
auto foo(String s) { impl }

Looks generic, but actually isn't. The implementation bloat is a different beast though.

May 07, 2021
On 5/7/21 6:34 PM, Jon Degenhardt wrote:
> On Friday, 7 May 2021 at 15:24:42 UTC, Andrei Alexandrescu wrote:
>> 0. We put a String type in the standard library. It uses UTF8 inside and supports iteration by either bytes, UTF8, UTF16, or UTF32. It manages its own memory so no need for the GC. It disallows remote coupling across callers/callees. Case closed.
> 
> This is a bit orthogonal, but... An important characteristic of utf-8 arrays is that they are simultaneously a random access range of bytes and an input range of utf-8 characters. For efficiency it's often important to switch back and forth between these two interpretations.
> 
> `byLine` is one type of example, where a byte oriented search is done (e.g. with `memchr`), but afterward the representation array is accessed as utf-8 input range.
> 
> `byLine` implementations will usually work by iterating forward, but there are random access use cases as well. For example, it is perfectly reasonable to divide a utf-8 array in roughly in half using byte offsets, then searching for the nearest utf-8 character boundary. At after this both halves are treated as utf-8 input ranges, not random access.
> 
> This switching between interpretations doesn't fit well with current distinction between `char[]` and `byte[]`. A numbers of algorithms in phobos operate on one or the other, but not both.
> 
> It'd be very useful to have an approach to utf-8 strings that enabled switching interpretations easily, without casting.

String s;
func1(s.bytes);
func2(s.dchars);


May 08, 2021

On Saturday, 8 May 2021 at 02:05:42 UTC, Andrei Alexandrescu wrote:

>

On 5/7/21 6:34 PM, Jon Degenhardt wrote:

>

It'd be very useful to have an approach to utf-8 strings that enabled switching interpretations easily, without casting.

String s;
func1(s.bytes);
func2(s.dchars);

That's not quite what I was getting at. But that's my fault. A hastily written message that muddled a couple of concepts. Sorry about that, I need to write up a better description. But there are two underlying thoughts.

One is being able to convert from a random access byte array to char input range (e.g. byUTF), do something with it (e.g. popFront), then convert that form back to a random access byte range. This is logically doable because both are views on the same physical array. However, once something is an input range it doesn't convert simply to a random access range.

This first one strikes me as potentially challenging because this dual view on the underlying data is not common, so there's not a lot of incentive to support it as a general concept.

The second issue is more about current Phobos algorithms that specialize their implementations depending on whether the argument is a char[] or a byte[]. This normally involves conditioning on isSomeString or isSomeChar. char[] / char pass these tests, byte[] / byte do not. The cases I remember are cases where the string form was specialized to have better performance than the byte form. Look through searching.d for isSomeString use to see this.

The trouble with this is that at the application level it can be necessary to use a byte array when working with a number facilities. This often involves I/O. E.g. Reading fixed sized blocks from an input stream (File.byChunk). This operates on ubyte[] arrays. It can be cast to a char[]. But, this can run afoul of autodecoding related routines that expect correctly formed utf-8 characters. When reading fixed size buffers, the starts and ends of the buffer will often not fall on utf-8 boundaries, so examining the bytes is necessary to handle these cases. (And input streams may contain corrupt utf-8 characters.)

I know the above is still not an adequate description. At some point I'll try to write up something more compelling.

--Jon