May 08, 2021

On Saturday, 8 May 2021 at 20:19:51 UTC, Berni44 wrote:

>

Oh, yes, you can! Think of an algorithm which is doing cryptographic analysis and counting consecutive pairs of ascii characters. For that it doesn't matter if there is RTL text cut into pieces.

No cryptography is done on strings but instead on byte arrays. Why would you even want to use string here? Its methods won't be in any help.

May 08, 2021

On Saturday, 8 May 2021 at 20:22:28 UTC, Jon Degenhardt wrote:

>

If all an algorithm needs to do is split a string roughly in half, then use the byte offsets to find the halfway point and then look for a utf-8 character boundary. If the algorithm is based on some other boundary, say, token boundaries, then find one of those boundaries.

Those algorithms you talking about are either doesn't need strings at all but instead byte/char arrays or would produce garbage for any input other than ascii.
Your example with log files mixes binary data with text. Properly done logger will escape delimiters inside text chunks, so it isn't even a string per se, it's some binary data from which you need to extract a string first.
A lot of bugs are caused by this mixing of text with binary. And I think it is better to distinguish them properly on a type level.

May 08, 2021

On Saturday, 8 May 2021 at 20:06:35 UTC, guai wrote:

> >

The thing is making the range be of dchars doesn't help with this.

At least it won't induce more problems

This is what Phobos already does and it has already created more problems. It was a mistake to do it this way.

But if string was just an opaque(ish) blob with a variety of accessor properties it would work better then. The big mistake Phobos made was trying to automatically do something and causing friction by that automatic thing not being right.

May 08, 2021

On Saturday, 8 May 2021 at 21:54:28 UTC, Adam D. Ruppe wrote:

>

On Saturday, 8 May 2021 at 20:06:35 UTC, guai wrote:

> >

The thing is making the range be of dchars doesn't help with this.

At least it won't induce more problems

This is what Phobos already does and it has already created more problems. It was a mistake to do it this way.

But if string was just an opaque(ish) blob with a variety of accessor properties it would work better then. The big mistake Phobos made was trying to automatically do something and causing friction by that automatic thing not being right.

The opaque blob model also allows SSO much more easily.

May 08, 2021

On Saturday, 8 May 2021 at 21:47:21 UTC, guai wrote:

>

On Saturday, 8 May 2021 at 20:22:28 UTC, Jon Degenhardt wrote:

>

If all an algorithm needs to do is split a string roughly in half, then use the byte offsets to find the halfway point and then look for a utf-8 character boundary. If the algorithm is based on some other boundary, say, token boundaries, then find one of those boundaries.

Those algorithms you talking about are either doesn't need strings at all but instead byte/char arrays or would produce garbage for any input other than ascii.

I don't understand the point you are trying to make. Perhaps you could rephrase.

I've implemented any number of these types of algorithms. Its very common to mix interpretation as unicode strings with interpretation as utf-8 bytes. e.g. Maybe its necessary to do case-conversion at some stage of processing. This has to be done on unicode characters, not bytes. But needing to do such processing at some point does exclude such treating the data as utf-8 bytes for other purposes.

Also, a char[] in D is defined to be utf-8, and a string is an immutable(char)[]. So why would utf-8 data, including non-ascii characters, read into a char[] produce garbage? The answer is that it wouldn't. No, you cannot simply start on an arbitrary byte boundary, but nobody has suggested this.

>

Your example with log files mixes binary data with text. Properly done logger will escape delimiters inside text chunks, so it isn't even a string per se, it's some binary data from which you need to extract a string first.

Again, I'm not following the logic. Log files may or may not include binary data. But I'm sure why that matters. I'm talking about log files where the text portions are encoded as utf-8.

>

A lot of bugs are caused by this mixing of text with binary. And I think it is better to distinguish them properly on a type level.

Perhaps it would help if you described what you mean by "binary". I tend to think of "binary" as things like image data, binary serialization formats, base-64 coding, compressed or encrypted text. These are quite different than utf-8 encoded unicode text.

May 08, 2021
On 5/7/2021 7:16 AM, Paul Backus wrote:
> "Is a string type" and "is implicitly convertible to a string type" are not the same thing.

Language lawyer point:

An enum can be implicitly converted to its base type, but it's a match level 2:

https://dlang.org/spec/function.html#function-overloading

(Agreeing with Paul)
May 09, 2021
On 5/7/2021 7:05 PM, Andrei Alexandrescu wrote:
> String s;
> func1(s.bytes);
> func2(s.dchars);

Already done:

s.byCodeUnit
s.byChar
s.byWchar
s.byDchar
s.byUTF

https://dlang.org/phobos/std_utf.html
May 10, 2021
On Friday, 7 May 2021 at 15:33:56 UTC, Adam D. Ruppe wrote:
> On Friday, 7 May 2021 at 15:25:30 UTC, Andrei Alexandrescu wrote:
>> Enums derived from strings should not be supported as strings in the standard library.
>
> I don't think the stdlib should special case much of anything.
>
> Special casing enums is a mistake. If the user wants it treated as a string, they can cast it to a string.
>
> [...]
>
> Kill all the special cases!

100% agreed, but, back to my original point, why is the enum thing a special case to begin with?

The fact that it is a special case to begin with flies in the face of Liskov's substitution principle - the enum type clearly is a subtype of string.

You got to wonder how it came to be that it just don't work automatically to begin with. Adding special cases is indeed the wrong path. There is something deeper rotten here, and just saying, no, this shouldn't work is just not cutting it.

Note that there should be special cases, but it's be good to understand why these are special case to begin with, and fix this.

Alternatively, we decide enums are not subtypes, in which case they shouldn't be implicitly convertible either. That wouldn't be such a bad idea as I've often missed the ability to do opaque type aliasing in D, but that seems way more disruptive than just admitting that "enum strings" are indeed a subtype of string.
May 10, 2021
On Sunday, 9 May 2021 at 02:57:42 UTC, Walter Bright wrote:
> On 5/7/2021 7:16 AM, Paul Backus wrote:
>> "Is a string type" and "is implicitly convertible to a string type" are not the same thing.
>
> Language lawyer point:
>
> An enum can be implicitly converted to its base type, but it's a match level 2:
>
> https://dlang.org/spec/function.html#function-overloading
>
> (Agreeing with Paul)

Sorry to be blunt, but this is complete language layering fail.

Classes implementing and interface are a subtype and are match level 2 (implicit conversion) when matching against the interface.

In fact, any subtype is expected to be a match level 2 - arguably, this isn't bijective, as not all level 2 match will be subtypes, that doesn't definitively nails the topic at hand, but the argument made in this thread are disturbingly unsound.
May 10, 2021
On Friday, 7 May 2021 at 16:43:20 UTC, Andrei Alexandrescu wrote:
> Well you see here is the problem. An enum with base string can be coerced to a string, but is not a true subtype of string. This came to a head with ranges, too - you can pop off the head of a string still have a string, but if you pop off the head of an enum string you get some enum value that is not present in the set of enum values. Concatenation has similar problems, e.g. s ~ s for enum strings yields string, not an enum string. (Weirdly s ~= s works...)
>

Popping the head out of an enum value ought to be a string, not that enum's value. I don't really see where the problem is here, this is subtyping 101.

I raised a few times int he past that there were unsound operations performed in the past (as in "Weirdly s ~= s works...") but I don't think turning compiler bugs into standard library policies is going to lead to better tomorrows.