No we should not support enum types derived from strings (page 6)

On Saturday, 8 May 2021 at 20:22:28 UTC, Jon Degenhardt wrote:

If all an algorithm needs to do is split a string roughly in half, then use the byte offsets to find the halfway point and then look for a utf-8 character boundary. If the algorithm is based on some other boundary, say, token boundaries, then find one of those boundaries.

Those algorithms you talking about are either doesn't need strings at all but instead byte/char arrays or would produce garbage for any input other than ascii.
Your example with log files mixes binary data with text. Properly done logger will escape delimiters inside text chunks, so it isn't even a string per se, it's some binary data from which you need to extract a string first.
A lot of bugs are caused by this mixing of text with binary. And I think it is better to distinguish them properly on a type level.

May 08, 2021

Re: No we should not support enum types derived from strings

Posted by Jon Degenhardt
in reply to guai

Permalink

Jon Degenhardt

Posted in reply to guai

Permalink

On Saturday, 8 May 2021 at 21:47:21 UTC, guai wrote:

On Saturday, 8 May 2021 at 20:22:28 UTC, Jon Degenhardt wrote:

Those algorithms you talking about are either doesn't need strings at all but instead byte/char arrays or would produce garbage for any input other than ascii.

I don't understand the point you are trying to make. Perhaps you could rephrase.

I've implemented any number of these types of algorithms. Its very common to mix interpretation as unicode strings with interpretation as utf-8 bytes. e.g. Maybe its necessary to do case-conversion at some stage of processing. This has to be done on unicode characters, not bytes. But needing to do such processing at some point does exclude such treating the data as utf-8 bytes for other purposes.

Also, a char[] in D is defined to be utf-8, and a string is an immutable(char)[]. So why would utf-8 data, including non-ascii characters, read into a char[] produce garbage? The answer is that it wouldn't. No, you cannot simply start on an arbitrary byte boundary, but nobody has suggested this.

Your example with log files mixes binary data with text. Properly done logger will escape delimiters inside text chunks, so it isn't even a string per se, it's some binary data from which you need to extract a string first.

Again, I'm not following the logic. Log files may or may not include binary data. But I'm sure why that matters. I'm talking about log files where the text portions are encoded as utf-8.

A lot of bugs are caused by this mixing of text with binary. And I think it is better to distinguish them properly on a type level.

Perhaps it would help if you described what you mean by "binary". I tend to think of "binary" as things like image data, binary serialization formats, base-64 coding, compressed or encrypted text. These are quite different than utf-8 encoded unicode text.

On 5/7/2021 7:16 AM, Paul Backus wrote: > "Is a string type" and "is implicitly convertible to a string type" are not the same thing. Language lawyer point: An enum can be implicitly converted to its base type, but it's a match level 2: https://dlang.org/spec/function.html#function-overloading (Agreeing with Paul)

On 5/7/2021 7:05 PM, Andrei Alexandrescu wrote: > String s; > func1(s.bytes); > func2(s.dchars); Already done: s.byCodeUnit s.byChar s.byWchar s.byDchar s.byUTF https://dlang.org/phobos/std_utf.html

On Friday, 7 May 2021 at 15:33:56 UTC, Adam D. Ruppe wrote: > On Friday, 7 May 2021 at 15:25:30 UTC, Andrei Alexandrescu wrote: >> Enums derived from strings should not be supported as strings in the standard library. > > I don't think the stdlib should special case much of anything. > > Special casing enums is a mistake. If the user wants it treated as a string, they can cast it to a string. > > [...] > > Kill all the special cases! 100% agreed, but, back to my original point, why is the enum thing a special case to begin with? The fact that it is a special case to begin with flies in the face of Liskov's substitution principle - the enum type clearly is a subtype of string. You got to wonder how it came to be that it just don't work automatically to begin with. Adding special cases is indeed the wrong path. There is something deeper rotten here, and just saying, no, this shouldn't work is just not cutting it. Note that there should be special cases, but it's be good to understand why these are special case to begin with, and fix this. Alternatively, we decide enums are not subtypes, in which case they shouldn't be implicitly convertible either. That wouldn't be such a bad idea as I've often missed the ability to do opaque type aliasing in D, but that seems way more disruptive than just admitting that "enum strings" are indeed a subtype of string.

On Sunday, 9 May 2021 at 02:57:42 UTC, Walter Bright wrote: > On 5/7/2021 7:16 AM, Paul Backus wrote: >> "Is a string type" and "is implicitly convertible to a string type" are not the same thing. > > Language lawyer point: > > An enum can be implicitly converted to its base type, but it's a match level 2: > > https://dlang.org/spec/function.html#function-overloading > > (Agreeing with Paul) Sorry to be blunt, but this is complete language layering fail. Classes implementing and interface are a subtype and are match level 2 (implicit conversion) when matching against the interface. In fact, any subtype is expected to be a match level 2 - arguably, this isn't bijective, as not all level 2 match will be subtypes, that doesn't definitively nails the topic at hand, but the argument made in this thread are disturbingly unsound.

On Friday, 7 May 2021 at 16:43:20 UTC, Andrei Alexandrescu wrote: > Well you see here is the problem. An enum with base string can be coerced to a string, but is not a true subtype of string. This came to a head with ranges, too - you can pop off the head of a string still have a string, but if you pop off the head of an enum string you get some enum value that is not present in the set of enum values. Concatenation has similar problems, e.g. s ~ s for enum strings yields string, not an enum string. (Weirdly s ~= s works...) > Popping the head out of an enum value ought to be a string, not that enum's value. I don't really see where the problem is here, this is subtyping 101. I raised a few times int he past that there were unsound operations performed in the past (as in "Weirdly s ~= s works...") but I don't think turning compiler bugs into standard library policies is going to lead to better tomorrows.

Forums