November 10, 2021

On Wednesday, 10 November 2021 at 11:47:16 UTC, Guillaume Piolat wrote:

>

It seems in practice it doesn't have to be utf-8 until you use something that assume it is. Which is ok for me.

Hm… for me the key advantage of stricter typing is that you can make more functions free of exceptions and error-handling without using much human judgment. The ideal is to only do error handling in I/O call-trees.

November 10, 2021

On Wednesday, 10 November 2021 at 11:47:16 UTC, Guillaume Piolat wrote:

>

import("file.stuff") yields string.
So there is at least one gap, as it is often used with binary files that ain't UTF-8.

Maybe a «binary_import!T("file.data")» that yields slice of type T?

November 11, 2021
On Wednesday, 10 November 2021 at 10:23:31 UTC, Ola Fosheim Grøstad wrote:
> I had a look at the [documentation]( https://dlang.org/spec/arrays.html#strings ) today, and it said:
>
> «char[] strings are in UTF-8 format.»
>
> I would assume that this is normative? Maybe change the documentation to use more forceful specification language so that it says: «char[] strings MUST be in UTF-8 format.»
>
> So, I think a messed up ```string``` should be considered a type error and it would be good if the compiler checked this statically where possible (e.g. literals) and simply assumed it to hold when parsing strings (like in a ```for``` loop).

I agree this should be required.  If you want something which is not valid UTF-8, _do not put it into a string_.  Use ubyte[].

Go further: require a runtime check on cast from ubyte[] to char[] (expensive), and on slicing char[] (cheap).  (If you abuse unions you are on your own; but obviously that is not allowed in @safe code, so has the same limitations as e.g. boundschecking.)
November 11, 2021
On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:
> I agree this should be required.  If you want something which is not valid UTF-8, _do not put it into a string_.  Use ubyte[].

Exactly.

> Go further: require a runtime check on cast from ubyte[] to char[] (expensive), and on slicing char[] (cheap).  (If you abuse unions you are on your own; but obviously that is not allowed in @safe code, so has the same limitations as e.g. boundschecking.)

The compiler could do such checks in an extra-solid-debug-mode. That could certainly improve unit-testing and other testing. In such a mode you could also do overflow checks for signed integers (if they are changed so they don't wrap).

November 12, 2021
On Thursday, 11 November 2021 at 07:58:54 UTC, Ola Fosheim Grøstad wrote:
> On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:
>> I agree this should be required.  If you want something which is not valid UTF-8, _do not put it into a string_.  Use ubyte[].
>
> Exactly.

[...]

> The compiler could do such checks in an extra-solid-debug-mode.

This requires lots of changes or additions

```
import std.stdio;
import std.file;

void main ()
{
   ubyte [] filename = [ 'a', 0x80, 'b', '\0' ]; // valid filename in some OS
   auto s = readText (filename);
}
```

This does not yet compile:

   [...]
          R = ubyte[]`
     must satisfy one of the following constraints:
   `       isSomeChar!(ElementType!R)
          is(StringTypeOf!R)`
November 15, 2021

On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:

>

On Thursday, 11 November 2021 at 07:58:54 UTC, Ola Fosheim Grøstad wrote:

>

On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:

>

I agree this should be required. If you want something which is not valid UTF-8, do not put it into a string. Use ubyte[].

Exactly.

[...]

>

The compiler could do such checks in an extra-solid-debug-mode.

This requires lots of changes or additions

import std.stdio;
import std.file;

void main ()
{
   ubyte [] filename = [ 'a', 0x80, 'b', '\0' ]; // valid filename in some OS
   auto s = readText (filename);
}

This does not yet compile:

[...]
R = ubyte[]must satisfy one of the following constraints: isSomeChar!(ElementType!R)
is(StringTypeOf!R)`

Yes, because readText is typed in a way that it excludes valid filenames. But it's already wrong - this feature would only expose the wrongness, as filename is already not a validly typed string. File a bug?

November 15, 2021
On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
> This does not yet compile:
>
>    [...]
>           R = ubyte[]`
>      must satisfy one of the following constraints:
>    `       isSomeChar!(ElementType!R)
>           is(StringTypeOf!R)`

auto-decoding or not... you need to decode from whatever is the OS encoding (must be ancient ANSI I presume ?) to UTF-8.
November 15, 2021
On Monday, 15 November 2021 at 08:20:57 UTC, user1234 wrote:
> On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
>> This does not yet compile:
>>
>>    [...]
>>           R = ubyte[]`
>>      must satisfy one of the following constraints:
>>    `       isSomeChar!(ElementType!R)
>>           is(StringTypeOf!R)`
>
> auto-decoding or not... you need to decode from whatever is the OS encoding (must be ancient ANSI I presume ?) to UTF-8.

I meant decode then re-enc to utf
November 15, 2021
On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
> On Thursday, 11 November 2021 at 07:58:54 UTC, Ola Fosheim Grøstad wrote:
>> On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:
>>> I agree this should be required.  If you want something which is not valid UTF-8, _do not put it into a string_.  Use ubyte[].
>>
>> Exactly.
>
> [...]
>
>> The compiler could do such checks in an extra-solid-debug-mode.
>
> This requires lots of changes or additions
>
> ```
> import std.stdio;
> import std.file;
>
> void main ()
> {
>    ubyte [] filename = [ 'a', 0x80, 'b', '\0' ]; // valid filename in some OS
>    auto s = readText (filename);
> }
> ```
>
> This does not yet compile:
>
>    [...]
>           R = ubyte[]`
>      must satisfy one of the following constraints:
>    `       isSomeChar!(ElementType!R)
>           is(StringTypeOf!R)`

One idea that has come up would be compile time checking of strings.

But thinking about the garbage in garbage out concept in general, maybe functions should really just accept data and it's the callers responsibility that it's valid.

This becomes a philosophical discussion, but could maybe be interesting (increased compile times ofc, but could be worth it). This would be more of a D3 thing. The Erlang path is fail fast. Fix the error at it's root.

Don't get me wrong, I understand why phobos is the way it is now, and it works. It's more in the "ideas to explore" category. One might say "but what about external data, I don't know if that's valid". The answer there would be to sanitize it before passing it to the function. It would also be better from a composability viewpoint.

In summary: Keep the functions themselves short and friendly. Make the data in correct. Put the constraints outside the function.

Pros and cons as with everything ofc
November 15, 2021

On Monday, 15 November 2021 at 08:22:13 UTC, user1234 wrote:

>

On Monday, 15 November 2021 at 08:20:57 UTC, user1234 wrote:

>

On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:

>

This does not yet compile:

[...]
R = ubyte[]
must satisfy one of the following constraints:
isSomeChar!(ElementType!R) is(StringTypeOf!R)

auto-decoding or not... you need to decode from whatever is the OS encoding (must be ancient ANSI I presume ?) to UTF-8.

I meant decode then re-enc to utf

I don't see how that could work. readText would need to encode it to the OS codepage, but readText has no idea what encoding you intend. And the encoding of a filename isn't even always determined by the locale; consider trying to access filenames saved in a different locale, ie. what iconv does. There's no way around readText taking ubyte[].