dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead (page 11)

Settings

Help

Index » General » dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead (page 11)

November 10, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Ola Fosheim Grøstad
in reply to Guillaume Piolat

Permalink

Ola Fosheim Grøstad

Posted in reply to Guillaume Piolat

Permalink

On Wednesday, 10 November 2021 at 11:47:16 UTC, Guillaume Piolat wrote:

It seems in practice it doesn't have to be utf-8 until you use something that assume it is. Which is ok for me.

Hm… for me the key advantage of stricter typing is that you can make more functions free of exceptions and error-handling without using much human judgment. The ideal is to only do error handling in I/O call-trees.

November 10, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Ola Fosheim Grøstad
in reply to Guillaume Piolat

Permalink

Ola Fosheim Grøstad

Posted in reply to Guillaume Piolat

Permalink

On Wednesday, 10 November 2021 at 11:47:16 UTC, Guillaume Piolat wrote:

import("file.stuff") yields string.
So there is at least one gap, as it is often used with binary files that ain't UTF-8.

Maybe a «binary_import!T("file.data")» that yields slice of type T?

November 11, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Elronnd
in reply to Ola Fosheim Grøstad

Permalink

Elronnd

Posted in reply to Ola Fosheim Grøstad

Permalink

On Wednesday, 10 November 2021 at 10:23:31 UTC, Ola Fosheim Grøstad wrote:
> I had a look at the [documentation]( https://dlang.org/spec/arrays.html#strings ) today, and it said:
>
> «char[] strings are in UTF-8 format.»
>
> I would assume that this is normative? Maybe change the documentation to use more forceful specification language so that it says: «char[] strings MUST be in UTF-8 format.»
>
> So, I think a messed up ```string``` should be considered a type error and it would be good if the compiler checked this statically where possible (e.g. literals) and simply assumed it to hold when parsing strings (like in a ```for``` loop).

I agree this should be required.  If you want something which is not valid UTF-8, _do not put it into a string_.  Use ubyte[].

Go further: require a runtime check on cast from ubyte[] to char[] (expensive), and on slicing char[] (cheap).  (If you abuse unions you are on your own; but obviously that is not allowed in @safe code, so has the same limitations as e.g. boundschecking.)

November 11, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Ola Fosheim Grøstad
in reply to Elronnd

Permalink

Ola Fosheim Grøstad

Posted in reply to Elronnd

Permalink

On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:
> I agree this should be required.  If you want something which is not valid UTF-8, _do not put it into a string_.  Use ubyte[].

Exactly.

> Go further: require a runtime check on cast from ubyte[] to char[] (expensive), and on slicing char[] (cheap).  (If you abuse unions you are on your own; but obviously that is not allowed in @safe code, so has the same limitations as e.g. boundschecking.)

The compiler could do such checks in an extra-solid-debug-mode. That could certainly improve unit-testing and other testing. In such a mode you could also do overflow checks for signed integers (if they are changed so they don't wrap).

November 12, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by kdevel
in reply to Ola Fosheim Grøstad

Permalink

kdevel

Posted in reply to Ola Fosheim Grøstad

Permalink

On Thursday, 11 November 2021 at 07:58:54 UTC, Ola Fosheim Grøstad wrote:
> On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:
>> I agree this should be required.  If you want something which is not valid UTF-8, _do not put it into a string_.  Use ubyte[].
>
> Exactly.

[...]

> The compiler could do such checks in an extra-solid-debug-mode.

This requires lots of changes or additions

```
import std.stdio;
import std.file;

void main ()
{
   ubyte [] filename = [ 'a', 0x80, 'b', '\0' ]; // valid filename in some OS
   auto s = readText (filename);
}
```

This does not yet compile:

   [...]
          R = ubyte[]`
     must satisfy one of the following constraints:
   `       isSomeChar!(ElementType!R)
          is(StringTypeOf!R)`

November 15, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by FeepingCreature
in reply to kdevel

Permalink

FeepingCreature

Posted in reply to kdevel

Permalink

On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:

On Thursday, 11 November 2021 at 07:58:54 UTC, Ola Fosheim Grøstad wrote:

On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:

I agree this should be required. If you want something which is not valid UTF-8, do not put it into a string. Use ubyte[].

Exactly.

[...]

The compiler could do such checks in an extra-solid-debug-mode.

This requires lots of changes or additions

import std.stdio;
import std.file;

void main ()
{
   ubyte [] filename = [ 'a', 0x80, 'b', '\0' ]; // valid filename in some OS
   auto s = readText (filename);
}

This does not yet compile:

[...]
R = ubyte[]must satisfy one of the following constraints: isSomeChar!(ElementType!R)
is(StringTypeOf!R)`

Yes, because readText is typed in a way that it excludes valid filenames. But it's already wrong - this feature would only expose the wrongness, as filename is already not a validly typed string. File a bug?

November 15, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by user1234
in reply to kdevel

Permalink

user1234

Posted in reply to kdevel

Permalink

On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
> This does not yet compile:
>
>    [...]
>           R = ubyte[]`
>      must satisfy one of the following constraints:
>    `       isSomeChar!(ElementType!R)
>           is(StringTypeOf!R)`

auto-decoding or not... you need to decode from whatever is the OS encoding (must be ancient ANSI I presume ?) to UTF-8.

November 15, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by user1234
in reply to user1234

Permalink

user1234

Posted in reply to user1234

Permalink

On Monday, 15 November 2021 at 08:20:57 UTC, user1234 wrote:
> On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
>> This does not yet compile:
>>
>>    [...]
>>           R = ubyte[]`
>>      must satisfy one of the following constraints:
>>    `       isSomeChar!(ElementType!R)
>>           is(StringTypeOf!R)`
>
> auto-decoding or not... you need to decode from whatever is the OS encoding (must be ancient ANSI I presume ?) to UTF-8.

I meant decode then re-enc to utf

November 15, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by Imperatorn
in reply to kdevel

Permalink

Imperatorn

Posted in reply to kdevel

Permalink

On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:
> On Thursday, 11 November 2021 at 07:58:54 UTC, Ola Fosheim Grøstad wrote:
>> On Thursday, 11 November 2021 at 01:31:46 UTC, Elronnd wrote:
>>> I agree this should be required.  If you want something which is not valid UTF-8, _do not put it into a string_.  Use ubyte[].
>>
>> Exactly.
>
> [...]
>
>> The compiler could do such checks in an extra-solid-debug-mode.
>
> This requires lots of changes or additions
>
> ```
> import std.stdio;
> import std.file;
>
> void main ()
> {
>    ubyte [] filename = [ 'a', 0x80, 'b', '\0' ]; // valid filename in some OS
>    auto s = readText (filename);
> }
> ```
>
> This does not yet compile:
>
>    [...]
>           R = ubyte[]`
>      must satisfy one of the following constraints:
>    `       isSomeChar!(ElementType!R)
>           is(StringTypeOf!R)`

One idea that has come up would be compile time checking of strings.

But thinking about the garbage in garbage out concept in general, maybe functions should really just accept data and it's the callers responsibility that it's valid.

This becomes a philosophical discussion, but could maybe be interesting (increased compile times ofc, but could be worth it). This would be more of a D3 thing. The Erlang path is fail fast. Fix the error at it's root.

Don't get me wrong, I understand why phobos is the way it is now, and it works. It's more in the "ideas to explore" category. One might say "but what about external data, I don't know if that's valid". The answer there would be to sanitize it before passing it to the function. It would also be better from a composability viewpoint.

In summary: Keep the functions themselves short and friendly. Make the data in correct. Put the constraints outside the function.

Pros and cons as with everything ofc

November 15, 2021

Re: dmd foreach loops throw exceptions on invalid UTF sequences, use replacementDchar instead

Posted by FeepingCreature
in reply to user1234

Permalink

FeepingCreature

Posted in reply to user1234

Permalink

On Monday, 15 November 2021 at 08:22:13 UTC, user1234 wrote:

On Monday, 15 November 2021 at 08:20:57 UTC, user1234 wrote:

On Friday, 12 November 2021 at 10:42:15 UTC, kdevel wrote:

This does not yet compile:

[...]
R = ubyte[]
must satisfy one of the following constraints:
isSomeChar!(ElementType!R) is(StringTypeOf!R)

auto-decoding or not... you need to decode from whatever is the OS encoding (must be ancient ANSI I presume ?) to UTF-8.

I meant decode then re-enc to utf

I don't see how that could work. readText would need to encode it to the OS codepage, but readText has no idea what encoding you intend. And the encoding of a filename isn't even always determined by the locale; consider trying to access filenames saved in a different locale, ie. what iconv does. There's no way around readText taking ubyte[].

Top | Forum index | About this forum

Forums