March 27, 2018
My IRC bot is suddenly seeing crashes. It reads characters from a Socket into an ubyte[] array, then idups parts of that (full lines) into strings for parsing. Parsing involves slicing such strings into meaningful segments; sender, event type, target channel/user, message content, etc. I can assume all of them to be char[]-compliant except for the content field.

Running it in a debugger I see I'm tripping an assert in utf.d[1] when calling stripRight on a content slice[2].

> /++
>     Returns the number of code units that are required to encode the code point
>     $(D c) when $(D C) is the character type used to encode it.
>   +/
> ubyte codeLength(C)(dchar c) @safe pure nothrow @nogc
> if (isSomeChar!C)
> {
>     static if (C.sizeof == 1)
>     {
>         if (c <= 0x7F) return 1;
>         if (c <= 0x7FF) return 2;
>         if (c <= 0xFFFF) return 3;
>         if (c <= 0x10FFFF) return 4;
>         assert(false);  // <--
>     }
>     // ...

This trips it:

> import std.string;
>
> void main()
> {
>     string s = "\355\342\256 \342\245\341⮢\256\245 ᮮ\241饭\250\245".stripRight;  // <-- asserts false
> }

The real backtrace:
> #0  _D3std3utf__T10codeLengthTaZQpFNaNbNiNfwZh (c=26663461) at /usr/include/dlang/dmd/std/utf.d:2530
> #1  0x000055555578d7aa in _D3std6string__T10stripRightTAyaZQrFQhZ14__foreachbody2MFNaNbNiNfKmKwZi (this=0x7fffffff99c0, __applyArg1=@0x7fffffff9978: 26663461, __applyArg0=@0x7fffffff9970: 17) at /usr/include/dlang/dmd/std/string.d:2918
> #2  0x00007ffff7a47014 in _aApplyRcd2 () from /usr/lib/libphobos2.so.0.78
> #3  0x000055555578d731 in _D3std6string__T10stripRightTAyaZQrFNaNiNfQnZQq (str=...) at /usr/include/dlang/dmd/std/string.d:2915
> #4  0x00005555558e0cc7 in _D8kameloso3irc17parseSpecialcasesFNaNfKSQBnQBh9IRCParserKSQCf7ircdefs8IRCEventKAyaZv (slice=..., event=...,parser=...) at source/kameloso/irc.d:1184


Should that not be an Exception, as it's based on input? I'm not sure where the character 26663461 came from. Even so, should it assert?

I don't know what to do right now. I'd like to avoid sanitizing all lines. I could catch an Exception but not so much an AssertError.


[1]: https://github.com/dlang/phobos/blob/master/std/utf.d#L2522
[2]: https://github.com/zorael/kameloso/blob/master/source/kameloso/irc.d#L1184
March 27, 2018
On Tuesday, March 27, 2018 23:29:57 Anonymouse via Digitalmars-d-learn wrote:
> My IRC bot is suddenly seeing crashes. It reads characters from a Socket into an ubyte[] array, then idups parts of that (full lines) into strings for parsing. Parsing involves slicing such strings into meaningful segments; sender, event type, target channel/user, message content, etc. I can assume all of them to be char[]-compliant except for the content field.
>
> Running it in a debugger I see I'm tripping an assert in utf.d[1] when calling stripRight on a content slice[2].
>
> > /++
> >
> >     Returns the number of code units that are required to
> >
> > encode the code point
> >
> >     $(D c) when $(D C) is the character type used to encode it.
> >
> >   +/
> >
> > ubyte codeLength(C)(dchar c) @safe pure nothrow @nogc
> > if (isSomeChar!C)
> > {
> >
> >     static if (C.sizeof == 1)
> >     {
> >
> >         if (c <= 0x7F) return 1;
> >         if (c <= 0x7FF) return 2;
> >         if (c <= 0xFFFF) return 3;
> >         if (c <= 0x10FFFF) return 4;
> >         assert(false);  // <--
> >
> >     }
> >     // ...
>
> This trips it:
> > import std.string;
> >
> > void main()
> > {
> >
> >     string s = "\355\342\256 \342\245\341⮢\256\245
> >
> > ᮮ\241饭\250\245".stripRight;  // <-- asserts false
> > }
>
> The real backtrace:
> > #0  _D3std3utf__T10codeLengthTaZQpFNaNbNiNfwZh (c=26663461) at
> > /usr/include/dlang/dmd/std/utf.d:2530
> > #1  0x000055555578d7aa in
> > _D3std6string__T10stripRightTAyaZQrFQhZ14__foreachbody2MFNaNbNiNfKmKwZi
> > (this=0x7fffffff99c0, __applyArg1=@0x7fffffff9978: 26663461,
> > __applyArg0=@0x7fffffff9970: 17) at
> > /usr/include/dlang/dmd/std/string.d:2918 #2  0x00007ffff7a47014 in
> > _aApplyRcd2 () from
> > /usr/lib/libphobos2.so.0.78
> > #3  0x000055555578d731 in
> > _D3std6string__T10stripRightTAyaZQrFNaNiNfQnZQq (str=...) at
> > /usr/include/dlang/dmd/std/string.d:2915
> > #4  0x00005555558e0cc7 in
> > _D8kameloso3irc17parseSpecialcasesFNaNfKSQBnQBh9IRCParserKSQCf7ircdefs8I
> > RCEventKAyaZv (slice=..., event=...,parser=...) at
> > source/kameloso/irc.d:1184
> Should that not be an Exception, as it's based on input? I'm not sure where the character 26663461 came from. Even so, should it assert?
>
> I don't know what to do right now. I'd like to avoid sanitizing all lines. I could catch an Exception but not so much an AssertError.
>
>
> [1]: https://github.com/dlang/phobos/blob/master/std/utf.d#L2522
> [2]:
> https://github.com/zorael/kameloso/blob/master/source/kameloso/irc.d#L1184

It means that codeLength requires that dchar be a valid code point, though the documentation doesn't say that. It probably should. It was probably assumed that no one would try to pass it an invalid code point - especially since it's usually called with well-known values rather than data from some place like a socket. Regardless, the way to work around it would be to call isValidDchar on the dchar before passing it to codeLength so that you can handle the invalid code point rather than calling codeLength on it.

- Jonathan M Davis