Thread overview
Crash in byCodeUnit() <- byDchar() when converting faulty text to HTML
Jun 15, 2014
Nordlöw
Jun 15, 2014
Nordlöw
Jun 16, 2014
monarch_dodra
Jun 16, 2014
monarch_dodra
Jun 16, 2014
Nordlöw
June 15, 2014
I'm using the following snippet to convert a UTF-8 string to HTML

/** Convert character $(D c) to HTML representation. */
string toHTML(C)(C c) @safe pure if (isSomeChar!C)
{
    import std.conv: to;
    if      (c == '&')  return "&amp;"; // ampersand
    else if (c == '<')  return "&lt;"; // less than
    else if (c == '>')  return "&gt;"; // greater than
    else if (c == '\"') return "&quot;"; // double quote
    else if (0 < c && c < 128)
        return to!string(cast(char)c);
    else
        return "&#" ~ to!string(cast(int)c) ~ ";";
}

static if (__VERSION__ >= 2066L)
{
    /** Convert string $(D s) to HTML representation. */
    auto encodeHTML(string s) @safe pure
    {
        import std.utf: byDchar;
        import std.algorithm: joiner, map;
        return s.byDchar.map!toHTML.joiner("");
    }
}

Note that it uses Walter's new std.utf.byDchar.

But it triggers

core.exception.RangeError@std/utf.d(2703): Range violation
----------------
Stack trace:
#1: ?? line (0)
#2: ?? line (0)
#3: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/utf.d line (2703)
#4: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/utf.d line (3232)
#5: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/algorithm.d line (510)
#6: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/algorithm.d line (3440)
#7: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/algorithm.d line (3540)
#8: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/range.d line (1861)
#9: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d line (2172)
#10: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d line (2843)
#11: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d line (3167)
#12: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d line (526)
#13: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/stdio.d line (1168)

for non-utf-8 input.

Is this intentional?

utf.d on line 2703 is inside byCodeUnit().

When I use byChar() i doesn't crash but then I get incorrect conversions.

Could somebody explain the different between byChar, byWchar and byDchar?
June 15, 2014
> But it triggers

See also: https://github.com/nordlow/justd/blob/master/test/t_err.d
June 16, 2014
On Sunday, 15 June 2014 at 23:09:24 UTC, Nordlöw wrote:
> Is this intentional?
>
> utf.d on line 2703 is inside byCodeUnit().

AFAIK, no. You hit an Error, and those shouldn't occur unless you go out of your way for them.

I'll look into it.

> When I use byChar() i doesn't crash but then I get incorrect conversions.
>
> Could somebody explain the different between byChar, byWchar and byDchar?

What's there to say? They all take a range of characters, and return it as a range of the corresponding requested type.

In the case of "byDchar", it decodes the string (while returning a "BadChar") for invalid encodings.

The others first decode using "byDchar", and then re-encode the individual dchars into the corresponding requested char-type.
June 16, 2014
On Monday, 16 June 2014 at 10:02:16 UTC, monarch_dodra wrote:
> I'll look into it.

Yeah, there's an issue in the implementation. I brought it up in the pull page. If it doesn't get attention there, I'll file it.
June 16, 2014
> AFAIK, no. You hit an Error, and those shouldn't occur unless you go out of your way for them.
>
> I'll look into it.

Superb!

> What's there to say? They all take a range of characters, and return it as a range of the corresponding requested type.

Excuse me for the kind of dumb question. I was unsure about the details. Is there a bleeding edge (in sync with git master) variant of dlang.org docs I can read instead of the source? If not, I build dmd, druntime amd phobos daily for testing purposes so I might aswell build the docs aswell and get it from there.

> In the case of "byDchar", it decodes the string (while returning a "BadChar") for invalid encodings.

This is what I want/need :)

> The others first decode using "byDchar", and then re-encode the individual dchars into the corresponding requested char-type.

Ok. Got it!

Thx a lot.