Crash in byCodeUnit() <- byDchar() when converting faulty text to HTML

Jun 15, 2014

Nordlöw

Jun 15, 2014

Nordlöw

Jun 16, 2014

monarch_dodra

Jun 16, 2014

monarch_dodra

Jun 16, 2014

Nordlöw

June 15, 2014

Crash in byCodeUnit() <- byDchar() when converting faulty text to HTML

Posted by Nordlöw

Permalink

Nordlöw

Permalink

I'm using the following snippet to convert a UTF-8 string to HTML

/** Convert character $(D c) to HTML representation. */
string toHTML(C)(C c) @safe pure if (isSomeChar!C)
{
    import std.conv: to;
    if      (c == '&')  return "&amp;"; // ampersand
    else if (c == '<')  return "&lt;"; // less than
    else if (c == '>')  return "&gt;"; // greater than
    else if (c == '\"') return "&quot;"; // double quote
    else if (0 < c && c < 128)
        return to!string(cast(char)c);
    else
        return "&#" ~ to!string(cast(int)c) ~ ";";
}

static if (__VERSION__ >= 2066L)
{
    /** Convert string $(D s) to HTML representation. */
    auto encodeHTML(string s) @safe pure
    {
        import std.utf: byDchar;
        import std.algorithm: joiner, map;
        return s.byDchar.map!toHTML.joiner("");
    }
}

Note that it uses Walter's new std.utf.byDchar.

But it triggers

core.exception.RangeError@std/utf.d(2703): Range violation
----------------
Stack trace:
#1: ?? line (0)
#2: ?? line (0)
#3: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/utf.d line (2703)
#4: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/utf.d line (3232)
#5: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/algorithm.d line (510)
#6: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/algorithm.d line (3440)
#7: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/algorithm.d line (3540)
#8: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/range.d line (1861)
#9: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d line (2172)
#10: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d line (2843)
#11: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d line (3167)
#12: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d line (526)
#13: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/stdio.d line (1168)

for non-utf-8 input.

Is this intentional?

utf.d on line 2703 is inside byCodeUnit().

When I use byChar() i doesn't crash but then I get incorrect conversions.

Could somebody explain the different between byChar, byWchar and byDchar?

On Sunday, 15 June 2014 at 23:09:24 UTC, Nordlöw wrote: > Is this intentional? > > utf.d on line 2703 is inside byCodeUnit(). AFAIK, no. You hit an Error, and those shouldn't occur unless you go out of your way for them. I'll look into it. > When I use byChar() i doesn't crash but then I get incorrect conversions. > > Could somebody explain the different between byChar, byWchar and byDchar? What's there to say? They all take a range of characters, and return it as a range of the corresponding requested type. In the case of "byDchar", it decodes the string (while returning a "BadChar") for invalid encodings. The others first decode using "byDchar", and then re-encode the individual dchars into the corresponding requested char-type.

On Monday, 16 June 2014 at 10:02:16 UTC, monarch_dodra wrote: > I'll look into it. Yeah, there's an issue in the implementation. I brought it up in the pull page. If it doesn't get attention there, I'll file it.

> AFAIK, no. You hit an Error, and those shouldn't occur unless you go out of your way for them. > > I'll look into it. Superb! > What's there to say? They all take a range of characters, and return it as a range of the corresponding requested type. Excuse me for the kind of dumb question. I was unsure about the details. Is there a bleeding edge (in sync with git master) variant of dlang.org docs I can read instead of the source? If not, I build dmd, druntime amd phobos daily for testing purposes so I might aswell build the docs aswell and get it from there. > In the case of "byDchar", it decodes the string (while returning a "BadChar") for invalid encodings. This is what I want/need :) > The others first decode using "byDchar", and then re-encode the individual dchars into the corresponding requested char-type. Ok. Got it! Thx a lot.

Forums