Decoding HTML escape sequences

May 12, 2014

Posted by Hugo Florentino

Permalink

Hugo Florentino

Permalink

Hi, I have some documents where some strings appears in HTML escape sequences in one of these forms:

\x3C\x53\x43\x52\x49\x50\x54\x20\x4C\x41\x4E\x47\x55\x41\x47\x45\x3D\x22\x4A\x61\x76\x61\x53\x63\x72\x69\x70\x74\x22\x3e

%3C%53%43%52%49%50%54%20%4C%41%4E%47%55%41%47%45%3D%22%4A%61%76%61%53%63%72%69%70%74%22%3e

And I would like to recode them to readable form:

<SCRIPT LANGUAGE="Javascript">

I tried something like this, using regular expressions and the uri module:


import std.stdio, std.file, std.encoding, std.string, std.regex, std.uri;

static auto re = regex(`(%[a-fA-F0-9]{2})`);

int main(in string[] args)
{
  if (args.length < 2)
  {
    writeln("Usage: unescape file1.htm > file2.htm");
    return -1;
  }
  auto input = cast(Latin1String) read(args[1]);
  string buffer;
  transcode(input, buffer);

  string output;
  foreach(m; matchAll(buffer, re)) output ~= decode(m.hit);

  writeln(output);

  return 0;
}


Unfortunately it doesn't seem to work 100%.

I would appreciate any suggestion.

Regards, Hugo

You should use decodeComponent instead of decode in your matchAll loop. IMO encodeComponent and decodeComponent are the only two useful uri encode functions (btw same in JS, use decodeURIComponent instead of the other functions). The other ones have weird rules.

Forums