May 12, 2014 Decoding HTML escape sequences | ||||
---|---|---|---|---|
| ||||
Hi, I have some documents where some strings appears in HTML escape sequences in one of these forms: \x3C\x53\x43\x52\x49\x50\x54\x20\x4C\x41\x4E\x47\x55\x41\x47\x45\x3D\x22\x4A\x61\x76\x61\x53\x63\x72\x69\x70\x74\x22\x3e %3C%53%43%52%49%50%54%20%4C%41%4E%47%55%41%47%45%3D%22%4A%61%76%61%53%63%72%69%70%74%22%3e And I would like to recode them to readable form: <SCRIPT LANGUAGE="Javascript"> I tried something like this, using regular expressions and the uri module: import std.stdio, std.file, std.encoding, std.string, std.regex, std.uri; static auto re = regex(`(%[a-fA-F0-9]{2})`); int main(in string[] args) { if (args.length < 2) { writeln("Usage: unescape file1.htm > file2.htm"); return -1; } auto input = cast(Latin1String) read(args[1]); string buffer; transcode(input, buffer); string output; foreach(m; matchAll(buffer, re)) output ~= decode(m.hit); writeln(output); return 0; } Unfortunately it doesn't seem to work 100%. I would appreciate any suggestion. Regards, Hugo |
May 13, 2014 Re: Decoding HTML escape sequences | ||||
---|---|---|---|---|
| ||||
Posted in reply to Hugo Florentino | You should use decodeComponent instead of decode in your matchAll loop. IMO encodeComponent and decodeComponent are the only two useful uri encode functions (btw same in JS, use decodeURIComponent instead of the other functions). The other ones have weird rules. |
Copyright © 1999-2021 by the D Language Foundation