Thread overview | |||||||
---|---|---|---|---|---|---|---|
|
June 24, 2018 Can I parse this kind of HTML with arsd.dom module? | ||||
---|---|---|---|---|
| ||||
This is the module I'm speaking about: https://arsd-official.dpldocs.info/arsd.dom.html So I have this HTML that not even parseGarbae() can del with: <a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a> There is this spaces between "href" and "=" and "https..." which makes below code fails: string html = get(page, client).text; auto document = new Document(); document.parseGarbage(html); Element attEle = document.querySelector("span[id=link2]"); Element aEle = attEle.querySelector("a"); string link = aEle.href; // <-- if the href contains space, it return "href" rather the link let's say the page HTML look like this: <body bgcolor="#000000"> <font color="yellow"> <h2> Hello, dear world! <span id="link2"> <a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a> </span> </h2> </font> I know the library author post on this forum often, I hope he see this help somehow to make it work. But if anyone else know how to fix this, will be very welcome too! |
June 24, 2018 Re: Can I parse this kind of HTML with arsd.dom module? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dr.No | On Sunday, 24 June 2018 at 03:46:09 UTC, Dr.No wrote:
> string html = get(page, client).text;
> auto document = new Document();
> document.parseGarbage(html);
> Element attEle = document.querySelector("span[id=link2]");
> Element aEle = attEle.querySelector("a");
> string link = aEle.href; // <-- if the href contains space, it return "href" rather the link
>
> [...]
>
> <body bgcolor="#000000">
> <font color="yellow">
> <h2>
> Hello, dear world!
> <span id="link2">
> <a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a>
> </span>
> </h2>
> </font>
missing </body>
Seems to be buggy, the parsed document part refering to "a" looks like this:
<a "https:=""https:" href="href" />G!
|
June 24, 2018 Re: Can I parse this kind of HTML with arsd.dom module? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Timoses | On Sunday, 24 June 2018 at 10:49:51 UTC, Timoses wrote:
>> <a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a>
>> </span>
>> </h2>
>> </font>
> missing </body>
>
> Seems to be buggy, the parsed document part refering to "a" looks like this:
>
> <a "https:=""https:" href="href" />G!
It reads href as a no content attribute (like `checked` which becomes `checked="checked"` in xhtml style), then ignored the = as malplaced trash, then did the same with the https.
so the fix is to collapse whitespace around the =.....
|
June 24, 2018 Re: Can I parse this kind of HTML with arsd.dom module? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dr.No | On Sunday, 24 June 2018 at 03:46:09 UTC, Dr.No wrote:
> I know the library author post on this forum often, I hope he see this help somehow
Yeah, I'm out this week but it shouldn't be too hard to add, the garbage attribute parser can special-case = surrounded by spaces to just skip the spaces.
I won't get to it today, but I might be able to tomorrow. Shoot me a reminder email if I don't by tomorrow night. The parser code is unbelievably bad, but the code to change is somewhere around line 450 if you wanna take a stab at it yourself.
|
June 25, 2018 Re: Can I parse this kind of HTML with arsd.dom module? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dr.No | On Sunday, 24 June 2018 at 03:46:09 UTC, Dr.No wrote:
> to make it work. But if anyone else know how to fix this, will be very welcome too!
try it now.
thanks to Sandman83 on github.
|
Copyright © 1999-2021 by the D Language Foundation