Thread overview
Can I parse this kind of HTML with arsd.dom module?
Jun 24, 2018
Dr.No
Jun 24, 2018
Timoses
Jun 24, 2018
Adam D. Ruppe
Jun 24, 2018
Adam D. Ruppe
Jun 25, 2018
Adam D. Ruppe
June 24, 2018
This is the module I'm speaking about: https://arsd-official.dpldocs.info/arsd.dom.html

So I have this HTML that not even parseGarbae() can del with:

<a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a>

There is this spaces between  "href" and "=" and "https..." which makes below code fails:


	string html = get(page, client).text;
	auto document = new Document();
	document.parseGarbage(html);
Element attEle = document.querySelector("span[id=link2]");
	Element aEle = attEle.querySelector("a");
string link = aEle.href; // <-- if the href contains space, it return "href" rather the link



let's say the page HTML look like this:

<body bgcolor="#000000">
<font color="yellow">
<h2>
	Hello, dear world!
	<span id="link2">
<a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a>
	</span>
</h2>
</font>

I know the library author post on this forum often, I hope he see this help somehow

to make it work. But if anyone else know how to fix this, will be very welcome too!
June 24, 2018
On Sunday, 24 June 2018 at 03:46:09 UTC, Dr.No wrote:
> 	string html = get(page, client).text;
> 	auto document = new Document();
> 	document.parseGarbage(html);
> Element attEle = document.querySelector("span[id=link2]");
> 	Element aEle = attEle.querySelector("a");
> string link = aEle.href; // <-- if the href contains space, it return "href" rather the link
>
> [...]
>
> <body bgcolor="#000000">
> <font color="yellow">
> <h2>
> 	Hello, dear world!
> 	<span id="link2">
> <a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a>
> 	</span>
> </h2>
> </font>
missing </body>

Seems to be buggy, the parsed document part refering to "a" looks like this:

<a "https:="&quot;https:" href="href" />G!


June 24, 2018
On Sunday, 24 June 2018 at 10:49:51 UTC, Timoses wrote:
>> <a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a>
>> 	</span>
>> </h2>
>> </font>
> missing </body>
>
> Seems to be buggy, the parsed document part refering to "a" looks like this:
>
> <a "https:="&quot;https:" href="href" />G!

It reads href as a no content attribute (like `checked` which becomes `checked="checked"` in xhtml style), then ignored the = as malplaced trash, then did the same with the https.

so the fix is to collapse whitespace around the =.....
June 24, 2018
On Sunday, 24 June 2018 at 03:46:09 UTC, Dr.No wrote:
> I know the library author post on this forum often, I hope he see this help somehow

Yeah, I'm out this week but it shouldn't be too hard to add, the garbage attribute parser can special-case = surrounded by spaces to just skip the spaces.

I won't get to it today, but I might be able to tomorrow. Shoot me a reminder email if I don't by tomorrow night. The parser code is unbelievably bad, but the code to change is somewhere around line 450 if you wanna take a stab at it yourself.
June 25, 2018
On Sunday, 24 June 2018 at 03:46:09 UTC, Dr.No wrote:
> to make it work. But if anyone else know how to fix this, will be very welcome too!

try it now.

thanks to Sandman83 on github.