Can I parse this kind of HTML with arsd.dom module? - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Can I parse this kind of HTML with arsd.dom module?

Thread overview

Can I parse this kind of HTML with arsd.dom module?
Jun 24, 2018 Dr.No
Jun 24, 2018 Timoses
Jun 24, 2018 Adam D. Ruppe
Jun 24, 2018 Adam D. Ruppe
Jun 25, 2018 Adam D. Ruppe

June 24, 2018

Can I parse this kind of HTML with arsd.dom module?

Posted by Dr.No

Dr.No

This is the module I'm speaking about: https://arsd-official.dpldocs.info/arsd.dom.html

So I have this HTML that not even parseGarbae() can del with:

<a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a>

There is this spaces between  "href" and "=" and "https..." which makes below code fails:


	string html = get(page, client).text;
	auto document = new Document();
	document.parseGarbage(html);
Element attEle = document.querySelector("span[id=link2]");
	Element aEle = attEle.querySelector("a");
string link = aEle.href; // <-- if the href contains space, it return "href" rather the link



let's say the page HTML look like this:

<body bgcolor="#000000">
<font color="yellow">
<h2>
	Hello, dear world!
	<span id="link2">
<a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a>
	</span>
</h2>
</font>

I know the library author post on this forum often, I hope he see this help somehow

to make it work. But if anyone else know how to fix this, will be very welcome too!

June 24, 2018

Re: Can I parse this kind of HTML with arsd.dom module?

Posted by Timoses
in reply to Dr.No

Timoses

Posted in reply to Dr.No

On Sunday, 24 June 2018 at 03:46:09 UTC, Dr.No wrote:
> 	string html = get(page, client).text;
> 	auto document = new Document();
> 	document.parseGarbage(html);
> Element attEle = document.querySelector("span[id=link2]");
> 	Element aEle = attEle.querySelector("a");
> string link = aEle.href; // <-- if the href contains space, it return "href" rather the link
>
> [...]
>
> <body bgcolor="#000000">
> <font color="yellow">
> <h2>
> 	Hello, dear world!
> 	<span id="link2">
> <a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a>
> 	</span>
> </h2>
> </font>
missing </body>

Seems to be buggy, the parsed document part refering to "a" looks like this:

<a "https:="&quot;https:" href="href" />G!

June 24, 2018

Re: Can I parse this kind of HTML with arsd.dom module?

Posted by Adam D. Ruppe
in reply to Timoses

Adam D. Ruppe

Posted in reply to Timoses

On Sunday, 24 June 2018 at 10:49:51 UTC, Timoses wrote:
>> <a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a>
>> 	</span>
>> </h2>
>> </font>
> missing </body>
>
> Seems to be buggy, the parsed document part refering to "a" looks like this:
>
> <a "https:="&quot;https:" href="href" />G!

It reads href as a no content attribute (like `checked` which becomes `checked="checked"` in xhtml style), then ignored the = as malplaced trash, then did the same with the https.

so the fix is to collapse whitespace around the =.....

June 24, 2018

Re: Can I parse this kind of HTML with arsd.dom module?

Posted by Adam D. Ruppe
in reply to Dr.No

Adam D. Ruppe

Posted in reply to Dr.No

On Sunday, 24 June 2018 at 03:46:09 UTC, Dr.No wrote:
> I know the library author post on this forum often, I hope he see this help somehow

Yeah, I'm out this week but it shouldn't be too hard to add, the garbage attribute parser can special-case = surrounded by spaces to just skip the spaces.

I won't get to it today, but I might be able to tomorrow. Shoot me a reminder email if I don't by tomorrow night. The parser code is unbelievably bad, but the code to change is somewhere around line 450 if you wanna take a stab at it yourself.

June 25, 2018

Re: Can I parse this kind of HTML with arsd.dom module?

Posted by Adam D. Ruppe
in reply to Dr.No

Adam D. Ruppe

Posted in reply to Dr.No

On Sunday, 24 June 2018 at 03:46:09 UTC, Dr.No wrote:
> to make it work. But if anyone else know how to fix this, will be very welcome too!

try it now.

thanks to Sandman83 on github.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation