Thread overview
html fetcher/parser
Aug 12, 2017
Faux Amis
Aug 12, 2017
Adam D. Ruppe
Aug 12, 2017
Michael
Aug 13, 2017
Faux Amis
Aug 13, 2017
Adam D. Ruppe
Aug 14, 2017
Faux Amis
Aug 15, 2017
Adam D. Ruppe
Aug 12, 2017
Soulsbane
Aug 13, 2017
Faux Amis
August 12, 2017
I would like to get into D again by making a small program which fetches a website every X-time and keeps track of all changes within specified dom elements.

fetching: should I go for std curl, vibe.d or something else?
parsing: I could only find these dub packages: htmld & libdominator.
And they don't seem overly active, any recommendations?

As I haven't been using D for some time I just don't want to get off with a bad start :)
thx

August 12, 2017
On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:
> I would like to get into D again by making a small program which fetches a website every X-time and keeps track of all changes within specified dom elements.

My dom.d and http2.d combine to make this easy:

https://github.com/adamdruppe/arsd/blob/master/dom.d
https://github.com/adamdruppe/arsd/blob/master/http2.d

and support file for random encodings:

https://github.com/adamdruppe/arsd/blob/master/characterencodings.d


Or via dub:

http://code.dlang.org/packages/arsd-official

the dom and http subpackages are the ones you want.


Docs: http://dpldocs.info/arsd.dom


Sample program:

---
// compile: $ dmd thisfile.d ~/arsd/{dom,http2,characterencodings}

import std.stdio;
import arsd.dom;

void main() {
        auto document = Document.fromUrl("https://dlang.org/");
        writeln(document.optionSelector("p").innerText);
}
---

Output:

D is a general-purpose programming language with
        static typing, systems-level access, and C-like syntax.
        It combines efficiency, control and modeling power with safety
        and programmer productivity.




Note that the https support requires OpenSSL available on your system. Works best on Linux with it installed as a devel lib (so like openssl-devel or whatever, just like you would if using it from C).



How it works:


Document.fromUrl uses the http lib to fetch it, then automatically parse the contents as a dom document. It will correct for common errors in webpage markup, character sets, etc.

Document and Element both have various methods for navigating, modifying, and accessing the DOM tree. Here, I used `optionSelector`, which works like `querySelector` in Javascript (and the same syntax is used for CSS), returning the first matching element.

querySelector, however, returns null if there is nothing found. optionSelector returns a dummy object instead, so you don't have to explicitly test it for null and instead just access its methods.

`innerText` returns the text inside, stripped of markup. You might also want `innerHTML`, or `toString` to get the whole thing, markup and all.



there's a lot more you can do too but just these few functions I think will be enough for your task.


Bonus fact: http://dpldocs.info/experimental-docs/std.algorithm.comparison.levenshteinDistanceAndPath.1.html that function from the standard library makes doing a diff display of before and after pretty simple....
August 12, 2017
On Saturday, 12 August 2017 at 20:22:44 UTC, Adam D. Ruppe wrote:
> On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:
>> [...]
>
> My dom.d and http2.d combine to make this easy:
>
> https://github.com/adamdruppe/arsd/blob/master/dom.d
> https://github.com/adamdruppe/arsd/blob/master/http2.d
>
> [...]

Sometimes it feels like there's the standard D library, Phobos, and then for everything else you have already developed a suitable library to supplement it haha!
August 12, 2017
On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:
> I would like to get into D again by making a small program which fetches a website every X-time and keeps track of all changes within specified dom elements.
>
> fetching: should I go for std curl, vibe.d or something else?
> parsing: I could only find these dub packages: htmld & libdominator.
> And they don't seem overly active, any recommendations?
>
> As I haven't been using D for some time I just don't want to get off with a bad start :)
> thx

I've the requests module nice to work with: http://code.dlang.org/packages/requests
August 13, 2017
On 2017-08-12 22:22, Adam D. Ruppe wrote:
> On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:
>> [...]
> 
> [...]
> ---
> // compile: $ dmd thisfile.d ~/arsd/{dom,http2,characterencodings}
> 
> import std.stdio;
> import arsd.dom;
> 
> void main() {
>          auto document = Document.fromUrl("https://dlang.org/");
>          writeln(document.optionSelector("p").innerText);
> }
> ---
Nice!

> [...]
> Document.fromUrl uses the http lib to fetch it, then automatically parse the contents as a dom document. It will correct for common errors in webpage markup, character sets, etc.

Just curious, but is there a spec of sorts which defines which errors should be fixed and such?

> [...] Bonus fact: http://dpldocs.info/experimental-docs/std.algorithm.comparison.levenshteinDistanceAndPath.1.html that function from the standard library makes doing a diff display of before and after pretty simple....
Thanks for the pointer!
August 13, 2017
On 2017-08-13 01:49, Soulsbane wrote:
> On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:
>> I would like to get into D again by making a small program which fetches a website every X-time and keeps track of all changes within specified dom elements.
>>
>> fetching: should I go for std curl, vibe.d or something else?
>> parsing: I could only find these dub packages: htmld & libdominator.
>> And they don't seem overly active, any recommendations?
>>
>> As I haven't been using D for some time I just don't want to get off with a bad start :)
>> thx
> 
> I've the requests module nice to work with: http://code.dlang.org/packages/requests
Thanks, looks nice! I'll try it if Adam's modules fail me :)
August 13, 2017
On Sunday, 13 August 2017 at 15:54:45 UTC, Faux Amis wrote:
> Just curious, but is there a spec of sorts which defines which errors should be fixed and such?

The HTML5 spec describes how you are supposed to parse various things, including the recovery paths for broken markup.

My module, however, isn't so formal. I just used it for a web scraping thing at work that hit a few hundred sites and fixed bugs as they came up to give good enough results for me.... (one thing I found is a lot of sites claiming to be UTF-8 are actually latin-1, so it validates and falls back to handle that. My http thing, while buggier, is similar - I hit a server once that ignored the accept gzip header and always sent it anyway, so I had to handle that... and I noticed curl actually didn't!)

So on the one hand, there's surely still bugs and weird cases, but on the other hand, it did get a fair chunk of real-world use so I am fairly confident it will be ok for most things.

August 15, 2017
On 2017-08-13 19:51, Adam D. Ruppe wrote:
> On Sunday, 13 August 2017 at 15:54:45 UTC, Faux Amis wrote:
>> Just curious, but is there a spec of sorts which defines which errors should be fixed and such?
> 
> The HTML5 spec describes how you are supposed to parse various things, including the recovery paths for broken markup.
> 
> My module, however, isn't so formal. I just used it for a web scraping thing at work that hit a few hundred sites and fixed bugs as they came up to give good enough results for me.... (one thing I found is a lot of sites claiming to be UTF-8 are actually latin-1, so it validates and falls back to handle that. My http thing, while buggier, is similar - I hit a server once that ignored the accept gzip header and always sent it anyway, so I had to handle that... and I noticed curl actually didn't!)
> 
> So on the one hand, there's surely still bugs and weird cases, but on the other hand, it did get a fair chunk of real-world use so I am fairly confident it will be ok for most things.
> 

Sounds good!
(Althought following the spec would be the first step to a D html layout engine :D )
August 15, 2017
On Monday, 14 August 2017 at 23:15:13 UTC, Faux Amis wrote:
> (Althought following the spec would be the first step to a D html layout engine :D )

Oh, I've actually done some of that before too.
https://github.com/adamdruppe/arsd/blob/master/htmlwidget.d


It is pretty horrible... but managed to render my old homepage which used css float, boxes, and basic tables. I don't know if it still compiles, I haven't even tried it for years.