Thread overview | ||||||||
---|---|---|---|---|---|---|---|---|
|
March 23, 2016 parsing HTML for a web robot (crawler) like application | ||||
---|---|---|---|---|
| ||||
Hello! I want to set up a web robot to detect changes on certain web pages or sites. Any hint to similar projects or libraries at dub or git to look at, before starting to develop my own RegExp for parsing? Best regards mt. |
March 23, 2016 Re: parsing HTML for a web robot (crawler) like application | ||||
---|---|---|---|---|
| ||||
Posted in reply to Martin Tschierschke | On Wednesday, 23 March 2016 at 09:02:37 UTC, Martin Tschierschke wrote: > Hello! > I want to set up a web robot to detect changes on certain web pages or sites. > Any hint to similar projects or libraries at dub or git to look at, > before starting to develop my own RegExp for parsing? > > Best regards > mt. Adam's dom.d will get you pretty far. I believe it can also handle documents that aren't completely well-formed. https://github.com/adamdruppe/arsd/blob/master/dom.d |
March 23, 2016 Re: parsing HTML for a web robot (crawler) like application | ||||
---|---|---|---|---|
| ||||
Posted in reply to Rene Zwanenburg | On Wednesday, 23 March 2016 at 09:06:37 UTC, Rene Zwanenburg wrote:
[...]
>
> Adam's dom.d will get you pretty far. I believe it can also handle documents that aren't completely well-formed.
>
> https://github.com/adamdruppe/arsd/blob/master/dom.d
Thank you! This forum has an incredible fast auto responder ;-)
|
March 23, 2016 Re: parsing HTML for a web robot (crawler) like application | ||||
---|---|---|---|---|
| ||||
Posted in reply to Rene Zwanenburg | On Wednesday, 23 March 2016 at 09:06:37 UTC, Rene Zwanenburg wrote: > Adam's dom.d will get you pretty far. I believe it can also handle documents that aren't completely well-formed. > > https://github.com/adamdruppe/arsd/blob/master/dom.d HTML-docs here: http://dpldocs.info/experimental-docs/arsd.dom.html throught Adam's own web-service. |
March 23, 2016 Re: parsing HTML for a web robot (crawler) like application | ||||
---|---|---|---|---|
| ||||
Posted in reply to Martin Tschierschke | On Wednesday, 23 March 2016 at 09:02:37 UTC, Martin Tschierschke wrote: > Hello! > I want to set up a web robot to detect changes on certain web pages or sites. > Any hint to similar projects or libraries at dub or git to look at, > before starting to develop my own RegExp for parsing? > > Best regards > mt. See also: http://code.dlang.org/packages/htmld |
March 24, 2016 Re: parsing HTML for a web robot (crawler) like application | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nordlöw | On Wednesday, 23 March 2016 at 10:49:03 UTC, Nordlöw wrote: > HTML-docs here: > > http://dpldocs.info/experimental-docs/arsd.dom.html Indeed, though the docs are still a work in progress (the lib is now about 6 years old, but until recently, ddoc blocked me from using examples in the comments so I didn't bother. I've fixed that now though, but haven't finished writing them all up). Basic idea though for web scraping: auto document = new Document(); document.parseGarbage(your_html_string); // supports most the CSS syntax, and you might also know it from jQuery Element[] elements = document.querySelectorAll("css selector"); // or if you just want the first hit or null if none... Element element = document.querySelector("css selector"); And once you have a reference: element.innerText element.innerHTML to print its contents in some form. You can do a lot more too (a LOT more), but just these functions should get you started. The parseGarbage function will also need you to compile in the characterencodings.d file from my same github. It will handle charset detection and translation as well as tag soup parsing. I use it for a lot of web scraping myself. |
Copyright © 1999-2021 by the D Language Foundation