Class for fetching a web page and parse into DOM

Dec 15, 2011

On Thursday, 15 December 2011 at 09:55:22 UTC, breezes wrote: > Is there a class that can fetch a web page from the internet? And is std.xml the right module for parsing it > into a DOM tree? You might want to use my dom.d https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff Grab dom.d, characterencodings.d, and curl.d. Here's an example program: ==== import arsd.dom; import arsd.curl; import std.stdio; void main() { auto document = new Document(); document.parseGarbage(curl("http://digitalmars.com/")); writeln(document.querySelector("p")); } ===== Compile like this: dmd yourfile.d dom.d characterencodings.d curl.d You'll need the curl C library from an outside source. If you're on Linux, it is probably already installed. If you're on Windows, check the Internet. // this downloads a file from the web and returns a string curl(site url); // this builds a DOM tree out of html. It's called parseGarbage because // it tries to figure out really bad html - so it works on a lot of web // sites. document.parseGarbage(string); // My dom.d includes a lot of functions you might know from // javascript like getElementById, getElementsByTagName, and the // get element by CSS selector functions document.querySelector("p") // get the first paragraph And then, finally, the writeln puts out the html of an element.

December 15, 2011

Re: Class for fetching a web page and parse into DOM

Posted by Nick Sabalausky
in reply to Adam D. Ruppe

Permalink

Nick Sabalausky

Posted in reply to Adam D. Ruppe

Permalink

"Adam D. Ruppe" <destructionator@gmail.com> wrote in message news:nlccexskkftzaapfdnti@dfeed.kimsufi.thecybershadow.net...
> On Thursday, 15 December 2011 at 09:55:22 UTC, breezes wrote:
>> Is there a class that can fetch a web page from the internet? And is
>> std.xml the right module for parsing it
>> into a DOM tree?
>
> You might want to use my dom.d
>
> https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff
>
> Grab dom.d, characterencodings.d, and curl.d.
>
> Here's an example program:
>
> ====
> import arsd.dom;
> import arsd.curl;
>
> import std.stdio;
>
> void main() {
> auto document = new Document();
> document.parseGarbage(curl("http://digitalmars.com/"));
>
> writeln(document.querySelector("p"));
> }
> =====
>
> Compile like this:
>
> dmd yourfile.d dom.d characterencodings.d curl.d
>
> You'll need the curl C library from an outside source. If you're on Linux, it is probably already installed. If you're on Windows, check the Internet.
>
> // this downloads a file from the web and returns a string
> curl(site url);
>
> // this builds a DOM tree out of html. It's called parseGarbage because
> // it tries to figure out really bad html - so it works on a lot of web
> // sites.
> document.parseGarbage(string);
>
> // My dom.d includes a lot of functions you might know from
> // javascript like getElementById, getElementsByTagName, and the
> // get element by CSS selector functions
> document.querySelector("p") // get the first paragraph
>
>
> And then, finally, the writeln puts out the html of an element.

Yup, I can confirm Adam's tools are great for this. At the moment, std.xml is known to have problems and is currently undergoing a rewrite.

Forums