Thread overview
Web crawler/scraping
Feb 17, 2021
Carlos Cabral
Feb 17, 2021
Ferhat Kurtulmuş
Feb 17, 2021
Carlos Cabral
Feb 17, 2021
Adam D. Ruppe
Feb 17, 2021
Carlos Cabral
Feb 17, 2021
Carlos Cabral
February 17, 2021
Hi,
I'm trying to collect some json data from a website/admin panel automatically, which is behind a login form.

Is there a D library that can help me with this?

Thank you
February 17, 2021
On Wednesday, 17 February 2021 at 12:12:56 UTC, Carlos Cabral wrote:
> Hi,
> I'm trying to collect some json data from a website/admin panel automatically, which is behind a login form.
>
> Is there a D library that can help me with this?
>
> Thank you

I found this but it looks outdated:

https://github.com/gedaiu/selenium.d
February 17, 2021
On Wednesday, 17 February 2021 at 12:27:16 UTC, Ferhat Kurtulmuş wrote:
> On Wednesday, 17 February 2021 at 12:12:56 UTC, Carlos Cabral wrote:
>> Hi,
>> I'm trying to collect some json data from a website/admin panel automatically, which is behind a login form.
>>
>> Is there a D library that can help me with this?
>>
>> Thank you
>
> I found this but it looks outdated:
>
> https://github.com/gedaiu/selenium.d

Thanks!
This seems to depend on Selenium, I was looking for something standalone, like

crawler.get(...)
crawler.post(...)
crawler.parse(...)

so that I can deploy the executable in the client's network as a single executable (the website I'm crawling is only available internally...).
February 17, 2021
On Wednesday, 17 February 2021 at 12:12:56 UTC, Carlos Cabral wrote:
> I'm trying to collect some json data from a website/admin panel automatically, which is behind a login form.

Does the website need javascript?

If not, my dom.d may be able to help. It can download some HTML, parse it, fill in forms, then my http2.d submits it (I never implemented Form.submit in dom.d but it is pretty easy to make with other functions that are implemented, heck maybe I'll implement it now if it sounds like it might work).

Or if it is all json you might be able to just craft some requests with my lib or even phobos' std.net.curl that submits the login request, saves a cookie, then fetches some json stuff.

I literally just rolled out of bed but in an hour or two I can come back and make some example code for you if this sounds plausible.
February 17, 2021
On Wednesday, 17 February 2021 at 13:13:00 UTC, Adam D. Ruppe wrote:
> On Wednesday, 17 February 2021 at 12:12:56 UTC, Carlos Cabral wrote:
>> I'm trying to collect some json data from a website/admin panel automatically, which is behind a login form.
>
> Does the website need javascript?
>
> If not, my dom.d may be able to help. It can download some HTML, parse it, fill in forms, then my http2.d submits it (I never implemented Form.submit in dom.d but it is pretty easy to make with other functions that are implemented, heck maybe I'll implement it now if it sounds like it might work).
>
> Or if it is all json you might be able to just craft some requests with my lib or even phobos' std.net.curl that submits the login request, saves a cookie, then fetches some json stuff.
>
> I literally just rolled out of bed but in an hour or two I can come back and make some example code for you if this sounds plausible.

No, I don't think it needs JS.
I think can submit the login form and then just fetch/save the json request using the login cookie as you suggest. A crawler/scraping solution maybe overkill...

I'll try with std.net.curl and come back to you in a couple of hours

Thank you!!


February 17, 2021
On Wednesday, 17 February 2021 at 13:13:00 UTC, Adam D. Ruppe wrote:
> On Wednesday, 17 February 2021 at 12:12:56 UTC, Carlos Cabral wrote:
>> I'm trying to collect some json data from a website/admin panel automatically, which is behind a login form.
>
> Does the website need javascript?
>
> If not, my dom.d may be able to help. It can download some HTML, parse it, fill in forms, then my http2.d submits it (I never implemented Form.submit in dom.d but it is pretty easy to make with other functions that are implemented, heck maybe I'll implement it now if it sounds like it might work).
>
> Or if it is all json you might be able to just craft some requests with my lib or even phobos' std.net.curl that submits the login request, saves a cookie, then fetches some json stuff.
>
> I literally just rolled out of bed but in an hour or two I can come back and make some example code for you if this sounds plausible.

...and it's working :)
thank you Adam and Ferhat

leaving this here if anyone needs:

```
import std.stdio;
import std.string;
import std.net.curl;
import core.thread;

void main()
{
    int waitTime = 5;
    auto domain = "https://example.com";
    auto cookiesFile = "cookies.txt";
    auto http = HTTP();

    http.handle.set(CurlOption.use_ssl, 1);
    http.handle.set(CurlOption.ssl_verifypeer, 0);
    http.handle.set(CurlOption.cookiefile, cookiesFile);
    http.handle.set(CurlOption.cookiejar , cookiesFile);
    http.setUserAgent("...");
    http.onReceive = (ubyte[] data) { (...) }

    http.method = HTTP.Method.get;
    http.url = domain ~ "/login";
    http.perform();

    Thread.sleep(waitTime.seconds);

    auto data = "username=user&password=pass";
    http.method = HTTP.Method.post;
    http.url = domain ~ "/login";
    http.setPostData(data, "application/x-www-form-urlencoded");
    http.perform();

    Thread.sleep(waitTime.seconds);

    http.method = HTTP.Method.get;
    http.url = domain ~ "/fetchjson";
    http.perform();
}
```