For those ready to take the challenge (page 3)

On Saturday, 10 January 2015 at 15:52:21 UTC, Tobias Pankrath wrote: > I think he's wrong, because it spoils the comparison. Every answer should delegate those tasks to a library that Stroustroup used as well, e.g. regex matching, string to number conversion and some kind of TCP sockets. But it must do the same work that he's solution does: Create and parse HTML header and extract the html links, probably using regex, but I wouldn't mind another solution. The challenge is completely pointless. Different languages have different ways of hacking together a compact incorrect solution. How to directly translate a C++ hack into another language is a task for people who are drunk. For the challenge to make sense it would entail parsing all legal HTML5 documents, extracting all resource links, converting them into absolute form and printing them one per line. With no hickups.

January 10, 2015

Re: For those ready to take the challenge

Posted by Adam D. Ruppe
in reply to Ola Fosheim Grøstad

Permalink

Adam D. Ruppe

Posted in reply to Ola Fosheim Grøstad

Permalink

On Saturday, 10 January 2015 at 17:23:31 UTC, Ola Fosheim Grøstad wrote:
> For the challenge to make sense it would entail parsing all legal HTML5 documents, extracting all resource links, converting them into absolute form and printing them one per line. With no hickups.

Though, that's still a library thing rather than a language thing.

dom.d and the Url struct in cgi.d should be able to do all that, in just a few lines even, but that's just because I've done a *lot* of web scraping with the libs before so I made them work for that.

In fact.... let me to do it. I'll use my http2.d instead of cgi.d, actually, it has a similar Url struct just more focused on client requests.


import arsd.dom;
import arsd.http2;
import std.stdio;

void main() {
	auto base = Uri("http://www.stroustrup.com/C++.html");
        // http2 is a newish module of mine that aims to imitate
        // a browser in some ways (without depending on curl btw)
	auto client = new HttpClient();
	auto request = client.navigateTo(base);
	auto document = new Document();

        // and http2 provides an asynchonous api but you can
        // pretend it is sync by just calling waitForCompletion
	auto response = request.waitForCompletion();

        // parseGarbage uses a few tricks to fixup invalid/broken HTML
        // tag soup and auto-detect character encodings, including when
        // it lies about being UTF-8 but is actually Windows-1252
	document.parseGarbage(response.contentText);

        // Uri.basedOn returns a new absolute URI based on something else
	foreach(a; document.querySelectorAll("a[href]"))
		writeln(Uri(a.href).basedOn(base));
}


Snippet of the printouts:

[...]
http://www.computerhistory.org
http://www.softwarepreservation.org/projects/c_plus_plus/
http://www.morganstanley.com/
http://www.cs.columbia.edu/
http://www.cse.tamu.edu
http://www.stroustrup.com/index.html
http://www.stroustrup.com/C++.html
http://www.stroustrup.com/bs_faq.html
http://www.stroustrup.com/bs_faq2.html
http://www.stroustrup.com/C++11FAQ.html
http://www.stroustrup.com/papers.html
[...]

The latter are relative links that it based on and the first few are absolute. Seems to have worked.


There's other kinds of links than just a[href], but fetching them is as simple as adding them to the selector or looping for them too separately:

	foreach(a; document.querySelectorAll("script[src]"))
		writeln(Uri(a.src).basedOn(base));

none on that page, no <link>s either, but it is easy enough to do with the lib.



Looking at the source of that page, I find some invalid HTML and lies about the character set. How did Document.parseGarbage do? Pretty well, outputting the parsed DOM tree shows it auto-corrected the problems I see by eye.

On Saturday, 10 January 2015 at 14:56:09 UTC, Adam D. Ruppe wrote: > On Saturday, 10 January 2015 at 13:22:57 UTC, Nordlöw wrote: >> on dmd git master. Ideas anyone? > > Don't use git master :P Do use git master. The more people do, the fewer regressions will slip into the final release. You can use Dustmite to reduce the code to a simple example, and Digger to find the exact pull request which introduced the regression. (Yes, shameless plug, preaching to the choir, etc.)

On Friday, 9 January 2015 at 13:50:29 UTC, eles wrote: > https://codegolf.stackexchange.com/questions/44278/debunking-stroustrups-debunking-of-the-myth-c-is-for-large-complicated-pro From the link: "Let's show Stroustrup what small and readable program actually is." Alright, there are a lot a examples in many languagens, but those examples doesn't should handle exceptions like the original code does? Matheus.

On Saturday, 10 January 2015 at 17:39:17 UTC, Adam D. Ruppe wrote: > Though, that's still a library thing rather than a language thing. It is a language-library-platform thing, things like how composable the eco system is would be interesting to compare. But it would be unfair to require a minimalistic language to not use third party libraries. One should probably require that the library used is generic (not a spider-framework), not using FFI, mature and maintained? > document.parseGarbage(response.contentText); > > // Uri.basedOn returns a new absolute URI based on something else > foreach(a; document.querySelectorAll("a[href]")) > writeln(Uri(a.href).basedOn(base)); > } > Nice and clean code; does it expand html entities ("&amp")? The HTML5 standard has improved on HTML4 by now being explicit on how incorrect documents shall be interpreted in section 8.2. That ought to be sufficient, since that is what web browsers are supposed to do. http://www.w3.org/TR/html5/syntax.html#html-parser

On Saturday, 10 January 2015 at 19:17:22 UTC, Ola Fosheim Grøstad wrote: > Nice and clean code; does it expand html entities ("&amp")? Of course. It does it both ways: <span>a &</span> span.innerText == "a &" span.innerText = "a \" b"; assert(span.innerHTML == "a " b"); parseGarbage also tries to fix broken entities, so like & standing alone it will translate to & for you. there's also parseStrict which just throws an exception in cases like that. That's one thing a lot of XML parsers don't do in the name of speed, but I do since it is pretty rare that I don't want them translated. One thing I did for a speedup though was scan the string for & and if it doesn't find one, return a slice of the original, and if it does, return a new string with the entity translated. Gave a surprisingly big speed boost without costing anything in convenience. > The HTML5 standard has improved on HTML4 by now being explicit on how incorrect documents shall be interpreted in section 8.2. That ought to be sufficient, since that is what web browsers are supposed to do. > > http://www.w3.org/TR/html5/syntax.html#html-parser Huh, I never read that, my thing just did what looked right to me over hundreds of test pages that were broken in various strange and bizarre ways.

Forums