Extracting Structure from HTML using Adam's dom.d

I'm trying to figure out how to most easily extract structured information using Adam D Ruppe's dom.d. Typically I want the following HTML example ... <h2> <span class="mw-headline" id="H2_A">More important</span> </h2> <p>This is <i>important</i>.</p> <h2> <span class="mw-headline" id="H2_B">Less important</span> </h2> <p>This is not important.</p> ... to be reduced to This is <i>important</i>. This means that I need some kind of interface to extract all the contents of each <p> paragraph that is preceeded by a <h2> heading with a specific id (say "H2_A") or content (say "More important"). How do I accomplish that? Further, is there a way to extract the "contents" only of an Element instance, that is "Stuff" from "<p>Stuff</p>" for each Element in the return of for example getElementsByTagName(`p`)?

On Wednesday, 21 January 2015 at 23:31:26 UTC, Nordlöw wrote: > This means that I need some kind of interface to extract all the contents of each <p> paragraph that is preceeded by a <h2> heading with a specific id (say "H2_A") or content (say "More important"). How do I accomplish that? You can do that with a CSS selector like: document.querySelector("#H2_A + p"); or even document.querySelectorAll("h2 + p") to get every P immediately following a h2. My implementation works mostly the same as in javascript so you can read more about css selectors anywhere on the net like https://developer.mozilla.org/en-US/docs/Web/API/Document.querySelector > Further, is there a way to extract the "contents" only of an Element instance, that is "Stuff" from "<p>Stuff</p>" for each Element in the return of for example getElementsByTagName(`p`)? Element.innerText returns all the plain text inside with all tags stripped out (same as the function in IE) Element.innerHTML returns all the content inside, including tags (same as the function in all browsers) Element.firstInnerText returns all the text up to the first tag, but then stops there. (this is a custom extension) You can call those in a regular foreach loop or with something like std.algorithm.map to get the info from an array of elements.

January 22, 2015

Re: Extracting Structure from HTML using Adam's dom.d

Posted by Per Nordlöw
in reply to Adam D. Ruppe

Permalink

Per Nordlöw

Posted in reply to Adam D. Ruppe

Permalink

On Thursday, 22 January 2015 at 02:06:16 UTC, Adam D. Ruppe wrote:
> On Wednesday, 21 January 2015 at 23:31:26 UTC, Nordlöw wrote:
>> This means that I need some kind of interface to extract all the contents of each <p> paragraph that is preceeded by a <h2> heading with a specific id (say "H2_A") or content (say "More important"). How do I accomplish that?
>
> You can do that with a CSS selector like:
>
> document.querySelector("#H2_A + p");
>
> or even document.querySelectorAll("h2 + p") to get every P immediately following a h2.
>
>
> My implementation works mostly the same as in javascript so you can read more about css selectors anywhere on the net like https://developer.mozilla.org/en-US/docs/Web/API/Document.querySelector
>
>> Further, is there a way to extract the "contents" only of an Element instance, that is  "Stuff" from "<p>Stuff</p>" for each Element in the return of for example getElementsByTagName(`p`)?
>
> Element.innerText returns all the plain text inside with all tags stripped out (same as the function in IE)
>
> Element.innerHTML returns all the content inside, including tags (same as the function in all browsers)
>
> Element.firstInnerText returns all the text up to the first tag, but then stops there. (this is a custom extension)
>
>
> You can call those in a regular foreach loop or with something like std.algorithm.map to get the info from an array of elements.

Brilliant! Thanks!

BTW: Would you be interested in receiving a PR for dom.d where I replace array allocations with calls to lazy ranges?

On Thursday, 22 January 2015 at 02:06:16 UTC, Adam D. Ruppe wrote: > You can do that with a CSS selector like: > > document.querySelector("#H2_A + p"); What is the meaning of selectors such as `a[href]` used in doc.querySelectorAll(`a[href]`) ?

On Thursday, 22 January 2015 at 11:23:49 UTC, Nordlöw wrote: > What is the meaning of selectors such as > > `a[href]` > > used in > > doc.querySelectorAll(`a[href]`) > > ? Select all `a` tags that have a `href` attribute. You can also select using the attribute value too. For example get all the text inputs in a form: doc.querySelectorAll(`form[name="myform"] input[type="text"]`) dom.d is awesome!

On Thu, 22 Jan 2015 11:40:52 +0000 Gary Willoughby via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> wrote: > On Thursday, 22 January 2015 at 11:23:49 UTC, Nordlöw wrote: > > What is the meaning of selectors such as > > > > `a[href]` > > > > used in > > > > doc.querySelectorAll(`a[href]`) > > > > ? > > Select all `a` tags that have a `href` attribute. > > You can also select using the attribute value too. For example get all the text inputs in a form: > > doc.querySelectorAll(`form[name="myform"] input[type="text"]`) > > dom.d is awesome! i miss it in Phobos.

On Thursday, 22 January 2015 at 09:27:17 UTC, Per Nordlöw wrote: > BTW: Would you be interested in receiving a PR for dom.d where I replace array allocations with calls to lazy ranges? Maybe. It was on my todo list to do that for getElementsByTagName at least, which is supposed to be a live list rather than a copy of references. querySelectorAll, however, is supposed to be a copy, so don't want that to be a range. (this is to match the W3C standard and what javascript does) There are lazy range functions in there btw: element.tree is a lazy range. If you combine it with stuff like std.algorithm.filter and map, etc., it'd be easy to do a bunch of them. getElementsByTagName for example is filter!((e) => e.tagName == want)(element.tree). So the lazy implementations could just be in those terms. (actually though, that's not hard to write on the spot, so maybe it should just be explained instead of adding/changing methods. It is nice that they are plain methods instead of templates now because they can be so easily wrapped in things like script code)

On Thursday, 22 January 2015 at 11:40:53 UTC, Gary Willoughby wrote: > doc.querySelectorAll(`form[name="myform"] input[type="text"]`) > > dom.d is awesome! Something to remember btw is this also works in browser JavaScript AND css itself, since IE8 and Firefox 3.5. (no need for slow, bloated jquery) My implementation is different in some ways but mostly compatible, including some of the more advanced features like [attr^=starts_with_this] and $= and *= and so on. Also the sibling selectors ~ and +, and so on. (search for CSS selector info to learn more) dom.d also does thigns like :first-child, but it does NOT support like :nth-of-type and a few more of those newer CSS3 things. I might add them some day but I fhind this is pretty good as is.

On Thursday, 22 January 2015 at 16:22:14 UTC, ketmar via Digitalmars-d-learn wrote: > i miss it in Phobos. I'm sure it'd fail the phobos review process though. But since it is an independent file (or it + characterencodings.d for full functionality), it is easy to just download and add to your project anyway.

Forums