Thread overview | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
January 20, 2006 html2txt library, anyone? | ||||
---|---|---|---|---|
| ||||
yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and change it to text, but has anyone written a d library to do this? Thanks, josé |
January 21, 2006 Re: html2txt library, anyone? | ||||
---|---|---|---|---|
| ||||
Posted in reply to jicman | jicman wrote:
> yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and
> change it to text, but has anyone written a d library to do this?
>
> Thanks,
>
> josé
>
>
I've just written such a thing for C#... code is mostly platform-agnostic so it should move to D easily enough. It looks like C and scares everyone, which is why I like it :) Interested?
|
January 22, 2006 Re: html2txt library, anyone? | ||||
---|---|---|---|---|
| ||||
Posted in reply to James Dunne |
He he he he... c doesn't scare me. ;-) Neither does c#. :-) yes, please. I would love to have it. Would you be so kind as to email it to,
cabrera
at
wrc.xerox.com
thanks.
josé
James Dunne says...
>
>jicman wrote:
>> yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and change it to text, but has anyone written a d library to do this?
>>
>> Thanks,
>>
>> josé
>>
>>
>
>I've just written such a thing for C#... code is mostly platform-agnostic so it should move to D easily enough. It looks like C and scares everyone, which is why I like it :) Interested?
|
January 23, 2006 Re: html2txt library, anyone? | ||||
---|---|---|---|---|
| ||||
Posted in reply to jicman | Here's a PCRE regex that will do it "jicman" <jicman_member@pathlink.com> wrote in message news:dqpvdf$1h3j$1@digitaldaemon.com... > > yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and > change it to text, but has anyone written a d library to do this? > > Thanks, > > josé > > |
January 23, 2006 Re: html2txt library, anyone? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Charles | Oops, char [] htmlContents = `<([^>])+>|&([^;])+;`; // to extract all <tag> plain text </tag> from html "Charles" <noone@nowhere.com> wrote in message news:dr2tmr$2g9a$1@digitaldaemon.com... > Here's a PCRE regex that will do it > > "jicman" <jicman_member@pathlink.com> wrote in message news:dqpvdf$1h3j$1@digitaldaemon.com... > > > > yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html > file and > > change it to text, but has anyone written a d library to do this? > > > > Thanks, > > > > josé > > > > > > |
January 28, 2006 Re: html2txt library, anyone? | ||||
---|---|---|---|---|
| ||||
Posted in reply to jicman | jicman wrote: > He he he he... c doesn't scare me. ;-) Neither does c#. :-) yes, please. I > would love to have it. Would you be so kind as to email it to, > > cabrera > at > wrc.xerox.com > > thanks. > > josé > > James Dunne says... > >>jicman wrote: >> >>>yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and >>>change it to text, but has anyone written a d library to do this? >>> >>>Thanks, >>> >>>josé >>> >>> >> >>I've just written such a thing for C#... code is mostly platform-agnostic so it should move to D easily enough. It looks like C and scares everyone, which is why I like it :) Interested? > > > So, keep me in suspense... -- Regards, James Dunne |
January 28, 2006 Re: html2txt library, anyone? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Charles | Charles wrote: > Oops, > > char [] htmlContents = `<([^>])+>|&([^;])+;`; // to extract all <tag> plain > text </tag> from html > > > "Charles" <noone@nowhere.com> wrote in message > news:dr2tmr$2g9a$1@digitaldaemon.com... > >>Here's a PCRE regex that will do it >> >>"jicman" <jicman_member@pathlink.com> wrote in message >>news:dqpvdf$1h3j$1@digitaldaemon.com... >> >>>yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html >> >>file and >> >>>change it to text, but has anyone written a d library to do this? >>> >>>Thanks, >>> >>>josé >>> >>> >> >> > > What about reflowing whitespace runs? BR tags to newlines, P tags, ordered lists, bulleted lists? Incorrect tag close nestings? (i.e. <i><b></i></b>) Not to mention that you have to parse each tag's attributes so you don't accidentally hit a right angle bracket inside a string value... HTML/XML comments... the list never ends. This is why HTML is such a hacked standard. -- Regards, James Dunne |
January 29, 2006 Re: html2txt library, anyone? | ||||
---|---|---|---|---|
| ||||
Posted in reply to James Dunne | > This is why HTML is such a hacked standard. Yea I agree . I've been using AJAX lately but its hard for me to get over how 'hackish' it is , jumping through tons of hurdles just to overcome the limitations of HTTP/HTML. Have you seen HTML 2.0 ? http://www.w3.org/MarkUp/html-spec/html-spec_toc.html . I'd love to see a new design language for the web , with some better widgets and connection based . Using Mango for the server and the Harmonia code base to display this unnamed new language :D. "James Dunne" <james.jdunne@gmail.com> wrote in message news:drf2b5$gb9$1@digitaldaemon.com... > Charles wrote: > > Oops, > > > > char [] htmlContents = `<([^>])+>|&([^;])+;`; // to extract all <tag> plain > > text </tag> from html > > > > > > "Charles" <noone@nowhere.com> wrote in message news:dr2tmr$2g9a$1@digitaldaemon.com... > > > >>Here's a PCRE regex that will do it > >> > >>"jicman" <jicman_member@pathlink.com> wrote in message news:dqpvdf$1h3j$1@digitaldaemon.com... > >> > >>>yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html > >> > >>file and > >> > >>>change it to text, but has anyone written a d library to do this? > >>> > >>>Thanks, > >>> > >>>josé > >>> > >>> > >> > >> > > > > > > What about reflowing whitespace runs? BR tags to newlines, P tags, ordered lists, bulleted lists? Incorrect tag close nestings? (i.e. <i><b></i></b>) Not to mention that you have to parse each tag's attributes so you don't accidentally hit a right angle bracket inside a string value... HTML/XML comments... the list never ends. > > This is why HTML is such a hacked standard. > > -- > Regards, > James Dunne |
Copyright © 1999-2021 by the D Language Foundation