Thread overview
html2txt library, anyone?
Jan 20, 2006
jicman
Jan 21, 2006
James Dunne
Jan 22, 2006
jicman
Jan 28, 2006
James Dunne
Jan 23, 2006
Charles
Jan 23, 2006
Charles
Jan 28, 2006
James Dunne
Jan 29, 2006
Charles
January 20, 2006
yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and change it to text, but has anyone written a d library to do this?

Thanks,

josé


January 21, 2006
jicman wrote:
> yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and
> change it to text, but has anyone written a d library to do this?
> 
> Thanks,
> 
> josé
> 
> 

I've just written such a thing for C#... code is mostly platform-agnostic so it should move to D easily enough.  It looks like C and scares everyone, which is why I like it :)  Interested?
January 22, 2006
He he he he... c doesn't scare me. ;-)  Neither does c#. :-)  yes, please.  I would love to have it.  Would you be so kind as to email it to,

cabrera
at
wrc.xerox.com

thanks.

josé

James Dunne says...
>
>jicman wrote:
>> yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and change it to text, but has anyone written a d library to do this?
>> 
>> Thanks,
>> 
>> josé
>> 
>> 
>
>I've just written such a thing for C#... code is mostly platform-agnostic so it should move to D easily enough.  It looks like C and scares everyone, which is why I like it :)  Interested?


January 23, 2006
Here's a PCRE regex that will do it

"jicman" <jicman_member@pathlink.com> wrote in message news:dqpvdf$1h3j$1@digitaldaemon.com...
>
> yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html
file and
> change it to text, but has anyone written a d library to do this?
>
> Thanks,
>
> josé
>
>


January 23, 2006
Oops,

char [] htmlContents = `<([^>])+>|&([^;])+;`; // to extract all <tag> plain
text </tag> from html


"Charles" <noone@nowhere.com> wrote in message news:dr2tmr$2g9a$1@digitaldaemon.com...
> Here's a PCRE regex that will do it
>
> "jicman" <jicman_member@pathlink.com> wrote in message news:dqpvdf$1h3j$1@digitaldaemon.com...
> >
> > yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html
> file and
> > change it to text, but has anyone written a d library to do this?
> >
> > Thanks,
> >
> > josé
> >
> >
>
>


January 28, 2006
jicman wrote:
> He he he he... c doesn't scare me. ;-)  Neither does c#. :-)  yes, please.  I
> would love to have it.  Would you be so kind as to email it to,
> 
> cabrera
> at
> wrc.xerox.com
> 
> thanks.
> 
> josé
> 
> James Dunne says...
> 
>>jicman wrote:
>>
>>>yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and
>>>change it to text, but has anyone written a d library to do this?
>>>
>>>Thanks,
>>>
>>>josé
>>>
>>>
>>
>>I've just written such a thing for C#... code is mostly platform-agnostic so it should move to D easily enough.  It looks like C and scares everyone, which is why I like it :)  Interested?
> 
> 
> 

So, keep me in suspense...

-- 
Regards,
James Dunne
January 28, 2006
Charles wrote:
> Oops,
> 
> char [] htmlContents = `<([^>])+>|&([^;])+;`; // to extract all <tag> plain
> text </tag> from html
> 
> 
> "Charles" <noone@nowhere.com> wrote in message
> news:dr2tmr$2g9a$1@digitaldaemon.com...
> 
>>Here's a PCRE regex that will do it
>>
>>"jicman" <jicman_member@pathlink.com> wrote in message
>>news:dqpvdf$1h3j$1@digitaldaemon.com...
>>
>>>yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html
>>
>>file and
>>
>>>change it to text, but has anyone written a d library to do this?
>>>
>>>Thanks,
>>>
>>>josé
>>>
>>>
>>
>>
> 
> 

What about reflowing whitespace runs?  BR tags to newlines, P tags, ordered lists, bulleted lists?  Incorrect tag close nestings?  (i.e. <i><b></i></b>)  Not to mention that you have to parse each tag's attributes so you don't accidentally hit a right angle bracket inside a string value...  HTML/XML comments...  the list never ends.

This is why HTML is such a hacked standard.

-- 
Regards,
James Dunne
January 29, 2006
> This is why HTML is such a hacked standard.

Yea I agree .  I've been using AJAX lately but its hard for me to get over how 'hackish' it is , jumping through tons of hurdles just to overcome the limitations of HTTP/HTML. Have you seen HTML 2.0 ? http://www.w3.org/MarkUp/html-spec/html-spec_toc.html .


I'd love to see a new design language for the web , with some better widgets and connection based .  Using Mango for the server and the Harmonia code base to display this unnamed new language :D.



"James Dunne" <james.jdunne@gmail.com> wrote in message news:drf2b5$gb9$1@digitaldaemon.com...
> Charles wrote:
> > Oops,
> >
> > char [] htmlContents = `<([^>])+>|&([^;])+;`; // to extract all <tag>
plain
> > text </tag> from html
> >
> >
> > "Charles" <noone@nowhere.com> wrote in message news:dr2tmr$2g9a$1@digitaldaemon.com...
> >
> >>Here's a PCRE regex that will do it
> >>
> >>"jicman" <jicman_member@pathlink.com> wrote in message news:dqpvdf$1h3j$1@digitaldaemon.com...
> >>
> >>>yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html
> >>
> >>file and
> >>
> >>>change it to text, but has anyone written a d library to do this?
> >>>
> >>>Thanks,
> >>>
> >>>josé
> >>>
> >>>
> >>
> >>
> >
> >
>
> What about reflowing whitespace runs?  BR tags to newlines, P tags, ordered lists, bulleted lists?  Incorrect tag close nestings?  (i.e. <i><b></i></b>)  Not to mention that you have to parse each tag's attributes so you don't accidentally hit a right angle bracket inside a string value...  HTML/XML comments...  the list never ends.
>
> This is why HTML is such a hacked standard.
>
> --
> Regards,
> James Dunne