std.xml2 (collecting features) control character (page 8)

February 19, 2016

Re: std.xml2 (collecting features)

Posted by Chris
in reply to Joakim

Permalink

Chris

Posted in reply to Joakim

Permalink

On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote:
> On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote:
>> std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it:
>>
>> - SAX and DOM parser
>> - in-situ / slicing parsing when possible (forward range?)
>> - compile time switch (CTS) for lazy attribute parsing
>> - CTS for encoding (ubyte(ASCII), char(utf8), ... )
>> - CTS for input validating
>> - performance
>>
>> Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2
>>
>> Please post you feature requests, and please keep the posts DRY and on topic.
>
> My request: just skip it.  XML is a horrible waste of space for a standard, better D doesn't support it well, anything to discourage it's use.  I'd rather see you spend your time on something worthwhile.  If data formats are your thing, you could help get Ludwig's JSON stuff in, or better yet, enable some nice binary data format.

Glad to hear that someone is working on XML support. We cannot just "skip it". XML/HTML like mark up comes up all the time, here and there. I recently had to write a mini-parser (nowhere near the stuff Robert is doing, just a quick fix!) to extract data from XML input. This has nothing to do with personal preferences, it's just there [1] and has to be dealt with.

[1] https://en.wikipedia.org/wiki/Speech_Synthesis_Markup_Language

On 2016-02-19 11:58, Kagamin via Digitalmars-d wrote: > On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner Schadek wrote: >> the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E" > > http://dpaste.dzfl.pl/80888ed31958 like this? No, The program just takes the hex dump as string. you would need to do something like: ubyte[] arr = cast(ubyte[])[3C, 66, 6F, 6F, 3E, C2, 80, 3C, 2F, 66, 6F, 6F, 3E]); string s = cast(string)arr; dstring ds = to!dstring(s); and see what happens

On Friday, 19 February 2016 at 12:30:06 UTC, Robert burner Schadek wrote: > ubyte[] arr = cast(ubyte[])[3C, 66, 6F, 6F, 3E, C2, 80, 3C, 2F, 66, 6F, > 6F, 3E]); > string s = cast(string)arr; > dstring ds = to!dstring(s); > > and see what happens http://dpaste.dzfl.pl/2f8a8ff10bde like this?

On Thursday, 18 February 2016 at 21:53:24 UTC, Robert burner Schadek wrote: > On Thursday, 18 February 2016 at 18:28:10 UTC, Alex Vincent wrote: >> Regarding control characters: If you give me a complete sample file, I can run it through Mozilla's UTF stream conversion and/or XML parsing code (via either SAX or DOMParser) to tell you how that reacts as a reference. Mozilla supports XML 1.0, but not 1.1. > > thanks you making the effort > > https://github.com/burner/std.xml2/blob/master/tests/eduni/xml-1.1/out/010.xml In this case, Firefox just passes the control characters through to the contentHandler.characters method: Starting runTest Retrieved source contentHandler.startDocument() contentHandler.startElement("", "foo", "foo", {}) contentHandler.characters("\u0080") contentHandler.endElement("", "foo", "foo") contentHandler.endDocument() Done reading

On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote: > Please post you feature requests... - the ability to read documents with missing or incorrectly specified encoding - additional feature: relaxed mode for reading html and broken XML documents Some time ago I worked for Accusoft for the document viewing/converting software. The main experience that I get: any theoretically possible types of errors in the documents are real, when the application is popular.

On Saturday, 20 February 2016 at 19:08:25 UTC, crimaniak wrote: > - the ability to read documents with missing or incorrectly specified encoding > - additional feature: relaxed mode for reading html and broken XML documents fyi, my dom.d can do those, I use it for web scraping where there's all kinds of hideous stuff out there. https://github.com/adamdruppe/arsd/blob/master/dom.d

On Saturday, 20 February 2016 at 19:16:47 UTC, Adam D. Ruppe wrote: > On Saturday, 20 February 2016 at 19:08:25 UTC, crimaniak wrote: >> - the ability to read documents with missing or incorrectly specified encoding >> - additional feature: relaxed mode for reading html and broken XML documents > > fyi, my dom.d can do those, I use it for web scraping where there's all kinds of hideous stuff out there. > > https://github.com/adamdruppe/arsd/blob/master/dom.d It works, thanks! I will use it in my experiments, but getElementsBySelector() selector language need to be improved I think.

On Sunday, 21 February 2016 at 23:01:22 UTC, crimaniak wrote: > I will use it in my experiments, but getElementsBySelector() selector language need to be improved I think. What, specifically, do you have in mind?

Forums