May 03, 2015
On 5/3/2015 2:31 PM, Ilya Yaroshenko wrote:
> Can it lazily reads huge files (files greater than memory)?

If a range interface is used, it doesn't need to be aware of where the data is coming from. In fact, the xml package should NOT be doing I/O.
May 03, 2015
On 2015-05-03 17:39:46 +0000, "Robert burner Schadek" <rburners@gmail.com> said:

> std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it:
> 
> - SAX and DOM parser
> - in-situ / slicing parsing when possible (forward range?)
> - compile time switch (CTS) for lazy attribute parsing
> - CTS for encoding (ubyte(ASCII), char(utf8), ... )
> - CTS for input validating
> - performance
> 
> Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2
> 
> Please post you feature requests, and please keep the posts DRY and on topic.

This isn't a feature request (sorry?), but I just want to point out that you should feel free to borrow code from https://github.com/michelf/mfr-xml-d  There's probably a lot you can reuse in there.

-- 
Michel Fortin
michel.fortin@michelf.ca
http://michelf.ca

May 04, 2015
On 4/05/2015 5:39 a.m., Robert burner Schadek wrote:
> std.xml has been considered not up to specs nearly 3 years now. Time to
> build a successor. I currently plan the following featues for it:
>
> - SAX and DOM parser
> - in-situ / slicing parsing when possible (forward range?)
> - compile time switch (CTS) for lazy attribute parsing
> - CTS for encoding (ubyte(ASCII), char(utf8), ... )
> - CTS for input validating
> - performance
>
> Not much code yet, I'm currently building the performance test suite
> https://github.com/burner/std.xml2
>
> Please post you feature requests, and please keep the posts DRY and on
> topic.

Preferably the interfaces are made first 1:1 as the spec requires.
Then its just a matter of building the actual reader/writer code.

That way we could theoretically rewrite the reader/writer to support other formats such as html5/svg. Independently of phobos.

Also would be nice to be CTFE'able!
May 04, 2015
On Sunday, 3 May 2015 at 23:32:28 UTC, Michel Fortin wrote:
>
> This isn't a feature request (sorry?), but I just want to point out that you should feel free to borrow code from https://github.com/michelf/mfr-xml-d  There's probably a lot you can reuse in there.

nice, thank you
May 04, 2015
On Sunday, 3 May 2015 at 22:02:13 UTC, Walter Bright wrote:
> On 5/3/2015 2:31 PM, Ilya Yaroshenko wrote:
>> Can it lazily reads huge files (files greater than memory)?
>
> If a range interface is used, it doesn't need to be aware of where the data is coming from. In fact, the xml package should NOT be doing I/O.

Wouldn't D-ranges make it impossible to use SIMD optimizations  when scanning?

However, it would make a lot of sense to just convert an existing XML solution with Boost license. I don't know which ones are any good, but RapidXML is at least Boost.
May 04, 2015
On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote:

> My request: just skip it.  XML is a horrible waste of space for a standard, better D doesn't support it well, anything to discourage it's use.  I'd rather see you spend your time on something worthwhile.  If data formats are your thing, you could help get Ludwig's JSON stuff in, or better yet, enable some nice binary data format.

Am Sun, 03 May 2015 18:44:11 +0000
schrieb "w0rp" <devw0rp@gmail.com>:

> I agree that JSON is superior through-and-through, but legacy support matters, and XML is in many places. It's good to have a quality XML parsing library.

You two are terrible at motivating people. "Better D doesn't support it well" and "JSON is superior through-and-through" is overly dismissive. To me it sounds like someone saying replace C++ with JavaScript, because C++ is a horrible standard and JavaScript is so much superior.  Honestly.

Remember that while JSON is simpler, XML is not just a structured container for bool, Number and String data. It comes with many official side kicks covering a broad range of use cases:

XPath:
 * allows you to use XML files like a textual database
 * complex enough to allow for almost any imaginable query
 * many tools emerged to test XPath expressions against XML documents
 * also powers XSLT
   (http://www.liquid-technologies.com/xpath-tutorial.aspx)

XSL (Extensible Stylesheet Language) and
XSLT (XSL Transformations):
 * written as XML documents
 * standard way to transform XML from one structure into another
 * convert or "compile" data to XHTML or SVG for display in a browser
 * output to XSL-FO

XSL-FO (XSL formatting objects):
 * written as XSL
 * type-setting for XML; a XSL-FO processor is similar to a LaTex processor
 * reads an XML document (a "Format" document) and outputs to a PDF, RTF or similar format

XML Schema Definition (XSD):
 * written as XML
 * linked in by an XML file
 * defines structure and validates content to some extent
 * can set constraints on how often an element can occur in a list
 * can validate data type of values (length, regex, positive, etc.)
 * database like unique IDs and references

I think XML is the most eat-your-own-dog-food language ever and nicely covers a wide range of use cases. In any case there are many XML based file formats that we might want to parse. Amongst them SVG, OpenDocument (Open/LibreOffics), RSS feeds, several US Offices, XMP and other meta data formats.

When it comes to which features to support, I personally used XSD more than XPath and the tech using it. But quite frankly both would be expected by users. Based on XPath, XSL transformations can be added any time then. Anything beyond that doesn't feel quite "core" enough to be in a XML module.

-- 
Marco

May 04, 2015
Am Sun, 03 May 2015 14:00:11 -0700
schrieb Walter Bright <newshound2@digitalmars.com>:

> On 5/3/2015 10:39 AM, Robert burner Schadek wrote:
> > - CTS for encoding (ubyte(ASCII), char(utf8), ... )
> 
> Encoding schemes should be handled by adapter algorithms, not in the XML parser itself, which should only handle UTF8.

Unlike JSON, XML actually declares the encoding in the prolog, e.g.: <?xml version="1.0" encoding="Windows-1252"?>

-- 
Marco

May 04, 2015
On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote:
> std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it:
>
> - SAX and DOM parser
> - in-situ / slicing parsing when possible (forward range?)
> - compile time switch (CTS) for lazy attribute parsing
> - CTS for encoding (ubyte(ASCII), char(utf8), ... )
> - CTS for input validating
> - performance
>
> Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2
>
> Please post you feature requests, and please keep the posts DRY and on topic.

If I were doing it, I'd do three types of parsers:

1. A parser that was pretty much as low level as you can get, where you basically a range of XML atributes or tags. Exactly how to build that could be a bit entertaining, since it would have to be hierarchical, and ranges aren't, but something like a range of tags where you can get a range of its attributes and sub-tags from it so that the whole document can be processed without actually getting to the level of even a SAX parser. That parser could then be used to build the other parsers, and anyone who needed insanely fast speeds could use it rather than the SAX or DOM parser so long as they were willing to pay the inevitable loss in user-friendliness.

2. SAX parser built on the low level parser.

3. DOM parser built either on the low level parser or the SAX parser (whichever made more sense).

I doubt that I'm really explaining the low level parser well enough or have even though through it enough, but I really think that even a SAX parser is too high level for the base parser and that something that slightly higher than a lexer (high enough to actually be processing XML rather than individual tokens but pretty much only as high as is required to do that) would be a far better choice.

IIRC, Michel Fortin's work went in that direction, and he linked to his code in another post, so I'd suggest at least looking at that for ideas.

Regardless, by building layers of XML parsers rather than just the standard ones, it should be possible to get higher performance while still having the more standard, user-friendly ones for those that don't need the full performance and do need the user-friendliness (though of course, we do want the SAX and DOM parsers to be efficient as well).

- Jonathan M Davis
May 04, 2015
On 2015-05-04 21:14, Jonathan M Davis wrote:

> If I were doing it, I'd do three types of parsers:
>
> 1. A parser that was pretty much as low level as you can get, where you
> basically a range of XML atributes or tags. Exactly how to build that
> could be a bit entertaining, since it would have to be hierarchical, and
> ranges aren't, but something like a range of tags where you can get a
> range of its attributes and sub-tags from it so that the whole document
> can be processed without actually getting to the level of even a SAX
> parser. That parser could then be used to build the other parsers, and
> anyone who needed insanely fast speeds could use it rather than the SAX
> or DOM parser so long as they were willing to pay the inevitable loss in
> user-friendliness.
>
> 2. SAX parser built on the low level parser.
>
> 3. DOM parser built either on the low level parser or the SAX parser
> (whichever made more sense).
>
> I doubt that I'm really explaining the low level parser well enough or
> have even though through it enough, but I really think that even a SAX
> parser is too high level for the base parser and that something that
> slightly higher than a lexer (high enough to actually be processing XML
> rather than individual tokens but pretty much only as high as is
> required to do that) would be a far better choice.
>
> IIRC, Michel Fortin's work went in that direction, and he linked to his
> code in another post, so I'd suggest at least looking at that for ideas.

This way the XML parser is structured in Tango. A pull parser at the lowest level, a SAX parser on top of that and I think the DOM parser builds on top of the pull parser.

The Tango pull parser can give you the following tokens:

* start element
* attribute
* end element
* end empty element
* data
* comment
* cdata
* doctype
* pi

-- 
/Jacob Carlborg
May 04, 2015
On 2015-05-03 19:39, Robert burner Schadek wrote:

> Not much code yet, I'm currently building the performance test suite
> https://github.com/burner/std.xml2

I recommend benchmarking against the Tango pull parser.

-- 
/Jacob Carlborg