May 04, 2015
On Sunday, 3 May 2015 at 22:02:13 UTC, Walter Bright wrote:
> On 5/3/2015 2:31 PM, Ilya Yaroshenko wrote:
>> Can it lazily reads huge files (files greater than memory)?
>
> If a range interface is used, it doesn't need to be aware of where the data is coming from. In fact, the xml package should NOT be doing I/O.

Indeed. It should operate on ranges without caring where they came from (though it may end up supporting both input ranges and random-access ranges with the idea that it can support reading of a socket with a range in a less efficient manner or operating on a whole file at once as via a random-access range for more efficient parsing).

But if I/O is a big concern, I'd suggest just using std.mmfile to do the trick, since then you can still operate on the whole file as a single array without having to actually have the whole thing in memory.

- Jonathan M Davis
May 04, 2015
On Monday, 4 May 2015 at 09:35:55 UTC, Ola Fosheim Grøstad wrote:
> However, it would make a lot of sense to just convert an existing XML solution with Boost license. I don't know which ones are any good, but RapidXML is at least Boost.

Given how D's arrays work, we have the opportunity to have an _extremely_ fast XML parser thanks to slices. It's highly unlikely that any C or C++ solution is going to be able to compete, and if it can, it's likely to be far more complex than necessary. Parsing is an area where we definitely should write our own stuff rather than porting existing code from other languages or use existing libraries in other languages via C bindings. Fast parsing is definitely a killer feature of D and the fact that std.xml botches that so badly is just embarrassing.

- Jonathan M Davis
May 04, 2015
On 2015-05-03 19:39, Robert burner Schadek wrote:

> Not much code yet, I'm currently building the performance test suite
> https://github.com/burner/std.xml2

There are a couple of interesting comments about the Tango pull parser that can be worth mentioning:

* Use -version=whitespace to retain whitespace as data nodes. We see a %25 increase in token count and 10% throughput drop when parsing "hamlet.xml" with this option enabled (pullparser alone)

* The parser is constructed with some tradeoffs relating to document integrity. It is generally optimized for well-formed documents, and currently may read past a document-end for those that are not well formed

* Making some tiny unrelated change to the code can cause notable throughput changes. We're not yet clear why these swings are so pronounced (for changes outside the code path) but they seem to be related to the alignment of codegen. It could be a cache-line issue, or something else

The last comment might not relevant anymore since these are all quite old comments.

-- 
/Jacob Carlborg
May 04, 2015
On 5/4/15 12:31 PM, Jonathan M Davis wrote:
> On Monday, 4 May 2015 at 09:35:55 UTC, Ola Fosheim Grøstad wrote:
>> However, it would make a lot of sense to just convert an existing XML
>> solution with Boost license. I don't know which ones are any good, but
>> RapidXML is at least Boost.
>
> Given how D's arrays work, we have the opportunity to have an
> _extremely_ fast XML parser thanks to slices. It's highly unlikely that
> any C or C++ solution is going to be able to compete, and if it can,
> it's likely to be far more complex than necessary. Parsing is an area
> where we definitely should write our own stuff rather than porting
> existing code from other languages or use existing libraries in other
> languages via C bindings. Fast parsing is definitely a killer feature of
> D and the fact that std.xml botches that so badly is just embarrassing.

To be frank what's more embarrassing is that we managed to do nothing about it for years (aside from endlessly wailing about it in an a capella ensemble). It's a failure of leadership (that Walter and I need to work on) that very many unimportant and arguably less interesting areas of Phobos get attention at the expense of this one. -- Andrei

May 04, 2015
On 5/4/2015 12:31 PM, Jonathan M Davis wrote:
> Given how D's arrays work, we have the opportunity to have an _extremely_ fast
> XML parser thanks to slices. It's highly unlikely that any C or C++ solution is
> going to be able to compete, and if it can, it's likely to be far more complex
> than necessary. Parsing is an area where we definitely should write our own
> stuff rather than porting existing code from other languages or use existing
> libraries in other languages via C bindings. Fast parsing is definitely a killer
> feature of D and the fact that std.xml botches that so badly is just embarrassing.

Tango's XML package was well regarded and the fastest in the business. It used slicing, and almost no memory allocation.

May 04, 2015
On 5/4/2015 2:35 AM, "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= <ola.fosheim.grostad+dlang@gmail.com>" wrote:
> Wouldn't D-ranges make it impossible to use SIMD optimizations when scanning?

Not at all. Algorithms can be specialized for various forms of input ranges, including ones where SIMD optimizations can be used.

Specialization is one of the very cool things about D algorithms.

May 04, 2015
On 5/4/2015 12:28 PM, Jacob Carlborg wrote:
> On 2015-05-03 19:39, Robert burner Schadek wrote:
>
>> Not much code yet, I'm currently building the performance test suite
>> https://github.com/burner/std.xml2
>
> I recommend benchmarking against the Tango pull parser.

I agree. The Tango XML parser has set the performance bar. If any new solution can't match that, throw it out and try again.

May 04, 2015
On Monday, 4 May 2015 at 19:45:18 UTC, Andrei Alexandrescu wrote:
> On 5/4/15 12:31 PM, Jonathan M Davis wrote:
>> Fast parsing is definitely a killer feature of
>> D and the fact that std.xml botches that so badly is just embarrassing.
>
> To be frank what's more embarrassing is that we managed to do nothing about it for years (aside from endlessly wailing about it in an a capella ensemble). It's a failure of leadership (that Walter and I need to work on) that very many unimportant and arguably less interesting areas of Phobos get attention at the expense of this one. -- Andrei

Also true. Many of us just don't find enough time to work on D, and we don't seem to do a good job of encouraging larger contributions to Phobos, so newcomers don't tend to contribute like that. And there's so much to do all around that the big stuff just falls by the wayside, and it really shouldn't.

- Jonathan M Davis
May 04, 2015
On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek
wrote:
> std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it:
>
> - SAX and DOM parser
> - in-situ / slicing parsing when possible (forward range?)
> - compile time switch (CTS) for lazy attribute parsing
> - CTS for encoding (ubyte(ASCII), char(utf8), ... )
> - CTS for input validating
> - performance
>
> Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2
>
> Please post you feature requests, and please keep the posts DRY and on topic.

Not a feature, but if `std.data.json` [1] gets accepted in to
Phobos, it may be something to consider naming this
`std.data.xml` (although that might not as effectively
differentiate it from `std.xml`).

[1]: http://wiki.dlang.org/Review_Queue
May 05, 2015
On 5/05/2015 10:45 a.m., Liam McSherry wrote:
> On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek
> wrote:
>> std.xml has been considered not up to specs nearly 3 years now. Time
>> to build a successor. I currently plan the following featues for it:
>>
>> - SAX and DOM parser
>> - in-situ / slicing parsing when possible (forward range?)
>> - compile time switch (CTS) for lazy attribute parsing
>> - CTS for encoding (ubyte(ASCII), char(utf8), ... )
>> - CTS for input validating
>> - performance
>>
>> Not much code yet, I'm currently building the performance test suite
>> https://github.com/burner/std.xml2
>>
>> Please post you feature requests, and please keep the posts DRY and on
>> topic.
>
> Not a feature, but if `std.data.json` [1] gets accepted in to
> Phobos, it may be something to consider naming this
> `std.data.xml` (although that might not as effectively
> differentiate it from `std.xml`).
>
> [1]: http://wiki.dlang.org/Review_Queue

It really should be std.data.xml. To keep with the new structuring. Plus it'll make transitioning a little easier.