May 05, 2015
On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek wrote:
> std.xml has been considered not up to specs nearly 3 years now. Time to build a successor. I currently plan the following featues for it:
>
> - SAX and DOM parser
> - in-situ / slicing parsing when possible (forward range?)
> - compile time switch (CTS) for lazy attribute parsing
> - CTS for encoding (ubyte(ASCII), char(utf8), ... )
> - CTS for input validating
> - performance
>
> Not much code yet, I'm currently building the performance test suite https://github.com/burner/std.xml2
>
> Please post you feature requests, and please keep the posts DRY and on topic.

maybe off-topic, but it would be nice if the standard json,xml, etc etc all had identical interfaces(except for implementation-specific quirks.) This might be something worth discussing if it wasn't already agreed upon.
May 05, 2015
Am Tue, 05 May 2015 02:01:50 +0000
schrieb "weaselcat" <weaselcat@gmail.com>:

> maybe off-topic, but it would be nice if the standard json,xml, etc etc all had identical interfaces(except for implementation-specific quirks.) This might be something worth discussing if it wasn't already agreed upon.

I don't think this needs discussion. It is plain impossible to
have a sophisticated JSON parser and a sophisticated XML
parser share the same API. Established function names,
structural differences in the formats and feature sets differ
to much.
For example in XML attributes and child elements are used
somewhat interchangeably whereas in JSON attributes don't
exist. So while in JSON "obj.field" makes sense in XML you
would want to select either an attribute or an element with
the name "field".

-- 
Marco

May 05, 2015
On Monday, 4 May 2015 at 19:31:59 UTC, Jonathan M Davis wrote:
> Given how D's arrays work, we have the opportunity to have an _extremely_ fast XML parser thanks to slices.

Yes, that would be great. XML is a flexible go-to archive, exchange and application format.

Things like entities, namespaces and so makes it non-trivial, but being able to conveniently process Inkscape and Open Office files etc would be very useful.

One should probably look at what applications generate XML and create some large test files with existing applications.
May 05, 2015
On Monday, 4 May 2015 at 19:28:25 UTC, Jacob Carlborg wrote:
> On 2015-05-03 19:39, Robert burner Schadek wrote:
>
>> Not much code yet, I'm currently building the performance test suite
>> https://github.com/burner/std.xml2
>
> I recommend benchmarking against the Tango pull parser.

Recently, I compared DOM parsers for an XML files of 100 MByte:

15.8 s tango.text.xml (SiegeLord/Tango-D2)
13.4 s ae.utils.xml (CyberShadow/ae)
 8.5 s xml.etree (Python)

Either the Tango DOM parser is slow compared to the Tango pull parser,
or the D2 port ruined the performance.
May 05, 2015
On Tuesday, 5 May 2015 at 10:41:37 UTC, Mario Kröplin wrote:
> On Monday, 4 May 2015 at 19:28:25 UTC, Jacob Carlborg wrote:
>> On 2015-05-03 19:39, Robert burner Schadek wrote:
>>
>>> Not much code yet, I'm currently building the performance test suite
>>> https://github.com/burner/std.xml2
>>
>> I recommend benchmarking against the Tango pull parser.
>
> Recently, I compared DOM parsers for an XML files of 100 MByte:
>
> 15.8 s tango.text.xml (SiegeLord/Tango-D2)
> 13.4 s ae.utils.xml (CyberShadow/ae)
>  8.5 s xml.etree (Python)
>
> Either the Tango DOM parser is slow compared to the Tango pull parser,
> or the D2 port ruined the performance.

As usual: system, compiler, compiler version, compilation flags?
May 05, 2015
On 05/05/2015 11:41, "Mario =?UTF-8?B?S3LDtnBsaW4i?= <linkrope@github.com>" wrote:
>
> Recently, I compared DOM parsers for an XML files of 100 MByte:
>
> 15.8 s tango.text.xml (SiegeLord/Tango-D2)
> 13.4 s ae.utils.xml (CyberShadow/ae)
>   8.5 s xml.etree (Python)
>
> Either the Tango DOM parser is slow compared to the Tango pull parser,
> or the D2 port ruined the performance.


fwiw I did some tests a couple of years back with https://launchpad.net/d2-xml on 20 odd megabyte files and found it faster than Tango.
Unfortunately that would need some work to test now, as xmlp is abandoned and wouldn't build last time I tried it :-(

I also had some success with https://github.com/opticron/kxml, though it had some issues with chuffy entity decoding performance.


Also, profiling showed a lot of time spent in the GC, and the recent improvements in that area might have changed things by now.
May 05, 2015
On 2015-05-05 12:41, "Mario =?UTF-8?B?S3LDtnBsaW4i?= <linkrope@github.com>" wrote:

> Recently, I compared DOM parsers for an XML files of 100 MByte:
>
> 15.8 s tango.text.xml (SiegeLord/Tango-D2)
> 13.4 s ae.utils.xml (CyberShadow/ae)
>   8.5 s xml.etree (Python)
>
> Either the Tango DOM parser is slow compared to the Tango pull parser,

Yes, of course it's slower. The DOM parser creates a DOM as well, which the pull parser doesn't.

These other libraries, what kind of parsers are those using? I mean, it's not fair to compare a pull parser against a DOM parser.

Could you try D1 Tango as well? Or do you have the benchmark available somewhere?

> or the D2 port ruined the performance.

Might be the case as well, see this comment [1].

[1] http://forum.dlang.org/thread/vsbsxfeciryrdsjhhfak@forum.dlang.org?page=3#post-mi8hs8:24b0j:241:40digitalmars.com

-- 
/Jacob Carlborg
May 05, 2015
On Tuesday, 5 May 2015 at 12:10:59 UTC, Jacob Carlborg wrote:
> Yes, of course it's slower. The DOM parser creates a DOM as well, which the pull parser doesn't.
>
> These other libraries, what kind of parsers are those using? I mean, it's not fair to compare a pull parser against a DOM parser.

I agree. Most applications will use a DOM parser for convenience, so sacrificing some speed initially in favour of easy-of-use makes a lot of sense. As long as it is possible to improve it later (e.g. use SIMD scanning to find the end of CDATA etc).

In my opinion it is rather difficult to build a good API without also using the API in an application in parallel. So it would be a good strategy to build a specific DOM along with writing the XML infrastructure, like SVG/HTML.

Also, some parsers, like RapidXML only support a subset of XML. So they cannot be used for comparisons.
May 05, 2015
On 5/5/2015 4:16 AM, Richard Webb wrote:
> Also, profiling showed a lot of time spent in the GC, and the recent
> improvements in that area might have changed things by now.

I haven't read the Tango source code, but the performance of it's xml was supposedly because it did not use the GC, it used slices.
May 06, 2015
On 2015-05-06 01:38, Walter Bright wrote:

> I haven't read the Tango source code, but the performance of it's xml
> was supposedly because it did not use the GC, it used slices.

That's only true for the pull parser (not sure about the SAX parser). The DOM parser needs to allocate the nodes, but if I recall correctly those are allocated in a free list. Not sure which parser was used in the test.

-- 
/Jacob Carlborg