Replacing std.xml (page 5)

On 2013-08-31 17:43, ilya-stromberg wrote: > Also, we have Tango Xml: > https://github.com/SiegeLord/Tango-D2/tree/d2port/tango/text/xml > > It's the fastest Xml parser in the world, so may be you can find it useful: > dotnot.org/blog/archives/2008/03/10/xml-benchmarks-parsequerymutateserialize/ > > dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-fast-at-parsing-xml/ Unfortunately the Tango XML package will never end up in Phobos due to licensing issues. -- /Jacob Carlborg

On 2013-08-31 15:43:00 +0000, "ilya-stromberg" <ilya-stromberg-2009@yandex.ru> said: > On Thursday, 29 August 2013 at 07:53:46 UTC, Tobias Pankrath wrote: >> There is http://dsource.org/projects/xmlp, which at some point has been proposed for std.xml2. But that stalled for some time now. > > Also, we have Tango Xml: > https://github.com/SiegeLord/Tango-D2/tree/d2port/tango/text/xml > > It's the fastest Xml parser in the world, so may be you can find it useful: > dotnot.org/blog/archives/2008/03/10/xml-benchmarks-parsequerymutateserialize/ > dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-fast-at-parsing-xml/ Someone should benchmark it against the XML implementation I made. It has many of the same characteristics. For instance, Tango's SaxParser is based on its PullParser. This design requires the use a dynamic array to maintain a stack of opened elements. While not a huge performance hit, you don't need that if you use recursion, which you can do with my implementation. You can do that even though you can also use it as a pull tokenizer[^1] when needed (recursion is optional on a token-by-token basis). [^1]: IMHO, PullParser isn't a really good term for something that does not conform to the requirements of a parser in the XML spec. Tokenizer is a better term. -- Michel Fortin michel.fortin@michelf.ca http://michelf.ca

On 2013-08-31 20:53, Michel Fortin wrote: > [^1]: IMHO, PullParser isn't a really good term for something that does > not conform to the requirements of a parser in the XML spec. Tokenizer > is a better term. I guess "Pull" is the key here. That it is the client's responsibility to fetch the next token, not the other way around. -- /Jacob Carlborg

On 8/29/2013 12:25 AM, w0rp wrote: > Hello everybody. I've been wondering, what are the current plans to replace > std.xml? I'd like to help with the effort to get a final XML library in phobos. > So, I have a few questions. > > First, and most importantly, what do we except out of a D XML library? I'd > really like to have a discussion of the form, "Here is exactly the interface the > structs/classes need to implement, go forth and implement." The general idea in > my mind is "something SAX-like, with something a little DOM-like." I'm aware > that std.xml has some issues support different encodings, so obvious that's > included. > > Second, is there an existing library that has gotten close to meeting whatever > we need for the first point? If so, how far away is it from being able to meet > all of the requirements and become the standard library version? The Tango implementation of XML has been very well received. I haven't looked at it, but it was designed to do no memory allocation - it just did slices over the input. I don't believe it should make any attempt at decoding. Decoding entails both performance loss and memory consumption. If the user wants to do decoding, they can layer it on the output. And lastly, it should of course sport a range interface.

On Thursday, 29 August 2013 at 18:58:57 UTC, H. S. Teoh wrote: > On Thu, Aug 29, 2013 at 01:38:23PM -0400, Jonathan M Davis wrote: > [...] >> Well, as I said, I couldn't remember exactly what the XML standard said about encodings, but if it can contain non-ASCII characters, then my first inclination is to say that it has to be UTF-8, UTF-16, or UTF-32 based on the fact that that's what we support in the language and in Phobos > > Take a look here: > > http://www.w3schools.com/xml/xml_encoding.asp > > XML files can have *any* valid encoding, including nastiness like > windows-1252 and relics like iso-8859-1. Unfortunately, I don't think we > have a way around this, since existing XML files out there probably > already have all of these encodings are more, and std.xml is gonna hafta > support 'em all. Otherwise we're gonna get irate users complaining "why > can't std.xml parse my oddly-encoded-but-standards-compliant XML file?!" > As this is not the first time I see it used as a reliable source, no, w3school is full of shit. Don't use that website when looking for precise high quality information.

On Saturday, 31 August 2013 at 18:03:10 UTC, Jacob Carlborg wrote: > Unfortunately the Tango XML package will never end up in Phobos due to licensing issues. Yes, but we can always learn source code and put attention to the design solutions.

On Sunday, September 01, 2013 10:02:50 ilya-stromberg wrote: > On Saturday, 31 August 2013 at 18:03:10 UTC, Jacob Carlborg wrote: > > Unfortunately the Tango XML package will never end up in Phobos due to licensing issues. > > Yes, but we can always learn source code and put attention to the design solutions. Not really. Looking at the source code effectively taints you. By doing so, you run the risk of being accused of copying if anything you do is similar enough. It's just safer to never look at source code when the license is going to make it so that you can't use that code. - Jonathan M Davis

On 31/08/2013 16:43, ilya-stromberg wrote: > > It's the fastest Xml parser in the world, so may be you can find it useful: > dotnot.org/blog/archives/2008/03/10/xml-benchmarks-parsequerymutateserialize/ > > dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-fast-at-parsing-xml/ Has anyone done any benchmarks recently to see if that is still the case? I did some (admitedly brief) tests last year and found that xmlp was actually faster at building large XML docs into a DOM. There have been lots of changes since then, so i don't know if that is still the case.

September 02, 2013

Re: Replacing std.xml

Posted by qznc
in reply to Michel Fortin

Permalink

qznc

Posted in reply to Michel Fortin

Permalink

On Saturday, 31 August 2013 at 18:53:42 UTC, Michel Fortin wrote:
> On 2013-08-31 15:43:00 +0000, "ilya-stromberg" <ilya-stromberg-2009@yandex.ru> said:
>
>> On Thursday, 29 August 2013 at 07:53:46 UTC, Tobias Pankrath wrote:
>>> There is http://dsource.org/projects/xmlp, which at some point has been proposed for std.xml2. But that stalled for some time now.
>> 
>> Also, we have Tango Xml:
>> https://github.com/SiegeLord/Tango-D2/tree/d2port/tango/text/xml
>> 
>> It's the fastest Xml parser in the world, so may be you can find it useful:
>> dotnot.org/blog/archives/2008/03/10/xml-benchmarks-parsequerymutateserialize/
>> dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-fast-at-parsing-xml/
>
> Someone should benchmark it against the XML implementation I made. It has many of the same characteristics.
>
> For instance, Tango's SaxParser is based on its PullParser. This design requires the use a dynamic array to maintain a stack of opened elements. While not a huge performance hit, you don't need that if you use recursion, which you can do with my implementation. You can do that even though you can also use it as a pull tokenizer[^1] when needed (recursion is optional on a token-by-token basis).

Recursion means you use the call stack instead of stack object on the heap.

Be careful about nesting deepness. There are XML documents out there with thousands and more nested elements. With recursion on a 32bit machine you might get a stack overflow, but a heap-stack could handle a million nested elements.

On 2013-09-02 13:34:18 +0000, "qznc" <qznc@web.de> said: > On Saturday, 31 August 2013 at 18:53:42 UTC, Michel Fortin wrote: >> For instance, Tango's SaxParser is based on its PullParser. This design requires the use a dynamic array to maintain a stack of opened elements. While not a huge performance hit, you don't need that if you use recursion, which you can do with my implementation. You can do that even though you can also use it as a pull tokenizer[^1] when needed (recursion is optional on a token-by-token basis). > > Recursion means you use the call stack instead of stack object on the heap. > > Be careful about nesting deepness. There are XML documents out there with thousands and more nested elements. With recursion on a 32bit machine you might get a stack overflow, but a heap-stack could handle a million nested elements. Good point about caring for pathological cases. -- Michel Fortin michel.fortin@michelf.ca http://michelf.ca

Forums