std.xml: Why is it so slow? Is there anything else wrong with it?

Mar 13, 2011

dsimcha

Mar 13, 2011

Daniel Gibson

Mar 13, 2011

Mar 13, 2011

Mar 13, 2011

Mar 13, 2011

There seems to be a consensus around here that Phobos needs a good XML module, and that std.xml doesn't cut it, at least partly due to performance issues. I have no clue how to write a good XML module from scratch. It seems like noone else is taking up the project either. This leads me to two questions: 1. Has anyone ever sat down and tried to figure out **why** std.xml is so slow? Seriously, if noone's bothered to profile it or read the code carefully, then for all we know there might be some low hanging fruit and it might be an afternoon of optimization away from being reasonably fast. Basically every experience I've ever had suggests that, if a piece of code has not already been profiled and heavily optimized, at least a 5-fold speedup can almost always be obtained just by optimizing the low-hanging fruit. (For example, see my recent pull request for the D garbage collector. BTW, if excessive allocations are a contributing factor, then fixing the GC should help with XML, too.) If the answer is no, this hasn't been done, please post some canned benchmarks and maybe I'll take a crack at it. 2. What other major defects/design flaws, if any, does std.xml have? In other words, how are we really so sure that we need to start from scratch?

Am 13.03.2011 05:34, schrieb dsimcha: > There seems to be a consensus around here that Phobos needs a good XML > module, and that std.xml doesn't cut it, at least partly due to > performance issues. I have no clue how to write a good XML module from > scratch. It seems like noone else is taking up the project either. This > leads me to two questions: > Isn't Tomek Sowiński working on it? > 1. Has anyone ever sat down and tried to figure out **why** std.xml is > so slow? Seriously, if noone's bothered to profile it or read the code > carefully, then for all we know there might be some low hanging fruit > and it might be an afternoon of optimization away from being reasonably > fast. Basically every experience I've ever had suggests that, if a piece > of code has not already been profiled and heavily optimized, at least a > 5-fold speedup can almost always be obtained just by optimizing the > low-hanging fruit. (For example, see my recent pull request for the D > garbage collector. BTW, if excessive allocations are a contributing > factor, then fixing the GC should help with XML, too.) > > If the answer is no, this hasn't been done, please post some canned > benchmarks and maybe I'll take a crack at it. > > 2. What other major defects/design flaws, if any, does std.xml have? > > In other words, how are we really so sure that we need to start from > scratch? (These questions should probably discusses nevertheless) Cheers, - Daniel

On Saturday 12 March 2011 20:39:31 Daniel Gibson wrote: > Am 13.03.2011 05:34, schrieb dsimcha: > > There seems to be a consensus around here that Phobos needs a good XML module, and that std.xml doesn't cut it, at least partly due to performance issues. I have no clue how to write a good XML module from scratch. It seems like noone else is taking up the project either. This > > > leads me to two questions: > Isn't Tomek Sowiński working on it? Yes. > > 1. Has anyone ever sat down and tried to figure out **why** std.xml is so slow? Seriously, if noone's bothered to profile it or read the code carefully, then for all we know there might be some low hanging fruit and it might be an afternoon of optimization away from being reasonably fast. Basically every experience I've ever had suggests that, if a piece of code has not already been profiled and heavily optimized, at least a 5-fold speedup can almost always be obtained just by optimizing the low-hanging fruit. (For example, see my recent pull request for the D garbage collector. BTW, if excessive allocations are a contributing factor, then fixing the GC should help with XML, too.) > > > > If the answer is no, this hasn't been done, please post some canned benchmarks and maybe I'll take a crack at it. > > > > 2. What other major defects/design flaws, if any, does std.xml have? > > > > In other words, how are we really so sure that we need to start from scratch? As I understand it, one of the main issues is that std.xml is delegate-based. I don't know how well it does with slicing and avoiding copying strings, but one of the biggest advantages that D has is its array slicing. And taking full advantage of that and avoiding string copying is one of - if not _the_ best - way to make std.xml lightning fast. In any case, there was a discussion about std.xml recently, and the consensus was that we should just throw it out rather than leave it there and have people complain about how bad Phobos' xml module is. As Daniel pointed out, Tomek Sowiński is currently working on a new std.xml. I don't know how far along he is or when he expects it to be done, but supposedly he's working on it and sometime reasonably soon we should have a new std.xml to review. We are definitely _not_ going to be working on improving the current std.xml though. I think that the only reason that it's still there is that Andrei didn't get around to throwing it out before the last release (or at least deprecating it). That's definitely what he wants to do, and the consensus was in favor of that decision. - Jonathan M Davis

On Sat, 2011-03-12 at 23:34 -0500, dsimcha wrote: > There seems to be a consensus around here that Phobos needs a good XML module, and that std.xml doesn't cut it, at least partly due to performance issues. I have no clue how to write a good XML module from scratch. It seems like noone else is taking up the project either. I just worry that creating a whole self-standing library is a waste of time when wrapping libxml2 and libxslt gets a fast XML subsystem for free. This is the direction Python has gone. cf. the lxml package to replace ElementTree. The elephant in the room is of course W3C DOM. Everyone believes they have to have an implementation, but no-one then uses it. > This leads me to two questions: > > 1. Has anyone ever sat down and tried to figure out **why** std.xml is so slow? Seriously, if noone's bothered to profile it or read the code carefully, then for all we know there might be some low hanging fruit and it might be an afternoon of optimization away from being reasonably fast. Basically every experience I've ever had suggests that, if a piece of code has not already been profiled and heavily optimized, at least a 5-fold speedup can almost always be obtained just by optimizing the low-hanging fruit. (For example, see my recent pull request for the D garbage collector. BTW, if excessive allocations are a contributing factor, then fixing the GC should help with XML, too.) > > If the answer is no, this hasn't been done, please post some canned benchmarks and maybe I'll take a crack at it. > > 2. What other major defects/design flaws, if any, does std.xml have? > > In other words, how are we really so sure that we need to start from scratch? Excellent question. Especially given the existence of libxml2 and libxslt. -- Russel. ============================================================================= Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder@ekiga.net 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel@russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder