Thread overview
Re: dxml 0.1.0 released
Feb 09, 2018
H. S. Teoh
Feb 09, 2018
Jonathan M Davis
Feb 11, 2018
Russel Winder
Feb 11, 2018
Jonathan M Davis
Feb 11, 2018
Russel Winder
February 09, 2018
On Fri, Feb 09, 2018 at 02:15:33PM -0700, Jonathan M Davis via Digitalmars-d-announce wrote:
> I have multiple projects that need an XML parser, and std_experimental_xml is clearly going nowhere, with the guy who wrote it having disappeared into the ether, so I decided to break down and write one. I've kind of wanted to for years, but I didn't want to spend the time on it. However, sometime last year I finally decided that I had to, and it's been what I've been working on in my free time for a while now. And it's finally reached the point when it makes sense to release it - hence this post.

Hooray!  Finally, a glimmer of hope for XML parsing in D!


> Currently, dxml contains only a range-based StAX / pull parser and related helper functions, but the plan is to add a DOM parser as well as two writers - one which is the writer equivalent of a StaX parser, and one which is DOM-based. However, in theory, the StAX parser is complete and quite useable as-is - though I expect that I'll be adding more helper functions to make it easier to use, and if you find that you're doing a particular operation with it frequently and that that operation is overly verbose, please point it out so that maybe a helper function can be added to improve that use case - e.g. I'm thinking of adding a function similar to std.getopt.getopt for handling attributes, because I personally find that dealing with those is more verbose than I'd like. Obviously, some stuff is just going to do better with a DOM parser, but thus far, I've found that a StAX parser has suited my needs quite well. I have no plans to add a SAX parser, since as far as I can tell, SAX parsers are just plain worse than StAX parsers, and the StAX approach is quite well-suited to ranges.
> 
> Of note, dxml does not support the DTD section beyond what is required to parse past it, since supporting it would make it impossible for the parser to return slices of the original input beyond the case where strings are used (and it would be forced to allocate strings in some cases, whereas dxml does _very_ minimal heap allocation right now), and parsing the DTD section signicantly increases the complexity of the parser in order to support something that I honestly don't think should ever have been part of the XML standard and is unnecessary for many, many XML documents. So, if you're dealing with XML documents that contain entity references that are declared in the DTD section and then used outside of the DTD section, then dxml will not support them, but it will work just fine if a DTD section is there so long as it doesn't declare any entity references that are then referenced in the document proper.
> 
> Hopefully, the documentation is clear enough, but obviously, I'm not the best judge of that. So, have at it.
> 
> Documentation: http://jmdavisprog.com/docs/dxml/0.1.0/
> Github: https://github.com/jmdavis/dxml
> Dub: http://code.dlang.org/packages/dxml
[...]

Wonderful!  The docs are beautiful, I must say.  Good job on that. Though a simple example of basic usage in the module header would be very nice.

Glanced over the docs.  It's a pretty nice and clean API, and IMO, worthy of consideration to be included into Phobos.  IMO, the lack of SAX / DOM parsing is not a big deal, since it's not hard to build one given StAX primitives.

Being range-based is very nice, but I'd say your choice to slice the input, defer expensive/allocating operations to normalize() is a big winning point.  This approach is fundamental to high performance, in the principle of not doing any operation that isn't strictly necessary until it's actually asked for.  If nothing else, this is a good design pattern that I plan to st^Wcopy in my own code. :-P

As for DTDs, perhaps it might be enough to make normalize() configurable with some way to specify additional entities that may be defined in the DTD?  Once that's possible, I'd say it's Good Enough(tm), since the user will have the tools to build DTD support from what they're given.  Of course, "standard" DTD support can be added later, built on the current StAX parser.

I would support it if you proposed dxml to be added to Phobos.


T

-- 
There are 10 kinds of people in the world: those who can count in binary, and those who can't.
February 09, 2018
On Friday, February 09, 2018 13:47:52 H. S. Teoh via Digitalmars-d-announce wrote:
> As for DTDs, perhaps it might be enough to make normalize() configurable with some way to specify additional entities that may be defined in the DTD?  Once that's possible, I'd say it's Good Enough(tm), since the user will have the tools to build DTD support from what they're given.  Of course, "standard" DTD support can be added later, built on the current StAX parser.

As I understand it (though IMHO, the spec isn't clear enough, and I'd have to go over it with a fine-tooth comb to make sure that I got it right), as soon as you start dealing with entity references, you can pretty much just drop whole sections of XML into your document, fundamentally, changing the document. So, I don't think that it's possible to deal with the entity references after the fact. They're basically macros that have to be expanded while you're parsing, which is part of why they're so disgusting IMHO - even without getting into any of the document validation stuff.

Though honestly, the part about the DTD section that I find truly offensive is that the document itself is defining what constitutes valid input. Since when does it make any sense for the _input_ for a program to tell the program what constitutes valid input? That's for the program to decide. And considering how much more complicated the parser has to be to properly deal with the DTD makes its inclusion in the spec seem absolutely insane to me.

And none of that mess is necessary for simple, sane XML documents that are just providing data.

I _might_ add a DTD parser later, but if I do, it will almost certainly be its own separate parser. However, given how much of my life I would then be wasting on something that I consider to be of essentially zero value (if not negative value), I don't see myself doing it without someone paying me to. IMHO, the only reason that it makes any sense to fully support the DTD section is for those poor folks who have to deal with XML documents where someone else decided to use those features, and they don't have any choice. I would hope that few programmers would actually _want_ to be using those features.

> I would support it if you proposed dxml to be added to Phobos.

I've thought about it, but I'd like to complete the writers and the DOM parser first as well as see it get at least somewhat battle-tested. Right now, it's just been used in a couple of my personal projects, which did affect some of my design choices (for the better, I think), but since no one else has done anything with it, there may be something that it needs that I've completely missed. The API is simple enough that I _think_ that it's good as-is and that improvements are largely a question of adding helper functions, but the library does need more widespread use and feedback.

- Jonathan M Davis

February 11, 2018
On Fri, 2018-02-09 at 13:47 -0800, H. S. Teoh via Digitalmars-d- announce wrote:
> On Fri, Feb 09, 2018 at 02:15:33PM -0700, Jonathan M Davis via Digitalmars-d-announce wrote:
> > I have multiple projects that need an XML parser, and
> > std_experimental_xml is clearly going nowhere, with the guy who
> > wrote
> > it having disappeared into the ether, so I decided to break down
> > and
> > write one. I've kind of wanted to for years, but I didn't want to
> > spend the time on it. However, sometime last year I finally decided
> > that I had to, and it's been what I've been working on in my free
> > time
> > for a while now. And it's finally reached the point when it makes
> > sense to release it - hence this post.
> 
> Hooray!  Finally, a glimmer of hope for XML parsing in D!

I wonder why no-one has tried using DStep to create a D binding for libxml2 and libxslt.

Whilst Python has a SAX and DOM parsing capability, well three different ones in the standard library, anyone doing serious XML work in Python uses lxml which is just a Python binding to libxml2 and libxslt.

If Python people have given up on the XML stuff in it's standard
library and use a binding to a well known and distributed one, is this
a good path for D?


-- 
Russel.
===========================================
Dr Russel Winder      t: +44 20 7585 2200
41 Buckmaster Road    m: +44 7770 465 077
London SW11 1EN, UK   w: www.russel.org.uk


February 11, 2018
On Sunday, February 11, 2018 10:11:05 Russel Winder via Digitalmars-d- announce wrote:
> On Fri, 2018-02-09 at 13:47 -0800, H. S. Teoh via Digitalmars-d-
>
> announce wrote:
> > On Fri, Feb 09, 2018 at 02:15:33PM -0700, Jonathan M Davis via
> >
> > Digitalmars-d-announce wrote:
> > > I have multiple projects that need an XML parser, and
> > > std_experimental_xml is clearly going nowhere, with the guy who
> > > wrote
> > > it having disappeared into the ether, so I decided to break down
> > > and
> > > write one. I've kind of wanted to for years, but I didn't want to
> > > spend the time on it. However, sometime last year I finally decided
> > > that I had to, and it's been what I've been working on in my free
> > > time
> > > for a while now. And it's finally reached the point when it makes
> > > sense to release it - hence this post.
> >
> > Hooray!  Finally, a glimmer of hope for XML parsing in D!
>
> I wonder why no-one has tried using DStep to create a D binding for libxml2 and libxslt.
>
> Whilst Python has a SAX and DOM parsing capability, well three different ones in the standard library, anyone doing serious XML work in Python uses lxml which is just a Python binding to libxml2 and libxslt.
>
> If Python people have given up on the XML stuff in it's standard library and use a binding to a well known and distributed one, is this a good path for D?

Given how strings work in D, parsing is something that we should easily be able to do faster than other languages - or at least, other languages typically have to write much less idiomatic code and go to a lot more effort to reach the speeds that we can easily reach with idiomatic D code. So, in general, IMHO, parsers are one of those things that we should typically be writing natively.

That being said, if someone really wants full DTD support, I have no problem sending them off to deal with bindings to C/C++ libraries, since I for one am not willing to put in the time or effort to support that part of the XML spec, since it complicates things considerably while adding nothing positive IMHO. I'm sure that a D solution could compete excellently with a C/C++ solution, but it's sure not worth my time and effort, and no one else has stepped up to implement anything along those lines.

Also, we're not about to put bindings to a C/C++ library for XML in Phobos (it's already been argued quite a bit that doing so with curl was a big mistake), so if we want to replace std.xml, that calls for writing a replacement in D.

- Jonathan M Davis

February 11, 2018
On Sun, 2018-02-11 at 03:34 -0700, Jonathan M Davis via Digitalmars-d- announce wrote:
> 
[…]
> Given how strings work in D, parsing is something that we should
> easily be
> able to do faster than other languages - or at least, other languages
> typically have to write much less idiomatic code and go to a lot more
> effort
> to reach the speeds that we can easily reach with idiomatic D code.
> So, in
> general, IMHO, parsers are one of those things that we should
> typically be
> writing natively.

Works for me, and given you have given the project a massive kick start, hopefully others can get stuck in and Phobos can do a swap of what was with what is.

> That being said, if someone really wants full DTD support, I have no
> problem
> sending them off to deal with bindings to C/C++ libraries, since I
> for one
> am not willing to put in the time or effort to support that part of
> the XML
> spec, since it complicates things considerably while adding nothing
> positive
> IMHO. I'm sure that a D solution could compete excellently with a
> C/C++
> solution, but it's sure not worth my time and effort, and no one else
> has
> stepped up to implement anything along those lines.

I am no longer doing XML stuff myself, but a couple of years ago DTDs were "dead" and everyone was using XML Schemas.

> Also, we're not about to put bindings to a C/C++ library for XML in
> Phobos
> (it's already been argued quite a bit that doing so with curl was a
> big
> mistake), so if we want to replace std.xml, that calls for writing a
> replacement in D.

True, and entirely reasonable. It is why lxml is only available via download or far more usually via PyPI.

D having a really good XML (and XSLT) support in it's standard library, and removing the crud, would be one up on what Python has done.

-- 
Russel.
===========================================
Dr Russel Winder      t: +44 20 7585 2200
41 Buckmaster Road    m: +44 7770 465 077
London SW11 1EN, UK   w: www.russel.org.uk