Jump to page: 1 2
Thread overview
dxml 0.1.0 released
Feb 09, 2018
Jonathan M Davis
Feb 10, 2018
Stefan
Feb 10, 2018
Jonathan M Davis
Feb 10, 2018
Seb
Feb 10, 2018
Jonathan M Davis
Feb 10, 2018
Jacob Carlborg
Feb 10, 2018
Jonathan M Davis
Feb 10, 2018
Joakim
Feb 10, 2018
Jonathan M Davis
Feb 11, 2018
Jacob Carlborg
Feb 10, 2018
bauss
Feb 10, 2018
Jesse Phillips
Feb 10, 2018
Jonathan M Davis
Feb 11, 2018
Cym13
February 09, 2018
I have multiple projects that need an XML parser, and std_experimental_xml is clearly going nowhere, with the guy who wrote it having disappeared into the ether, so I decided to break down and write one. I've kind of wanted to for years, but I didn't want to spend the time on it. However, sometime last year I finally decided that I had to, and it's been what I've been working on in my free time for a while now. And it's finally reached the point when it makes sense to release it - hence this post.

Currently, dxml contains only a range-based StAX / pull parser and related helper functions, but the plan is to add a DOM parser as well as two writers - one which is the writer equivalent of a StaX parser, and one which is DOM-based. However, in theory, the StAX parser is complete and quite useable as-is - though I expect that I'll be adding more helper functions to make it easier to use, and if you find that you're doing a particular operation with it frequently and that that operation is overly verbose, please point it out so that maybe a helper function can be added to improve that use case - e.g. I'm thinking of adding a function similar to std.getopt.getopt for handling attributes, because I personally find that dealing with those is more verbose than I'd like. Obviously, some stuff is just going to do better with a DOM parser, but thus far, I've found that a StAX parser has suited my needs quite well. I have no plans to add a SAX parser, since as far as I can tell, SAX parsers are just plain worse than StAX parsers, and the StAX approach is quite well-suited to ranges.

Of note, dxml does not support the DTD section beyond what is required to parse past it, since supporting it would make it impossible for the parser to return slices of the original input beyond the case where strings are used (and it would be forced to allocate strings in some cases, whereas dxml does _very_ minimal heap allocation right now), and parsing the DTD section signicantly increases the complexity of the parser in order to support something that I honestly don't think should ever have been part of the XML standard and is unnecessary for many, many XML documents. So, if you're dealing with XML documents that contain entity references that are declared in the DTD section and then used outside of the DTD section, then dxml will not support them, but it will work just fine if a DTD section is there so long as it doesn't declare any entity references that are then referenced in the document proper.

Hopefully, the documentation is clear enough, but obviously, I'm not the best judge of that. So, have at it.

Documentation: http://jmdavisprog.com/docs/dxml/0.1.0/
Github: https://github.com/jmdavis/dxml
Dub: http://code.dlang.org/packages/dxml

- Jonathan M Davis

February 10, 2018
great work, Jonathan. Thank you.
We were missing xml for a long time and did so many hacks just to get xml somehow parsed.
February 10, 2018
On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis wrote:
> I have multiple projects that need an XML parser, and std_experimental_xml is clearly going nowhere, with the guy who wrote it having disappeared into the ether, so I decided to break down and write one. I've kind of wanted to for years, but I didn't want to spend the time on it. However, sometime last year I finally decided that I had to, and it's been what I've been working on in my free time for a while now. And it's finally reached the point when it makes sense to release it - hence this post.
>
> [...]

FWIW we recently forked the experimental.xml repo to dlang-community:

https://github.com/dlang-community/experimental.xml

So PRs etc can be merged easily.
But yeah it's not moving anywhere atm :/
February 10, 2018
On 2018-02-09 22:15, Jonathan M Davis wrote:

> Currently, dxml contains only a range-based StAX / pull parser and related
> helper functions, but the plan is to add a DOM parser as well as two writers
> - one which is the writer equivalent of a StaX parser, and one which is
> DOM-based. However, in theory, the StAX parser is complete and quite useable
> as-is - though I expect that I'll be adding more helper functions to make it
> easier to use, and if you find that you're doing a particular operation with
> it frequently and that that operation is overly verbose, please point it out
> so that maybe a helper function can be added to improve that use case - e.g.

This is great news! Have you run any benchmarks to see how it performs?

-- 
/Jacob Carlborg
February 10, 2018
On Saturday, February 10, 2018 16:14:41 Jacob Carlborg via Digitalmars-d- announce wrote:
> On 2018-02-09 22:15, Jonathan M Davis wrote:
> > Currently, dxml contains only a range-based StAX / pull parser and related helper functions, but the plan is to add a DOM parser as well as two writers - one which is the writer equivalent of a StaX parser, and one which is DOM-based. However, in theory, the StAX parser is complete and quite useable as-is - though I expect that I'll be adding more helper functions to make it easier to use, and if you find that you're doing a particular operation with it frequently and that that operation is overly verbose, please point it out so that maybe a helper function can be added to improve that use case - e.g.
> This is great news! Have you run any benchmarks to see how it performs?

Kind of. I did some benchmarking to see if some code changes would improve performance, but I haven't tried benchmarking it against any other XML libraries. That would take a fair bit of time and effort, and IMHO, that would be better spent finishing the library first. Also, ldc's latest release is only up to dmd 2.077.1, and dxml needs an improvement that got added to byCodeUnit in 2.078.0, so any benchmarking that wants to do something like compare dxml with a C/C++ parsing library while taking the optimizer out of the equation isn't going to work yet unless I fork byCodeUnit for dxml until we get another release of ldc.

One result of the benchmarking that I did do allowed me to simplify the code quite a bit though. I'd originally had it be configurable whether the parser kept track of the line number and column of the document, just the line number, or neither on the theory that I really wanted access to the position in the document in error messages but that it would affect performance, so it should be configurable. However, benchmarking showed that it had negligible impact on performance to the point that different PositionTypes won out depending on the file and the particular run of the program, indicating that that extra complexity was buying me nothing. There were a fair number of static ifs to deal with that configuration option, so as soon as I was able to measure that they didn't matter particularly, I removed that option from the Config and all of its associated static ifs in the parser and was able to reduce the complexity of the code a fair bit. Testing that bit was actually the main reason that I did any benchmarking before releasing anything, since I wanted to avoid changing the API later if I could.

I am going to need to spend more time benchmarking code changes at some point here though to see if I can make the parser faster, and eventually, I will probably benchmark it against other parsing libraries. I fully expect that it will compare favorably given that it does almost no heap allocations and slices everything, but there's every possibility that I did something algorithmically internally that hurts performance more than it should - e.g. while it tries to parse everything only once, there are a few places where it ends up taking a second pass over a piece of text, and refactoring that is on my todo list (though most of the other potential improvements I did benchmark were a wash, so I may find that it doesn't matter much).

I'll probably be in more of a hurry to benchmark dxml against other parsing libraries if my dconf talk proposal on it gets accepted, since that's the sort of thing that should probably be in such a talk.

I haven't even taken the time yet to figure out which libraries it should be benchmared against.

- Jonathan M Davis

February 10, 2018
On Saturday, February 10, 2018 12:04:48 Seb via Digitalmars-d-announce wrote:
> On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis
>
> wrote:
> > I have multiple projects that need an XML parser, and std_experimental_xml is clearly going nowhere, with the guy who wrote it having disappeared into the ether, so I decided to break down and write one. I've kind of wanted to for years, but I didn't want to spend the time on it. However, sometime last year I finally decided that I had to, and it's been what I've been working on in my free time for a while now. And it's finally reached the point when it makes sense to release it - hence this post.
> >
> > [...]
>
> FWIW we recently forked the experimental.xml repo to dlang-community:
>
> https://github.com/dlang-community/experimental.xml
>
> So PRs etc can be merged easily.
> But yeah it's not moving anywhere atm :/

Yeah, I got some e-mails about that the other day, since I had some open issues and PRs on it, and IIRC github was telling me that you'd migrated some of that over, but unless someone decides that they want to take up the torch on it, it seems pretty dead. I assume that the guy who did it simply got too busy with school once GSoC ended and then never got back to it even when he did have time. If he were serious about finishing it and being an active part of the D community, he would have at least looked at some the PRs on the project, but he's been completely silent for quite a while now. So, I guess he moved on. I was able to use it on one of my projects by making some local changes and by working around some bugs, but it clearly needs work that it's not getting.

I had some rather specific ideas about what I wanted to do with an XML parser though and didn't want to spend the time trying to decipher what he'd done and morph it into something more like what I wanted, so I just started from scratch.

- Jonathan M Davis

February 10, 2018
On Saturday, February 10, 2018 10:27:42 Stefan via Digitalmars-d-announce wrote:
> great work, Jonathan. Thank you.
> We were missing xml for a long time and did so many hacks just to
> get xml somehow parsed.

LOL. Actually, one of the helper functions in std.datetime.timezone that has to deal with xml does it via hacks, because the XML in question was fairly simple, and I didn't want to deal with std.xml.

If dxml does end up going through the Phobo review process and eventually ends up in Phobos, I'll have to change that code so that it uses dxml instead of the hacks.

- Jonathan M Davis

February 10, 2018
On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis wrote:
> I have multiple projects that need an XML parser, and std_experimental_xml is clearly going nowhere, with the guy who wrote it having disappeared into the ether, so I decided to break down and write one. I've kind of wanted to for years, but I didn't want to spend the time on it. However, sometime last year I finally decided that I had to, and it's been what I've been working on in my free time for a while now. And it's finally reached the point when it makes sense to release it - hence this post.
>
> Currently, dxml contains only a range-based StAX / pull parser and related helper functions, but the plan is to add a DOM parser as well as two writers - one which is the writer equivalent of a StaX parser, and one which is DOM-based. However, in theory, the StAX parser is complete and quite useable as-is - though I expect that I'll be adding more helper functions to make it easier to use, and if you find that you're doing a particular operation with it frequently and that that operation is overly verbose, please point it out so that maybe a helper function can be added to improve that use case - e.g. I'm thinking of adding a function similar to std.getopt.getopt for handling attributes, because I personally find that dealing with those is more verbose than I'd like. Obviously, some stuff is just going to do better with a DOM parser, but thus far, I've found that a StAX parser has suited my needs quite well. I have no plans to add a SAX parser, since as far as I can tell, SAX parsers are just plain worse than StAX parsers, and the StAX approach is quite well-suited to ranges.
>
> Of note, dxml does not support the DTD section beyond what is required to parse past it, since supporting it would make it impossible for the parser to return slices of the original input beyond the case where strings are used (and it would be forced to allocate strings in some cases, whereas dxml does _very_ minimal heap allocation right now), and parsing the DTD section signicantly increases the complexity of the parser in order to support something that I honestly don't think should ever have been part of the XML standard and is unnecessary for many, many XML documents. So, if you're dealing with XML documents that contain entity references that are declared in the DTD section and then used outside of the DTD section, then dxml will not support them, but it will work just fine if a DTD section is there so long as it doesn't declare any entity references that are then referenced in the document proper.
>
> Hopefully, the documentation is clear enough, but obviously, I'm not the best judge of that. So, have at it.
>
> Documentation: http://jmdavisprog.com/docs/dxml/0.1.0/
> Github: https://github.com/jmdavis/dxml
> Dub: http://code.dlang.org/packages/dxml
>
> - Jonathan M Davis

This is going to be really useful for people like me who works with webservices using soap.

Thanks for the great work.
February 10, 2018
On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis wrote:

> Hopefully, the documentation is clear enough, but obviously, I'm not the best judge of that. So, have at it.
>
> Documentation: http://jmdavisprog.com/docs/dxml/0.1.0/
> Github: https://github.com/jmdavis/dxml
> Dub: http://code.dlang.org/packages/dxml
>
> - Jonathan M Davis

This looks so nice.

I can understand the concerns of the DTD, and it doesn't look like you needed to do anything special for namespaces with this parser.
February 10, 2018
On Saturday, February 10, 2018 19:53:48 Jesse Phillips via Digitalmars-d- announce wrote:
> On Friday, 9 February 2018 at 21:15:33 UTC, Jonathan M Davis
>
> wrote:
> > Hopefully, the documentation is clear enough, but obviously, I'm not the best judge of that. So, have at it.
> >
> > Documentation: http://jmdavisprog.com/docs/dxml/0.1.0/
> > Github: https://github.com/jmdavis/dxml
> > Dub: http://code.dlang.org/packages/dxml
> >
> > - Jonathan M Davis
>
> This looks so nice.
>
> I can understand the concerns of the DTD, and it doesn't look like you needed to do anything special for namespaces with this parser.

I confess that I haven't looked into namespaces in detail, but from what I understand about them, I don't see any reason to do anything beyond treating them as part of the name. If the application wants to do something special with them, then it's free to do so. Key goals of this parser were to make it fast and simple to use for the typical use case. As much as possible, I'd like to keep the complicated stuff out of it.

Personally, I see XML only as data just like JSON is only data, and I think that the complications in the XML spec come from trying to treat it as more than that.

I had originally intended to provide at least minimal DTD support but leave most of it to some kind of helper functionality (e.g. have a helper function which took the DTD data and then validated the rest of the XML using it). However, as I got farther along, it became clear that that wasn't going to work without giving up on being able to just slice the input, and I wasn't willing to give up on that, especially when I don't see handling the DTD as valuable for anything but dealing with overly complicated XML that is outside of the programmer's control or to simply be able to say that I completely implemented the XML spec.

Slicing is part of why parsers written in D should tend to be inherently fast in comparison to those written in languages like C++, and I want to take advantage of that. In principle, something like an XML parser should be able to be a showcase for why D is great. Tango's was, but Phobos' hasn't been, and I'd like for dxml to be able to be that regardless of whether it eventually replaces std.xml or not.

- Jonathan M Davis

« First   ‹ Prev
1 2