February 12, 2018
On Monday, February 12, 2018 07:59:24 H. S. Teoh via Digitalmars-d-announce wrote:
> On Mon, Feb 12, 2018 at 07:04:38AM -0700, Jonathan M Davis via Digitalmars-d-announce wrote: [...]
>
> > However, if folks as a whole think that Phobos' xml parser needs to support the DTD section to be acceptable, then dxml won't replace std.xml, because dxml is not going to implement DTD support. DTD support fundamentally does not fit in with dxml's design.
>
> Actually, thinking about this, I'm wondering if a combination of preprocessing and/or postprocessing might make it possible to implement DTD support without needing to rewrite the guts of dxml. AIUI, dxml does parse the DTD section correctly, i.e., as an XML directive, but only doesn't look into its internal details. So one way to implement DTD support might be:
>
> - Write an auxiliary parser that's basically a wrapper around dxml,
>   forwarding XML events to the caller, except:
> - If a DTD event is encountered, eagerly parse it, store DTD
>   declarations internally for future reference.
> - If there's a DTD that has been seen, perform on-the-fly validation as
>   XML events are forwarded.
> - In PCDATA sections, if there are entity references to the DTD, expand
>   them, possibly inserting more XML events into the stream based on
>   what's defined in the DTD. (This may need to reuse some dxml internals
>   to parse XML snippets that might be contained in an entity definition,
>   for example.)

The core problem is that entity references get replaced with more XML that needs to be parsed. So, they can't simply be passed on for post-processing. As I understand it, they have to be replaced while the parsing is going on. And that means that you can't do something like return slices of the original input that don't bother with the entity references and then have a separate parser take that and process it further to deal with the entity references. The first parser has to deal with them, and that means not returning slices of the original input unless you're dealing purely with strings and are willing to allocate new strings in the cases where the data needs to be mutated because of an entity reference.

If we were going to stick to strings and only strings, it would be quite possible to define the API in a way that it may or may not do DTD processing, but that doesn't work with arbitrary ranges of characters, not unless you give up on returning slices of the original input, and that means harming the performance and usability for the common case in order to support DTDs.

Also, anything that has the concept of "events" would be drastically different from what dxml does. dxml is completely range-based. It has no callbacks or anything of the sort, and having anything like that would complicate it considerably.

There are lots of interesting things that could be done to try and deal with the DTD section, but they fundamentally don't work with returning slices of the original input unless you're only using strings.

In any case, I refuse to change dxml so that it has DTD support, and I refuse to change it so that it doesn't return slices of the original input. If I were to do so, it would make the parser worse for any use case I care about and require a lot of time and effort on my part that I'm not willing to spend. So, if that makes it so that dxml is never included in Phobos, then so be it.

Folks are free to decide to support dxml for inclusion when the time comes and free to vote it as unacceptable. Personally, I think that dxml's approach is ideal for XML that doesn't use entity references, and I'd much rather use that kind of parser regardless of whether it's in the standard library or not. I think that the D community would be far better off with std.xml being replaced by dxml, but whatever happens happens. I'd be just as fine with a decision to remove std.xml and not include dxml. I'm less fine with std.xml being left in Phobos and dxml being rejected, because std.xml has been recognized as bad, and it sure doesn't look like anyone else is going to write a replacement any time soon. I also think that dxml's approach is better for the common case than anything that supported DTDs would be, so I think that having dxml's solution in Phobos would be better for the community even if Phobos also had a solution that supported DTDs, but at this point, it looks like the options are going to be

1. std.xml stays and continues to suck.
2. std.xml gets ripped out and dxml replaces it.
3. std.xml gets ripped out and we have no xml solution in Phobos.

But as it stands, it doesn't seem likely that any XML solution that supports DTDs being in Phobos is likely to happen any time soon, if ever, because AFAIK, only three people have put in any real effort towards replacing std.xml since 2010 or whenever it was that we decided it needed to be replaced. The first two people both disappeared into oblivion without ever finishing, and here I am with a working StAX parser (now with DOM support) and an XML writer in the works - and given how involved I am with D, I think that it's pretty unlikely that I'm disappearing anywhere short of getting hit by a bus or whatnot. So, at least I've actually put in the time and effort towards a solution and made it available, and it will almost certainly be an essentially complete solution by the time that dconf rolls around if not well before.

So, I do expect that the question of Phobos inclusion will ultimately be a question of whether std.xml _ever_ gets replaced, but regardless, at least there is a solution, and it will continue to be available as a 3rd party library even if it never makes it into Phobos.

- Jonathan M Davis

February 12, 2018
On 2018-02-12 17:49, Chris wrote:

> How could it possibly make the situation any worse than it is now? Atm,
> nobody will ever use std.xml, because it is sub-standard and has no future.

I'm using std.xml in a new project right now. It's a really small private project that just need to extracts some data from an XML document. I started it a couple of days before dxml was announced.

-- 
/Jacob Carlborg
February 12, 2018
On Monday, 12 February 2018 at 19:47:09 UTC, Jacob Carlborg wrote:
> On 2018-02-12 17:49, Chris wrote:
>
>> How could it possibly make the situation any worse than it is now? Atm,
>> nobody will ever use std.xml, because it is sub-standard and has no future.
>
> I'm using std.xml in a new project right now. It's a really small private project that just need to extracts some data from an XML document. I started it a couple of days before dxml was announced.

A few lines of code that could be replaced easily once something better is available? But who will start an important commercial project with std.xml when it says in red letters:

"Warning: This module is considered out-dated and not up to Phobos' current standards. It will remain until we have a suitable replacement, but be aware that it will not remain long term."

I for my part wouldn't and I'm glad there's dxml now.


February 12, 2018
On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis wrote:
> dxml 0.2.0 has now been released.
> [...]

Thank you very much for your efforts, I really appreciate it, as I have been looking for a decent xml library for quite some time.

Whethr or not this is a candidate for inclusion into phobos is certainly up for debate, but as you already mentioned several times, this thread is hardly the right place for that.

So instead I'd like to emphasize how much I appreciate you working on this and I am sure I am not the only one. This absence of a usable high quality xml library is/was a big problem for d in my opinion and it is great to see that this is finally being worked on :)
February 12, 2018
On Mon, Feb 12, 2018 at 09:50:16AM -0700, Jonathan M Davis via Digitalmars-d-announce wrote: [...]
> The core problem is that entity references get replaced with more XML that needs to be parsed. So, they can't simply be passed on for post-processing.  As I understand it, they have to be replaced while the parsing is going on.  And that means that you can't do something like return slices of the original input that don't bother with the entity references and then have a separate parser take that and process it further to deal with the entity references. The first parser has to deal with them, and that means not returning slices of the original input unless you're dealing purely with strings and are willing to allocate new strings in the cases where the data needs to be mutated because of an entity reference.
[...]

I think you missed my point.

What I'm trying to say is, given the current functionality of dxml, one *can* build an XML interface that implements DTD support.

Of course, some concessions obviously have to be made, such as needing to allocate memory (I don't see how else one could keep a dictionary of DTD rules / entity declarations otherwise, for example), or not being able to return only slices of the input anymore.  For example, entity support pretty much means plain slices are no longer an option, because you have to perform substitution of entity definitions, so you'll have to either wrap it in some kind of lazy range that chains the entity definition to the surrounding text, or you'l have to use strings or something else.  Which means you'll need to have memory allocation / slower parsing / whatever, but that's the price of DTD support.

But again, the point is, basic XML parsing (without DTD support) doesn't *need* to pay this price. What's currently in dxml doesn't need to change. DTD support can be implemented in a submodule / separate module that wraps around dxml and builds DTD support on top of it.

Put another way, we can implement DTD support *on top of* dxml this way:
- Parse the XML using dxml as an initial step (this can be done lazily,
  or semi-lazily, as needed).
- As an intermediate step, parse the DTD section, construct whatever
  internal state is needed to handle DTD rules, a dictionary of entity
  references, etc..
- Filter the output of dxml to insert whatever extra behaviour is needed
  to implement DTD support before handing it to the calling code, e.g.,
  expand entity references, or implement validation and throw an
  exception if validation fails, etc..

*We don't need to change dxml's current API at all.*

At the most, I anticipate that the only potential change needed is to expose an interface to parse XML fragments (i.e., not a complete XML document that contains an outer <xml> tag, but just some PCDATA that may contain entities or tags) so that the DTD support wrapper can use it to expand entities and insert any tags that may appear inside the entity definition.

The DTD wrapper doesn't guarantee (and doesn't need to!) to return slices of the input like dxml does. I don't see that as a problem, since I can't see how anyone would be able to implement full DTD support with only slices, even independently from the way dxml is implemented right now.

We can even design the DTD support wrapper to start with being just a thin wrapper around dxml, and lazily switch to full DTD mode only if a DTD section is encountered.  Then user code that doesn't care to use dxml's raw API won't even need to care about the difference.


T

-- 
Curiosity kills the cat. Moral: don't be the cat.
February 12, 2018
On Mon, Feb 12, 2018 at 02:54:48PM +0000, rikki cattermole via Digitalmars-d-announce wrote: [...]
> Everything you have mentioned is not in Phobos. Just because something is 'good enough' does not make it 'good enough' for Phobos. In the words of Andrei "Good enough is not good enough", we need to aim higher to show what we actually can do.

And thus Phobos continues to let the perfect be the enemy of the good, and 10 years later std.xml will still be around, and we will still be arguing over how to replace it.


> Personally I find J.M.D. arguments quite reasonable for a third-party library, since yes it does cover 90% of the use cases.

As I have just said in another post, dxml itself does not need to be changed to implement DTD support.  It's perfectly possible to write a wrapper on top of it that *does* implement DTD support.  In fact, I dare say it might be possible to lazily switch from a thin wrapper over dxml to full DTD mode, so that end users don't even need to care about the difference if they don't care to.

As far as API is concerned, it could be as simple as something like:

	auto parseXml(R, DtdSupport = dtdSupport.true)(R input) if (...)
	{
		static if (DtdSupport)
			return dtdWrapper(dxmlParse(input));
		else
			return dxmlParse(input);
	}

Then just note in the documentation that turning off DTD support would provide extra features X, Y, and Z (speed, slices, whatever). Then let the user choose.

Seriously, I would have thought something like this would be obvious to programmers of the calibre found on these forums.  I'm a little astonished that this would even be such a point of contention in the first place, since the solution is so simple.


T

-- 
Many open minds should be closed for repairs. -- K5 user
February 12, 2018
On Monday, February 12, 2018 13:51:56 H. S. Teoh via Digitalmars-d-announce wrote:
> For example, entity
> support pretty much means plain slices are no longer an option, because
> you have to perform substitution of entity definitions, so you'll have
> to either wrap it in some kind of lazy range that chains the entity
> definition to the surrounding text, or you'l have to use strings or
> something else.  Which means you'll need to have memory allocation /
> slower parsing / whatever, but that's the price of DTD support.

Which was my point. The API as-is doesn't work with DTD support for those very reasons.

> But again, the point is, basic XML parsing (without DTD support) doesn't *need* to pay this price. What's currently in dxml doesn't need to change. DTD support can be implemented in a submodule / separate module that wraps around dxml and builds DTD support on top of it.
>
> Put another way, we can implement DTD support *on top of* dxml this way:
> - Parse the XML using dxml as an initial step (this can be done lazily,
>   or semi-lazily, as needed).
> - As an intermediate step, parse the DTD section, construct whatever
>   internal state is needed to handle DTD rules, a dictionary of entity
>   references, etc..
> - Filter the output of dxml to insert whatever extra behaviour is needed
>   to implement DTD support before handing it to the calling code, e.g.,
>   expand entity references, or implement validation and throw an
>   exception if validation fails, etc..
>
> *We don't need to change dxml's current API at all.*

I don't think that this works, because the entity references insert new XML and thus affect the parsing. And as such, you can't simply pass through the entity references to be processed by another parser. They need to be handled by the core parser, otherwise it's going to give incorrect results, not just results that need further parsing. I'm sure that dxml's internals could be refactored so that they could be shared with another parser that did that, but unless I'm misunderstanding how entity references work, you can't use what's there now as-is and build another parser on top of it. The entity reference replacement needs to happen in the core parser.

> The DTD wrapper doesn't guarantee (and doesn't need to!) to return slices of the input like dxml does. I don't see that as a problem, since I can't see how anyone would be able to implement full DTD support with only slices, even independently from the way dxml is implemented right now.

Yeah, if I were writing a parser that handled the DTD section, I wouldn't make it deal with slices of the input like DTD does unless I decided to make it always return string, in which case, you could get slices of the original input for strings but no other range types - it's either that or using a lazy range, which would be worse if you passed strings but better for other range types. And that's the main reason that I gave up on having dxml handle the DTD section. I consider that approach unacceptable. One of the key goals for dxml was that it would be providing slices of the input and not lazy ranges or allocating new strings.

In any case, unless I misunderstand how entity references work, that would have to be its own parser and not simply a wrapper around dxml because of how the entity references affect the parsing. If I'm wrong, then great, someone else can come along later and add some sort of DTD parser on top of dxml, and if I'm right, well, then anyone who wants to do anything like that is going to need to write a new parser, but that can then coexist alongside dxml's parser just fine. Either way, I like dxml's approach and don't want to compromise what it's doing in an attempt to fully deal with DTDs.

- Jonathan M Davis

February 12, 2018
On Monday, February 12, 2018 21:26:45 Johannes Loher via Digitalmars-d- announce wrote:
> On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis
>
> wrote:
> > dxml 0.2.0 has now been released.
> > [...]
>
> Thank you very much for your efforts, I really appreciate it, as I have been looking for a decent xml library for quite some time.
>
> Whethr or not this is a candidate for inclusion into phobos is certainly up for debate, but as you already mentioned several times, this thread is hardly the right place for that.
>
> So instead I'd like to emphasize how much I appreciate you working on this and I am sure I am not the only one. This absence of a usable high quality xml library is/was a big problem for d in my opinion and it is great to see that this is finally being worked on :)

Thanks. When you do use it, please give feedback - particularly if you find any problems or pain points. I definitely think that the API is solid overall, but that doesn't mean that I got it completely right, and even with all of the tests that I have, I could have missed something and ended up with a bug in the parser. I'm reasonably confident in the code quality, but that doesn't mean that I didn't miss anything.

- Jonathan M Davis

February 13, 2018
On 12/02/2018 10:02 PM, H. S. Teoh wrote:
> On Mon, Feb 12, 2018 at 02:54:48PM +0000, rikki cattermole via Digitalmars-d-announce wrote:
> [...]
>> Everything you have mentioned is not in Phobos. Just because something
>> is 'good enough' does not make it 'good enough' for Phobos. In the
>> words of Andrei "Good enough is not good enough", we need to aim
>> higher to show what we actually can do.
> 
> And thus Phobos continues to let the perfect be the enemy of the good,
> and 10 years later std.xml will still be around, and we will still be
> arguing over how to replace it.
> 
> 
>> Personally I find J.M.D. arguments quite reasonable for a third-party
>> library, since yes it does cover 90% of the use cases.
> 
> As I have just said in another post, dxml itself does not need to be
> changed to implement DTD support.  It's perfectly possible to write a
> wrapper on top of it that *does* implement DTD support.  In fact, I dare
> say it might be possible to lazily switch from a thin wrapper over dxml
> to full DTD mode, so that end users don't even need to care about the
> difference if they don't care to.
> 
> As far as API is concerned, it could be as simple as something like:
> 
> 	auto parseXml(R, DtdSupport = dtdSupport.true)(R input) if (...)
> 	{
> 		static if (DtdSupport)
> 			return dtdWrapper(dxmlParse(input));
> 		else
> 			return dxmlParse(input);
> 	}
> 
> Then just note in the documentation that turning off DTD support would
> provide extra features X, Y, and Z (speed, slices, whatever). Then let
> the user choose.
> 
> Seriously, I would have thought something like this would be obvious to
> programmers of the calibre found on these forums.  I'm a little
> astonished that this would even be such a point of contention in the
> first place, since the solution is so simple.
> 
> 
> T

In other places it was said that it wasn't possible to build it on top of it.

But yes, I would be expecting an entry point like you described and is something that I mentioned :)

std.experimental.xml:
	- interfaces.d: interface Element {...}
	- entry.d: auto parseXML(...)(...) {...}
	- impl_subset:
		- dom.d
		ext.
	- impl_full:
		- entry.d
		ext.


February 12, 2018
On 02/12/2018 05:02 PM, H. S. Teoh wrote:
> On Mon, Feb 12, 2018 at 02:54:48PM +0000, rikki cattermole via Digitalmars-d-announce wrote:
> [...]
>> Everything you have mentioned is not in Phobos. Just because something
>> is 'good enough' does not make it 'good enough' for Phobos. In the
>> words of Andrei "Good enough is not good enough", we need to aim
>> higher to show what we actually can do.
> 
> And thus Phobos continues to let the perfect be the enemy of the good,
> and 10 years later std.xml will still be around, and we will still be
> arguing over how to replace it.

+Several billion.

Like the improved assert messages we would've had since many years ago and was implemented, done and ready to go, but it was instead thrown away because...(and here's the real kicker, considering current D climate)...because it was a fully in-library solution instead of a new compiler feature. Go figure ::eyeroll::

> Seriously, I would have thought something like this would be obvious to
> programmers of the calibre found on these forums.  I'm a little
> astonished that this would even be such a point of contention in the
> first place, since the solution is so simple.

I would've expected so too, if it weren't that one of the top favorite activities 'round these parts is nitpicking reasonable ideas to death for stupid reasons. And, generally letting the perfect be the enemy of the good.