February 13, 2018
On Tuesday, February 13, 2018 21:18:12 Patrick Schluter via Digitalmars-d- announce wrote:
> There's also the issue that entity references open a whole can of worms concerning security. It quite possible to have an exponential growing entity replacement that can take down any parser.

Well, if dxml just passes the entity references along unparsed beyond validating that the entity reference itself contains valid characters (e.g. it's not something like &.; or & by itself), then dxml would still not be replacing the entity references with anything. Any security or performance problems associated with entity references would be left up to whatever parser parsed the DTD section and then used dxml to parse the rest of the XML and replaced the entity references in dxml's parsing results with whatever they were.

The big problem is how the entity references affect the parsing. If start tags can be dropped in and affect the parsing (and it's still not clear to me from the spec whether that's legal - there is a section talking about being nested properly which might indicate that that's not legal, but it's not very specific or clear), and if it's legal to do something like use an entity reference for a tag name - e.g. <&foo;>, then that's a serious problem. And problems like that are the main reason why I completely dropped any attempt to do anything with the DTD section.

If entity references are only legal in the text between start and end tags and between the quotes of attribute values, and whatever they're replaced with cannot actually affect anything else in the XML document (i.e. it can't just be a start or end tag or anything like that - it has to be fulling parseable on its own and not affect the parsing of the document itself), then passing them along should be fine.

Basically, if I can change dxml so that in the places where it currently allows one of the standard entity references to be, it then also allows other entity references but passes them along without replacing them instead of throwing an XMLParsingException, and that works without having documents be screwed up due to missing start tags or something, then passing them along should be fine. But if entity references allow arbitrary enough chunks of XML, that doesn't work. It also doesn't work if entity references are allowed in places other than the text between start and end tags or within attribute values. And it's not clear to me at all what is legal in an entity reference or where exactly they're legal. The spec talks about the grammar being the grammar _after_ all of the references have been replaced, which makes the grammar rather untrustworthy, and I find the spec very hard to understand in general.

Regardless, there's no risk of dxml's parser ever being changed to actually replace entity references. That doesn't work with returning slices of the original input, and it really doesn't work with a parser that's just supposed to take a range of characters and parse it. To fully handle all of the DTD stuff means actually reading files from disk or from the internet - which of course is where the security problems come in, but it also means that you're not just dealing with a parser anymore. In principle, dxml's parser should be pure (though some implementation make it so that it isn't right now), whereas an XML parser that fully handles the DTD section could never be pure.

- Jonathan M Davis

February 13, 2018
On Tue, Feb 13, 2018 at 09:18:12PM +0000, Patrick Schluter via Digitalmars-d-announce wrote:
> On Tuesday, 13 February 2018 at 20:10:59 UTC, Jonathan M Davis wrote:
[...]
> > If it's 100% sure that entity references can be treated as just text and that you can't end up with stuff like start tags or end tags being inserted and messing with the parsing such that they all have to be replaced for the XML to be correctly parsed, then I have no problem passing entity references along, and a higher level parser could try to do something with them, but it's not clear to me at all that an XML document with entity references is correct enough to be parsed while not replacing the entity references with whatever XML markup they contain. I had originally passed them along with the idea that a higher level parser could do something with them, but I decided that I couldn't do that if you could do something like drop a start tag in there and change the meaning of the stuff that needs to be parsed that isn't directly in the entity reference.

This made me go to the W3C spec (https://www.w3.org/TR/xml/) to figure out what exactly is/isn't defined.  I discovered to my chagrin that XML entities are a huge rabbit hole with extremely pathological behaviour that makes it almost impossible to implement in any way that's even remotely efficient.

Here's a page with examples of how nasty it can get:

	http://www.floriankaeferboeck.at/XML/Comparison.html

Here's an example given in the W3C spec itself:

	<?xml version='1.0'?>
	<!DOCTYPE test [
	<!ELEMENT test (#PCDATA) >
	<!ENTITY % xx '&#37;zz;'>
	<!ENTITY % zz '&#60;!ENTITY tricky "error-prone" >' >
	%xx;
	]>
	<test>This sample shows a &tricky; method.</test>

A correct XML parser is supposed to produce the following text as the body of the <test>...</test> tag (the grammatical error is intentional):

	This sample shows a error-prone method.


Fortunately, there's a glimmer of hope on the horizon: in section 4.3.2 of the spec (https://www.w3.org/TR/xml/#wf-entities), it is explicitly stated:

	A consequence of well-formedness in general entities is that the
	logical and physical structures in an XML document are properly
	nested; no start-tag, end-tag, empty-element tag, element,
	comment, processing instruction, character reference, or entity
	reference can begin in one entity and end in another.

Meaning, if I understand it correctly, that you can't have a start tag in &entity1; and its corresponding end tag in &entity2;, and then have your document contain "&entity1; &entity2;".  This is because the body of the entity can only contain text or entire tags (the production "content" in the spec); an entity that contains an open tag without an end tag (or vice versa) does not match this rule and is thus illegal.

So this means that we *can* use dxml as a backend to drive a DTD-supporting XML parser implementation.  The wrapper / higher-level parser would scan the slices returned by dxml for entity references, and substitute them accordingly, which may involve handing the body of the entity to another instance of dxml to parse any tags that may be nested in there.

The nastiness involving partially-formed entity references (as seen in the above examples) apparently only applies inside the DOCTYPE declaration, so AIUI this can be handled by the higher-level parser as part of replacing inline entities with their replacement text.

(The higher-level parser has a pretty tall order to fill, though, because entities can refer to remote resources via URI, meaning that an innocuous-looking 5-line XML file can potentially expand to terabytes of XML tags downloaded from who knows how many external resources recursively. Not to mention a bunch of security issues like described below.)


> There's also the issue that entity references open a whole can of worms concerning security. It quite possible to have an exponential growing entity replacement that can take down any parser.
> 
> <!DOCTYPE root [
>  <!ELEMENT root ANY>
>  <!ENTITY LOL "LOL">
>  <!ENTITY LOL1 "&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;">
>  <!ENTITY LOL2
> "&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;">
>  <!ENTITY LOL3
> "&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;">
>  <!ENTITY LOL4
> "&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;">
>  <!ENTITY LOL5
> "&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;">
>  <!ENTITY LOL6
> "&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;">
>  <!ENTITY LOL7
> "&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;">
>  <!ENTITY LOL8
> "&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;">
>  <!ENTITY LOL9
> "&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;">
> ]>
> <root>&LOL9;</root>
> 
> Hope you have enough memory (this expands to a 3 000 000 000 LOL's)
[...]

Yeah, after reading through relevant portions of the spec, I have to say that full DTD support is a HUGE can of worms.  I tip my hats off in advance to the brave soul (or poor fool :-P) who would attempt to implement the spec in full. :-D

There are ways to deal with exponential entity growth, e.g., if the expansion was carried out lazily.  But it's still a DOS vulnerability if the software then spins practically forever trying to traverse the huge range of stuff being churned out.

Not to mention that having embedded external references is itself a security issue, particular since the partial entity formation thing can be used to obfuscate the real URI of a referenced entity, so you could potentially trick a remote XML parser to download stuff from questionable sources.  It could be used as a covert surveillance method, for example, or a malware delivery vector, if combined with an exploitable bug in the parser code.  Or it could be used to read sensitive files (e.g., if an entity references file:///etc/passwd or some such system file).  Ick.

Ironically, the general advice I found online w.r.t XML vulnerabilities is "don't allow DTDs", "don't expand entities", "don't resolve externals", etc..  There also aren't many XML parsers out there that fully support all the features called for in the spec.  IOW, this basically amounts to "just use dxml and forget about everything else". :-D

Now of course, there *are* valid use cases for DTDs... but a naïve implementation of the spec is only going to end in tears.  My current inclination is, just merge dxml into Phobos, then whoever dares implement DTD support can do so on top of dxml, and shoulder their own responsibility for vulnerabilities or whatever.  (I mean, seriously, just for the sake of being able to say "my XML is validated" we have to implement network access, local filesystem access, a security framework, and what amounts to a sandbox to control pathological behaviour like exponentially recursive entities?  And all of this, just to handle rare corner cases?  That's completely ridiculous.  It's an obvious design smell to me.  The only thing missing from this poisonous mix is Turing completeness, which would have made XML hackers' heaven.  Oh wait, on further googling, I see that XSLT *is* Turing complete.  Great, just great.   Now I know why I've always had this gut feeling that *something* is off about the whole XML mania.)


T

-- 
English is useful because it is a mess. Since English is a mess, it maps well onto the problem space, which is also a mess, which we call reality. Similarly, Perl was designed to be a mess, though in the nicest of all possible ways. -- Larry Wall
February 13, 2018
On Tue, Feb 13, 2018 at 03:00:59PM -0700, Jonathan M Davis via Digitalmars-d-announce wrote: [...]
> The big problem is how the entity references affect the parsing. If start tags can be dropped in and affect the parsing (and it's still not clear to me from the spec whether that's legal - there is a section talking about being nested properly which might indicate that that's not legal, but it's not very specific or clear), and if it's legal to do something like use an entity reference for a tag name - e.g. <&foo;>, then that's a serious problem. And problems like that are the main reason why I completely dropped any attempt to do anything with the DTD section.

AFAICT, section 4.3.2 in the spec (probably the one you're referring to) seems to be saying that you can't do that:

	A consequence of well-formedness in general entities is that the
	logical and physical structures in an XML document are properly
	nested; no start-tag, end-tag, empty-element tag, element,
	comment, processing instruction, character reference, or entity
	reference can begin in one entity and end in another.


> If entity references are only legal in the text between start and end tags and between the quotes of attribute values, and whatever they're replaced with cannot actually affect anything else in the XML document (i.e. it can't just be a start or end tag or anything like that - it has to be fulling parseable on its own and not affect the parsing of the document itself), then passing them along should be fine.

That's the approach I'm thinking of.


[...]
> Regardless, there's no risk of dxml's parser ever being changed to actually replace entity references. That doesn't work with returning slices of the original input, and it really doesn't work with a parser that's just supposed to take a range of characters and parse it. To fully handle all of the DTD stuff means actually reading files from disk or from the internet - which of course is where the security problems come in, but it also means that you're not just dealing with a parser anymore. In principle, dxml's parser should be pure (though some implementation make it so that it isn't right now), whereas an XML parser that fully handles the DTD section could never be pure.
[...]

Given the insane complexities of DTD that I'm only slowly beginning to grasp from actually reading the spec, I'm quickly adopting the opinion that dxml should remain as-is, and any DTD implementation should be layered on top.  The only potential changes that might be needed is:

- provide a way to parse XML snippets that don't have a <?xml ...>
  declaration, so that a DTD implementation could, for example, hand an
  entity body over to dxml to extract any tags that may be nested in
  there (and if my reading of section 4.3.2 is correct, all such tags
  must always be closed inside the entity body, so there should be no
  errors produced).

- provide some way of hooking into non-default entities so that
  DTD-defined entities can be expanded by the DTD implementation.  This
  could be as simple as leaving such entities untouched in the returned
  range, or invent a special EntityType representing such entities (with
  a slice of the input containing the entity name) so that the DTD
  implementation can insert the replacement text.

Everything else should be handled by the DTD layer, e.g., parsing the DOCTYPE section (which is itself pretty pathological, given the actual examples in the W3C spec to this effect), expanding entities, looking up external entities, limiting recursive entity expansion, implementing a security model, etc..


T

-- 
Why do conspiracy theories always come from the same people??
February 13, 2018
On Tuesday, February 13, 2018 14:13:36 H. S. Teoh via Digitalmars-d-announce wrote:
> Great, just
> great.   Now I know why I've always had this gut feeling that
> *something* is off about the whole XML mania.)

Well, there are plenty of folks who talk like XML is a pile of steaming muck that should never be used (and then usually talk about how great JSON is). I think that basic XML is actually pretty okay - basically the subset that dxml supports, though if I were designing XML I'd take it a bit further.

Personally, I'd make XML documents completely recursive - meaning that the top level is the same as any deeper level, so you could have as many element tags at the top level as you want and as much text as you want, whereas XML requires a root element and only allows stuff like processing instructions, comments, and the DOCTYPE stuff outside of the root element.

I'd get rid of the <?xml...?> and <!DOCTYPE...> declarations as well as processing instructions, and I'd probably get rid of the CDATA section in favor of escaping characters with backslashes like you typically do in strings (or in JSON), and related to that, I'd get rid of the predefined entity references, making stuff like & legal. I also might get rid of empty element tags becase they're annoying to deal with when parsing, but they do reduce the verbosity of the document such that they might be worth keeping. It's also tempting to get rid of the tag name on end tags, which would actually make parsing much easier, but having them helps the legibility of XML documents, and it's a bit like semicolons in D in the sense that they can help ensure that error messages refer to the right thing rather than something later in the document, so I don't know. I'd also allow all Unicode characters instead of disallowing a number of them, since it won't really matter for most documents, and then the parser doesn't need to care about them when validating.

So, basically, you end up with start tags, end tags, and comments, with start tags optionally having attributes. backslashes would then be used for escaping stuff, and you end up with something pretty dead simple.

However, as you're finding out when reading through the XML spec, the folks who created XML didn't think like that at all, and were clearly coming from a _very_ different point of view as to what an XML document was for and should contain. But as you might imagine, given my take on what XML should have been, finding out in detail what XML actually _is_ was pretty horrifying.

I started dxml with the intention of fully implementing all aspects of the spec but ultimately decided that it simply wasn't worth it.

- Jonathan M Davis

February 13, 2018
On Tuesday, February 13, 2018 14:29:27 H. S. Teoh via Digitalmars-d-announce wrote:
> Given the insane complexities of DTD that I'm only slowly beginning to grasp from actually reading the spec, I'm quickly adopting the opinion that dxml should remain as-is, and any DTD implementation should be layered on top.  The only potential changes that might be needed is:
>
> - provide a way to parse XML snippets that don't have a <?xml ...>
>   declaration, so that a DTD implementation could, for example, hand an
>   entity body over to dxml to extract any tags that may be nested in
>   there (and if my reading of section 4.3.2 is correct, all such tags
>   must always be closed inside the entity body, so there should be no
>   errors produced).

XML 1.0 does not require the <?xml...?> section - which is the main reason why dxml implements XML 1.0 and not 1.1. When working on one of my projects with std_experimental_xml, I had to keep adding the <?xml...?> declaration to the start of XML snippets in all of my tests which had to deal with sections of an XML document, and it was _really_ annoying.

dxml does require that what it's given be a valid XML 1.0 document, which means that you have to have exactly one root element in what it's passed, which does limit which kind of XML snippets you pass it, but it will work for a lot of XML snippets as-is.

> - provide some way of hooking into non-default entities so that
>   DTD-defined entities can be expanded by the DTD implementation.  This
>   could be as simple as leaving such entities untouched in the returned
>   range, or invent a special EntityType representing such entities (with
>   a slice of the input containing the entity name) so that the DTD
>   implementation can insert the replacement text.

After having actually implemented full parsing for the entire DTD section before figuring out that references could be inserted in it just about anywhere and that the grammar in the spec is only the grammar _after_ all of the replacements were made (when I figured that out was when I gave up on DTD support), I would strongly argue in favor of simply passing along entity references as-is and leaving any and all such processing to a DTD-enabled parser. Originally, the Config had options like SkipDTD and SkipProlog, and I even provided a way to get at the information in the <?xml...?> declaration if you wanted it, all that just wasn't worth the extra complexity.

- Jonathan M Davis

February 14, 2018
On Tuesday, 13 February 2018 at 22:00:59 UTC, Jonathan M Davis wrote:
> On Tuesday, February 13, 2018 21:18:12 Patrick Schluter via Digitalmars-d- announce wrote:
>> [...]
>
> Well, if dxml just passes the entity references along unparsed beyond validating that the entity reference itself contains valid characters (e.g. it's not something like &.; or & by itself), then dxml would still not be replacing the entity references with anything. Any security or performance problems associated with entity references would be left up to whatever parser parsed the DTD section and then used dxml to parse the rest of the XML and replaced the entity references in dxml's parsing results with whatever they were.
>
> The big problem is how the entity references affect the parsing. If start tags can be dropped in and affect the parsing (and it's still not clear to me from the spec whether that's legal - there is a section talking about being nested properly which might indicate that that's not legal, but it's not very specific or clear), and if it's legal to do something like use an entity reference for a tag name - e.g. <&foo;>, then that's a serious problem. And problems like that are the main reason why I completely dropped any attempt to do anything with the DTD section.
>
Yikes! In any case, even if I had to implement a parser I would tend to not implement this "feature" as it sounds quite unreasonable. Only if a real need (i.e. one in the real world, not one that could be contrived out of the specs) arises would I then potentially implement the real deal.
February 14, 2018
On Wednesday, February 14, 2018 10:03:45 Patrick Schluter via Digitalmars-d- announce wrote:
> On Tuesday, 13 February 2018 at 22:00:59 UTC, Jonathan M Davis
>
> wrote:
> > On Tuesday, February 13, 2018 21:18:12 Patrick Schluter via
> >
> > Digitalmars-d- announce wrote:
> >> [...]
> >
> > Well, if dxml just passes the entity references along unparsed beyond validating that the entity reference itself contains valid characters (e.g. it's not something like &.; or & by itself), then dxml would still not be replacing the entity references with anything. Any security or performance problems associated with entity references would be left up to whatever parser parsed the DTD section and then used dxml to parse the rest of the XML and replaced the entity references in dxml's parsing results with whatever they were.
> >
> > The big problem is how the entity references affect the parsing. If start tags can be dropped in and affect the parsing (and it's still not clear to me from the spec whether that's legal - there is a section talking about being nested properly which might indicate that that's not legal, but it's not very specific or clear), and if it's legal to do something like use an entity reference for a tag name - e.g. <&foo;>, then that's a serious problem. And problems like that are the main reason why I completely dropped any attempt to do anything with the DTD section.
>
> Yikes! In any case, even if I had to implement a parser I would tend to not implement this "feature" as it sounds quite unreasonable. Only if a real need (i.e. one in the real world, not one that could be contrived out of the specs) arises would I then potentially implement the real deal.

Well, since folks other than me are going to use this parser, and it's even potentially going to end up in D's standard library, it needs to at least be good enough to not let through invalid XML or incorrectly interpret any XML. It can potentially not support portions of the spec as long as it does so in a clear and clean manner, but it's going to have to correctly handle anything that it does handle.

For better or worse, I'm the sort of person who prefers to completely implement a spec when I'm implementing one, but in this case, it wasn't really reasonable. Fortunately however, from the perspective of implementing something that's useful for me personally, the DTD section is completely unnecessary. From that perspective, processing instructions and CDATA sections are also unnecessary, since I'd never do anythnig with them, but I don't think that it would be reasonable to skip those, so they're implemented. And it's not like they're hard to implement support for, unlike the DTD section.

- Jonathan M Davis

February 14, 2018
On Tuesday, 13 February 2018 at 22:29:27 UTC, H. S. Teoh wrote:
> - provide some way of hooking into non-default entities so that
>   DTD-defined entities can be expanded by the DTD implementation.

The parser now returns raw text, entity replacement can be done by DTD processor without any modification of API. So it's good for experimental if there's incentive to maintain it, but it's purely PR problem: there's nothing wrong in having xml support in dub registry and std.xml in phobos, if phobos is ok with it, it can stay as is.
It looks like EntityRange requires forward range, is it ok for a parser?
February 14, 2018
On Tuesday, 13 February 2018 at 22:13:36 UTC, H. S. Teoh wrote:

>
> Ironically, the general advice I found online w.r.t XML vulnerabilities is "don't allow DTDs", "don't expand entities", "don't resolve externals", etc..  There also aren't many XML parsers out there that fully support all the features called for in the spec.  IOW, this basically amounts to "just use dxml and forget about everything else". :-D
>
> Now of course, there *are* valid use cases for DTDs... but a naïve implementation of the spec is only going to end in tears.
>  My current inclination is, just merge dxml into Phobos, then whoever dares implement DTD support can do so on top of dxml, and shoulder their own responsibility for vulnerabilities or whatever.  (I mean, seriously, just for the sake of being able to say "my XML is validated" we have to implement network access, local filesystem access, a security framework, and what amounts to a sandbox to control pathological behaviour like exponentially recursive entities?  And all of this, just to handle rare corner cases?  That's completely ridiculous.  It's an obvious design smell to me.  The only thing missing from this poisonous mix is Turing completeness, which would have made XML hackers' heaven.  Oh wait, on further googling, I see that XSLT *is* Turing complete.  Great, just great.   Now I know why I've always had this gut feeling that *something* is off about the whole XML mania.)
>
>
> T

Thanks for the analysis. I'd say you're right. It makes no sense to keep dxml from becoming std.xml's successor only because it doesn't support DTDs. Also, as I said before, if we had DTD support in std.xml, people would complain about the lack of efficiency, and the discussion about interpreting the specs correctly, implementing them 100%, complaints about the lack of security would just never end.
February 14, 2018
On Wednesday, February 14, 2018 10:14:44 Kagamin via Digitalmars-d-announce wrote:
> It looks like EntityRange requires forward range, is it ok for a parser?

It's very difficult in general to write a parser that isn't at least a forward range, because without that, you're stuck at only one character of look ahead unless you play a lot of games with putting data from the input range in a buffer so that you can keep it around to look at it again after you've looked farther ahead.

Honestly, pure input ranges are borderline useless for a _lot_ of cases. It's generally only the cases where you only care about operating on each element individually irrespective of what's going on with other elements in the range that pure input ranges are really useable, and parsing definitely doesn't fall into that camp.

- Jonathan M Davis