February 12, 2018
On 02/12/2018 11:15 AM, rikki cattermole wrote:
> 
> dxml 7.5k LOC
> std.xml 3k LOC
> 
> dxml would make the situation a lot worse.

4.5k LOC == "a lot worse"?

Uuuuhhh...WAT?
February 12, 2018
On Monday, February 12, 2018 21:53:21 Nick Sabalausky  via Digitalmars-d- announce wrote:
> On 02/12/2018 11:15 AM, rikki cattermole wrote:
> > dxml 7.5k LOC
> > std.xml 3k LOC
> >
> > dxml would make the situation a lot worse.
>
> 4.5k LOC == "a lot worse"?
>
> Uuuuhhh...WAT?

There is sometimes a tendency for folks to think that something having a lot of lines of code is bad, and there can be some truth to that. If something can be done in a simpler way, it tends to be shorter and easier to maintain, but shorter isn't always better, and simpler isn't always better - especially if that complexity is needed to get the job done. So, LOC tells you something, but what it really tells you is up for debate.

And actually, well-written D code is going to have a much higher line count in general because of stuff like documentation and unit tests being in the source file. In this case, while std.xml does seem to have a fair bit of documentation, it has very little in the way of unit tests, whereas dxml has fairly thorough unit tests - maybe not quite as extreme as std.datetime, but I do tend to be thorough with unit tests.

Andrei used to complain periodically about how large std.datetime was, thinking that it was way too much code, and then someone actually went to the effort of stripping out all of the comments and unit tests and whatnot to count the actual lines of code in the implementation, and it was a _way_ smaller number than the lines in the file (IIRC, it might have even been something like only 10% of the file, if that). That's what happens when you write documentation and unit tests that are thorough.

- Jonathan M Davis

February 12, 2018
On 02/12/2018 10:49 PM, Jonathan M Davis wrote:
> 
> Andrei used to complain periodically about how large std.datetime was,
> thinking that it was way too much code, and then someone actually went to
> the effort of stripping out all of the comments and unit tests and whatnot
> to count the actual lines of code in the implementation, and it was a _way_
> smaller number than the lines in the file (IIRC, it might have even been
> something like only 10% of the file, if that). That's what happens when you
> write documentation and unit tests that are thorough.
> 

Yea, totally. Another example: mysql-native used to be one (!!) source file. It was maybe a bit on the large size for a single module, but it was still workable. In the last several years, that library has grown many times its old size. But now, I'd say that easily the majority of lines are either comments or tests. The *actual* implementation and API isn't really all that much more LOC than it used to be. The original one-module version, by contrast, was less documented and had...I don't think it even had a single test (IIRC, the now-old-and-probably-bitrotted "app.d" wasn't even there.)
February 13, 2018
On 2018-02-12 21:19, Chris wrote:

> A few lines of code that could be replaced easily once something better is available?

Fairly easy because it's so small. I'm actually using the SAX interface from std.xml and it quite nicely fits my needs.

-- 
/Jacob Carlborg
February 13, 2018
On Monday, 12 February 2018 at 21:51:56 UTC, H. S. Teoh wrote:
[...]
> We can even design the DTD support wrapper to start with being just a thin wrapper around dxml, and lazily switch to full DTD mode only if a DTD section is encountered.  Then user code that doesn't care to use dxml's raw API won't even need to care about the difference.
>
>
> T

In this vein, if a new version of std.xml didn't offer pure and fast parsing like dxml, but included DTD by default, people would complain that that was the real deal breaker (too slow, man!). Remember `autodecode`? Right.

DTD inclusion should only be available on demand. Imagine you want to implement a library project where ebooks (say classics) are catalogued and presented in an ebook reader on the web (or in an app on your smart phone). It is likely that the whole DTD thing would probably be done at the cataloguing stage, but once the books are in the library most users will probably just want to go through them page by page or search for quotes etc. - and for that you'd need a fast tool like dxml with no overhead.
February 13, 2018
On Mon, 2018-02-12 at 14:54 +0000, rikki cattermole via Digitalmars-d- announce wrote:
> […]
> 
> Personally I find J.M.D. arguments quite reasonable for a third-
> party
> library, since yes it does cover 90% of the use cases.

The problem is that std.xml needs removing to make it clear there is no good XML package in Phobos. The people will go looking in the Dub repository.

-- 
Russel.
===========================================
Dr Russel Winder      t: +44 20 7585 2200
41 Buckmaster Road    m: +44 7770 465 077
London SW11 1EN, UK   w: www.russel.org.uk


February 13, 2018
On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis wrote:
> The core problem is that entity references get replaced with more XML that needs to be parsed. So, they can't simply be passed on for post-processing. As I understand it, they have to be replaced while the parsing is going on. And that means that you can't do something like return slices of the original input that don't bother with the entity references and then have a separate parser take that and process it further to deal with the entity references. The first parser has to deal with them, and that means not returning slices of the original input unless you're dealing purely with strings and are willing to allocate new strings in the cases where the data needs to be mutated because of an entity reference.

Standard entities like & have the same problem, so the same solution should work too.
February 13, 2018
On Tuesday, 13 February 2018 at 02:53:21 UTC, Nick Sabalausky (Abscissa) wrote:
> On 02/12/2018 11:15 AM, rikki cattermole wrote:
>> 
>> dxml 7.5k LOC
>> std.xml 3k LOC
>> 
>> dxml would make the situation a lot worse.
>
> 4.5k LOC == "a lot worse"?
>
> Uuuuhhh...WAT?

And it's like 2k LOC of code and 5.5k LOC of tests and docs.
February 13, 2018
On Tuesday, February 13, 2018 15:22:32 Kagamin via Digitalmars-d-announce wrote:
> On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis
>
> wrote:
> > The core problem is that entity references get replaced with more XML that needs to be parsed. So, they can't simply be passed on for post-processing. As I understand it, they have to be replaced while the parsing is going on. And that means that you can't do something like return slices of the original input that don't bother with the entity references and then have a separate parser take that and process it further to deal with the entity references. The first parser has to deal with them, and that means not returning slices of the original input unless you're dealing purely with strings and are willing to allocate new strings in the cases where the data needs to be mutated because of an entity reference.
>
> Standard entities like & have the same problem, so the same solution should work too.

That depends on what exactly an entity reference can contain. If it can do something like put a start tag in there, and then it has to be terminated by the document putting an end tag in there or another entity reference containing an end tag, then it can't be handled after the fact like & can be, since & is just replaced by text. If an entity reference can't contain a start tag without a matching end tag, then sure. But I find the XML spec to be surprisingly hard to understand with regards to entity references. It's not clear to me where it's even legal to put them or not, let alone what you're allowed to put in them exactly. And I can't even really trust the XML gramamr as long as entity references are involved, because the gramamr in the spec is the grammar _after_ entity references have all been replaced, which I was quite dismayed to figure out.

If it's 100% sure that entity references can be treated as just text and that you can't end up with stuff like start tags or end tags being inserted and messing with the parsing such that they all have to be replaced for the XML to be correctly parsed, then I have no problem passing entity references along, and a higher level parser could try to do something with them, but it's not clear to me at all that an XML document with entity references is correct enough to be parsed while not replacing the entity references with whatever XML markup they contain. I had originally passed them along with the idea that a higher level parser could do something with them, but I decided that I couldn't do that if you could do something like drop a start tag in there and change the meaning of the stuff that needs to be parsed that isn't directly in the entity reference.

- Jonathan M Davis

February 13, 2018
On Tuesday, 13 February 2018 at 20:10:59 UTC, Jonathan M Davis wrote:
> On Tuesday, February 13, 2018 15:22:32 Kagamin via Digitalmars-d-announce wrote:
>> On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis
>>
>> wrote:
>> > The core problem is that entity references get replaced with more XML that needs to be parsed. So, they can't simply be passed on for post-processing. As I understand it, they have to be replaced while the parsing is going on. And that means that you can't do something like return slices of the original input that don't bother with the entity references and then have a separate parser take that and process it further to deal with the entity references. The first parser has to deal with them, and that means not returning slices of the original input unless you're dealing purely with strings and are willing to allocate new strings in the cases where the data needs to be mutated because of an entity reference.
>>
>> Standard entities like & have the same problem, so the same solution should work too.
>
> That depends on what exactly an entity reference can contain. If it can do something like put a start tag in there, and then it has to be terminated by the document putting an end tag in there or another entity reference containing an end tag, then it can't be handled after the fact like & can be, since & is just replaced by text. If an entity reference can't contain a start tag without a matching end tag, then sure. But I find the XML spec to be surprisingly hard to understand with regards to entity references. It's not clear to me where it's even legal to put them or not, let alone what you're allowed to put in them exactly. And I can't even really trust the XML gramamr as long as entity references are involved, because the gramamr in the spec is the grammar _after_ entity references have all been replaced, which I was quite dismayed to figure out.
>
> If it's 100% sure that entity references can be treated as just text and that you can't end up with stuff like start tags or end tags being inserted and messing with the parsing such that they all have to be replaced for the XML to be correctly parsed, then I have no problem passing entity references along, and a higher level parser could try to do something with them, but it's not clear to me at all that an XML document with entity references is correct enough to be parsed while not replacing the entity references with whatever XML markup they contain. I had originally passed them along with the idea that a higher level parser could do something with them, but I decided that I couldn't do that if you could do something like drop a start tag in there and change the meaning of the stuff that needs to be parsed that isn't directly in the entity reference.
>

There's also the issue that entity references open a whole can of worms concerning security. It quite possible to have an exponential growing entity replacement that can take down any parser.

<!DOCTYPE root [
 <!ELEMENT root ANY>
 <!ENTITY LOL "LOL">
 <!ENTITY LOL1 "&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;">
 <!ENTITY LOL2 "&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;">
 <!ENTITY LOL3 "&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;">
 <!ENTITY LOL4 "&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;">
 <!ENTITY LOL5 "&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;">
 <!ENTITY LOL6 "&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;">
 <!ENTITY LOL7 "&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;">
 <!ENTITY LOL8 "&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;">
 <!ENTITY LOL9 "&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;">
]>
<root>&LOL9;</root>

Hope you have enough memory (this expands to a 3 000 000 000 LOL's)