Replacing std.xml (page 4)

On Thursday, 29 August 2013 at 08:15:39 UTC, Robert Schadek wrote: > On 08/29/2013 09:51 AM, Johannes Pfau wrote: >> I most points here also apply to std.xml: >> http://wiki.dlang.org/Wish_list/std.json Those are not strict >> requirements though, I just summarized what I remembered from old >> discussions. > I think, this even extends to access to all semi- and structured-data. > Think csv, sql nosql, you name it. Something which deserves a name like > Uniform Access. I don't want to care if data is laid out differently. I > want to define my struct or class mark the members to fill a pass it to > somebodies code and don't want to care if its xml, sql or whatever. That's a really great point. All of these modules that can't know the types and structure in advance should probably all use the same techniques for handling the situation. Perhaps a new module to unify all this stuff is in order. I seem to recall Adam D. Ruppe's "Is this D or is this Javascript?" thread[1] having some nice tricks to deal with dynamically typed data. 1. http://forum.dlang.org/thread/kuxfkakrgjaofkrdvgmx@forum.dlang.org

On Thursday, 29 August 2013 at 19:40:08 UTC, Brad Anderson wrote: > That's a really great point. All of these modules that can't know the types and structure in advance should probably all use the same techniques for handling the situation. Perhaps a new module to unify all this stuff is in order. > > I seem to recall Adam D. Ruppe's "Is this D or is this Javascript?" thread[1] having some nice tricks to deal with dynamically typed data. > > 1. http://forum.dlang.org/thread/kuxfkakrgjaofkrdvgmx@forum.dlang.org (or maybe just improve Variant)

On Aug 29, 2013, at 11:57 AM, H. S. Teoh <hsteoh@quickfur.ath.cx> wrote: > > One way is to write the core code of std.xml in such a way that it handles all data as ubyte[] (or ushort[]/uint[] for 16-bit/32-bit encodings) so that it's encoding-independent. Then on top of this core, write some convenience wrappers that casts/converts to string, wstring, dstring. As an initial stab, we could support only UTF-8, UTF-16, UTF-32 if the user asks for string/wstring/dstring, and leave XML in other encodings up to the user to decode manually. This way, at least the user can get the data out of the file. > > Later on, once we've gotten our act together with std.encoding, we can hook it up to std.xml to provide autoconversion. As long autoconversion is optional. When parsing XML or JSON or whatever, I generally only care about specific strings, and sometimes don't want anything decoded at all. Having decoding done automatically before the event fires is a huge and potentially unnecessary performance hit. Not doing this decoding automatically is what makes the Tango XML parser so fast.

On Thursday, 29 August 2013 at 20:08:10 UTC, Sean Kelly wrote: > As long autoconversion is optional. When parsing XML or JSON or whatever, I generally only care about specific strings, and sometimes don't want anything decoded at all. Having decoding done automatically before the event fires is a huge and potentially unnecessary performance hit. Not doing this decoding automatically is what makes the Tango XML parser so fast. This makes me wonder what kind of optimizations a hypothetical ctXml could perform.

On Thu, Aug 29, 2013 at 12:41:16PM -0700, Sean Kelly wrote: > On Aug 29, 2013, at 11:57 AM, H. S. Teoh <hsteoh@quickfur.ath.cx> wrote: > > > > One way is to write the core code of std.xml in such a way that it handles all data as ubyte[] (or ushort[]/uint[] for 16-bit/32-bit encodings) so that it's encoding-independent. Then on top of this core, write some convenience wrappers that casts/converts to string, wstring, dstring. As an initial stab, we could support only UTF-8, UTF-16, UTF-32 if the user asks for string/wstring/dstring, and leave XML in other encodings up to the user to decode manually. This way, at least the user can get the data out of the file. > > > > Later on, once we've gotten our act together with std.encoding, we can hook it up to std.xml to provide autoconversion. > > As long autoconversion is optional. When parsing XML or JSON or whatever, I generally only care about specific strings, and sometimes don't want anything decoded at all. Having decoding done automatically before the event fires is a huge and potentially unnecessary performance hit. Not doing this decoding automatically is what makes the Tango XML parser so fast. Right, that's why I said the core of std.xml should handle everything as bytes, only specially treating the ASCII values of <, >, &, and other metacharacters. The tagname and tag body should just be a range over segments of the input. T -- What are you when you run out of Monet? Baroque.

On Thursday, August 29, 2013 14:27:22 H. S. Teoh wrote: > Right, that's why I said the core of std.xml should handle everything as bytes, only specially treating the ASCII values of <, >, &, and other metacharacters. The tagname and tag body should just be a range over segments of the input. That works especially well with how Michel and I were thinking it should be split up with a core that essentially just gives you a range of XML tokens/tags. You then have separate SAX and/or DOM parsers on top of that (which also should minimize decoding, but they actually have to care about decoding in some cases in order to do stuff like check matching tags). - Jonathan M Davis

On Thursday, 29 August 2013 at 19:26:21 UTC, Jacob Carlborg wrote: > On 2013-08-29 16:07, Chris wrote: > >> And while we're at it, what about YAML? It's a subset of JSON which >> means the new json.d module will handle it, I suppose. > > YAML is a super set of JSON, not the other way around. But yes, I would like to have YAML support as well. Yes of course, you are right. I found this on the internet. Seems to be abandoned. https://github.com/kiith-sa/D-YAML

On Thursday, 29 August 2013 at 22:56:36 UTC, Chris wrote: > On Thursday, 29 August 2013 at 19:26:21 UTC, Jacob Carlborg wrote: >> On 2013-08-29 16:07, Chris wrote: >> >>> And while we're at it, what about YAML? It's a subset of JSON which >>> means the new json.d module will handle it, I suppose. >> >> YAML is a super set of JSON, not the other way around. But yes, I would like to have YAML support as well. > > Yes of course, you are right. I found this on the internet. Seems to be abandoned. > > https://github.com/kiith-sa/D-YAML It's not really abandoned, I keep updating it with compatibility fixes for new DMD releases as my other projects depend on it. Its API does not fit into Phobos, however (not range-based), and it won't unless I find a few weeks/months to work on it exclusively, which is unlikely in the near future. It also only supports YAML 1.1 at the moment, and recursive data structures are not yet supported.

August 30, 2013

Re: Replacing std.xml

Posted by Michel Fortin
in reply to Jonathan M Davis

Permalink

Michel Fortin

Posted in reply to Jonathan M Davis

Permalink

On 2013-08-29 17:38:23 +0000, "Jonathan M Davis" <jmdavisProg@gmx.com> said:

> Well, as I said, I couldn't remember exactly what the XML standard said about
> encodings, but if it can contain non-ASCII characters, then my first
> inclination is to say that it has to be UTF-8, UTF-16, or UTF-32 based on the
> fact that that's what we support in the language and in Phobos (as I
> understand it, std.encodings is a bit of a joke that needs to be rethought and
> replaced, but regardless, it's the only Phobos module supporting any non-
> Unicode encodings).

The XML standard says that an XML parser MUST support UTF-8 and UTF-16, and MAY support other encodings.

Supporting non-UTF-8 encodings is a separate problem from parsing XML, and proper code for that would have much broader applications. Keep in mind that the more encoding you support, the more bloat you add to the executable, so there's a tradeoff to be made. In many cases, UTF-8 is enough, while in many others it's not.

(My XML implementation has a function that parses the XML prolog and tells you the encoding so you can take the appropriate code path before feeding the parser. A higher level API could handle encodings automatically based on that that. )


> However, because all of the XML special symbols should be ASCII, you should
> still be able to avoid decoding characters for the most part. It's only when
> you have to actually look at the content that Unicode would potentially
> matter. So, the performance hit of decoding Unicode characters should mostly
> be able to be avoided.

Just like my XML implementation does. (I made frontUnit/popFrontUnit functions I'm using when decoding code points is unnecessary.)


-- 
Michel Fortin
michel.fortin@michelf.ca
http://michelf.ca

On Thursday, 29 August 2013 at 07:53:46 UTC, Tobias Pankrath wrote: > There is http://dsource.org/projects/xmlp, which at some point has been proposed for std.xml2. But that stalled for some time now. Also, we have Tango Xml: https://github.com/SiegeLord/Tango-D2/tree/d2port/tango/text/xml It's the fastest Xml parser in the world, so may be you can find it useful: dotnot.org/blog/archives/2008/03/10/xml-benchmarks-parsequerymutateserialize/ dotnot.org/blog/archives/2008/03/12/why-is-dtango-so-fast-at-parsing-xml/

Forums