August 29, 2013 Re: Replacing std.xml | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jacob Carlborg | On Thursday, August 29, 2013 15:20:39 Jacob Carlborg wrote:
> On 2013-08-29 11:23, Jonathan M Davis wrote:
> > IIRC, everything in XML is
> > ASCII anyway, with stuff like HTML codes to indicate Unicode characters.
> > And if that's the case, avoiding unnecessary decoding is trivial when
> > operating on strings.
>
> What! I hardly believe that. That might be the case for HTML but I don't think it is for XML. There are many file formats that are based on XML. I don't think all those use HTML codes.
>
> This is what W3 Schools says:
>
> "XML documents can contain non ASCII characters, like Norwegian æ ø å , or French ê è é.
>
> To avoid errors, specify the XML encoding, or save XML files as Unicode.".
Well, as I said, I couldn't remember exactly what the XML standard said about encodings, but if it can contain non-ASCII characters, then my first inclination is to say that it has to be UTF-8, UTF-16, or UTF-32 based on the fact that that's what we support in the language and in Phobos (as I understand it, std.encodings is a bit of a joke that needs to be rethought and replaced, but regardless, it's the only Phobos module supporting any non- Unicode encodings).
However, because all of the XML special symbols should be ASCII, you should still be able to avoid decoding characters for the most part. It's only when you have to actually look at the content that Unicode would potentially matter. So, the performance hit of decoding Unicode characters should mostly be able to be avoided.
- Jonathan M Davis
|
August 29, 2013 Re: Replacing std.xml | ||||
---|---|---|---|---|
| ||||
Posted in reply to Michel Fortin | On Thursday, August 29, 2013 12:14:28 Michel Fortin wrote:
> On 2013-08-29 07:47:17 +0000, Jonathan M Davis <jmdavisProg@gmx.com> said:
> > On Thursday, August 29, 2013 09:25:35 w0rp wrote:
> >> The general idea in my mind is
> >> "something SAX-like, with something a little DOM-like."
> >
> > What I personally think would be best is to have multiple parsers. First you have something STAX-like (or maybe even lower level - I don't recall exactly what STAX gives you at the moment) that basically tokenizes the XML and returns a range of that. Then SAX and DOM parsers can be built on top of that. That way, you get the fastest parser possible as well as higher level, more functional parsers.
> >
> > But two of the biggest points of the design are that it's going to have to be range-based, and it's going to need to be able to take full advantage of slices (when used with any strings or random-access ranges) in order to avoid copying any of the data. That's the key design point which will allow a D parser to be extremely fast in comparison to parsers in most other languages.
> I wrote something like that a while ago.
>
> It only accepted arrays as input because of the lack of a "buffered range" concept that'd allow lookahead and efficient slicing from any kind of range, but that could be retrofitted in. It implements pretty much all of the XML spec, except for documents having an internal subset (which is something a little arcane). It does not deal with namespaces either, I feel like that should be done a layer above, but I'm not entirely sure.
>
> Lower-level parser: http://michelf.ca/docs/d/mfr/xmltok.html
>
> Higher-level parser built on the first one: http://michelf.ca/docs/d/mfr/xml.html
>
> The code:
> http://michelf.ca/docs/d/mfr-xml-2010-10-19.zip
>
> That code hasn't been compiled in a while, but it used to work very well for me. Feel free to use as a starting point.
Cool. I started looking at implementing something like that a while back but really didn't have time to get very far. But if we really care about efficiency, I think that that's the basic approach that we need to take. However, the trick as always is someone having the time to do it. Maybe one of us can take what you did and start from there or at least use it is an example to start from.
- Jonathan M Davis
|
August 29, 2013 Re: Replacing std.xml | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On Thursday, 29 August 2013 at 17:38:43 UTC, Jonathan M Davis wrote:
>
> Well, as I said, I couldn't remember exactly what the XML standard said about
> encodings, but if it can contain non-ASCII characters, then my first
> inclination is to say that it has to be UTF-8, UTF-16, or UTF-32 based on the
> fact that that's what we support in the language and in Phobos (as I
> understand it, std.encodings is a bit of a joke that needs to be rethought and
> replaced, but regardless, it's the only Phobos module supporting any non-
> Unicode encodings).
>
> However, because all of the XML special symbols should be ASCII, you should
> still be able to avoid decoding characters for the most part. It's only when
> you have to actually look at the content that Unicode would potentially
> matter. So, the performance hit of decoding Unicode characters should mostly
> be able to be avoided.
>
> - Jonathan M Davis
You just specify the encoding in the root element.
<?xml version="1.0" encoding="us-ascii"?>
<?xml version="1.0" encoding="windows-1252"?>
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml version="1.0" encoding="UTF-8"?>
<?xml version="1.0" encoding="UTF-16"?>
UTF-8 is the default in lieu of a BOM saying otherwise.
|
August 29, 2013 Re: Replacing std.xml | ||||
---|---|---|---|---|
| ||||
Posted in reply to Joakim | On Thursday, 29 August 2013 at 09:24:31 UTC, Joakim wrote: > I think it's great that there's no std.xml, as it implies that nobody using D would use a dumb tech like XML. Let's keep it that way. :) JSON is better than XML in every way I can think of. Easier to map to data structures in whichever language you're using, much smaller in size, less corner cases, etc. However, just saying XML is dumb isn't a useful policy. You need ways of parsing XML on hand until people stop using it. On Thursday, 29 August 2013 at 08:15:39 UTC, Robert Schadek wrote: > On 08/29/2013 09:51 AM, Johannes Pfau wrote: >> I most points here also apply to std.xml: >> t Those are not strict >> requirements though, I just summarized what I remembered from old >> discussions. > I think, this even extends to access to all semi- and structured-data. > Think csv, sql nosql, you name it. Something which deserves a name like > Uniform Access. I don't want to care if data is laid out differently. I > want to define my struct or class mark the members to fill a pass it to > somebodies code and don't want to care if its xml, sql or whatever. I'm really not so sure about that kind of approach. Automatic serialisation I think works one of two ways. Either you have control over the data you're pulling in, and you can change it to map more easily to your data structures, or you don't and you have to make your data structures more ugly to fit the data you're pulling in. I prefer just writing functions that take format X and give you in-memory representation Y over automatic serialisation stuff. I know it's boring and easy to write functions like that, but why can't some things just be boring and easy? This looks like a really popular topic, and it's cool that there seem to be quite a few implementations that are close to being what we want. I think we're probably not far off just lining up a few different implementations and reviewing them all for possible inclusion in phobos. |
August 29, 2013 Re: Replacing std.xml | ||||
---|---|---|---|---|
| ||||
On Thu, Aug 29, 2013 at 01:38:23PM -0400, Jonathan M Davis wrote: [...] > Well, as I said, I couldn't remember exactly what the XML standard said about encodings, but if it can contain non-ASCII characters, then my first inclination is to say that it has to be UTF-8, UTF-16, or UTF-32 based on the fact that that's what we support in the language and in Phobos Take a look here: http://www.w3schools.com/xml/xml_encoding.asp XML files can have *any* valid encoding, including nastiness like windows-1252 and relics like iso-8859-1. Unfortunately, I don't think we have a way around this, since existing XML files out there probably already have all of these encodings are more, and std.xml is gonna hafta support 'em all. Otherwise we're gonna get irate users complaining "why can't std.xml parse my oddly-encoded-but-standards-compliant XML file?!" > (as I understand it, std.encodings is a bit of a joke that needs to be rethought and replaced, but regardless, it's the only Phobos module supporting any non- Unicode encodings). No kidding! I was trying to write a program that navigates a website automatically using std.net.curl, and I'm running into all sorts of silly roadblocks, including std.encoding not supporting iso-8859-* encodings. The good news is that on Linux, there's a handy utility called 'recode', which comes with a library called 'librecode', that supports converting between a huge number of different encodings -- many more than probably you or I have imagined existed -- including to/from Unicode. I know we don't like including external libraries in Phobos, but I honestly don't see any justification for reinventing the wheel by writing (and maintaining!) our own equivalent to librecode, unless licensing issues prevents us from including librecode in Phobos, nicely wrapped in a modern range-based D API. > However, because all of the XML special symbols should be ASCII, you should still be able to avoid decoding characters for the most part. It's only when you have to actually look at the content that Unicode would potentially matter. So, the performance hit of decoding Unicode characters should mostly be able to be avoided. [...] One way is to write the core code of std.xml in such a way that it handles all data as ubyte[] (or ushort[]/uint[] for 16-bit/32-bit encodings) so that it's encoding-independent. Then on top of this core, write some convenience wrappers that casts/converts to string, wstring, dstring. As an initial stab, we could support only UTF-8, UTF-16, UTF-32 if the user asks for string/wstring/dstring, and leave XML in other encodings up to the user to decode manually. This way, at least the user can get the data out of the file. Later on, once we've gotten our act together with std.encoding, we can hook it up to std.xml to provide autoconversion. T -- Almost all proofs have bugs, but almost all theorems are true. -- Paul Pedersen |
August 29, 2013 Re: Replacing std.xml | ||||
---|---|---|---|---|
| ||||
Posted in reply to Chris | On 2013-08-29 16:07, Chris wrote: > And while we're at it, what about YAML? It's a subset of JSON which > means the new json.d module will handle it, I suppose. YAML is a super set of JSON, not the other way around. But yes, I would like to have YAML support as well. -- /Jacob Carlborg |
August 29, 2013 Re: Replacing std.xml | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On 2013-08-29 19:38, Jonathan M Davis wrote: > However, because all of the XML special symbols should be ASCII, you should > still be able to avoid decoding characters for the most part. It's only when > you have to actually look at the content that Unicode would potentially > matter. So, the performance hit of decoding Unicode characters should mostly > be able to be avoided. I don't understand. If use a range of dchar and call "front" and "popFront" won't it do decoding then? -- /Jacob Carlborg |
August 29, 2013 Re: Replacing std.xml | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On 2013-08-29 20:57, H. S. Teoh wrote: > XML files can have *any* valid encoding, including nastiness like > windows-1252 and relics like iso-8859-1. Actually, does the encoding really matters (as long as it's compatible with ASCII). Just use a range of ubytes, the parser will only be looking for characters in the ASCII table anyway. -- /Jacob Carlborg |
August 29, 2013 Re: Replacing std.xml | ||||
---|---|---|---|---|
| ||||
Posted in reply to H. S. Teoh | On Thursday, 29 August 2013 at 18:58:57 UTC, H. S. Teoh wrote: > No kidding! I was trying to write a program that navigates a website > automatically using std.net.curl, and I'm running into all sorts of > silly roadblocks, including std.encoding not supporting iso-8859-* > encodings. > It doesn't look like adding the rest of the ISO-8859 encodings would be all that difficult if you used the existing ISO-8859-1 (Latin1) as a base. I don't quite understand where and how transcoding is done though. > The good news is that on Linux, there's a handy utility called 'recode', > which comes with a library called 'librecode', that supports converting > between a huge number of different encodings -- many more than probably > you or I have imagined existed -- including to/from Unicode. I know we > don't like including external libraries in Phobos, but I honestly don't > see any justification for reinventing the wheel by writing (and > maintaining!) our own equivalent to librecode, unless licensing issues > prevents us from including librecode in Phobos, nicely wrapped in a > modern range-based D API. > > >> However, because all of the XML special symbols should be ASCII, you >> should still be able to avoid decoding characters for the most part. >> It's only when you have to actually look at the content that Unicode >> would potentially matter. So, the performance hit of decoding Unicode >> characters should mostly be able to be avoided. > [...] > > One way is to write the core code of std.xml in such a way that it > handles all data as ubyte[] (or ushort[]/uint[] for 16-bit/32-bit > encodings) so that it's encoding-independent. Then on top of this core, > write some convenience wrappers that casts/converts to string, wstring, > dstring. As an initial stab, we could support only UTF-8, UTF-16, UTF-32 > if the user asks for string/wstring/dstring, and leave XML in other > encodings up to the user to decode manually. This way, at least the user > can get the data out of the file. > > Later on, once we've gotten our act together with std.encoding, we can > hook it up to std.xml to provide autoconversion. > > > T |
August 29, 2013 Re: Replacing std.xml | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jacob Carlborg | On Thursday, August 29, 2013 21:28:09 Jacob Carlborg wrote:
> On 2013-08-29 19:38, Jonathan M Davis wrote:
> > However, because all of the XML special symbols should be ASCII, you
> > should
> > still be able to avoid decoding characters for the most part. It's only
> > when you have to actually look at the content that Unicode would
> > potentially matter. So, the performance hit of decoding Unicode
> > characters should mostly be able to be avoided.
>
> I don't understand. If use a range of dchar and call "front" and "popFront" won't it do decoding then?
Any decent parser is going to special-case strings (especially if it's using slicing), in which case, it won't call front unless it needs to decode. The only real question is whether generic char and wchar ranges should be supported, because then you could avoid the decoding for ranges that aren't strings, but strings are already covered simply by special casing. You really can't afford to not special-case for strings for algorithms in general if efficiency is a high priority.
- Jonathan M Davis
|
Copyright © 1999-2021 by the D Language Foundation