std.xml2 (collecting features) control character (page 7)

On Thursday, 18 February 2016 at 16:41:52 UTC, Robert burner Schadek wrote: > for instance, quick often I find <80> in tests that are supposed to be valid xml 1.0. they are invalid xml 1.1 though What char encoding does the document declare itself as?

On Thursday, 18 February 2016 at 16:47:35 UTC, Adam D. Ruppe wrote: > On Thursday, 18 February 2016 at 16:41:52 UTC, Robert burner Schadek wrote: >> for instance, quick often I find <80> in tests that are supposed to be valid xml 1.0. they are invalid xml 1.1 though > > What char encoding does the document declare itself as? It does not, it has no prolog and therefore no EncodingInfo. unix file says it is a utf8 encoded file, but not BOM is present.

On Thursday, 18 February 2016 at 16:54:10 UTC, Robert burner Schadek wrote: > unix file says it is a utf8 encoded file, but not BOM is present. the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E"

February 18, 2016

Re: std.xml2 (collecting features) control character

Posted by Adam D. Ruppe
in reply to Robert burner Schadek

Permalink

Adam D. Ruppe

Posted in reply to Robert burner Schadek

Permalink

On Thursday, 18 February 2016 at 16:54:10 UTC, Robert burner Schadek wrote:
> It does not, it has no prolog and therefore no EncodingInfo.

In that case, it needs to be valid UTF-8 or valid UTF-16 and it is a fatal error if there's any invalid bytes:

https://www.w3.org/TR/REC-xml/#charencoding

==
 It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding. Specifically, it is a fatal error if an entity encoded in UTF-8 contains any ill-formed code unit sequences, as defined in section 3.9 of Unicode [Unicode]. Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.
==

On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner Schadek wrote: >> unix file says it is a utf8 encoded file, but not BOM is present. > > the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E" Gah, I should have read this before replying... well, that does appear to be valid utf-8.... why is it throwing an exception then? I'm pretty sure that byte stream *is* actually well-formed xml 1.0 and should pass utf validation as well as the XML well-formedness check.

On Thursday, 18 February 2016 at 10:18:18 UTC, Robert burner Schadek wrote: > If you want to on some XML stuff, please join me. It is properly more productive working together than creating two competing implementations. Oh, I absolutely agree, independent implementation is a bad thing. (Someone should rename DRY as "don't repeat yourself or others"... but DRYOO sounds weird.) Where's your repo?

On Thursday, 18 February 2016 at 17:26:30 UTC, Adam D. Ruppe wrote: > On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner Schadek wrote: >>> unix file says it is a utf8 encoded file, but not BOM is present. >> >> the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E" > > Gah, I should have read this before replying... well, that does appear to be valid utf-8.... why is it throwing an exception then? > > I'm pretty sure that byte stream *is* actually well-formed xml 1.0 and should pass utf validation as well as the XML well-formedness check. Regarding control characters: If you give me a complete sample file, I can run it through Mozilla's UTF stream conversion and/or XML parsing code (via either SAX or DOMParser) to tell you how that reacts as a reference. Mozilla supports XML 1.0, but not 1.1.

On Thursday, 18 February 2016 at 18:28:10 UTC, Alex Vincent wrote: > Regarding control characters: If you give me a complete sample file, I can run it through Mozilla's UTF stream conversion and/or XML parsing code (via either SAX or DOMParser) to tell you how that reacts as a reference. Mozilla supports XML 1.0, but not 1.1. thanks you making the effort https://github.com/burner/std.xml2/blob/master/tests/eduni/xml-1.1/out/010.xml

On Thursday, 18 February 2016 at 10:18:18 UTC, Robert burner Schadek wrote: > On Thursday, 18 February 2016 at 04:34:13 UTC, Alex Vincent wrote: >> I'm looking for a status update. DUB doesn't seem to have many options posted. I was thinking about starting a SAXParser implementation. > > I'm working on it, but recently I had to do some major restructuring of the code. > Currently I'm trying to get this merged https://github.com/D-Programming-Language/phobos/pull/3880 because I had some problems with the encoding of test files. XML has a lot of corner cases, it just takes time. > > If you want to on some XML stuff, please join me. It is properly more productive working together than creating two competing implementations. Would you be interested in mentoring a student for the Google Summer of Code to do work on std.xml?

Forums