On the subject of an XML parser (page 2)

Settings

Help

Index » General » On the subject of an XML parser (page 2)

September 01, 2022

Re: On the subject of an XML parser

Posted by solidstate1991
in reply to solidstate1991

Permalink

solidstate1991

Posted in reply to solidstate1991

Permalink

On Thursday, 25 August 2022 at 19:41:19 UTC, solidstate1991 wrote:

I took a look at experimental.xml. According to its tests, it's biggest issue is that it accepts malformed documents. I'll attempt to reverse-engineer the code, then add the necessary checks to reject the malformed documents. Since it has multiple options for allocators (stdx-allocator), it'll be a bit of a challenge, but at worst I can strip that function and replace it with GC only.

So work have begun here: https://github.com/ZILtoid1991/experimental.xml

Things I've done so far:

Stripped the allocators and the custom error handling functions. Not much people are using allocators anyways, it just complicates the project, and GC is otherwise the best option for anything that builds a complex tree structure. With that gone, I can just use exceptions for error handling, which can be toggled with a flag: turning it off will enable parsing badly formed XML documents, and even SGML in theory.
Simplifying a lot of things in general, with array slicing and appending.
Enabled character escaping, which led me into the DTD hellhole.
Enabled checking for bad characters in names and texts.
Started working on the processing of XML declarations (important for setting version and checking for correct encoding), and the DTD.

I know that the removal of the allocators might doom my project from the inclusion in the Phobos library, but even then I can just release it as a regular dub library. Soon I'll be renaming it to newXML or something similar, while keeping the credits to its previous authors.

September 12, 2022

Re: On the subject of an XML parser

Posted by Ali Çehreli
in reply to Ali Çehreli

Permalink

Ali Çehreli

Posted in reply to Ali Çehreli

Permalink

On 8/24/22 08:16, Ali Çehreli wrote:
> On 8/22/22 15:51, Chris Piker wrote:
>
>  > So depending on the use case, dxml works quite well.  For my own
>  > purposes I'll need to find/create a ForwardRange adapter for stdin
>
> The 'cached' range adaptor I mentioned on these forums a couple of times
> and in my DConf 2022 lightning talk converts any InputRange to a
> ForwardRange. (It does this by evaluating the elements once; so it would
> be valuable with generators as well; and in fact, a generator use case
> was why I wrote it.)

It is now available:

  https://code.dlang.org/packages/alid

> (Aside: It actually makes a RandomAccessRange because it supports
> opIndex as well but it does not honor O(1): It will grab 'n' elements if
> you say myRange[n] and if those elements are not in the cache yet.)

I realized that it is still O(1) because the seemingly unnecessarily grabbed elements would still count as "amortized" because they are readily available at O(1) for consumption of both this range and all its .save'd ranges.

> Currently it has an assumed performance issue because it uses a regular
> D slice, and the way it uses the slice incurs an allocation cost per
> element. There are different ways of dealing with that issue but I
> haven't finished that yet.

I solved that by writing the `alid.circularblocks` module.

Ali

September 14, 2022

Re: On the subject of an XML parser

Posted by Chris Piker
in reply to Ali Çehreli

Permalink

Chris Piker

Posted in reply to Ali Çehreli

Permalink

On Monday, 12 September 2022 at 09:29:11 UTC, Ali Çehreli wrote:

> It is now available:
>
>   https://code.dlang.org/packages/alid
>
> > (Aside: It actually makes a RandomAccessRange because it
> supports
> > opIndex as well but it does not honor O(1): It will grab 'n'
> elements if
> > you say myRange[n] and if those elements are not in the cache
> yet.)
>
> I realized that it is still O(1) because the seemingly unnecessarily grabbed elements would still count as "amortized" because they are readily available at O(1) for consumption of both this range and all its .save'd ranges.

Wow pretty slick, thanks!  I know everyone wants the D community to be larger, but there are some advantages to a tight group.  Heck, I just got help on a ground support project from my favorite computer textbook author.  Outstanding!

As soon as I get back around to working on that project again I'll try out alid.  Neck deep in another sprint right now which depends on dpq2.

Best,

September 14, 2022

Re: On the subject of an XML parser

Posted by Chris Piker
in reply to H. S. Teoh

Permalink

Chris Piker

Posted in reply to H. S. Teoh

Permalink

On Monday, 22 August 2022 at 23:30:58 UTC, H. S. Teoh wrote:
> Do you need to parse xml on-the-fly, or would it work to just slurp the entire stdin into a buffer and then parse that?

Thanks for the suggestion, though the purpose of the lib is to support stream based processing of very long time series datasets (> 2 TB occurs occasionally).  Due to data volume, we typically work with binary formats, but there is a supported XML representation and I'd prefer to apply the same mentality when processing it so as not to break user expectations.

October 02, 2022

Re: On the subject of an XML parser

Posted by James Blachly
in reply to solidstate1991

Permalink

James Blachly

Posted in reply to solidstate1991

Permalink

On 8/22/22 7:48 AM, solidstate1991 wrote:
> Since the XML parsing library was removed from Phobos, I'm thinking about either getting dlang-community/experimental.xml into a usable state, or write a completely new parser.
> 
> First I'd want some community input, and would like to hear from the users of lodo1995's library. Depending on some circumstances, I'll be losing my job next month, so I'll have some extra time on my hands (no money will be a tough thing), and even without that I'll try to pull it off somehow.

Would be nice to have XSD support; many (most?) XML libraries I've looked at _across all languages_ only support DTD but not XSD schema def.

October 02, 2022

Re: On the subject of an XML parser

Posted by Ali Çehreli
in reply to Chris Piker

Permalink

Ali Çehreli

Posted in reply to Chris Piker

Permalink

On 9/13/22 19:00, Chris Piker wrote:

> As soon as I get back around to working on that project again I'll try
> out alid.

Please don't give up but give feedback if it doesn't fit your use case. It desperately needs to be tested in the wild. :)

Ali

October 03, 2022

Re: On the subject of an XML parser

Posted by tsbockman
in reply to Ali Çehreli

Permalink

tsbockman

Posted in reply to Ali Çehreli

Permalink

On Monday, 12 September 2022 at 09:29:11 UTC, Ali Çehreli wrote:

On 8/24/22 08:16, Ali Çehreli wrote:
https://code.dlang.org/packages/alid

(Aside: It actually makes a RandomAccessRange because it
supports
opIndex as well but it does not honor O(1): It will grab 'n'
elements if
you say myRange[n] and if those elements are not in the cache
yet.)

I realized that it is still O(1) because the seemingly unnecessarily grabbed elements would still count as "amortized" because they are readily available at O(1) for consumption of both this range and all its .save'd ranges.

I believe it's actually O(log(n)) amortized because CircularBlocks.addExistingBlock_ will reallocate blocks and move the contents over to the new address, an O(n) operation, for every O(log(n)) accesses to ElementCache.

(This extra O(log(n)) factor is typical for Appender-like systems.)

October 03, 2022

Re: On the subject of an XML parser

Posted by tsbockman
in reply to tsbockman

Permalink

tsbockman

Posted in reply to tsbockman

Permalink

On Monday, 3 October 2022 at 07:28:46 UTC, tsbockman wrote:

On Monday, 12 September 2022 at 09:29:11 UTC, Ali Çehreli wrote:

(This extra O(log(n)) factor is typical for Appender-like systems.)

Nevermind - I forgot that the n when moving the old contents over is actually a different variable each time, whose average value is O(n / log(n)), which makes the whole thing reduce to amortized O(1) time per element, as you claimed.

October 03, 2022

Re: On the subject of an XML parser

Posted by Ali Çehreli
in reply to tsbockman

Permalink

Ali Çehreli

Posted in reply to tsbockman

Permalink

On 10/3/22 00:28, tsbockman wrote:

> `CircularBlocks.addExistingBlock_` will reallocate `blocks` and move the
> contents over to the new address,

CircularBlocks does not move elements. It just allocates and adds a new block to its "array of slices" queue.

Algorithmic complexity does get complicated :) if the block size is too small compared to live elements. Then there would be too many small blocks to shuffle around e.g. to the back of the queue to be reused.

Ali

October 03, 2022

Re: On the subject of an XML parser

Posted by tsbockman
in reply to Ali Çehreli

Permalink

tsbockman

Posted in reply to Ali Çehreli

Permalink

On Monday, 3 October 2022 at 14:46:13 UTC, Ali Çehreli wrote:

CircularBlocks does not move elements. It just allocates and adds a new block to its "array of slices" queue.

Repeatedly appending ~= to a dynamic array, as CircularBlocks does here, will reallocate and move the elements over to the new memory whenever the new .length would exceed .capacity:

    void addExistingBlock_(ubyte[] buffer)
    {
        import std.array : back;

        blocks ~= ReusableBlock!T(buffer);
        capacity_ += blocks.back.capacity;
    }

Top | Forum index | About this forum

Forums