std.xml and Adam D Ruppe's dom module (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » std.xml and Adam D Ruppe's dom module (page 3)

February 08, 2012

Re: std.xml and Adam D Ruppe's dom module

Posted by Adam D. Ruppe
in reply to Johannes Pfau

Adam D. Ruppe

Posted in reply to Johannes Pfau

On Wednesday, 8 February 2012 at 08:12:57 UTC, Johannes Pfau wrote:
> Use buffering, return strings(better:
> w/d/char[]) as slices to that buffer. If the user needs to keep a string, he can still copy it. (String decoding should also be done on-demand only).

The way Document.parse works now in my code is with slices.
I think the best way to speed mine up is to untangle the mess
of recursive nested functions.

Last time I attacked dom.d with the profiler, I found a lot
of time was spent on string decoding, which looked like this:

foreach(c; str) { if(isEntity) value ~= decoded(value); else value ~= c; }

basically.

This reallocation was slow... but I got a huge speedup, not by
skipping decoding, but by scanning it first:

bool decode = false;
foreach(c; str) { if(c == '&') { decode = true; break; } }

if(!decode) return str;
// still uses the old decoder, which is the fastest I could find;
// ~= actually did better than appender in my tests!

But, quickly scanning the string and skipping the decode loop if
there are no entities about IIRC tripled the parse speed.

Right now, if I comment the decode call out entirely, there's very
little difference in speed on the data I've tried, so I think
decoding like this works well.

February 08, 2012

Re: std.xml and Adam D Ruppe's dom module

Posted by Johannes Pfau
in reply to Jonathan M Davis

Johannes Pfau

Posted in reply to Jonathan M Davis

Am Wed, 08 Feb 2012 00:29:55 -0800
schrieb Jonathan M Davis <jmdavisProg@gmx.com>:

> On Wednesday, February 08, 2012 09:12:57 Johannes Pfau wrote:
> > Using ranges of dchar directly can be horribly inefficient in some
> > cases, you'll need at least some kind off buffered dchar range. Some
> > std.json replacement code tried to use only dchar ranges and had to
> > reassemble strings character by character using Appender. That sucks
> > especially if you're only interested in a small part of the data and
> > don't care about the rest.
> > So for pull/sax parsers: Use buffering, return strings(better:
> > w/d/char[]) as slices to that buffer. If the user needs to keep a
> > string, he can still copy it. (String decoding should also be done
> > on-demand only).
> 
> That's why you accept ranges of dchar but specialize the code for strings. Then you can use any dchar range with it that you want but can get the extra efficiency of using strings if you want to do that.
> 
> - Jonathan M Davis

But spezializing for strings is not enough, you could stream XML over the network and want to parse it on the fly (think of XMPP/Jabber). Or you could read huge xml files which you do not want to load completely into ram. Data is read into buffers anyway, so the parser should be able to deal with that. (although a buffer of w/d/chars could be considered to be a string, but then the parser would need to handle incomplete input)

February 09, 2012

Re: std.xml and Adam D Ruppe's dom module

Posted by Robert Jacques
in reply to Johannes Pfau

Robert Jacques

Posted in reply to Johannes Pfau

On Wed, 08 Feb 2012 02:12:57 -0600, Johannes Pfau <nospam@example.com> wrote:
> Am Tue, 07 Feb 2012 20:44:08 -0500
> schrieb "Jonathan M Davis" <jmdavisProg@gmx.com>:
>> On Tuesday, February 07, 2012 00:56:40 Adam D. Ruppe wrote:
>> > On Monday, 6 February 2012 at 23:47:08 UTC, Jonathan M Davis
[snip]
>
> Using ranges of dchar directly can be horribly inefficient in some
> cases, you'll need at least some kind off buffered dchar range. Some
> std.json replacement code tried to use only dchar ranges and had to
> reassemble strings character by character using Appender. That sucks
> especially if you're only interested in a small part of the data and
> don't care about the rest.
> So for pull/sax parsers: Use buffering, return strings(better:
> w/d/char[]) as slices to that buffer. If the user needs to keep a
> string, he can still copy it. (String decoding should also be done
> on-demand only).

Speaking as the one proposing said Json replacement, I'd like to point out that JSON strings != UTF strings: manual conversion is required some of the time. And I use appender as a dynamic buffer in exactly the manner you suggest. There's even an option to use a string cache to minimize total memory usage. (Hmm... that functionality should probably be re-factored out and made into its own utility) That said, I do end up doing a bunch of useless encodes and decodes, so I'm going to special case those away and add slicing support for strings. wstrings and dstring will still need to be converted as currently Json values only accept strings and therefore also Json tokens only support strings. As a potential user of the sax/pull interface would you prefer the extra clutter of special side channels for zero-copy wstrings and dstrings?

February 09, 2012

Re: OT Adam D Ruppe's web stuff

Posted by Jacob Carlborg
in reply to Adam D. Ruppe

Jacob Carlborg

Posted in reply to Adam D. Ruppe

On 2012-02-08 15:51, Adam D. Ruppe wrote:
> On Wednesday, 8 February 2012 at 07:37:23 UTC, Jacob Carlborg wrote:
>> Maybe Adam's code can be used as a base of implementing a library like
>> Rack in D.
>>
>> http://rack.rubyforge.org/
>
> That looks like it does the same job as cgi.d.
>
> cgi.d actually offers a uniform interface across various
> web servers and integration methods.
>
> If you always talk through the Cgi class, and use the GenericMain
> mixin, you can run the same program with:
>
> 1) cgi, tested on Apache and IIS (including implementations for methods
> that don't work on one or the other natively)
>
> 2) fast cgi (using the C library)
>
> 3) HTTP itself (something I expanded this last weekend and still want
> to make better)
>
>
>
>
> Sometimes I think I should rename it, to reflect this, but meh,
> misc-stuff-including blah blah shows how good I am at names!

It seems Rack supports additional interface next to CGI. But I think we could take this one step further. I'm not entirely sure what APIs Rack provides but in Rails they have a couple of method to uniform the environment variables.

For example, ENV["REQUEST_URI"] returns differently on different servers. Rails provides a method, "request_uri" on the request object that will return the same value on all different servers.

I don't know if CGI already has support for something similar.

-- 
/Jacob Carlborg

February 09, 2012

Re: std.xml and Adam D Ruppe's dom module

Posted by Johannes Pfau
in reply to Robert Jacques

Johannes Pfau

Posted in reply to Robert Jacques

Am Wed, 08 Feb 2012 20:49:48 -0600
schrieb "Robert Jacques" <sandford@jhu.edu>:

> On Wed, 08 Feb 2012 02:12:57 -0600, Johannes Pfau <nospam@example.com> wrote:
> > Am Tue, 07 Feb 2012 20:44:08 -0500
> > schrieb "Jonathan M Davis" <jmdavisProg@gmx.com>:
> >> On Tuesday, February 07, 2012 00:56:40 Adam D. Ruppe wrote:
> >> > On Monday, 6 February 2012 at 23:47:08 UTC, Jonathan M Davis
> [snip]
> >
> > Using ranges of dchar directly can be horribly inefficient in some
> > cases, you'll need at least some kind off buffered dchar range. Some
> > std.json replacement code tried to use only dchar ranges and had to
> > reassemble strings character by character using Appender. That sucks
> > especially if you're only interested in a small part of the data and
> > don't care about the rest.
> > So for pull/sax parsers: Use buffering, return strings(better:
> > w/d/char[]) as slices to that buffer. If the user needs to keep a
> > string, he can still copy it. (String decoding should also be done
> > on-demand only).
> 
> Speaking as the one proposing said Json replacement, I'd like to point out that JSON strings != UTF strings: manual conversion is required some of the time. And I use appender as a dynamic buffer in exactly the manner you suggest. There's even an option to use a string cache to minimize total memory usage. (Hmm... that functionality should probably be re-factored out and made into its own utility) That said, I do end up doing a bunch of useless encodes and decodes, so I'm going to special case those away and add slicing support for strings. wstrings and dstring will still need to be converted as currently Json values only accept strings and therefore also Json tokens only support strings. As a potential user of the sax/pull interface would you prefer the extra clutter of special side channels for zero-copy wstrings and dstrings?

Regarding wstrings and dstrings: We'll JSON seems to be UTF8 in almost all cases, so it's not that important. But i think it should be possible to use templates to implement identical parsers for d/w/strings

Regarding the use of Appender: Long text ahead ;-)

I think pull parsers should really be as fast a possible and low-level. For easy to use highlevel stuff there's always DOM and a safe, high-level serialization API should be implemented based on the PullParser as well. The serialization API would read only the requested data, skipping the rest:
----------------
struct Data
{
    string link;
}
auto Data = unserialize!Data(json);
----------------

So in the PullParser we should
avoid memory allocation whenever possible, I think we can even avoid it
completely:

I think dchar ranges are just the wrong input type for parsers, parsers
should use buffered ranges or streams (which would be basically the
same). We could use a generic BufferedRange with real
dchar-ranges then. This BufferedRange could use a static buffer, so
there's no need to allocate anything.

The pull parser should return slices to the original string (if the
input is a string) or slices to the Range/Stream's buffer.
Of course, such a slice is only valid till the pull parser is called
again. The slice also wouldn't be decoded yet. And a slice string could
only be as long as the buffer, but I don't think this is an issue, a
512KB buffer can already store 524288 characters.

If the user wants to keep a string, he should really do decodeJSONString(data).idup. There's a little more opportunity for optimization: As long as a decoded json string is always smaller than the encoded one(I don't know if it is), we could have a decodeJSONString function which overwrites the original buffer --> no memory allocation.

If that's not the case, decodeJSONString has to allocate iff the
decoded string is different. So we need a function which always returns
the decoded string as a safe too keep copy and a function which returns
the decoded string as a slice if the decoded string is
the same as the original.

An example: string json =
{
   "link":"http://www.google.com",
   "useless_data":"lorem ipsum",
   "more":{
      "not interested":"yes"
   }
}

now I'm only interested in the link. I should be possible to parse that with zero memory allocations:

auto parser = Parser(json);
parser.popFront();
while(!parser.empty)
{
    if(parser.front.type == KEY
       && tempDecodeJSON(parser.front.value) == "link")
    {
        parser.popFront();
        assert(!parser.empty && parser.front.type == VALUE);
        return decodeJSON(parser.front.value); //Should return a slice
    }
    //Skip everything else;
    parser.popFront();
}

tempDecodeJSON returns a decoded string, which (usually) isn't safe to store(it can/should be a slice to the internal buffer, here it's a slice to the original string, so it could be stored, but there's no guarantee). In this case, the call to tempDecodeJSON could even be left out, as we only search for "link" wich doesn't need encoding.

February 09, 2012

Re: std.xml and Adam D Ruppe's dom module

Posted by Johannes Pfau
in reply to Robert Jacques

Johannes Pfau

Posted in reply to Robert Jacques

Am Wed, 08 Feb 2012 20:49:48 -0600
schrieb "Robert Jacques" <sandford@jhu.edu>:
> 
> Speaking as the one proposing said Json replacement, I'd like to point out that JSON strings != UTF strings: manual conversion is required some of the time. And I use appender as a dynamic buffer in exactly the manner you suggest. There's even an option to use a string cache to minimize total memory usage. (Hmm... that functionality should probably be re-factored out and made into its own utility) That said, I do end up doing a bunch of useless encodes and decodes, so I'm going to special case those away and add slicing support for strings. wstrings and dstring will still need to be converted as currently Json values only accept strings and therefore also Json tokens only support strings. As a potential user of the sax/pull interface would you prefer the extra clutter of special side channels for zero-copy wstrings and dstrings?

BTW: Do you know DYAML?
https://github.com/kiith-sa/D-YAML

I think it has a pretty nice DOM implementation which doesn't require any changes to phobos. As YAML is a superset of JSON, adapting it for std.json shouldn't be too hard. The code is boost licensed and well documented.

I think std.json would have better chances of being merged into phobos if it didn't rely on changes to std.variant.

February 09, 2012

Re: std.xml and Adam D Ruppe's dom module

Posted by Robert Jacques
in reply to Johannes Pfau

Robert Jacques

Posted in reply to Johannes Pfau

On Thu, 09 Feb 2012 05:13:52 -0600, Johannes Pfau <nospam@example.com> wrote:
> Am Wed, 08 Feb 2012 20:49:48 -0600
> schrieb "Robert Jacques" <sandford@jhu.edu>:
>>
>> Speaking as the one proposing said Json replacement, I'd like to
>> point out that JSON strings != UTF strings: manual conversion is
>> required some of the time. And I use appender as a dynamic buffer in
>> exactly the manner you suggest. There's even an option to use a
>> string cache to minimize total memory usage. (Hmm... that
>> functionality should probably be re-factored out and made into its
>> own utility) That said, I do end up doing a bunch of useless encodes
>> and decodes, so I'm going to special case those away and add slicing
>> support for strings. wstrings and dstring will still need to be
>> converted as currently Json values only accept strings and therefore
>> also Json tokens only support strings. As a potential user of the
>> sax/pull interface would you prefer the extra clutter of special side
>> channels for zero-copy wstrings and dstrings?
>
> BTW: Do you know DYAML?
> https://github.com/kiith-sa/D-YAML
>
> I think it has a pretty nice DOM implementation which doesn't require
> any changes to phobos. As YAML is a superset of JSON, adapting it for
> std.json shouldn't be too hard. The code is boost licensed and well
> documented.
>
> I think std.json would have better chances of being merged into phobos
> if it didn't rely on changes to std.variant.

I know about D-YAML, but haven't taken a deep look at it; it was developed long after I wrote my own JSON library. I did look into YAML before deciding to use JSON for my application; I just didn't need the extra features and implementing them would've taken extra dev time.

As for reliance on changes to std.variant, this was a change *suggested* by Andrei. And while it is the slower route to go, I believe it is the correct software engineering choice; prior to the change I was implementing my own typed union (i.e. I poorly reinvented std.variant) Actually, most of my initial work on Variant was to make its API just as good as my home-rolled JSON type. Furthermore, a quick check of the YAML code-base seems to indicate that underneath the hood, Variant is being used. I'm actually a little curious about what prevented YAML from being expressed using std.variant directly and if those limitations can be removed.

* The other thing slowing both std.variant and std.json down is my thesis writing :)

February 09, 2012

Re: OT Adam D Ruppe's web stuff

Posted by Adam D. Ruppe
in reply to Jacob Carlborg

Adam D. Ruppe

Posted in reply to Jacob Carlborg

On Thursday, 9 February 2012 at 08:26:25 UTC, Jacob Carlborg wrote:
> For example, ENV["REQUEST_URI"] returns differently on different servers. Rails provides a method, "request_uri" on the request object that will return the same value on all different servers.
>
> I don't know if CGI already has support for something similar.

Yeah, in cgi.d, you use Cgi.requestUri, which is an immutable
string, instead of using the environment variable directly.

  requestUri = getenv("REQUEST_URI");
// Because IIS doesn't pass requestUri, we simulate it here if it's empty.
   if(requestUri.length == 0) {
        // IIS sometimes includes the script name as part of the path info - we don't want that
        if(pathInfo.length >= scriptName.length && (pathInfo[0 .. scriptName.length] == scriptName))
            pathInfo = pathInfo[scriptName.length .. $];

           requestUri = scriptName ~ pathInfo ~ (queryString.length ? ("?" ~ queryString) : "");

          // FIXME: this works for apache and iis... but what about others?

That's in the cgi constructor. Somewhat ugly code, but I figure
better to have ugly code in the library than incompatibilities
in the user program!

The http constructor creates these variables from the raw headers.

Here's the ddoc:
http://arsdnet.net/web.d/cgi.html

If you search for "requestHeaders", you'll see all the stuff
following. If you use those class members instead of direct
environment variables, you'll get max compatibility.

February 09, 2012

Re: OT Adam D Ruppe's web stuff

Posted by Jacob Carlborg
in reply to Adam D. Ruppe

Jacob Carlborg

Posted in reply to Adam D. Ruppe

On 2012-02-09 15:56, Adam D. Ruppe wrote:
> On Thursday, 9 February 2012 at 08:26:25 UTC, Jacob Carlborg wrote:
>> For example, ENV["REQUEST_URI"] returns differently on different
>> servers. Rails provides a method, "request_uri" on the request object
>> that will return the same value on all different servers.
>>
>> I don't know if CGI already has support for something similar.
>
> Yeah, in cgi.d, you use Cgi.requestUri, which is an immutable
> string, instead of using the environment variable directly.
>
> requestUri = getenv("REQUEST_URI");
> // Because IIS doesn't pass requestUri, we simulate it here if it's empty.
> if(requestUri.length == 0) {
> // IIS sometimes includes the script name as part of the path info - we
> don't want that
> if(pathInfo.length >= scriptName.length && (pathInfo[0 ..
> scriptName.length] == scriptName))
> pathInfo = pathInfo[scriptName.length .. $];
>
> requestUri = scriptName ~ pathInfo ~ (queryString.length ? ("?" ~
> queryString) : "");
>
> // FIXME: this works for apache and iis... but what about others?
>
>
>
>
>
> That's in the cgi constructor. Somewhat ugly code, but I figure
> better to have ugly code in the library than incompatibilities
> in the user program!
>
> The http constructor creates these variables from the raw headers.
>
>
> Here's the ddoc:
> http://arsdnet.net/web.d/cgi.html
>
> If you search for "requestHeaders", you'll see all the stuff
> following. If you use those class members instead of direct
> environment variables, you'll get max compatibility.

Cool, you already thought of all of this it seems.

-- 
/Jacob Carlborg

February 09, 2012

Re: std.xml and Adam D Ruppe's dom module

Posted by Sean Kelly
in reply to Robert Jacques

Sean Kelly

Posted in reply to Robert Jacques

For XML, template the parser on char type so transcoding is unnecessary. Since JSON is UTF-8 I'd use char there, and at least for the event parser don't proactively decode strings--let the user do this. In fact, don't proactively decode anything. Give me the option of getting a number via its string representation directly from the input buffer. Roughly, JSON events should be:

Enter object
Object key
Int value (as string)
Float value (as string)
Null
True
False
Etc.

On Feb 8, 2012, at 6:49 PM, "Robert Jacques" <sandford@jhu.edu> wrote:

> On Wed, 08 Feb 2012 02:12:57 -0600, Johannes Pfau <nospam@example.com> wrote:
>> Am Tue, 07 Feb 2012 20:44:08 -0500
>> schrieb "Jonathan M Davis" <jmdavisProg@gmx.com>:
>>> On Tuesday, February 07, 2012 00:56:40 Adam D. Ruppe wrote:
>>> > On Monday, 6 February 2012 at 23:47:08 UTC, Jonathan M Davis
> [snip]
>> 
>> Using ranges of dchar directly can be horribly inefficient in some
>> cases, you'll need at least some kind off buffered dchar range. Some
>> std.json replacement code tried to use only dchar ranges and had to
>> reassemble strings character by character using Appender. That sucks
>> especially if you're only interested in a small part of the data and
>> don't care about the rest.
>> So for pull/sax parsers: Use buffering, return strings(better:
>> w/d/char[]) as slices to that buffer. If the user needs to keep a
>> string, he can still copy it. (String decoding should also be done
>> on-demand only).
> 
> Speaking as the one proposing said Json replacement, I'd like to point out that JSON strings != UTF strings: manual conversion is required some of the time. And I use appender as a dynamic buffer in exactly the manner you suggest. There's even an option to use a string cache to minimize total memory usage. (Hmm... that functionality should probably be re-factored out and made into its own utility) That said, I do end up doing a bunch of useless encodes and decodes, so I'm going to special case those away and add slicing support for strings. wstrings and dstring will still need to be converted as currently Json values only accept strings and therefore also Json tokens only support strings. As a potential user of the sax/pull interface would you prefer the extra clutter of special side channels for zero-copy wstrings and dstrings?

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation