August 26, 2008
BCS wrote:
> Reply to Benji,
>> Well, for something like a DOM parser, it's pretty much impossible to
>> parse a file that won't fit into memory. But a SAX parser doesn't
>> actually create any objects. It just calls events, while processing
>> XML data from a stream. A good SAX parser can operate without ever
>> allocating anything on the heap, leaving the consumer to create any
>> necessary objects from the parse process.
>>
>> --benji
>>
> 
> Interesting, I've worked with parsers* that function something like that but never thought of them in that way. OTOH I can think of only very limited domain where this would be useful. If I needed to process that much data I'd load it into a database and go from there.
> 
> *In fact my parser generator could be used that way.

In fact, that's one of the places where I've used this kind of parsing technique before.

I wrote a streaming CSV parser (which takes discipline to do correctly, since a double-quote enclosed field can legally contain arbitrary newline characters, and quotes are escaped by doubling). It provides a field callback and a record callback, so it's very handy for performing ETL tasks.

If I had to load the whole CSV files into memory before parsing, it wouldn't work, because sometimes they can be hundreds of megabytes. But the streaming parser takes up almost no memory at all.

--benji
August 26, 2008
Benji Smith Wrote:

> BCS wrote:
> > Reply to Benji,
> >> Well, for something like a DOM parser, it's pretty much impossible to parse a file that won't fit into memory. But a SAX parser doesn't actually create any objects. It just calls events, while processing XML data from a stream. A good SAX parser can operate without ever allocating anything on the heap, leaving the consumer to create any necessary objects from the parse process.
> >>
> >> --benji
> >>
> > 
> > Interesting, I've worked with parsers* that function something like that but never thought of them in that way. OTOH I can think of only very limited domain where this would be useful. If I needed to process that much data I'd load it into a database and go from there.
> > 
> > *In fact my parser generator could be used that way.
> 
> In fact, that's one of the places where I've used this kind of parsing technique before.
> 
> I wrote a streaming CSV parser (which takes discipline to do correctly, since a double-quote enclosed field can legally contain arbitrary newline characters, and quotes are escaped by doubling). It provides a field callback and a record callback, so it's very handy for performing ETL tasks.
> 
> If I had to load the whole CSV files into memory before parsing, it wouldn't work, because sometimes they can be hundreds of megabytes. But the streaming parser takes up almost no memory at all.
> 
> --benji

sure it takes very little memory. i'll tell u how much memory u need in fact. it's the finite state needed by the fsa. u could do that because csv only needs finite state for parsing. soon as you need to backtrack stream parsing becomes very difficult.
August 26, 2008
"bearophile" <bearophileHUGS@lycos.com> wrote in message news:g8vmda$sd4$1@digitalmars.com...
> BCS:
>> If you must have that sort of interface, pick a different language, because D isn't intended to work that way.
>
> I suggest Benji to try C# 3+, despite all the problems it has and the borg-like nature of such software, etc, it will be used way more than D, and it has all the nice things Benji asks for.
>

(pet peeve) As much as there is that I like about C#, the lack of an IArithmetic or operator constrains tends to gimp its template system in a number of cases.


August 26, 2008
superdan wrote:
>> No, the SCALABLE opIndex runs in O(1). The CORRECT opIndex can run in O(n^n) and still be correct.
> 
> stepanov has shown that for composable operations the complexity must be part of the specification. otherwise composition easily leads to high-order polynomial that fail to terminate in reasonable time. opIndex is an indexing operator expected to run in constant time, and algorithms rely on that. so no. opIndex running in o(n^n) is incorrect because it fails it spec.

Um.. . how would one "show" that. I'm not talking theoretical bullshit here, I'm talking real-world requirements. Some specs of operations (composable or not) list their time/memory complexity. Most do not. They're still usable. I agree that a standard library sort routine is one that *should* list its time complexity. My internal function for enumerating registry keys doesn't need to.

> here you hint you don't understand what i'm talking about indeed. neither of java, c#, or tango define a[n] to run in o(n). they define named functions, which i'm perfectly fine with.

I guess I didn't understand what you were saying because you _never mentioned_ you were talking only about opIndex and not other functions. I don't see the difference between a[n] and a.get(n); the former is just a shorter syntax. The D spec certainly doesn't make any guarantees about the time/memory complexity of opIndex; it's up to the implementing class to do so. In fact, the D spec makes no time/memory complexity guarantees about sort for arbitrary user-defined types, either, so maybe you shouldn't use that.

> funny you should mention that. window manager in windows 3.1 worked exactly like that. users noticed that the more windows the opened, the longer it took to open a new window. with new systems and more memory people would have many windows. before long this became a big issue. windows 95 fixed that.
> 
> never misunderestimate scalability.

I don't know enough about GUI programming to say for sure, but that suggests a window manager shouldn't be written using linked lists. It doesn't suggest that getting a value from an arbitrary index in a linked list is useless (in fact, it shows the opposite -- that it works fine -- it just shows its not scalable).
August 26, 2008
Jb wrote:
> Walter said "the hardware cannot predict where a virtual call will go".
> 
> It does in fact predict them, and speculatively execute them, and as pretty much any bechmark will show it gets it right the vast majority of the time. (On x86 anyway.)
> 
> That's what I was saying. 

Looks like I keep falling behind on what modern CPUs are doing :-(

In any case, throughout all the revolutions in how CPUs work, there have been a few invariants that hold true well enough as an optimization guide:

1. fewer instructions ==> faster execution
2. fewer memory accesses ==> faster execution
3. fewer conditional branches ==> faster execution
August 26, 2008
bearophile wrote:
> Walter Bright:
>>> We've already observed that D assoc arrays are less performant
>>> than even Python maps, so the extra cost of lookup operations is
>>> unwelcome.
>> Every one of those benchmarks that purported to show that D AA's
>> were relatively slow turned out to be, on closer examination, D
>> running the garbage collector more often than Python does. It had
>> NOTHING to do with the AA's.
> 
> Really? I must have missed those conclusions then, despite reading
> all the posts on the subject. What solutions do you propose for the
> problem then? I recall that disabling the GC didn't improve the
> situation much. So the problem now becomes how to improve the D GC?

In my experience with such programs, disabling the collection cycles brought the speed up to par.
August 26, 2008
superdan wrote:
> Benji Smith Wrote:
>>
>> I wrote a streaming CSV parser (which takes discipline to do correctly, since a double-quote enclosed field can legally contain arbitrary newline characters, and quotes are escaped by doubling). It provides a field callback and a record callback, so it's very handy for performing ETL tasks.
>>
>> If I had to load the whole CSV files into memory before parsing, it wouldn't work, because sometimes they can be hundreds of megabytes. But the streaming parser takes up almost no memory at all.
>>
>> --benji
> 
> sure it takes very little memory. i'll tell u how much memory u need in fact. it's the finite state needed by the fsa. u could do that because csv only needs finite state for parsing. soon as you need to backtrack stream parsing becomes very difficult.

Noooooooobody uses backtracking to parse.

Most of the time LL(k) token lookahead solves the problem. Sometimes you need a syntactic predicate or (rarely) a semantic predicate.

I've never even heard of a parser generator framework that supported backtracking.

--benji
August 26, 2008
Robert Fraser Wrote:

> superdan wrote:
> >> No, the SCALABLE opIndex runs in O(1). The CORRECT opIndex can run in O(n^n) and still be correct.
> > 
> > stepanov has shown that for composable operations the complexity must be part of the specification. otherwise composition easily leads to high-order polynomial that fail to terminate in reasonable time. opIndex is an indexing operator expected to run in constant time, and algorithms rely on that. so no. opIndex running in o(n^n) is incorrect because it fails it spec.
> 
> Um.. . how would one "show" that. I'm not talking theoretical bullshit here, I'm talking real-world requirements.

hey. hey. watch'em manners :) he's shown it by putting stl together.

> Some specs of operations (composable or not) list their time/memory complexity. Most do not. They're still usable. I agree that a standard library sort routine is one that *should* list its time complexity. My internal function for enumerating registry keys doesn't need to.

sure thing.

> > here you hint you don't understand what i'm talking about indeed. neither of java, c#, or tango define a[n] to run in o(n). they define named functions, which i'm perfectly fine with.
> 
> I guess I didn't understand what you were saying because you _never mentioned_ you were talking only about opIndex and not other functions.

well then allow me to quote myself: "oopsies. houston we got a problem here. problem is all that pluggable sort business works only if it can count on a constant time opIndex. why? because sort has right in its spec that it takes o(n log n) time. if u pass LinkedList to it you obtain a nonsensical design that compiles but runs afoul of the spec. because with that opIndex sort will run in quadratic time and no amount of commentin' is gonna save it from that."

> I don't see the difference between a[n] and a.get(n); the former is just a shorter syntax.

wrong. the former is used by sort. the latter ain't.

> The D spec certainly doesn't make any guarantees about the time/memory complexity of opIndex; it's up to the implementing class to do so.

it don't indeed. it should. that's a problem with the spec.

> In fact, the D spec makes no time/memory complexity guarantees about sort for arbitrary user-defined types, either, so maybe you shouldn't use that.

makes guarantees in terms of the primitive operations used.

> > funny you should mention that. window manager in windows 3.1 worked exactly like that. users noticed that the more windows the opened, the longer it took to open a new window. with new systems and more memory people would have many windows. before long this became a big issue. windows 95 fixed that.
> > 
> > never misunderestimate scalability.
> 
> I don't know enough about GUI programming to say for sure, but that suggests a window manager shouldn't be written using linked lists. It doesn't suggest that getting a value from an arbitrary index in a linked list is useless (in fact, it shows the opposite -- that it works fine -- it just shows its not scalable).

problem's too many people talk without knowin' enuff 'bout stuff. there's only a handful of subjects i know any about. and i try to not stray. when it come about any stuff i know i'm amazed readin' here at how many just fudge their way around.
August 26, 2008
"Jb" <jb@nowhere.com> wrote in message news:g90mm6$2tk9$1@digitalmars.com...
>
> "Walter Bright" <newshound1@digitalmars.com> wrote in message news:g90iia$2jc4$3@digitalmars.com...
>> Yigal Chripun wrote:
>>> a) people here said that a virtual call will make it slow. How much slow? how much of an overhead is it on modern hardware considering also that this is a place where hardware manufacturers spend time on optimizations?
>>
>> Virtual function calls have been a problem for hardware optimization. Direct function calls can be speculatively executed, but not virtual ones, because the hardware cannot predict where it will go. This means virtual calls can be much slower than direct function calls.
>
> Modern x86 branch prediction treats indirect calls the same as conditional branches. They get a slot in the branch target buffer, so they do get speculatively executed. And if correctly predicted it's only a couple of cycles more costly direct calls.

Just curious: How "modern", do you mean by "modern" here?


August 26, 2008
Benji Smith Wrote:

> superdan wrote:
> > Benji Smith Wrote:
> >>
> >> I wrote a streaming CSV parser (which takes discipline to do correctly, since a double-quote enclosed field can legally contain arbitrary newline characters, and quotes are escaped by doubling). It provides a field callback and a record callback, so it's very handy for performing ETL tasks.
> >>
> >> If I had to load the whole CSV files into memory before parsing, it wouldn't work, because sometimes they can be hundreds of megabytes. But the streaming parser takes up almost no memory at all.
> >>
> >> --benji
> > 
> > sure it takes very little memory. i'll tell u how much memory u need in fact. it's the finite state needed by the fsa. u could do that because csv only needs finite state for parsing. soon as you need to backtrack stream parsing becomes very difficult.
> 
> Noooooooobody uses backtracking to parse.

guess that makes perl regexes et al noooooooobody.

> Most of the time LL(k) token lookahead solves the problem. Sometimes you need a syntactic predicate or (rarely) a semantic predicate.
>
> I've never even heard of a parser generator framework that supported backtracking.

live & learn. keep lookin'. hint: try antlr.