Jump to page: 1 27  
Page
Thread overview
Re: std.compress
Jun 05, 2013
Jonathan M Davis
Jun 05, 2013
Adam D. Ruppe
Jun 05, 2013
Diggory
Jun 05, 2013
Brad Anderson
Jun 05, 2013
Brad Anderson
Jun 05, 2013
Brad Roberts
Jun 05, 2013
Idan Arye
Jun 05, 2013
Walter Bright
Jun 05, 2013
Walter Bright
Jun 05, 2013
H. S. Teoh
Jun 05, 2013
David Nadlinger
Jun 05, 2013
SomeDude
Jun 05, 2013
Jonathan M Davis
Jun 05, 2013
Dmitry Olshansky
Jun 05, 2013
Dmitry Olshansky
Jun 05, 2013
Adam D. Ruppe
Jun 05, 2013
Adam D. Ruppe
Jun 06, 2013
SomeDude
Jun 06, 2013
Marco Leise
Jun 06, 2013
Adam D. Ruppe
Jun 09, 2013
Jacob Carlborg
Jun 10, 2013
bearophile
Jun 10, 2013
Jacob Carlborg
Jun 05, 2013
Walter Bright
Jun 05, 2013
Jonathan M Davis
Jun 05, 2013
Dmitry Olshansky
Jun 05, 2013
Walter Bright
Jun 05, 2013
Walter Bright
Jun 06, 2013
Jonathan M Davis
Jun 06, 2013
David Nadlinger
Jun 07, 2013
Peter Alexander
Jun 07, 2013
David Nadlinger
Jun 07, 2013
Peter Alexander
Jun 09, 2013
Jacob Carlborg
Jun 09, 2013
Jonathan M Davis
Jun 10, 2013
Jacob Carlborg
Jun 10, 2013
Walter Bright
Jun 10, 2013
Jacob Carlborg
Jun 10, 2013
Peter Alexander
Jun 10, 2013
Jacob Carlborg
Jun 06, 2013
Marco Leise
Jun 05, 2013
w0rp
Jun 05, 2013
Brad Anderson
Jun 05, 2013
H. S. Teoh
Jun 05, 2013
Dmitry Olshansky
Jun 05, 2013
H. S. Teoh
Jun 05, 2013
Adam D. Ruppe
Jun 05, 2013
Timothee Cour
Jun 05, 2013
H. S. Teoh
Jun 05, 2013
Walter Bright
Jun 05, 2013
Jonathan M Davis
Jun 05, 2013
Walter Bright
Jun 05, 2013
Jonathan M Davis
June 05, 2013
On Wednesday, June 05, 2013 07:49:22 H. S. Teoh wrote:
> (Actually, std.algorithm currently sits at 11636 lines. I call BS on whoever claims to be able to "skim over" std.algorithm and "get a feel for how it works". Chances are your finger will get so tired of hitting PgDn about 2000 lines into the file that you won't even look at the rest. And most of the code is only superficially related to each other -- about 20 functions into the file you'd have lost track of all sense of how things fit together -- 'cos they *don't* really fit together! It's the epitome of why we *should* move to smaller modules, rather than the current giant monolithic ones.)

There's a significant difference between a 10,000+ line file and a file that's less than 1000 lines. Walter's proposed's std.compress is less than 1000 lines long, and yet we're talking about splitting it up. That's going to lead to ridiculously small modules IMHO. There's a balance to be had here that's somewhere in between the two extremes.

- Jonathan M Davis
June 05, 2013
On Wednesday, 5 June 2013 at 17:10:38 UTC, Jonathan M Davis wrote:
> There's a balance to be had here that's
> somewhere in between the two extremes.

I think number of lines is the wrong thing to look at, at least as the first measure.

The question I'd have is: is this of use to a different yet kinda related concept? If yes, make it a module.

For example, in my code, the "struct Color" used to be defined straight in simpledisplay.d. But image.d had its own struct Color. I wanted to take image.d's Images and draw them on simplediplay.d's Windows. But I didn't want either of those modules to depend on each other. So now Color is in a separate color.d and both use it.

Glancing at this code, the only not-strictly-related one is CircularBuffer. You could argue it should be separate just because it is of general use. But, is it needed for interoperability? Would two separate modules need to pass the same CircularBuffer type between them? If yes, separate it.

If no, now line count might be ok because the question I'd ask at this point is: is it easy code? If so, the costs of adding dependencies IMO override the costs of not being DRY (WETness???).

CircularBuffer here isn't exactly rocket science. I don't think it passes the threshold to justify adding a dependency for it.

Whereas the compression algorithm itself does; if you were writing a gif reader or whatever and implemented the algorithm, I'd say split it into a module if some second module needs it too. (Before then? do the simplest thing that can possibly work and you ain't gonna need it, so I'd just keep it in the one file.)
June 05, 2013
On 6/5/13 1:10 PM, Jonathan M Davis wrote:
> On Wednesday, June 05, 2013 07:49:22 H. S. Teoh wrote:
>> (Actually, std.algorithm currently sits at 11636 lines. I call BS on
>> whoever claims to be able to "skim over" std.algorithm and "get a feel
>> for how it works". Chances are your finger will get so tired of hitting
>> PgDn about 2000 lines into the file that you won't even look at the
>> rest. And most of the code is only superficially related to each other
>> -- about 20 functions into the file you'd have lost track of all sense
>> of how things fit together -- 'cos they *don't* really fit together!
>> It's the epitome of why we *should* move to smaller modules, rather than
>> the current giant monolithic ones.)
>
> There's a significant difference between a 10,000+ line file and a file that's
> less than 1000 lines. Walter's proposed's std.compress is less than 1000 lines
> long, and yet we're talking about splitting it up. That's going to lead to
> ridiculously small modules IMHO. There's a balance to be had here that's
> somewhere in between the two extremes.

Walter's algo is among the simplest around. I expect any of the industry-standard algos to be significantly larger, most likely each worth a module.

I think it's worth discussing the interface much more than the approach to modularization.

Walter's algo traffics in InputRange!ubyte and offers an InputRange!ubyte. That makes sense in some situations, but often trafficking one ubyte at a time may be not only inefficient but also the wrong granularity. Consider:

auto data = File("input.txt").byChunk().compress();

That won't work because byChunk deals in ubyte[], not ubyte. How do we fix this while keeping everybody efficient?

I talked to Walter and during this work he figured a lot of things about how ranges work and how they generate code. Turns out that the range equivalent of a tight loop is slightly less efficient with dmd because a range must keep its state together, which is harder to enregister than a bunch of automatic variables.

Right now we have joiner(), which given several ranges of T, offers a range of T. Developing along that idea, we should have two opposite functions: itemize() and collect().

itemize() takes a range of ranges of T and offers a range of T. For example, given a range of T[], offers a range of T.

collect() takes a range of T and offers a range of T[]. The number of items in each chunk can be a parameter.

With that in tow, we can set things up such that compress() and expand() traffic in ranges of ranges of ubyte (or simply ranges of ubyte[]), which ensures work at maximum speed. Then the matter of adapting to and fro ranges of ubyte is a simple matter of chaining a call to itemize() or collect().

Destroy.

I salute this code. It is concise, well engineered, well written, just as general as it needs, uses the right features in the right places, and does real work. A perfect example to follow. The only thing I'd add is this convenience function:

void putBits(bool[] bits...) {
    foreach (b; bits) putBit(b);
}

That reduces a bunch of code, and leaves room for future optimizations inside putBits.


Andrei
June 05, 2013
On Wednesday, 5 June 2013 at 17:36:05 UTC, Andrei Alexandrescu wrote:
> On 6/5/13 1:10 PM, Jonathan M Davis wrote:
>> On Wednesday, June 05, 2013 07:49:22 H. S. Teoh wrote:
>>> (Actually, std.algorithm currently sits at 11636 lines. I call BS on
>>> whoever claims to be able to "skim over" std.algorithm and "get a feel
>>> for how it works". Chances are your finger will get so tired of hitting
>>> PgDn about 2000 lines into the file that you won't even look at the
>>> rest. And most of the code is only superficially related to each other
>>> -- about 20 functions into the file you'd have lost track of all sense
>>> of how things fit together -- 'cos they *don't* really fit together!
>>> It's the epitome of why we *should* move to smaller modules, rather than
>>> the current giant monolithic ones.)
>>
>> There's a significant difference between a 10,000+ line file and a file that's
>> less than 1000 lines. Walter's proposed's std.compress is less than 1000 lines
>> long, and yet we're talking about splitting it up. That's going to lead to
>> ridiculously small modules IMHO. There's a balance to be had here that's
>> somewhere in between the two extremes.
>
> Walter's algo is among the simplest around. I expect any of the industry-standard algos to be significantly larger, most likely each worth a module.
>
> I think it's worth discussing the interface much more than the approach to modularization.
>
> Walter's algo traffics in InputRange!ubyte and offers an InputRange!ubyte. That makes sense in some situations, but often trafficking one ubyte at a time may be not only inefficient but also the wrong granularity. Consider:
>
> auto data = File("input.txt").byChunk().compress();
>
> That won't work because byChunk deals in ubyte[], not ubyte. How do we fix this while keeping everybody efficient?
>
> I talked to Walter and during this work he figured a lot of things about how ranges work and how they generate code. Turns out that the range equivalent of a tight loop is slightly less efficient with dmd because a range must keep its state together, which is harder to enregister than a bunch of automatic variables.
>
> Right now we have joiner(), which given several ranges of T, offers a range of T. Developing along that idea, we should have two opposite functions: itemize() and collect().
>
> itemize() takes a range of ranges of T and offers a range of T. For example, given a range of T[], offers a range of T.
>
> collect() takes a range of T and offers a range of T[]. The number of items in each chunk can be a parameter.
>
> With that in tow, we can set things up such that compress() and expand() traffic in ranges of ranges of ubyte (or simply ranges of ubyte[]), which ensures work at maximum speed. Then the matter of adapting to and fro ranges of ubyte is a simple matter of chaining a call to itemize() or collect().
>
> Destroy.
>
> I salute this code. It is concise, well engineered, well written, just as general as it needs, uses the right features in the right places, and does real work. A perfect example to follow. The only thing I'd add is this convenience function:
>
> void putBits(bool[] bits...) {
>     foreach (b; bits) putBit(b);
> }
>
> That reduces a bunch of code, and leaves room for future optimizations inside putBits.
>
>
> Andrei

That makes a lot of sense about itemize/collect. With regard to the efficiency of ranges - assuming the optimiser can inline the range code would it not turn into the exact same code as a loop using iterators? The length is needed in both cases, so I would have thought a good optimiser would produce equivalent code for the both of them.

Also about "putBits" - problem I see with that is that currently DMD seems to fail at optimising variadics. Even with inlining and optimisation turned on, the generated code is very poor: http://forum.dlang.org/thread/bug-10085-3@http.d.puremagic.com%2Fissues%2F
June 05, 2013
On Wednesday, 5 June 2013 at 17:36:05 UTC, Andrei Alexandrescu wrote:
> Walter's algo traffics in InputRange!ubyte and offers an InputRange!ubyte. That makes sense in some situations, but often trafficking one ubyte at a time may be not only inefficient but also the wrong granularity. Consider:
>
> auto data = File("input.txt").byChunk().compress();
>
> That won't work because byChunk deals in ubyte[], not ubyte. How do we fix this while keeping everybody efficient?
>
> I talked to Walter and during this work he figured a lot of things about how ranges work and how they generate code. Turns out that the range equivalent of a tight loop is slightly less efficient with dmd because a range must keep its state together, which is harder to enregister than a bunch of automatic variables.
>
> Right now we have joiner(), which given several ranges of T, offers a range of T. Developing along that idea, we should have two opposite functions: itemize() and collect().
>
> itemize() takes a range of ranges of T and offers a range of T. For example, given a range of T[], offers a range of T.
>
> collect() takes a range of T and offers a range of T[]. The number of items in each chunk can be a parameter.
>
> With that in tow, we can set things up such that compress() and expand() traffic in ranges of ranges of ubyte (or simply ranges of ubyte[]), which ensures work at maximum speed. Then the matter of adapting to and fro ranges of ubyte is a simple matter of chaining a call to itemize() or collect().

I like them.  How would itemize differ from joiner though? (apart from hopefully not converting narrow strings to wide strings like joiner currently seems to).
June 05, 2013
An the subject of std.algorithm being very large, I realised something. Once the import change is in that makes it possible to import sub-modules all at once, there will be some nice potential for splitting up std.algorithm a bit into sub-modules, while making 'import std.algorithm' work exactly like it does now.
June 05, 2013
On Wednesday, 5 June 2013 at 17:36:05 UTC, Andrei Alexandrescu wrote:
> Right now we have joiner(), which given several ranges of T, offers a range of T. Developing along that idea, we should have two opposite functions: itemize() and collect().

collect() is sometimes used as the name for map() in some languages (e.g., Ruby, Smalltalk).  I don't think it's a problem but maybe a note in the documentation would be warranted to point people to map.  Alternatively it could be called something like bundle or parcel.
June 05, 2013
On Wednesday, 5 June 2013 at 18:25:45 UTC, w0rp wrote:
> An the subject of std.algorithm being very large, I realised something. Once the import change is in that makes it possible to import sub-modules all at once, there will be some nice potential for splitting up std.algorithm a bit into sub-modules, while making 'import std.algorithm' work exactly like it does now.

You could probably just use the categories the documentation already defines.

std\algorithm\searching.d
std\algorithm\comparison.d
std\algorithm\iteration.d
std\algorithm\sorting.d
std\algorithm\set.d
std\algorithm\mutation.d
June 05, 2013
On Wednesday, 5 June 2013 at 17:36:05 UTC, Andrei Alexandrescu wrote:
> itemize() takes a range of ranges of T and offers a range of T. For example, given a range of T[], offers a range of T.
joiner?

> collect() takes a range of T and offers a range of T[]. The number of items in each chunk can be a parameter.
chunk? (or rather chunk(…).map!(a => a.array)).

> I salute this code. It is concise, well engineered, well written, just as general as it needs, uses the right features in the right places, and does real work. A perfect example to follow. The only thing I'd add is this convenience function:

It also doesn't utilize template constraints, reinvents isRandomAccessRange && hasSlicing under a poor name, uses C printf (!) in the examples, has random 2-3 letter variable names (dis, dip, di, si) all over the place, …

David
June 05, 2013
On Wed, Jun 05, 2013 at 08:29:51PM +0200, Brad Anderson wrote:
> On Wednesday, 5 June 2013 at 18:25:45 UTC, w0rp wrote:
> >An the subject of std.algorithm being very large, I realised something. Once the import change is in that makes it possible to import sub-modules all at once, there will be some nice potential for splitting up std.algorithm a bit into sub-modules, while making 'import std.algorithm' work exactly like it does now.

Is that even necessary? We could just do this:

----std/algorithm.d----
module std.algorithm;
public import std.algorithm.search;
public import std.algorithm.compare;
public import std.algorithm.iteration;
public import std.algorithm.sort;
public import std.algorithm.set;
public import std.algorithm.mutation;
----snip----

Newer code, of course, would use the more specific imports.


> You could probably just use the categories the documentation already defines.
> 
> std\algorithm\searching.d
> std\algorithm\comparison.d
> std\algorithm\iteration.d
> std\algorithm\sorting.d
> std\algorithm\set.d
> std\algorithm\mutation.d

Good idea!


T

-- 
There is no gravity. The earth sucks.
« First   ‹ Prev
1 2 3 4 5 6 7