Recommendations on avoiding range pipeline type hell (page 2)

Settings

Help

Index » Learn » Recommendations on avoiding range pipeline type hell (page 2)

May 16, 2021

Re: Recommendations on avoiding range pipeline type hell

Posted by Chris Piker
in reply to Adam D. Ruppe

Permalink

Chris Piker

Posted in reply to Adam D. Ruppe

Permalink

Thanks to everyone who has replied.  You've given me a lot to think about, and since I'm not yet fluent in D it will take a bit to digest it all, though one thing is clear.

This community is one of the strong features of D.

I will mention it to others as a selling point.

Best,

May 16, 2021

Re: Recommendations on avoiding range pipeline type hell

Posted by Chris Piker
in reply to Paul Backus

Permalink

Chris Piker

Posted in reply to Paul Backus

Permalink

On Saturday, 15 May 2021 at 14:05:34 UTC, Paul Backus wrote:

If you post your code (or at least a self-contained subset of it) someone can probably help you figure out where you're running into trouble. The error messages by themselves do not provide enough information--all I can say from them is, "you must be doing something wrong."

I just tacked on .array in the the unittest and moved on for now, but for those who may be interested in the "equivalent but not equivalent" dmd error message mentioned above, the code is up on github. To trigger the error message:

git clone git@github.com:das-developers/das2D.git
cd das2D
rdmd -unittest --main das2/range.d  # This works

In file das2/range.d, comment out lines 550 & 553 and uncomment lines 557 & 558 to get alternate definitions of coarse_recs and fine_recs then run rdmd again:

rdmd -unittest --main das2/range.d  # No longer works

In addition to the issue mentioned above, comments on any style issues, best practices or design choices are invited. By the way the writeln calls in the unittests just temporary.

May 16, 2021

Re: Recommendations on avoiding range pipeline type hell

Posted by Jordan Wilson
in reply to Chris Piker

Permalink

Jordan Wilson

Posted in reply to Chris Piker

Permalink

On Sunday, 16 May 2021 at 07:20:52 UTC, Chris Piker wrote:

On Saturday, 15 May 2021 at 14:05:34 UTC, Paul Backus wrote:

git clone git@github.com:das-developers/das2D.git
cd das2D
rdmd -unittest --main das2/range.d  # This works

In file das2/range.d, comment out lines 550 & 553 and uncomment lines 557 & 558 to get alternate definitions of coarse_recs and fine_recs then run rdmd again:

rdmd -unittest --main das2/range.d  # No longer works

In addition to the issue mentioned above, comments on any style issues, best practices or design choices are invited. By the way the writeln calls in the unittests just temporary.

Essentially, dr_fine and dr_coarse are different types. For example:

echo 'import std; void main() { auto a = [a,"test"]; }' | dmd -run - # your error

Another example:

auto r = [iota(1,10).map!(a => a.to!int),iota(1,10).map!(a => a.to!int)]; # compile error

Using .array on both of the elements of r will compile.

Thanks,

Jordan

May 16, 2021

Re: Recommendations on avoiding range pipeline type hell

Posted by Chris Piker
in reply to Jordan Wilson

Permalink

Chris Piker

Posted in reply to Jordan Wilson

Permalink

On Sunday, 16 May 2021 at 09:17:47 UTC, Jordan Wilson wrote:

Another example:

auto r = [iota(1,10).map!(a => a.to!int),iota(1,10).map!(a => a.to!int)];
# compile error

Hi Jordan

Nice succinct example. Thanks for looking at the code :)

So, honest question. Does it strike you as odd that the exact same range definition is considered to be two different types?

Maybe that's eminently reasonable to those with deep knowledge, but it seems crazy to a new D programmer. It breaks a general assumption about programming when copying and pasting a definition yields two things that aren't the same type. (except in rare cases like SQL where null != null.)

On a side note, I appreciate that .array solves the problem, but I'm writing pipelines that are supposed to work on arbitrarily long data sets (> 1.4 TB is not uncommon).

May 16, 2021

Re: Recommendations on avoiding range pipeline type hell

Posted by SealabJaster
in reply to Chris Piker

Permalink

SealabJaster

Posted in reply to Chris Piker

Permalink

On Sunday, 16 May 2021 at 09:55:31 UTC, Chris Piker wrote:

It's due to a quirk with passing lambdas as template arguments. Each lambda is actually separated into its own function.

It's kind of hard to explain, but examine this code:

// runnable version: https://run.dlang.io/is/NbU3iT
struct S(alias Func)
{
    pragma(msg, __traits(identifier, Func));
}

int func(int a)
{
    return a*2;
}

void main()
{
    auto a = S!(a => a*2)();
    auto b = S!(a => a*2)();

    // Comment above. Then uncomment below for a working version.
    /*
    auto a = S!func();
    auto b = S!func();
    */

    pragma(msg, typeof(a));
    pragma(msg, typeof(b));
    a = b;
}

In its given state, this is the following output:

__lambda1
__lambda3
S!((a) => a * 2)
S!((a) => a * 2)
onlineapp.d(24): Error: cannot implicitly convert expression `b` of type `onlineapp.main.S!((a) => a * 2)` to `onlineapp.main.S!((a) => a * 2)`

So while typeof(a) and typeof(b) look like they are the same, in actuality you can see that auto a uses __lambda1, whereas auto b uses __lambda3.

This means that, even though visually they should be equal, they are in fact two entirely separate types.

So if you had a nested Result struct, it'd look more like S!__lambda1.Result and S!__lambda3.Result, instead of just S!IDENTICAL_LAMBDA.Result.

Confusing, I know...

So if we change this to using a non-lambda function (doing the commenting/uncommenting as mentioned in the code) then we get successful output:

func
S!(func)
S!(func)

p.s. I love that you can debug D within D.

May 16, 2021

Re: Recommendations on avoiding range pipeline type hell

Posted by Jordan Wilson
in reply to Chris Piker

Permalink

Jordan Wilson

Posted in reply to Chris Piker

Permalink

On Sunday, 16 May 2021 at 09:55:31 UTC, Chris Piker wrote:

On Sunday, 16 May 2021 at 09:17:47 UTC, Jordan Wilson wrote:

Another example:

auto r = [iota(1,10).map!(a => a.to!int),iota(1,10).map!(a => a.to!int)];
# compile error

Hi Jordan

Nice succinct example. Thanks for looking at the code :)

So, honest question. Does it strike you as odd that the exact same range definition is considered to be two different types?

On a side note, I appreciate that .array solves the problem, but I'm writing pipelines that are supposed to work on arbitrarily long data sets (> 1.4 TB is not uncommon).

There are those far more learned than me that could help explain. But in short, yes, it did take a little getting used to it - I would recommend looking at Voldemort types for D.

Ironically, use of Voldemort types and range-based programming is what helps me perform large data processing.

Jordan

May 16, 2021

Re: Recommendations on avoiding range pipeline type hell

Posted by Patrick Schluter
in reply to Chris Piker

Permalink

Patrick Schluter

Posted in reply to Chris Piker

Permalink

On Sunday, 16 May 2021 at 09:55:31 UTC, Chris Piker wrote:

On Sunday, 16 May 2021 at 09:17:47 UTC, Jordan Wilson wrote:

Another example:

auto r = [iota(1,10).map!(a => a.to!int),iota(1,10).map!(a => a.to!int)];
# compile error

Hi Jordan

Nice succinct example. Thanks for looking at the code :)

So, honest question. Does it strike you as odd that the exact same range definition is considered to be two different types?

Even in C
```
typedef struct {
    int a;
} type1;
```
and
```
struct {
    int a;
} type2;
```

are two different types. The compiler will give an error if you pass one to a function waiting for the other.

void fun(type1 v)
{
}

type2 x;

fun(x);  // gives error

See https://godbolt.org/z/eWenEW6q1

On a side note, I appreciate that .array solves the problem, but I'm writing pipelines that are supposed to work on arbitrarily long data sets (> 1.4 TB is not uncommon).

May 16, 2021

Re: Recommendations on avoiding range pipeline type hell

Posted by Chris Piker
in reply to SealabJaster

Permalink

Chris Piker

Posted in reply to SealabJaster

Permalink

On Sunday, 16 May 2021 at 10:10:54 UTC, SealabJaster wrote:

It's due to a quirk with passing lambdas as template arguments. Each lambda is actually separated into its own function.

Hey that was a very well laid out example.

Okay, I think the light is starting do dawn. So if I use lambdas as real arguments (not type arguments) then I'm just storing function pointers and I'm back to C land where I understand what's going on. Though I may lose some safety, I gain some sanity.

For example:

alias EXPLICIT_TYPE = int function (int);

struct Tplt(FT)
{
   FT f;
}

void main()
{
   auto a = Tplt!(EXPLICIT_TYPE)( a => a+3);
   auto b = Tplt!(EXPLICIT_TYPE)( a => a*2);
	
   a = b; // Lambdas as arguments instead of types works
}

Here's the non-lambda version of your example that helped me to understand what's going on, and how the function called get's mashed into the type (even though typeid doesn't tell us that's what happens):

struct S(alias Func)
{
   pragma(msg, __traits(identifier, Func));
}

int func1(int a){ return a*2; }

int func2(int a){ return a*2; }

void main()
{
   auto a = S!func1();
   auto b = S!func2();

   pragma(msg, typeof(a));
   pragma(msg, typeof(b));
   a = b;
}

I'm going to go above my station and call this a bug in typeof/typeid. If the function name is part of the type, that should be stated explicitly to make the error messages more clear. We depend on those type names in compiler messages to understand what's going on.

Cheers,

May 16, 2021

Re: Recommendations on avoiding range pipeline type hell

Posted by Adam D. Ruppe
in reply to Chris Piker

Permalink

Adam D. Ruppe

Posted in reply to Chris Piker

Permalink

On Sunday, 16 May 2021 at 12:54:19 UTC, Chris Piker wrote:

a = b; // Lambdas as arguments instead of types works

Wait a sec, when you do the

auto a = S!(a => a*2)();

That's not actually passing a type. That's passing the (hidden) name of a on-the-spot-created function template as a compile time parameter. If it was just a type, you'd probably be ok!

It is that on-the-spot-created bit that trips up. The compiler doesn't both looking at the content of the lambda to see if it is repeated from something earlier. It just sees that shorthand syntax and blindly expands it to:

static int __some_internal_name_created_on_line_5(int a) {
   return a*2;
}

(even that isn't entirely true, because you didn't specify the type of a in your example, meaning the compiler actually passes

template __some_internal_name_created_on_line_5(typeof_a) {
   auto __some_internal_name_created_on_line_5(typeof_a a) {
      return a*2;
   }
}

and that template is instantiated with the type the range passes to the lambda - inside the range's implementation - to create the actual function that it ends up calling.

but that's not really important to the point since you'd get the same thing even if you did specify the types in this situation.)

Anyway, when you repeat it later, it makes another:

static int __some_internal_name_created_on_line_8(int a) {
   return a*2;
}

And passes that. Wanna know what's really nuts? If it is made in the context of another template, even being on the same line won't save you from duplicates. It creates a new copy of the lambda for each and every distinct context it sees. Same thing in a different object? Another function. Different line? Another function. Different template argument in the surrounding function? Yet another function.

In my day job thing at one point one little (a,b) => a < b sorting lambda exploded to two gigabytes of generated identical functions in the compiler's memory, and over 100 MB in the generated object files. simply moving that out to a top-level function eliminated all that bloat... most of us could barely believe such a little thing had such a profound impact.

It would be nice if the compiler could collapse those duplicates by itself, wouldn't it? But...

void main() {
auto a = (int arg) => arg + 1;
auto b = (int arg) => arg + 1;

    assert(a is b);

}

Should that assert pass? are those two actually the same function? Right now it does NOT, but should it? That's a question for philosophers and theologians, way above my pay grade.

Then a practical worry, how does the compiler tell if two lambdas are actually identical? There's a surprising number of cases that look obvious to us, but aren't actually. Suppose it refers to a different local variable. Or ends up with a different type of argument. Or what if they came from separate compilation units? It is legitimately more complex than it seems at first glance.

I digress again... where was I?

Oh yeah, since it is passing an alias to the function to the range instead of the type, the fact that they're considered distinct entities - even if just because the implementation is lazy and considers that one was created on line 5 and one was created on line 8 to be an irreconcilable difference - means the range based on that alias now has its own distinct type.

Indeed, passing the lambda as a runtime arg fixes this to some extent since at least then the type match up. But there's still a bit of generated code bloat (not NEARLY as much! but still a bit).

For best results, declare your own function as close to top-level as you can with as many discrete types as you can, and give it a name. Don't expect the compiler to automatically factor things out for you. (for now, i still kinda hope the implementation can improve someday.)

This is obviously more of a hassle. Even with a runtime param you have to specify more than just a => a*2... minimally like (int a) => a*2.

struct S(alias Func)
{
   pragma(msg, __traits(identifier, Func));
}

int func1(int a){ return a*2; }

int func2(int a){ return a*2; }

void main()
{
   auto a = S!func1();
   auto b = S!func2();

   pragma(msg, typeof(a));
   pragma(msg, typeof(b));
   a = b;
}

I'm going to go above my station and call this a bug in typeof/typeid.

Wait, what's the bug there? The typeof DOES tell you they are separate.

Error: cannot implicitly convert expression b of type S!(func2) to S!(func1)

Just remember it isn't the function name per se, it is the symbol alias that S is taking. Which means it is different for each symbol passed... The alias in the parameter list tells you it is making a new type for each param. Same as if you did

struct S(int a) {}

S!1 would be a distinct type from S!2.

May 16, 2021

Re: Recommendations on avoiding range pipeline type hell

Posted by H. S. Teoh
in reply to Chris Piker

Permalink

H. S. Teoh

Posted in reply to Chris Piker

Permalink

On Sat, May 15, 2021 at 11:25:10AM +0000, Chris Piker via Digitalmars-d-learn wrote: [...]
> Basically the issue is that if one attempts to make a range based pipeline aka:
> 
> ```d
> auto mega_range = range1.range2!(lambda2).range3!(lambda3);
> ```
> Then the type definition of mega_range is something in the order of:
> 
> ```d
>   TYPE_range3!( TYPE_range2!( TYPE_range1, TYPE_lamba2 ), TYPE_lambda3));
> ```
> So the type tree builds to infinity and the type of `range3` is very
> much determined by the lambda I gave to `range2`.  To me this seems
> kinda crazy.

Perhaps it's crazy, but as others have mentioned, these are Voldemort types; you're not *meant* to know what the concrete type is, merely that it satisfies the range API. It's sorta kinda like the compile-time functional analogue of a Java-style interface: you're not meant to know what the concrete derived type is, just that it implements that interface.

[...]
> But, loops are bad.  On the D blog I've seen knowledgeable people say all loops are bugs.

I wouldn't say all loops are bugs. If they were, why does D still have looping constructs? :D  But it's true that most loops should be refactored into functional-style components instead. Nested loops are especially evil if written carelessly or uncontrollably.

> But how do you get rid of them without descending into Type Hell(tm).

Generally, when using ranges you just let the compiler infer the type for you, usually with `auto`:

	auto mySuperLongPipeline = inputData
		.map!(...)
		.filter!(...)
		.splitter!(...)
		.joiner!(...)
		.whateverElseYouGot();

> Is there anyway to get some type erasure on the stack?

You can wrap a range in a heap-allocated OO object using the helpers in std.range.interfaces, e.g., .inputRangeObject.  Then you can use the interface as a handle to refer to the range.

Once I wrote a program almost entirely in a single pipeline.  It started from a math function, piped into an abstract 2D array (a generalization of ranges), filtered, transformed, mapped into a color scheme, superimposed on top of some rendered text, then piped to a pipeline-based implementation of PNG-generation code that produced a range of bytes in a PNG file that's then piped into std.stdio.File.bufferedWrite.  The resulting type of the main pipeline was so hilariously huge, that in an older version of dmd it produced a mangled symbol several *megabytes* long (by that I mean the *name* of the symbol was measured in MB), not to mention tickled several O(N^2) algorithms in dmd that caused it to explode in memory consumption and slow down to an unusable crawl.

The mangled symbol problem was shortly fixed, probably partly due to my complaint about it :-P -- kudos to Rainer for the fix!

Eventually I inserted this line into my code:

        .arrayObject    // Behold, type erasure!

(which is my equivalent of .inputRangeObject) and immediately observed a significant speedup in compilation time and reduction in executable size. :-D

The pipeline-based PNG emitter also leaves a lot to be desired in terms of runtime speed... if I were to do this again, I'd go for a traditional imperative-style PNG generator with hand-coded loops instead of the fancy pipeline-based one I wrote.

Pipelines are all good and everything, but sometimes you *really* just need a good ole traditional OO-style heap allocation and hand-written loop.  Don't pick a tool just because of the idealism behind it, I say, pick the tool best suited for the job.

T

-- 
Computers aren't intelligent; they only think they are.

Top | Forum index | About this forum

Forums