Thread overview
Re: toStringz note about keeping references
Oct 14, 2012
Jonathan M Davis
Oct 14, 2012
Andrej Mitrovic
Oct 14, 2012
Jonathan M Davis
Oct 14, 2012
Andrej Mitrovic
Oct 14, 2012
Ali Çehreli
Oct 16, 2012
Charles Hixson
Oct 14, 2012
Jonathan M Davis
Oct 15, 2012
Jacob Carlborg
Oct 15, 2012
Andrej Mitrovic
Oct 15, 2012
Jonathan M Davis
October 14, 2012
On Sunday, October 14, 2012 23:38:48 Andrej Mitrovic wrote:
> toStringz takes a string (immutable(char)[]), and the GC will not
> reclaim immutable data until app exit.

If the GC never collects immutable data which has no references to it until the app closes, then there's a serious problem. Immutability shouldn't factor into that at all. Anything and everything with no references to it any longer should be up for collection. Any other behavior would effectively result in huge memory leaks - especially if you're doing a lot with strings. Maybe you're seeing the behavior that you're seeing because the GC mistakingly thinks that something points to it (as happens sometimes - especially in 32- bit programs).

- Jonathan M Davis
October 14, 2012
On 10/15/12, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> Anything and everything with no references to it any
> longer should be up for collection.

I think this is fuzzy territory and it's a good opportunity to properly document GC behavior.

For example, TDPL states that immutable data is always available, and that a user should treat such data as if it existed throughout the lifetime of the program.
October 14, 2012
On Monday, October 15, 2012 00:51:34 Andrej Mitrovic wrote:
> On 10/15/12, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> > Anything and everything with no references to it any
> > longer should be up for collection.
> 
> I think this is fuzzy territory and it's a good opportunity to properly document GC behavior.

I don't see how it could be fuzzy at all. It makes no sense whatsoever to keep _any_ data around once it has nothing referencing it. The constness of an object should have _zero_ affect on its scope or lifetime. immutable _does_ implicitly make a variable shared, but that has nothing to do with the object's lifetime. That just makes it so that it can be used across threads. Once no more threads reference it, it should collected. Keeping the data around would simply result in the effective equivalent of leaked memory.

> For example, TDPL states that immutable data is always available, and that a user should treat such data as if it existed throughout the lifetime of the program.

I'd have to see exactly what TDPL says to comment on that accurately, but if it means that all immutable data is expected to be around for the entire lifetime of the program, then that's a huge problem. If it's something that was specifically allocated in ROM when the program started, then that makes sense (e.g. string literals on Linux), but nothing allocated on the normal GC heap should be kept alive simply because it's immutable. References to it must exist.

- Jonathan M Davis
October 14, 2012
On 10/15/12, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> I'd have to see exactly what TDPL says to comment on that accurately

Maybe I've misread it. On Page 288 it says:

"An immutable value is cast in stone: as soon as it's been
initialized, you may as well
consider it has been burned forever into the memory storing it. It
will never change
throughout the execution of the program."

Perhaps what was missing is: "as long as there is a reference to that data".

I'd really like to know for sure if the GC implementation actually collects immutable data or not. I've always used toStringz in direct calls to C without caring about keeping a reference to the source string in D code. I'm sure others have used it like this as well. Maybe the only reason my apps which use C don't crash is because a GC cycle doesn't often run, and when it does run it doesn't collect the source string data (either on purpose or because of buggy behavior, or because the GC is imprecise).

Anyway this stuff is important for OOP wrappers of C/C++ libraries. If the string reference must kept on the D side then this makes writing wrappers harder. For example, let's say you've had this type of wrapper:

extern(C) void* get_Foo_obj();
extern(C) void* c_Foo_test(void* c_obj, const(char)* input);

class Foo
{
    this() { c_Foo_obj = get_Foo_obj(); } // init c object by calling
a C function

    void test(string input)
    {
        c_Foo_test(c_Foo_obj, toStringz(input));
    }

    void* c_Foo_obj;  // reference to C object
}

Should we always store a reference to 'input' to avoid GC collection? E.g.:

class Foo
{
    this() { c_Foo_obj = get_Foo_obj(); } // init c object by calling
a C function

    void test(string input)
    {
        input_ref = input
        c_Foo_test(c_Foo_obj, toStringz(input));
    }

    string input_ref;  // keep it alive, C might use it after test() returns
    void* c_Foo_obj;  // reference to C object
}

And what about multiple calls? What if on each call to c_Foo_test() the C library stores each 'input' pointer internally? That would mean we have to keep an array of these pointers on the D side.

It's not know what the C library does without inspecting the source of the C library. So it becomes very difficult to write wrappers which are GC-safe.

There are wrappers out there that seem to expect the source won't be collected. For example GtkD also uses toStringz in calls to C without ever storing a reference to the input string.
October 14, 2012
On 10/14/2012 04:36 PM, Andrej Mitrovic wrote:
> On 10/15/12, Jonathan M Davis<jmdavisProg@gmx.com>  wrote:
>> I'd have to see exactly what TDPL says to comment on that accurately
>
> Maybe I've misread it. On Page 288 it says:
>
> "An immutable value is cast in stone: as soon as it's been
> initialized, you may as well
> consider it has been burned forever into the memory storing it. It
> will never change
> throughout the execution of the program."
>
> Perhaps what was missing is: "as long as there is a reference to that data".

Andrei must have written that only D in mind, without any C interaction. When we consider only D, then the statement is correct: If there is no more references, how can the application tell that the data is gone or not?

> I'd really like to know for sure if the GC implementation actually
> collects immutable data or not.

It does. Should be easy to test with an infinite loop that generates immutable data.

> I've always used toStringz in direct
> calls to C without caring about keeping a reference to the source
> string in D code. I'm sure others have used it like this as well.

It depends on whether the C-side keeps a copy of that pointer.

> Maybe the only reason my apps which use C don't crash is because a GC
> cycle doesn't often run, and when it does run it doesn't collect the
> source string data (either on purpose or because of buggy behavior, or
> because the GC is imprecise).
>
> Anyway this stuff is important for OOP wrappers of C/C++ libraries. If
> the string reference must kept on the D side then this makes writing
> wrappers harder. For example, let's say you've had this type of
> wrapper:
>
> extern(C) void* get_Foo_obj();
> extern(C) void* c_Foo_test(void* c_obj, const(char)* input);
>
> class Foo
> {
>      this() { c_Foo_obj = get_Foo_obj(); } // init c object by calling
> a C function
>
>      void test(string input)
>      {
>          c_Foo_test(c_Foo_obj, toStringz(input));
>      }
>
>      void* c_Foo_obj;  // reference to C object
> }
>
> Should we always store a reference to 'input' to avoid GC collection?

If the C function copies the pointer, yes.

> E.g.:
>
> class Foo
> {
>      this() { c_Foo_obj = get_Foo_obj(); } // init c object by calling
> a C function
>
>      void test(string input)
>      {
>          input_ref = input
>          c_Foo_test(c_Foo_obj, toStringz(input));
>      }
>
>      string input_ref;  // keep it alive, C might use it after test() returns

That's exactly what I do in a C++ library that wraps C types.

>      void* c_Foo_obj;  // reference to C object
> }
>
> And what about multiple calls? What if on each call to c_Foo_test()
> the C library stores each 'input' pointer internally? That would mean
> we have to keep an array of these pointers on the D side.

Again, that's exactly what I do in C++. :) There is a global container that keeps the objects alive.

> It's not know what the C library does without inspecting the source of
> the C library. So it becomes very difficult to write wrappers which
> are GC-safe.

Most functions document what they do with the input parameters. If not, it is usually obvious.

> There are wrappers out there that seem to expect the source won't be
> collected. For example GtkD also uses toStringz in calls to C without
> ever storing a reference to the input string.

Must be verified case-by-case.

Ali

October 14, 2012
On Monday, October 15, 2012 01:36:27 Andrej Mitrovic wrote:
> On 10/15/12, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> > I'd have to see exactly what TDPL says to comment on that accurately
> 
> Maybe I've misread it. On Page 288 it says:
> 
> "An immutable value is cast in stone: as soon as it's been
> initialized, you may as well
> consider it has been burned forever into the memory storing it. It
> will never change
> throughout the execution of the program."
> 
> Perhaps what was missing is: "as long as there is a reference to that data".

That says _nothing_ about collection. It's only saying that the value won't ever change. It's trying to highlight the difference between const and immutable. It would make _no_ sense for immutable data to not be collected when all references to it were gone.

> I'd really like to know for sure if the GC implementation actually collects immutable data or not.

I guarantee that if it doesn't, it's a bug. There are exceptions (e.g. string literals in Linux - because they go in ROM), and the GC isn't exactly enthusiastic about reclaiming memory, which means that stuff can hang around for quite a while, but normally, immutability should have no effect on the lifetime of an object.

> I've always used toStringz in direct
> calls to C without caring about keeping a reference to the source
> string in D code.

It's perfectly safe as long as the C function doesn't hold on to the pointer. If it does, then you could get screwed later on when that pointer gets used, and whether it works or not then becomes non-deterministic (since it depends on whether the GC has collected the memory or not and whether that memory has been reused) which could cause some really nasty bugs. That's why the note on toStringz is there in the first place. It would not surprise me at all if it's a common bug when interfacing with C that references are not kept around when they should be. I suspect that the main reason that it doesn't cause more issues is because most C functions don't keep pointers around.

> Anyway this stuff is important for OOP wrappers of C/C++ libraries. If the string reference must kept on the D side then this makes writing wrappers harder.

That's true, but to some extent, that's just life when dealing with interfacing with code outside of the GC's reach.

However, I believe that another option is to explicitly tell  the GC not collect a chunk of memory (glancing at core.memory, I suspect that removeRoot is the function to use for that, but I've never done it before, so I'm not well acquainted with the details). But if you want it to ever be collected, you'd need to make sure that it was readded to the GC again later, which could also complicate wrappers. It _is_ another option though if keeping a reference around in the D code is problematic.

> And what about multiple calls? What if on each call to c_Foo_test() the C library stores each 'input' pointer internally? That would mean we have to keep an array of these pointers on the D side.

Potentially, yes.

> It's not know what the C library does without inspecting the source of the C library. So it becomes very difficult to write wrappers which are GC-safe.

Unfortunately, that's true. However, remember that in C, you normally have to manage your own memory, so if C functions aren't appropriately clear about who owns what memory or which pointers get kept, then they'll run into serious problems in pure C. So, in general, I would expect a C function to be fairly clear when it keeps a pointer around or gives you a pointer to memory that it allocated or controls. But it's fairly rare that C functions keep pointers around (that would mean using global variables which are generally rare), so in most cases, it's a non-issue.

> There are wrappers out there that seem to expect the source won't be collected. For example GtkD also uses toStringz in calls to C without ever storing a reference to the input string.

As long as the function doesn't keep any of the pointers that it's given, then it's fine. If it _does_ keep a pointer around, then it's a bug for the D code not to keep a reference around. But as I said, it's fairly rare for C code to do that, which is probably why this doesn't cause more issues. But the note on toStringz is there precisely because most people aren't going to think of that problem, and they need to be aware of it when using toStringz.

- Jonathan M Davis
October 15, 2012
On 10/15/12, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> snip

Hmm ok, this sheds some light on things.

If a C function takes a const pointer and has no documentation about ownership then maybe it's a good guess to say it won't store that pointer anywhere and will only use it as a temporary?
October 15, 2012
On Monday, October 15, 2012 02:04:44 Andrej Mitrovic wrote:
> On 10/15/12, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> > snip
> 
> Hmm ok, this sheds some light on things.
> 
> If a C function takes a const pointer and has no documentation about ownership then maybe it's a good guess to say it won't store that pointer anywhere and will only use it as a temporary?

Generally speaking yes. It's rare for them to keep pointers around, and if they do and don't tell you, then they're creating bugs in C code too. Most C functions (especially if you're talking OS functions) which keep memory that you pass to them or give you memory thaty you didn't pass to them will inform you about it in their documentation.

- Jonathan M Davis
October 15, 2012
On 2012-10-15 01:56, Jonathan M Davis wrote:

> However, I believe that another option is to explicitly tell  the GC not
> collect a chunk of memory (glancing at core.memory, I suspect that removeRoot
> is the function to use for that, but I've never done it before, so I'm not
> well acquainted with the details).

It's the other way around. The GC will collect anything not reachable from the roots. The correct way is to use "addRoot".

-- 
/Jacob Carlborg
October 16, 2012
On 10/14/2012 04:54 PM, Ali Çehreli wrote:
> On 10/14/2012 04:36 PM, Andrej Mitrovic wrote:
>  > On 10/15/12, Jonathan M Davis<jmdavisProg@gmx.com> wrote:
>  >> I'd have to see exactly what TDPL says to comment on that accurately
>  >
>  > Maybe I've misread it. On Page 288 it says:
>  >
>  > "An immutable value is cast in stone: as soon as it's been
>  > initialized, you may as well
>  > consider it has been burned forever into the memory storing it. It
>  > will never change
>  > throughout the execution of the program."
>  >
>  > Perhaps what was missing is: "as long as there is a reference to that
> data".
>
> Andrei must have written that only D in mind, without any C interaction.
> When we consider only D, then the statement is correct: If there is no
> more references, how can the application tell that the data is gone or not?
>
>  > I'd really like to know for sure if the GC implementation actually
>  > collects immutable data or not.
>
> It does. Should be easy to test with an infinite loop that generates
> immutable data.
>
>  > I've always used toStringz in direct
>  > calls to C without caring about keeping a reference to the source
>  > string in D code. I'm sure others have used it like this as well.
>
> It depends on whether the C-side keeps a copy of that pointer.
>
>  > Maybe the only reason my apps which use C don't crash is because a GC
>  > cycle doesn't often run, and when it does run it doesn't collect the
>  > source string data (either on purpose or because of buggy behavior, or
>  > because the GC is imprecise).
>  >
>  > Anyway this stuff is important for OOP wrappers of C/C++ libraries. If
>  > the string reference must kept on the D side then this makes writing
>  > wrappers harder. For example, let's say you've had this type of
>  > wrapper:
>  >
>  > extern(C) void* get_Foo_obj();
>  > extern(C) void* c_Foo_test(void* c_obj, const(char)* input);
>  >
>  > class Foo
>  > {
>  > this() { c_Foo_obj = get_Foo_obj(); } // init c object by calling
>  > a C function
>  >
>  > void test(string input)
>  > {
>  > c_Foo_test(c_Foo_obj, toStringz(input));
>  > }
>  >
>  > void* c_Foo_obj; // reference to C object
>  > }
>  >
>  > Should we always store a reference to 'input' to avoid GC collection?
>
> If the C function copies the pointer, yes.
>
>  > E.g.:
>  >
>  > class Foo
>  > {
>  > this() { c_Foo_obj = get_Foo_obj(); } // init c object by calling
>  > a C function
>  >
>  > void test(string input)
>  > {
>  > input_ref = input
>  > c_Foo_test(c_Foo_obj, toStringz(input));
>  > }
>  >
>  > string input_ref; // keep it alive, C might use it after test() returns
>
> That's exactly what I do in a C++ library that wraps C types.
>
>  > void* c_Foo_obj; // reference to C object
>  > }
>  >
>  > And what about multiple calls? What if on each call to c_Foo_test()
>  > the C library stores each 'input' pointer internally? That would mean
>  > we have to keep an array of these pointers on the D side.
>
> Again, that's exactly what I do in C++. :) There is a global container
> that keeps the objects alive.
>
>  > It's not know what the C library does without inspecting the source of
>  > the C library. So it becomes very difficult to write wrappers which
>  > are GC-safe.
>
> Most functions document what they do with the input parameters. If not,
> it is usually obvious.
>
>  > There are wrappers out there that seem to expect the source won't be
>  > collected. For example GtkD also uses toStringz in calls to C without
>  > ever storing a reference to the input string.
>
> Must be verified case-by-case.
>
> Ali
>
There's a problem with this kind of ad hoc solution... If the library version changes, it can break things without changing the interface.

I think the real answer is that there needs to be a C layer between the D program and the library that copies any immutable data that may need to be kept, and only passes pointers to the copy to the C library.

OTOH, a nicer solution would be if there were a way of marking in D that "This item should never be garbage collected".  There are ways to do approximately that, but unfortunately the one's I'm recalling only work on class instances.  That's a clumsy way to do things.  What seems better is a wrapper around an item that holds a reference, so that data is marked held.  This could later be released.  This is actually just "keeping a reference", but it's a bit of syntactic sugar to make it easy.  Call the pair of functions "hold" and "release" or some such. (Actually, hold would need to do a bit more than just keep a reference.  It would also need to ensure that the data wasn't on the stack, which might mean you were working with a duplicate of the data, but since it's immutable, that shouldn't matter.)