Jump to page: 1 2
Thread overview
Is garbage detection a thing?
Nov 29, 2020
Mark
Nov 29, 2020
Daniel N
Nov 29, 2020
Mark
Nov 29, 2020
Mark
Nov 29, 2020
Mark
Dec 01, 2020
Kagamin
Nov 29, 2020
Daniel N
Nov 29, 2020
Kagamin
Nov 29, 2020
Elronnd
Nov 29, 2020
Mark
Nov 29, 2020
Bastiaan Veelo
Nov 29, 2020
Mark
Nov 29, 2020
Mark
November 29, 2020
Hi,

can I ask you something in general? I don't know anyone whom I could ask. I'm a hobbyist with no science degree or job in computing, and also know no other programmers.

I have no good understanding why "garbage collection" is a big thing and why "garbage detection" is no thing (I think so).

I want to get rid of undefined behavior. So I tell myself, what is this actually? It's most of the time corrupted heap memory and the C++ compiler giving me errors that I thought were kind of impossible.

Now I could follow all C++ guidelines and almost everything would be okay. But many people went into different directions, e.g. 1995 they released Java and you would use garbage collection.

What I don't understand is, when today there exist tools for C++ (allocator APIs for debugging purposes, or Address-Sanitizer or maybe also MPX) to just detect that your program tried to use a memory address that was actually freed and invalidated,

why did Java and other languages not stop there but also made a system that keeps every address alive as long as it is used?

One very minor criticism that I have is: With GC there can be "semantically old data" (a problematic term, sorry) which is still alive and valid, and the language gives me the feeling that it is a nice system that way. But the overall behavior isn't necessarily very correct, it's just that it is much better than a corrupted heap which could lead to everything possibly crashing soon.

My bigger criticism is just, that compilers with garbage collection are big software (with big libraries) and tend to have defects in other parts. E.g. such compilers (two different ones) lately gave me wrong line numbers in error messages.

And other people's (not mine, not really much) criticism is that they say garbage collection increases the use of memory and it can create a blocking of threads when accessing shared memory, or something like this.




So... I wonder where the languages are that only try to give this type of error: Your faulty program has (at runtime) used memory which has already been freed. Not garbage collection. The compiled program just stops all execution and tells me this, so that I would go on with my manual memory management.

Now, from today's perspective I could use Rust to create a very formal representation of my requirements and create a program that is very deterministic and at the same time uses very few resources.

But I'd like to pretend there is no Rust (because the lifetimes and some other things make it a domain-specific language to some extent), and I would like to ask about the "runtime-solution".
Why shouldn't it be a good thing? Has it been tried?

All I would *need* to do additionally is dividing the project into two sub-projects as it is done with C++: Debug build an release build.

Then the debug build would use a virtual machine that uses type information from compilation for garbage detection, but not garbage collection.

And when I have tested all runtime cases of my compiled software, which runs slow, but quite deterministically, I will go on and build the release build.

And if the release build (which is faster) does not behave deterministically, I would fix the "VM/Non-VM compiler" I'm talking about until the release build shows the same behavior.

I guess there is a way this approach could fail: Timing may have influence and make the VM behave differently from the Non-VM (e.g. x64). And it's surely not easy to write a compiler that creates code which traces pointers and still leaves you much freedom to cast and alter pointers. In some way it is doomed to fail, but there are language constructs that work.

There have been C interpreters, iterators as pointer replacements, or just any replacement. BTW I know of CINT and safe-c, but I'm not happy how these projects look from the outside.

If I had the education and persistence I would like to try to build my own "safe-c", yet another one. But I think it's better to ask you why garbage detection isn't a popular thing. Does it exist at all as core idea in a language (probably a C improvement)?

Where are the flaws in my thinking?

I currently think, if I were serious about it (I'm not 100% sure), I should just find a C interpreter. CINT? Or this one academic compiler from five years ago? (I believe this compiler needs a special CPU) To be honest, I have no clue. Just one "interpreter" that tries to mimic pointers as much as it can, and later I would be free to port the code to Microsoft's C.

Or maybe I could use the safe-c subset in D? But I believe it uses garbage collection. I know nothing about it, sorry.

What I tried in the past few days was porting working Go code to C. I wanted the C code to be Go-idiomatic, and I was looking there for the common subset from Golang combined with C. Well, I used macros, had a few ideas, but then this C style quickly failed. Really frustrating. But.. I'm not planning to give up. ;)

Thanks a lot for reading, and sorry for a lot of text that is off-topic and is not related to D.
November 29, 2020
On Sunday, 29 November 2020 at 16:05:04 UTC, Mark wrote:
>
> Thanks a lot for reading, and sorry for a lot of text that is off-topic and is not related to D.

Sounds like what you want is ASAN? You can use it with plain C or D(LDC).
https://clang.llvm.org/docs/AddressSanitizer.html

November 29, 2020
Maybe Ada.
November 29, 2020
On Sunday, 29 November 2020 at 16:21:59 UTC, Daniel N wrote:
> On Sunday, 29 November 2020 at 16:05:04 UTC, Mark wrote:
>>
>> Thanks a lot for reading, and sorry for a lot of text that is off-topic and is not related to D.
>
> Sounds like what you want is ASAN? You can use it with plain C or D(LDC).
> https://clang.llvm.org/docs/AddressSanitizer.html

I could use AddressSanitizer indirectly by using Go. But their compiler gave me wrong line numbers for errors and I have not yet overcome this psycholocicaly, to be honest. They have a fixed version, which is a WIP which is already using generics.

So I went on and saw that Visual C++ now features AdressSanitizer. It showed faulty behavior very soon, a false positive AFAIR. It's in experimental stage. /d2MPX is out of the early development stage but it's no prominent feature.

I went on with Nim, which then gave me wrong line numbers for error messages. I'm not counting wrong, they really do, both Golang and Nim gave me errors that triggered me. ;)

I began a little C compiler project based on c4, knowing that I would be very old when this should ever be finished.

Actually I am looking for a good compiler for Windows or maybe macOS, and looking at JAI, too.

Maybe I should just install Linux. But ... the drivers... My Thinkpad just doesn't like any Linux. I run out of ideas.

In the first place all I wanted to do is make some music.

Kind regards
November 29, 2020
> I could use AddressSanitizer indirectly by using Go. But their

Oh wait, it was ThreadSanitizer that Go uses, right? I failed at talking.

I would probably use ASAN under Linux, because that is the right thing to do?

Looking at Ada now.
November 29, 2020
> Looking at Ada now.

I found: Ada is not good for me. It has no augmented assignment. It's just that I want DRY because I use very verbose variable names, and in the past I had a real world case (game in Lua) where I became frustrated when I had to repeat the names. I understand that NASA or so will repeat their variable names. They get paid. ;)

Kind regards
November 29, 2020
On Sunday, 29 November 2020 at 16:35:26 UTC, Mark wrote:
>
> Maybe I should just install Linux. But ... the drivers... My Thinkpad just doesn't like any Linux. I run out of ideas.
>
> In the first place all I wanted to do is make some music.
>
> Kind regards

You could try a linux image in VirtualBox or VMware, to more easily evaluate if linux + ASAN matches your expectations or if it's another dead-end.

Regards,
Daniel
November 29, 2020
On Sunday, 29 November 2020 at 16:05:04 UTC, Mark wrote:
> I have no good understanding why "garbage collection" is a big thing and why "garbage detection" is no thing (I think so).

Because it's just as expensive to do garbage detection as automatic garbage collection.  So if you're going to go to the work of detecting when something is garbage, it's basically free to detect it at that point.


> today there exist tools for C++ (allocator APIs for debugging purposes, or Address-Sanitizer or maybe also MPX) to just detect that your program tried to use a memory address that was actually freed and invalidated,

Note that address sanitizer is significantly slower than most ‘real’ GCs (such as are used by java, or others).


> why did Java and other languages not stop there but also made a system that keeps every address alive as long as it is used?

> Then the debug build would use a virtual machine that uses type information from compilation for garbage detection, but not garbage collection.

Address sanitizer does exactly what you propose here.  The problem is this:

Testing cannot prove only the presence of bugs; never their absence.  You may run your c++ program a thousand times with address sanitizer enabled and get no errors; yet, your code may still be incorrect and contain memory errors.  Safety features in a language--like a GC--prevent an entire class of bugs definitively.


> One very minor criticism that I have is: With GC there can be "semantically old data" (a problematic term, sorry) which is still alive and valid, and the language gives me the feeling that it is a nice system that way. But the overall behavior isn't necessarily very correct, it's just that it is much better than a corrupted heap which could lead to everything possibly crashing soon.

The distinction here is _reachability_ vs _liveness_.

So, GC theory:

A _graph_ is type of data structure.  Imagine you have a sheet of paper, and on the sheet of paper you have a bunch of dots.  There are lines connecting some of the dots.  In graph theory, the dots are called nodes, and the lines are edges.  We say that nodes A and B are _connected_ if there is an edge going between them.  We also say that A is _reachable_ from B if either A and B are connected, or A is connected to some C, where C is reachable from B.  Basically, if you can reach one point from another just by following lines, then each is reachable from the other.

A _directed_ graph is one in which the edges have directionality.  Imagine the lines have little arrows at the ends.  There may be an edge that goes A -> B; or there may be an edge that goes B -> A.  Or there may be both: A <-> B.  (Or they can be unconnected.)  In this case, to reach one node from another, you have to follow the arrows.  So it may be that, starting at A, you can reach B; but you can't go the other way round.

The _heap_ is all the objects you've ever created.  This includes the objects you allocate with 'new', as well as all the objects you allocate from the stack and all your global variables.  What's interesting is that we can think of the heap as a directed graph.  If object A contains a pointer to object B, we can think of that the same as way as there being an edge going from node A to node B.

The _root set_ is some relatively small number of heap objects that are always available.  Generally, this is all the global variables and stack-allocated objects.  The name _reachable_ is given to any object which is reachable from one of the root set.

It is impossible for your program to access an unreachable object; there's no way to get a pointer to it in the first place.  So it is safe for the GC to free unreachable objects.

But we can also add another category of objects: _live_ vs _dead_ objects.  Live objects are ones which you're actually going to access at some point.  Dead objects are objects that you're never going to access again, even if they're reachable.  If a GC could detect which reachable objects were dead, it would be able to be more efficient and use less memory...hypothetically.

The reason this distinction is important, and the reason I bring up graph theory, is that liveness is impossible to prove.  Seriously: it's impossible, in the general case, for the GC to prove that an object is still alive.  Whereas it's trivial to prove reachability.

Now, it is true that there are some cases where an object is dead but still reachable.  The fact of the matter is that in most such cases, the object becomes unreachable shortly thereafter.  In the cases when it's not, it tends to be impractical to prove an object is dead.  The extra work that it would take to prove deadness in such cases, if it were even possible to prove, would make it a not worthwhile optimization.


> And when I have tested all runtime cases of my compiled software, which runs slow, but quite deterministically, I will go on and build the release build.
>
> And if the release build (which is faster) does not behave deterministically, I would fix the "VM/Non-VM compiler" I'm talking about until the release build shows the same behavior.
>
> I guess there is a way this approach could fail: Timing may have influence and make the VM behave differently from the Non-VM (e.g. x64).

I don't know why you're so hung up on timing.  It's easy to write code which isn't sensitive to timing, as long as you don't use threads.  That doesn't mean it's possible to test it exhaustively; see the above note about testing.
November 29, 2020
On Sunday, 29 November 2020 at 16:05:04 UTC, Mark wrote:
> Hi,
>
> can I ask you something in general? I don't know anyone whom I could ask. I'm a hobbyist with no science degree or job in computing, and also know no other programmers.
>
> I have no good understanding why "garbage collection" is a big thing and why "garbage detection" is no thing (I think so).

In order to detect garbage, you need extensive run-time instrumentation, the difficulties of which you have indicated yourself. In addition comes that detection depends on circumstance, which is an argument against the debug/release strategy you proposed. There is no guarantee that you’ll find all problems in the debug build. Garbage collection also comes at a runtime cost, but strategies exist to minimise those, and in addition a GC enables valuable language features. One such strategy is to minimise allocations, which improves performance in any memory management scheme.

[...]
> What I don't understand is, when today there exist tools for C++ (allocator APIs for debugging purposes, or Address-Sanitizer or maybe also MPX) to just detect that your program tried to use a memory address that was actually freed and invalidated,
>
> why did Java and other languages not stop there but also made a system that keeps every address alive as long as it is used?

Elimination of memory problems is much more valuable than detection. Recovering from memory errors at run time is unreliable.

> One very minor criticism that I have is: With GC there can be "semantically old data" (a problematic term, sorry) which is still alive and valid, and the language gives me the feeling that it is a nice system that way. But the overall behavior isn't necessarily very correct, it's just that it is much better than a corrupted heap which could lead to everything possibly crashing soon.

At least in D, you can avoid old data to hang around for too long. See core.memory.

> Or maybe I could use the safe-c subset in D? But I believe it uses garbage collection. I know nothing about it, sorry.

@safe D is not a sub-set, indeed it uses garbage collection. Fact is that there are very few domains where this is a problem. Not all garbage collectors are equal either, so if you think garbage collection is bad in one language, this may not directly apply in another. In D the garbage collector is even pluggable, various implantations exist. Have you seen the GC category on the blog?https://dlang.org/blog/2017/03/20/dont-fear-the-reaper/

BetterC is a subset of D, it does not use garbage collection.

You may be interested in current work being done in static analysis of manual memory management in D: https://youtu.be/XQHAIglE9CU

The advantage of D is that all options are open. This allows the following approach:
1) Start development without worrying about memory. Should collection cycles be noticeable:
2) Profile your program and make strategic optimisations https://youtu.be/dRORNQIB2wA. If this is not enough:
3) Force explicit collection in idle moments. If you need to go further:
4) Completely eliminate collection in hot loops using @nogc and/or GC.disable. When even this is not enough:
5) Try another GC implementation. And if you really need to:
6) Switch to manual memory management where it matters.

This makes starting a project in D a safe choice, in multiple meanings of the word.

— Bastiaan.
November 29, 2020
> The reason this distinction is important, and the reason I bring up graph theory, is that liveness is impossible to prove.
>  Seriously: it's impossible, in the general case, for the GC to prove that an object is still alive.  Whereas it's trivial to prove reachability.

My motivation was actually just that I wanted a very small compiler with no libraries, because I got a bit tired of big things with little defects.

But the liveness is a thing I would like to say something about: I don't want a compiler that tries to prove it for me. The reason is that if I did manual memory management and created a use-after-free bug, then my personal world would be still very good. It's just that the industry isn't happy when releasing banana software is a serious problem?

In the case of the use-after-free I would say that my software needs correct memory state and correct logic. The logic so to speak is what I actually wanted to implement in the first place. If the program is perfect and works in all situations, how could it do this with bugs in the memory management? It can't.

So, I didn't really think often about it, but when Rust came out, there was some trend into this direction, and my feeling was: I can follow it, it looks good, but can I still just use C? The way it works would just be that I have correct out-of-bounds and use-after-free/double-free detection and race condition detection if my software is supposed to be chaotic server or browser software (?), both things at runtime. Given that it is true (?) that software cannot do a task correctly when it does part of it (memory management) incorrectly.

More or less it was an idea to get what Rust and Swift try to do without all the language features. C with less language constructs would be nice. It's just that the processors are bad for me, for what I'm trying to do. Intel offers MPX. It's good I guess. But why does Visual C++ implement it like an easter egg. Why have they now added ASAN as experimental feature that immediately fails. They do things I don't really like, because the industry needs it like this, but not me. And at the high level, there's bloat and fancy colors.

I want just a toolkit to create exactly what can be created, and if I do it wrong it should fail hard. And I'm trying to make it so that I don't write assembly, if that is possible and good in my situation.

I'd say, I have just not understood the whole thing and maybe should try a different hobby, or finally create the thing I'm looking for, which turns out to be an endless story. It's just that a few hours ago I had the hope that the holy grail would exist.

I had found the solution to almost every problem in Golang. And then after one year my hobby just broke down, because Go outputs wrong line numbers. It's no good compiler when it does that. I'd rather quit my hobby than accept it.

Thanks a lot for your explanation!
Really kind regards,
you helped me to understand it.
« First   ‹ Prev
1 2