Reference Counting, or why would I need a Garbage Collector?This is maybe the oldest strategy to find unused resources. It is still widely used in C/C++ programs and OS code. This strategy does exactly what the name implies, it counts how many references a resource has. When it drops to zero it can be cleaned up, usually by the thread that decreased the counter to zero. It is a beautifully simple design. The resources a thread has to clean up before resuming are very small, meaning that there are no noticable "hiccups" for the user. It appears as if the program runs seamlessly. Sounds great, but sadly Reference Counting also has its downsides:
- Referemce Counting is rarely supported on the language level. The compiler doesn't help to enforce correct increase/decrease of the counters. Nothing stops the programmer from creating memory leaks.
- Free Space needs to be managed, typically by a Freelist that is able to merge neighboring free areas. It also takes time to allocate a new object, since the program has to go through the list and find an area big enough.
- Cyclic dependencies are hard to find. There are solutions to this problem, like weak pointers in C++, but the programmer has to identify the cyclic dependancy and resolve it manually, or the program has a memory leak.
- The Heap tends to fragment as time goes on. It is possible to make it less of a problem by allocating as much on the stack as possible, like you can in C++. But when you are developing a server application that should run 24/7, fragmentation can become a real issue.
Simple Mark and Sweep Garbage CollectorJava's Garbage Collectors are using a Mark and Sweep algorithm. Every time the GC kicks in, the JVM and all Applications running on it get suspended. After that it cleans the heap in two phases
- Mark Phase:The GC marks all objects that are reachable
- Sweep Phase:The GC sweeps all unreachable objects
- References on the Callstack (A [tt]stackVar = new A();[/tt] ...)
- Static, global object referenes
- very short lived references ( [tt]return new Integer(3);[/tt])
- Java allows through the JNI (Java Native Interface) the invocation of C/C++ code which is also able to allocate Java objects. Those are also part of the Root Set
Allocation, STW Phases and Other Bad HabitsBut how does it find a new space for an object? Sadly, we are still dependant on a Freelist for managing our free spaces. Even worse, now every object is allocated on the heap. Not only do we still have fragmentation issues, but we made them worse. If that isn't enough, now we have Stop The World (STW) phases, the time span from the GC kicking in to it giving control back to the actual program are very noticable. Which is pretty bad if you are the provider of a streaming service, or playing a video game. Note that the Mark and Sweep GC is doing the same work as Reference Counting, the only difference is the distribution of when and how much each strategy cleans up. But what do we get? The GC automatically solves cyclic dependancies, if a cluster of objects isn't reachable, it can get deleted no matter how many references they have internally. The GC also hides memory management from the programmer, it is one thing less he or she has to worry about. That sounds great, but the Mark and Sweep GC did so by making everything else worse. If you think this can't be worth it, then you are right! Just for legacy, the first experimental Garbage Collectors worked in this way. I don't know if they were ever thrown in a productive system in this form... I hope not. They were devastatingly impractical and required a reboot after five hours or so, because the program managed to fragment the heap to hell. The problem does not lie the actual Mark and Sweep algorithm itself. This is nice and tidy and is used in every Garbage Collector I know of. So let us analyze the problem we are trying to solve, how many objects are allocated and when exactly are they becoming garbage?
Object LifetimeTypically a program allocates some objects at the start which stay alive during the whole runtime. There are also many objects that become useless shortly after their creation. The following diagram shows what a typical program usually does. The Axis are a bit puzzling, but think of the x axis as the age of a byte and the y axis the number of bytes for each age. For now, ignore the dashed lines [attachment=15315:190245.gif] What does this mean? The average computer program tends to have three different classes of objects. By far the largest amount of objects die shortly after they were created. Then follows the middle class, still moderately in size and stays useful for a little while. At last, we have the old class. Not as many objects as the other classes, but they stay in memory for a long while, some even as long as the program runs. Now it becomes clear why the simple Mark and Sweep GC fragmented even worse. At the start of the program, short, middle and long lived objects get allocated at roughly the same time. After a second, a big amount of bytes became garbage already, but they are all spread out between medium and long lived objects. In C++ for example, short lived objects could at least sometimes be allocated on the stack. As you see, in order to make a "good" Garbage Collector, we need to handle the classes differently. It makes sense to regularly clean up the short lived objects and make space for new ones. On the other hand we should be able to let the long lived objects alone as long as possible, only few objects die when they are already old. It may not surprise you, but modern Collectors can do that! When Java only cleans up young objects, it calls it a Minor Collection and when it cleans up the young and the old objects, it calls it a Major/Full Collection. But how do we only search through young objects? And how can we take care of fragmentation?
The Generational GCSo this should be it, the solution to all our problems. It is still a Mark and Sweep GC, but far more sophisticated. This GC uses Minor and Major Collections and it splits up the Heap in five different Generations or Spaces. Again, these are the Java-specific names, other Garbage Collectors work similarly but use different names: [attachment=15344:GC.png]
- Eden Space: Here is where every object will be created. Every minor collection completely clears out the Eden Space.
- Survivor Space 1 and 2: Both spaces have to be exactly the same. As the name implies, the survivors of a minor collection get copied here. Why we exactly need two Survivor spaces gets clear when we talk about aging. After every Minor Collection, one Survivor Space is completely empty.
- Old Space (or Tenured Generation): When a object has lived long enough, it gets promoted to the Old Space. Here, it can only get cleaned out by a Major Collection.
- Permanent Space: No GC will be performed here, that is why it is called permanent. You are also unable to allocate here, this space is used by the JVM to store metadata and other implementation-specific stuff.
- Objects in the Survivor Spaces are getting older, their chances of becoming garbage have sunk dramatically.
- The Survivor Space is getting full.
OutOfMemoryExceptionNow the final question, when does the heap actually run out of memory? Obviously, we would like the heap to be full when, well duh, it is full, when we can't allocate any more space. Sadly, this isn't the case. Let us make a freak example: Every object you allocate is 1 MiB in size, you have 4 Mib in Eden, 2* 0.5 MiB for Survivors and 5 MiB for Old Space. In total, your heap could hold 10 objects. Your program needs 7 objects at once in the heap, it regularly allocates a new 1 Mib object. Before it does it, your program removes the reference to an old object. Can the program run without an OutOfMemoryException? The answer is a clear yes and no! In theory at thee time of the crash, the Eden Space is full with 4 objects. The Old Space contains 3 garbage and 2 objects alive. A survivor space can not hold even a single object. The Generational GC has to clean up the Eden Space completely, but there is no space for all 4 objects, even after a Major Collection. So the Garbage Collector has no other choice but to hoist the white flag, to throw an OutOfMemoryException. But it CAN work, the Oracle implementation for example has a smarter Eden Space, it doesn't clean up the whole Eden Space because it would throw an OutOfMemoryException. So it can work, but don't count on it.
Common MisconceptionsEspecially newer programmers seem to struggle with this. They hear about Garbage Collectors and the overhead. They think that good code therefore generates as less garbage as possible. They try to keep objects longer in memory than they should and maybe even dislike the Immutable Pattern. It is true that if you are using an object anyways later on, that you should probably reuse it, but don't keep objectsyou might use at some point in the future in memory. With modern garbage collectors, short lived objects produce much less overhead than long lived ones. The Collector might need to swap it around in the Survivor spaces or if you are unlucky, it might even get promoted. Then you really produced unnecessary overhead. So don't hesitate to throw away what you don't need, it will only matter if you are allocating and throwing away megabytes. If you are doing that, no matter if you are using a Garbage Collector or not, you should rethink your program logic.
PerformanceWhen I read people arguing about why C++ has the better performance than higher-level languages, I always read one sentence that says: "because it doesn't use a Garbage Collector". Which I find a bit irritating, because it is in general not true. Let us discuss Garbage Collectors and Performance. First, we should think about what we mean when we say performance, there are mainly four factors when we talk about memory management:
- Memory Footprint
- Stop the World Phases
- Allocation Overhead