The key difference between Java and C++ is not the speed of the generated code. It really is garbage collection. Garbage collection is a space-time tradeoff.
TL;DR - Garbage collectors need lots of RAM to run fast, but given enough RAM, they are as fast as malloc/free (BUT NOT FASTER).
The more space you are willing to devote to the garbage-collected heap, the faster it will run (since it will collect garbage less frequently). Given about 5x as much memory as you'd need with a C++ program, you will get about the same performance. Of course, if you need that memory for other reasons (i.e., caching data from disk / network, for large-scale in-memory computations, etc.), then that extra space comes at an extra cost. And if it would trigger paging, then obviously that is disastrous. Here's the paper and an excerpt from a Lambda the Ultimate discussion of a paper I wrote with my student Matthew Hertz in OOPSLA 2005 (paper and presentation attached).
Quantifying the performance of garbage collection vs. explicit memory management
Abstract:
Garbage collection yields numerous software engineering benefits, but its quantitative impact on performance remains elusive. One can compare the cost of conservative garbage collection to explicit memory management in C/C++ programs by linking in an appropriate collector. This kind of direct comparison is not possible for languages designed for garbage collection (e.g., Java), because programs in these languages naturally do not contain calls to free. Thus, the actual gap between the time and space performance of explicit memory management and precise, copying garbage collection remains unknown.
We introduce a novel experimental methodology that lets us quantify the performance of precise garbage collection versus explicit memory management. Our system allows us to treat unaltered Java programs as if they used explicit memory management by relying on oracles to insert calls to free. These oracles are generated from profile information gathered in earlier application runs. By executing inside an architecturally-detailed simulator, this "oracular" memory manager eliminates the effects of consulting an oracle while measuring the costs of calling malloc and free. We evaluate two different oracles: a liveness-based oracle that aggressively frees objects immediately after their last use, and a reachability-based oracle that conservatively frees objects just after they are last reachable. These oracles span the range of possible placement of explicit deallocation calls.We compare explicit memory management to both copying and non-copying garbage collectors across a range of benchmarks using the oracular memory manager, and present real (non-simulated) runs that lend further validity to our results. These results quantify the time-space tradeoff of garbage collection: with five times as much memory, an Appel-style generational collector with a non-copying mature space matches the performance of reachability-based explicit memory management. With only three times as much memory, the collector runs on average 17% slower than explicit memory management. However, with only twice as much memory, garbage collection degrades performance by nearly 70%. When physical memory is scarce, paging causes garbage collection to run an order of magnitude slower than explicit memory management.
....
Quantifying the Performance of Garbage Collection vs. Explicit Memory Management, by Matthew Hertz and Emery D. Berger:
Overall, generational collectors can add up to 50% space overhead and 5-10% runtime overheads if we ignore virtual memory. Very reasonable given the software engineering benefits. However, factoring in virtual memory with its attendant faulting paints a very different picture in section 5.2:
Garbage collection yields numerous software engineering benefits, but its quantitative impact on performance remains elusive. One can measure the cost of conservative garbage collection relative to explicit memory management in C/C++ programs by linking in an appropriate collector. This kind of direct comparison is not possible for languages designed for garbage collection (e.g., Java), because programs in these languages naturally do not contain calls to free. Thus, the actual gap between the time and space performance of explicit memory management and precise, copying garbage collection remains unknown.
We take the first steps towards quantifying the performance of precise garbage collection versus explicit memory management. We present a novel experimental methodology that lets us treat unaltered Java programs as if they used explicit memory management. Our system generates exact object reachability information by processing heap traces with the Merlin algorithm [34, 35]. It then re-executes the program, invoking free on objects just before they become unreachable. Since this point is the latest that a programmer could explicitly free objects, our approach conservatively approximates explicit memory management. By executing inside an architecturally-detailed simulator, this “oracular” memory manager eliminates the effects of trace processing while measuring the costs of calling malloc and free.
We compare explicit memory management to both copying and non-copying garbage collectors across a range of benchmarks, and include real (non-simulated) runs that validate our results. These results quantify the time-space tradeoff of garbage collection: with five times as much memory, an Appel-style generational garbage collector with a non-copying mature space matches the performance of explicit memory management. With only three times as much memory, it runs on average 17% slower than explicit memory management. However, with only twice as much memory, garbage collection degrades performance by nearly 70%. When physical memory is scarce, paging causes garbage collection to run an order of magnitude slower than explicit memory management.
These graphs show that, for reasonable ranges of available memory (but not enough to hold the entire application), both explicit memory managers substantially outperform all of the garbage collectors. For instance, pseudoJBB running with 63MB of available memory and the Lea allocator completes in 25 seconds. With the same amount of available memory and using GenMS, it takes more than ten times longer to complete (255 seconds). We see similar trends across the benchmark suite. The most pronounced case is javac: at 36MB with the Lea allocator, total execution time is 14 seconds, while with GenMS, total execution time is 211 seconds, over a 15-fold increase.
The culprit here is garbage collection activity, which visits far more pages than the application itself [60]