New is SLOW!

Started by
26 comments, last by iMalc 11 years, 4 months ago
Hello all.

I just ran the following two functions 100 times each:

void passing(int j )
{
for(int i = 0; i < 1000000; ++i)
{
int f = 0;
}
}
void referrencing(int* j)
{
for(int i = 0; i < 1000000; ++i)
{
int* f = new int;
delete f;
}
}

here was the output:

passing: 2381801 ticks
referrencing: 787614789 ticks

That is almost 400x as slow! Is this an anomaly of me running windows on a virtual machine, or is new, delete always this slow as compared with local variable alloation? I also did not find that simply dereferencing j caused any significant slowdown in the code.

I wonder as I wander...

http://www.davesgameoflife.com

Advertisement
Well if you think about what is actually happening:

In the first case, it is in the worst scenario doing a loop that has nothing to do; the local int is going to have been assigned to a register and at worst a space on the stack. In either case setting it to 0 is like nothing at all. Most likely your compiler is not even bothering to set it since it's unused, and may even be removing the loop altogether as it's not doing any work.

In the second case, each time you call 'new int' it's calling the operating system to ask it for space and performing lots of computations to choose the best place to allocate the int, with perhaps further logic to maintain the heap, same with delete in reverse.
A lot of this behaviour in C++ is implementation dependent but as a general rule in a nutshell:
- stack allocation requires incrementing a pointer by sizeof(WhatIWant)
- heap allocation requires synchronizing over all threads and walking some kind of list until a suitable chunk of memory is found, modifying the list accordingly, and returning the pointer.
Note that this applies to C++ only. On modern Java implementations for example, new is silly cheap. Also note that the Debug runtime of MSVC for example does a lot of extra work (like initializing the allocated memory with a special value and padding the allocated memory with another special value).

Also note that an optimizing compiler will probably be able to completely eliminate your passing function since it can detect that nothing in there has a side effect. I would not trust the benchmark as written. Either the compiler is not optimizing (in which case it's probably the default Debug configuration in case of MSVC and as such completely useless for any benchmark) or the function and containing loop was probably completely removed.
I started programming in 1982 and didn't write anything with dynamic memory allocation until 1992. Dynamic-memory allocation is a high-level construct that isn't obvious to implement in assembly, and therefore it's not a surprise that it's expensive. You should probably avoid using it in tight loops.
Your test cases are pretty use- and meaningless.

But your observation is essentially correct: Everything being equal, allocating on the stack is faster as allocating in the free store (or heap).

On modern Java implementations for example, new is silly cheap.


new might be cheap in Java compared to C++ but cleaning up isn't. (excessive use of new in Java is almost as bad as it is in C++ since it will give the GC more work)
[size="1"]I don't suffer from insanity, I'm enjoying every minute of it.
The voices in my head may not be real, but they have some good ideas!
I didn't say it does not come with a price at a different place. If Java-new were just completely better without any negative side effect at another point everyone would be stopping to use C++ and start using Java (well, not really). The important point up there was that new being expensive is something related to C++ only. It comes with prices and benefits all over the place. Other languages, even if they have the exactly same looking new-keyword, do that differently and things come with different prices and benefits.

That is almost 400x as slow! Is this an anomaly of me running windows on a virtual machine, or is new, delete always this slow as compared with local variable alloation?


There are two ways in which these results can be explained:

  • Allocation of local variables will typically be done on "the stack", though strictly speaking this is an architecture-specific implementation detail (I don't believe C++ mandates a stack explicitly, only "automatic storage"). Allocation of objects with new and delete will typically be done on "the heap" (technically, "the free store" according to the C++ standard), which normally involve function calls that look for "gaps" in virtual memory (crudely speaking).
  • A modern compiler will likely eliminate the first loop entirely. You might find a tool-chain that could eliminate the second loop, too, though this is less likely as I guess it would have to be done as a global optimization to ensure new/delete haven't been overridden.



I also did not find that simply dereferencing j caused any significant slowdown in the code.
[/quote]

If you want a thorough understanding about what's going on here, a really good exercise is to find out how to look at the assembly code your compiler generates and see if you can follow what it has done. Reading it will necessarily involve understanding what registers are, what "the stack" means on x86 and a bit about the calling convention(s) used.

There was quite a good series on altdevblogaday recently which you might be interested in. This is the best link I can find that includes them all:
http://www.altdevblogaday.com/?s=C%2B%2B+low+level+curriculum

Look for the "C++ low level curriculum" posts.
Apart from the already mentioned obvious (the first function is no-op), I don't consider 787 cycles for a general allocator slow. If you subtract the cycles for your [font=courier new,courier,monospace]for [/font]loop and for the non-inlineable library call, that's around 700-750 cycles alltogether.

Actually, for a non-specialized allocator which has to deliver regardless of size and situation, and is running in debug mode (unless you use the world's worst compiler in existence, this is demonstrably a non-optimized build), that's a pretty awesome performance.

In comparison, a single cache miss is on the order of several hundred cycles. Thus, realistically, for non-trivial code (one that actually uses a million allocated memory locations), the overhead of the allocator wouldn't matter much.
Samoth,

I was actually trying to generate cache misses with this code. I have read all about them and would like to experience them for myself. I wanted to know if it was better to send in a pointer to a structure for a often called piece of code, or to just pass the structure by value to avoid a potential cache miss. Can you advise on which you think would be better in that case? I might not be getting cache misses because I have a 15MB L2(3?) cache, but I know not all systems have that.

I wonder as I wander...

http://www.davesgameoflife.com

This topic is closed to new replies.

Advertisement