#1 Members - Reputation: 301
Posted 03 December 2012 - 07:00 AM
I just ran the following two functions 100 times each:
void passing(int j )
{
for(int i = 0; i < 1000000; ++i)
{
int f = 0;
}
}
void referrencing(int* j)
{
for(int i = 0; i < 1000000; ++i)
{
int* f = new int;
delete f;
}
}
here was the output:
passing: 2381801 ticks
referrencing: 787614789 ticks
That is almost 400x as slow! Is this an anomaly of me running windows on a virtual machine, or is new, delete always this slow as compared with local variable alloation? I also did not find that simply dereferencing j caused any significant slowdown in the code.
I wonder as I wander...
#2 Members - Reputation: 625
Posted 03 December 2012 - 07:09 AM
In the first case, it is in the worst scenario doing a loop that has nothing to do; the local int is going to have been assigned to a register and at worst a space on the stack. In either case setting it to 0 is like nothing at all. Most likely your compiler is not even bothering to set it since it's unused, and may even be removing the loop altogether as it's not doing any work.
In the second case, each time you call 'new int' it's calling the operating system to ask it for space and performing lots of computations to choose the best place to allocate the int, with perhaps further logic to maintain the heap, same with delete in reverse.
#3 Members - Reputation: 1406
Posted 03 December 2012 - 07:13 AM
- stack allocation requires incrementing a pointer by sizeof(WhatIWant)
- heap allocation requires synchronizing over all threads and walking some kind of list until a suitable chunk of memory is found, modifying the list accordingly, and returning the pointer.
Note that this applies to C++ only. On modern Java implementations for example, new is silly cheap. Also note that the Debug runtime of MSVC for example does a lot of extra work (like initializing the allocated memory with a special value and padding the allocated memory with another special value).
Also note that an optimizing compiler will probably be able to completely eliminate your passing function since it can detect that nothing in there has a side effect. I would not trust the benchmark as written. Either the compiler is not optimizing (in which case it's probably the default Debug configuration in case of MSVC and as such completely useless for any benchmark) or the function and containing loop was probably completely removed.
#4 Members - Reputation: 5899
Posted 03 December 2012 - 07:29 AM
#6 Members - Reputation: 3716
Posted 03 December 2012 - 07:42 AM
On modern Java implementations for example, new is silly cheap.
new might be cheap in Java compared to C++ but cleaning up isn't. (excessive use of new in Java is almost as bad as it is in C++ since it will give the GC more work)
The voices in my head may not be real, but they have some good ideas!
#7 Members - Reputation: 1406
Posted 03 December 2012 - 07:48 AM
#8 Members - Reputation: 2043
Posted 03 December 2012 - 08:21 AM
That is almost 400x as slow! Is this an anomaly of me running windows on a virtual machine, or is new, delete always this slow as compared with local variable alloation?
There are two ways in which these results can be explained:
- Allocation of local variables will typically be done on "the stack", though strictly speaking this is an architecture-specific implementation detail (I don't believe C++ mandates a stack explicitly, only "automatic storage"). Allocation of objects with new and delete will typically be done on "the heap" (technically, "the free store" according to the C++ standard), which normally involve function calls that look for "gaps" in virtual memory (crudely speaking).
- A modern compiler will likely eliminate the first loop entirely. You might find a tool-chain that could eliminate the second loop, too, though this is less likely as I guess it would have to be done as a global optimization to ensure new/delete haven't been overridden.
I also did not find that simply dereferencing j caused any significant slowdown in the code.
If you want a thorough understanding about what's going on here, a really good exercise is to find out how to look at the assembly code your compiler generates and see if you can follow what it has done. Reading it will necessarily involve understanding what registers are, what "the stack" means on x86 and a bit about the calling convention(s) used.
There was quite a good series on altdevblogaday recently which you might be interested in. This is the best link I can find that includes them all:
http://www.altdevblogaday.com/?s=C%2B%2B+low+level+curriculum
Look for the "C++ low level curriculum" posts.
Edited by edd, 03 December 2012 - 08:24 AM.
#9 Members - Reputation: 1971
Posted 03 December 2012 - 09:32 AM
Actually, for a non-specialized allocator which has to deliver regardless of size and situation, and is running in debug mode (unless you use the world's worst compiler in existence, this is demonstrably a non-optimized build), that's a pretty awesome performance.
In comparison, a single cache miss is on the order of several hundred cycles. Thus, realistically, for non-trivial code (one that actually uses a million allocated memory locations), the overhead of the allocator wouldn't matter much.
#10 Members - Reputation: 301
Posted 03 December 2012 - 12:56 PM
I was actually trying to generate cache misses with this code. I have read all about them and would like to experience them for myself. I wanted to know if it was better to send in a pointer to a structure for a often called piece of code, or to just pass the structure by value to avoid a potential cache miss. Can you advise on which you think would be better in that case? I might not be getting cache misses because I have a 15MB L2(3?) cache, but I know not all systems have that.
I wonder as I wander...
#11 Members - Reputation: 1971
Posted 03 December 2012 - 01:13 PM
Therefore, if you allocate/free/allocate/free/allocate an integer, you will very likely get the same address or one of a handful of addresses every time. Thus, cache misses are not going to happen.
To provoke cache misses, you need to either allocate one large block and index into that randomly, or do some ten thousand allocations without freeing, and dereference the returned pointers (preferrably in random order). Any access pattern that is either localized or an access pattern with a constant stride (e.g. accessing every 64th or 253th byte) will very likely not give a lot of cache misses due to either several accesses landing in the same cache line anyway or triggering the CPU's built-in prefetch mechanism to anticipate your requests.
Edited by samoth, 03 December 2012 - 01:14 PM.
#12 Members - Reputation: 2043
Posted 03 December 2012 - 01:28 PM
Without wanting to sound too disheartening, I wouldn't worry about this kind of stuff for the time being. Concentrate on writing solid code and understanding the language (for instance, free store vs automatic storage semantics is particularly important point in a value-based language such as C++).I was actually trying to generate cache misses with this code. I have read all about them and would like to experience them for myself.
If you really care about how much you're missing the cache, it's not something you want to be working out in your head, for the most part. You first want to write code sensibly and then use profiling tools to tell you where you might be able to improve it in terms of performance. Intel's PCM will give you really low-level information of this kind.
Neither. Typically use reference-to-const for non-primitive types. Tweak according to profile results. Again, IMHO getting the basics of the language down first is much more worth while.I wanted to know if it was better to send in a pointer to a structure for a often called piece of code,
Whether or not an object is on the stack or heap makes a difference, as does its size, and the amount of indirection inherent in its data structure. The full answer is extremely complicated.or to just pass the structure by value to avoid a potential cache miss.
If you insist on learning more about this stuff (and again, I don't think it's necessarily a good idea at this point in your education), a good starting point is this article which demonstrates cache effects on performance. Note how the expert himself failed to explain all the effects revealed by the data. You could try replicating the results yourself in C++:I might not be getting cache misses because I have a 15MB L2(3?) cache, but I know not all systems have that.
Edited by edd, 03 December 2012 - 01:30 PM.
#14 Members - Reputation: 301
Posted 03 December 2012 - 02:11 PM
edd: Actually my real goal is to implement an Observer scheme where I have, for example, DigitalButtonObservable, AnalogButtonObservable, 2DAxisObservable, etc., and then have a single InputPolyObserver class built into a vector in each UIController. By doing that I could, for example, map the jump of my playable character to three separate actions, i.e. Space Bar, X (360 controller) and maybe a mouse move for some Indie coolness. I know that it is not exactly logical to have the same action map to such disparate input methods, but I want to start with a completely generic setup for the engine code, then simplify when making the actual game. I really want to avoid having to code the input system twice, but rather script the input in any way I please, including mixing input types i.e. Digital vs. Analog.
The reason I can't pass by const ref in this case is that, for example, a digital button would return a bool for buttonDown and a mouse would return its int or float x, and y coords as a struct or whatnot. The only way I could think of to implement this as a single overload of a function in the abstract Observer class, named something such as GetObservableState(...), is to pass something like a void*, then requiring my client code to interpret the event, perhaps by also passing a type, such as an enum specifying whether the button was a digital press or an analog press i.e. DIGITAL_PRESS, ANALOG_PRESS.
I could then simply load up an observer with a function pointer for its code and link it with any button or combination of buttons from any controller, keyboard, or mouse, and the code would simply work as requested. Do you see another way to implement this?
I wonder as I wander...
#15 Moderator* - Reputation: 5411
Posted 03 December 2012 - 02:30 PM
It doesn't matter how fast the CPU is (when it comes to this cache/memory stuff); the bottleneck is the memory speed, not the CPU speed. Also, if there is a cache miss when using the object, there will be cache misses when copying the object as well. Just an FYI.Alvaro: I assumed that by copying the data into the stack then accessing it there, I would certainly not miss the cache. Because the CPU is fast this should be fast also.
#16 Moderators - Reputation: 5066
Posted 03 December 2012 - 02:38 PM
There is a lot of stuff going on in this thread. First you were talking about the allocator, then the cache, now it seems to be about design. Try to keep a thread coherent, and create additional threads if necessary for unrelated topics.
#18 Members - Reputation: 2043
Posted 03 December 2012 - 02:44 PM
But where are you copying it from? Any memory you read from there has to go through the cache hierarchy too. I don't want to start sounding like a broken record, but I'd say get the basics down first.Alvaro: I assumed that by copying the data into the stack then accessing it there, I would certainly not miss the cache. Because the CPU is fast this should be fast also.
I still don't see why you can't use references to const, or even which part of the whole system you're talking about here. There are surely a whole bunch of functions involved, any of which could take arguments by value, reference or pointer(?).The reason I can't pass by const ref in this case is that, for example, a digital button would return a bool for buttonDown and a mouse would return its int or float x, and y coords as a struct or whatnot.
It sounds like you're perhaps doing the observer pattern backwards. The thing being observed tells the things observing it what's going on. You really don't want to start falling back to void*.The only way I could think of to implement this as a single overload of a function in the abstract Observer class, named something such as GetObservableState(...), is to pass something like a void*, then requiring my client code to interpret the event, perhaps by also passing a type, such as an enum specifying whether the button was a digital press or an analog press i.e. DIGITAL_PRESS, ANALOG_PRESS.
void* and function pointers are all that's available in C. In C++, use virtual functions instead.I could then simply load up an observer with a function pointer
We're wandering well off-topic here (for the second time), and my suggestion may not fit what you're doing, but I'd go with something like this:Do you see another way to implement this?
template<typename State>
struct observer : uncopyable
{
virtual ~observer() { }
// called by an observable when a change is made.
virtual void on_update(const State &new_state) = 0;
};
template<typename State>
class observable : uncopyable
{
public:
// ...
// observers can be registered to receive notifications of change.
void add_observer(const share_ptr<observer<State> > &obs);
// called when something changes, calls on_update(new_state) for each
// registered observer.
void notify_observers(const State &new_state);
private:
// ...
};
So a mouse might have an "observable<mouse_position>", for example. Anything that wants to know when the mouse has moved registers an observer<mouse_position> implementation. This is a very noddy version of something like boost::signal<>.
#19 Members - Reputation: 301
Posted 03 December 2012 - 02:53 PM
I wonder as I wander...
#20 Moderator* - Reputation: 5411
Posted 03 December 2012 - 02:57 PM
Under each post, there's a button that says "Quote". Click it to quote someone. Alternatively, add [ quote ][ /quote ] (without spaces) tags in your post to create your own quote (but it won't display their name or post or timestamp if you don't click "Quote"). To get formatted code, use [ code ][ /code ] tags (again, without spaces)Thank you very much edd, that is very instructive. I never considered the template version. Can you tell me how I can quote others and post clean and colored code like you did?






