Extra function parameter impact

Started by
5 comments, last by Antheus 13 years, 9 months ago
Hello,

This might be a question for all among you who know assembler (and C++, ideally). I don't know much about what code a compiler generates, unfortunately.
My question is how big exactly the impact of one additional function parameter would be on the invocation efficiency of a function in C++.
Basically, what I have are a number of rather small functions which are called via function pointer. They are called _really_ often (billions of times per program execution). Right now they have only one parameter, an integer, while the remaining data is retrieved from global memory.
Unfortunately I need to expand the concept by either adding a more complex global memory layout or adding an additional function parameter (a pointer, most likely, to more specific data). I am uncertain just how bad the latter would be. The bigger functions probably wouldn't noticeably suffer but some of the smaller ones are less than ten lines long.
I know this sounds awfully paranoid but I'm yet at a fraction of the execution efficiency I'm aiming at, and every little bit counts.
Advertisement
Depends on the parameter. For simple cases I think it's just a few additional instruction to push/pop the parameter to/from the stack. I've certainly never had an issue with the number of function parameter. The only time I've ever had them show up as a hotspot is with some complex type I forgot to pass by const reference.

In general it wouldn't surprise me if the act of calling a function is more expensive than simple parameters (especially since I guess calling some remote function through a pointer could cause a cache miss?).


If the alternative is to get the same data some other way (eg from "global" memory through several layers of indirection) then you should also consider that doing so is going to be just as expensive if not more than a function parameter.
I don't think it will be too expensive. Fetching data from a global isn't free either, it can actually be less cache efficient, and cache misses are far more punishing than pushing a machine word onto the stack. You should keep in mind that using global state is far more likely to cause bugs.

Technically, what happens at the ASM layer depends on the calling convention.

Some common ones:
"stdcall"/"cdecl": pass arguments on the stack
"fastcall": pass two arguments by register, others via stack

You can manually specify the calling conventions (compiler specific) or you can choose to change the default calling convention. Be extremely careful if you do it manually, I believe you must take care to keep all your function pointers to have the same calling convention. Calling a function through a pointer with the wrong calling convention can have some not-fun results. This is especially important if these function pointers are callbacks that might not be written by you.

As always, the only way to be sure is to profile. It should be easy enough to rig up a #define so that you can globally change the calling convention of all your function pointers. Then you can test and see which approach works best.
Quote:They are called _really_ often (billions of times per program execution).

Ok - how often per second? If it's under 10 million/sec, the overhead likely won't matter.
If it's more than 10 million/sec, then the function bodies are too trivial, and there is simply too much overhead from calls rather than actual work being done.

Quote:Basically, what I have are a number of rather small functions which are called via function pointer.

If anything, the function pointer is the bottleneck. GCC doesn't inline it, MSVC often does not.

Parameters beyond that are irrelevant.

The bottleneck comes from polymorphic behavior. To improve upon this, for something that is called billions of times, split calls into groups. virtual call is simply incorrect design choice.

Rather than having one container of loose types, have n containers of specific types.

The next step in optimization is not to pass anything at all, nor call any functions. Simple example:
struct Foo {  virtual void bar(int, int, float);};for (...) {  stuff->bar(getX(), getY(), getZ());};


This can be trivially organized into pure POD style type:
struct SpecificFoo {  int x, y;  float z;};void bar(SpecificFoo & f);for (...) {  bar(*specificFooStuff);}
Here the bar() call will be inlined, there are no branches, no memory lookups, no dereferences and processing is sequential, no parameters are passed, not even 'this' - compiler will commonly mangle it into the for loop itself.

If there are multiple types, either sort the array by type and process each subsection using a separate for loop, or keep one container for each type.
Woah, lots to think about. Thanks for the quick replies :D

Ok, let's see.

@rip-off:
Quote:Some common ones:
"stdcall"/"cdecl": pass arguments on the stack
"fastcall": pass two arguments by register, others via stack

'fastcall' would probably be nice, assuming the 2xDWORD expands on 64bit compilations. Probably not. Since the pointer (or reference or whatever) would be QWORD size and the second of two required parameters, I'd be out of luck. Well, I'd change a little at least :)
Quote:You should keep in mind that using global state is far more likely to cause bugs.

I am aware that whatever global memory architecture I can cook up will be prone for bugs. I would prefer other solutions.

@Antheus:
Quote:Ok - how often per second? If it's under 10 million/sec, the overhead likely won't matter.
If it's more than 10 million/sec, then the function bodies are too trivial, and there is simply too much overhead from calls rather than actual work being done.

On average I'm somewhere around 1mil function invocations per second. Give or a take a little depending on the size of the function invoked. My current measuring techniques are a little fuzzy. At least they don't affect the result.

Edited: Nope, quite a bit more, but I'll have to rewrite a lot to get more specific results, atm. sorry :S

Quote:
If anything, the function pointer is the bottleneck. GCC doesn't inline it, MSVC often does not.

Yeah, this is probably not inlined. Thing is, these function pointers are passed to a thread pool which attempts to balance the work-load onto all available processor cores. So far that worked pretty nicely.

The problem is that the only alternative to passing crude function pointers is to implement worker thread functions specifically for each client function. Which means templates. I could whip up such a template class that takes the respective function pointer as a template parameter, thus maximizing the chance of inlining. Down-side is that I'd probably end up having fifty or more inactive threads laying around since I'd need an instance per function and each such instance would require its own set of threads - one per core. Is there actually any limit to how many threads one can create or any massive memory consumption per thread I am unaware of?
Then again, I am already producing memory consumption in the GBs; I guess whatever the threads can throw at me won't really bother me :P
Quote:
'fastcall' would probably be nice, assuming the 2xDWORD expands on 64bit compilations.

The registers expand to 64 bits, I don't see why they wouldn't use the additional bits. Further down on that page I linked it seems that 64 bit calling conventions use even more registers when passing additional parameters. Also, it seems there is the only one 64 bit calling convention per compiler, so there is less to worry about.

Quote:
The problem is that the only alternative to passing crude function pointers is to implement worker thread functions specifically for each client function. Which means templates. I could whip up such a template class that takes the respective function pointer as a template parameter, thus maximizing the chance of inlining. Down-side is that I'd probably end up having fifty or more inactive threads laying around since I'd need an instance per function and each such instance would require its own set of threads - one per core. Is there actually any limit to how many threads one can create or any massive memory consumption per thread I am unaware of?
Then again, I am already producing memory consumption in the GBs; I guess whatever the threads can throw at me won't really bother me :P

The solution Antheus was suggesting is to provide more batching. It depends on what the functions are doing etc, but you don't want to be passing millions of functions between threads, the intra-thread communication could become the bottleneck.

We would have to know a lot more about how these calls are generated - can they be predicted in advance, generated from user input, something else? Can you estimate the complexity of the function in advance?

If there is one main thread dispatching tasks, it could sort the incoming tasks into different "buckets", wait for either some bucket threshold or a timeout to pass before passing the bulk operations onto the worker threads. Latency will go up, but throughput should increase if you do it right.

Can you describe the high level goals of your system? Optimisation is all about knowing the specifics, you can't do much for the general case.
Quote:Original post by rip-off
The solution Antheus was suggesting is to provide more batching. It depends on what the functions are doing etc, but you don't want to be passing millions of functions between threads, the intra-thread communication could become the bottleneck.


Rather than passing function pointers to individual values, make function pointers operate on range, something like:
void foo(void * data, int begin, int end) {  SpecificType * values = (SpecificType*)data;  for (int i = begin; i < end; i++) actual_foo_logic(values);}


Next, before passing stuff to threads, sort the original array by function pointer. Pass whole sections of same-typed pointers.

Sorting can be done using parallel merge sort, perhaps even without final merge phase.
Ranges can be quickly determined using binary search. Pass whole range to workers, perhaps limited by some maximum. Existing parallel_for constructs can be used as well.

The above has the advantage of actual logic being inline expanded and performing everything locally.

Templating function pointers here doesn't help since GCC refuses to inline them but MSVC will. The simplest way to write a generic wrapper would be CRTP with virtual base class. That provides compile time expansion of for loop, but provides run-time polymorphism for dispatching.

The total number of threads is same as number of cores. Using some existing parallel_for is *much* more advisable since they take care of tricky details of load distribution.

This topic is closed to new replies.

Advertisement