Jump to content

  • Log In with Google      Sign In   
  • Create Account

14 years ago on June 15th Gamedev.net was first launched! We want to thank all of you for being part of our community and hope the best years are ahead of us. Happy birthday Gamedev.net!

Macroscopic timing of pass-by-reference vs pass-by-value for const double, long unsigned int, float


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
7 replies to this topic

#1 taby   Members   -  Reputation: 285

Like
0Likes
Like

Posted 18 June 2012 - 09:26 PM

Someone messaged me with some information on the microscopic details of the difference between passing a const double by reference vs by value. Thanks for that. I wondered what it would be like in terms of speed on the macroscopic side of things, so I timed (using QueryPerformanceCounter on Windows) a loop of 100,000 calls (by function pointer) to the two types of functions, repeated the process 10,000 times, and threw it all into a histogram of 100 bins.

The functions were:
double ref(const double &d) { return static_cast<double>(rand())*d - 1000.0; }
double val(const double d)  { return static_cast<double>(rand())*d - 1000.0; }

The function pointers were:
double (*fptr_ref)(const double &) = ref;
double (*fptr_val)(const double) = val;

The input/output variables were:
double in = 34, out = 0;

The inner loop was:
QueryPerformanceCounter(&start_ref);

for(size_t j = 0; j < 100000; j++)
	out = (*fptr_ref)(in);

QueryPerformanceCounter(&end_ref);

The pass-by-reference version took on average about 0.0076 seconds per 100,000 calls, and the pass-by-value version took on average about 0.0072 seconds per 100,000 calls. In other words, I found that the pass-by-value version was roughly 1.06 times faster, which is to be expected.

For long unsigned int, the pass-by-reference version took on average about 0.00578 seconds per 100,000 calls, and the pass-by-value version took on average about 0.00573 seconds per 100,000 calls. The pass-by-value version was roughly 1.008 times faster, which is to be expected.

The functions were:
long unsigned int ref(const long unsigned int &d) { return rand()*d - 1000; }
long unsigned int val(const long unsigned int d)  { return rand()*d - 1000; }

The function pointers and input/output types were unsigned long int.

Things changed for float though, and I cannot explain it to myself. The pass-by-reference version took on average about 0.00816159 seconds per 100,000 calls, and the pass-by-value version took on average about 0.00838039 seconds per 100,000 calls. The pass-by-reference version was roughly 1.03 times faster, which was unexpected. Of course, repeating the entire test yielded similar results every time. The standard deviation is about one order less than the average, which was also the case for double and long unsigned int, so it's not like all hell was breaking loose.

The functions were:
float ref(const float &d) { return static_cast<float>(rand())*d - 1000.0f; }
float val(const float d)  { return static_cast<float>(rand())*d - 1000.0f; }

The function pointers and input/output types were float.

This was all on Windows 7 32-bit, using MSVC++ 2010 Express in release mode.

Pass-by-reference was generally superior for composite types, which is to be expected.

Is there any obvious reason as to 1) why float would generally take longer than double, and 2) why pass-by-reference would be faster than pass-by-value for float?

Attached Thumbnails

  • plot_double.jpg
  • plot_long_unsigned_int.jpg
  • plot_float.jpg

Edited by taby, 18 June 2012 - 09:42 PM.


Sponsor:

#2 Hodgman   Moderators   -  Reputation: 14300

Like
2Likes
Like

Posted 18 June 2012 - 09:35 PM

Is there any obvious reason as to 1) why float would generally take longer than double, and 2) why pass-by-reference would be faster than pass-by-value for float?

Yep, likely because MSVC simply sucks balls at producing float-based code on default 32-bit settings.
Seriously, it produces about 8x as much "push this register to the stack, pop it back into a register again" instructions as it does actual math instructions -- especially when you're not doing much math, and when you're mixing int/float math together, like in your example.

Try switching the compiler to "fast" float precision mode, and x64 code if you can. The reason MSVC's float code is slow, is because it inserts extra instructions to ensure that your results do actually get rounded down to 32-bit constantly.
Inside the CPU, they might be stored as 80-bit float, which gives slightly different results to the standard IEEE format -- MSVC being helpful, inserts a "round to 32-bits" operation after every single math operation, just to ensure you're getting IEEE-compliant results (and then loads that rounded value back into an 80-bit register to do the next operation).

On the other hand, MSVC skips all this for doubles, and just gives you full 80-bit precision until you write the value out to RAM yourself (at which point it's rounded to 64bits). On MSVC's default settings, this makes doubles appear to be a lot faster in general that floats, because MSVC *has* produced much more efficient assembly for doubles, whereas it uses really slow but "correct" assembly for floats.

Your timings aren't very useful in general, because the rand/cast probably isn't representative of your typical function, which completely changes what you're measuring.
You don't need to time this anyway to see the theoretical performance difference between the two -- most articles I've seen on this simply examine the assembly and highlight inefficiencies where values are being bounced off of the cache for no reason (which won't matter much in your test, because the cache isn't under heavy load).

Edited by Hodgman, 18 June 2012 - 09:44 PM.


#3 taby   Members   -  Reputation: 285

Like
1Likes
Like

Posted 18 June 2012 - 09:49 PM

Oh, I know that macroscopic timing really pisses people off, but it was undeniably fair and accurate enough to get the general idea: pass-by-reference is not worth it for double and long unsigned int, and um, MSVC++ sucks balls. Of course, the functions given here are not the only functions in all of existence.

gcc on Ubuntu runs the float test much faster by default. The switch -ffast-math had some effect, though slight. The average per test fluctuated quite dramatically, making it unclear as to which pass method was faster, so running many tests and gathering the many averages would give a bigger picture.

The timer code:
#include <time.h>
struct timespec start_time;

float get_curr_time(void)
{
	struct timespec end_time; // better as static?
	clock_gettime(CLOCK_REALTIME, &end_time);

	struct timespec diff; // better as static?

	if(end_time.tv_nsec - start_time.tv_nsec < 0)
	{
		diff.tv_sec = end_time.tv_sec - start_time.tv_sec - 1;
		diff.tv_nsec = 1000000000 + end_time.tv_nsec - start_time.tv_nsec; // 1,000,000,000 nanoseconds = 1 second
	}
	else
	{
		diff.tv_sec = end_time.tv_sec - start_time.tv_sec;
		diff.tv_nsec = end_time.tv_nsec - start_time.tv_nsec;
	}

	return static_cast<float>(diff.tv_sec) + static_cast<float>(diff.tv_nsec) / 1000000000.0f; // 1,000,000,000 nanoseconds = 1 second
}

...

int main(void)
{
	clock_gettime(CLOCK_REALTIME, &start_time);

	...

	start_ref = get_curr_time();

	for(size_t j = 0; j < iterations; j++)
		out = (*fptr_ref)(in);

	end_ref  = get_curr_time();

	...


Edited by taby, 18 June 2012 - 11:40 PM.


#4 Álvaro   Members   -  Reputation: 6183

Like
1Likes
Like

Posted 18 June 2012 - 11:54 PM

I don't think your test measures anything useful. If you are allowing the compiler to inline the functions, any difference in how you pass the values is gone, and I am not even sure what you are measuring. Even if you correctly making the functions unavailable for inlining, rand() will take the majority of the time and obscure any real differences between the methods.

Passing a small native type by const reference feels wrong to me because it's not idiomatic, not because it might be slow. Passing by value is the simplest choice, what people expect to see in your code and probably what compilers are optimized for. If you do something different, there should be a good reason for it.

#5 taby   Members   -  Reputation: 285

Like
0Likes
Like

Posted 19 June 2012 - 02:44 PM

The functions were intentionally not inlined, and I pointed that out in the first post (yeah yeah, puns suck). Anyway, the time elapsed for the two functions stayed roughly in the same proportion as the number of iterations per inner loop was increased, and so yes I was spending the vast majority of time monitoring rand and for and * and -, but I was clearly not spending all of the time doing so. Perhaps it would have been easier to monitor if I had just returned the input.

I was never disagreeing with the notion that passing in float, double, long unsigned int by reference is just not worth it. I do disagree with strawmen and a pathological aversion to statistics though. Please keep in mind that I'm trepidatious enough to have troubleshot the test by using several different outer loops before allowing myself to come to any conclusions.

Edited by taby, 19 June 2012 - 03:12 PM.


#6 Álvaro   Members   -  Reputation: 6183

Like
1Likes
Like

Posted 19 June 2012 - 06:22 PM

I do disagree with strawmen and a pathological aversion to statistics though.


Quite the contrary, I think statistics are great. Gathering statistics about performance of a supposed optimization in your program is very useful. But synthetic performance tests are often not representative of what will happen in an actual program.

#7 Narf the Mouse   Members   -  Reputation: 312

Like
1Likes
Like

Posted 19 June 2012 - 07:44 PM

*Long, long time ago, I did a synthetic benchmark on C# which showed that passing and multiplying Matrix4x4's by Ref<T> was a lot **faster than passing and multiplying by value. Excitedly, I hurried over to my game engine, to replace all my Matrix4x4's with Ref<Matrix4x4>. Thanks to the magic of OO, VSEs' error-reporting and find-replace, this was not too ardeous a chore. I then quickly booted up my engines' speed test, ran some profiling...

...And discovered absolutely no difference in speed.

Differences in synthetic benchmarks are often because the compiler takes a look, says "Yep, that code looks fast" and doesn't overly optimize. Meanwhile, I've found (more than once - Most, but not always), that cool optimization you thought of? The compiler has already thought of it, or is doing something else that makes it uncessary or even slower.

Or, even simpler, that optimization you made doesn't really have much to do with what's slowing down the code.

Fortunately, OO, VSEs' error-reporting and find-replace meant it was easy to fix, too.

OTOH, I still lost most of a working day to that.

Was it worth it? Sure. If it had worked, I'd have seen some major speedups. However, even a synthetic benchmark that tests a specific problem is no match for a more real-world benchmark that tests the whole suite.

* About two weeks.
** Up to 57x, in fact.

#8 taby   Members   -  Reputation: 285

Like
0Likes
Like

Posted 21 June 2012 - 01:14 PM

Ok, I'll keep everything outside of the function roughly the same and make the function more complicated. Speaking of synthetic benchmarks, I'm hopeful that we can agree that sometimes just a handful of isolated instructions is not all that is required for gauging performance in practice.

Edited by taby, 21 June 2012 - 01:21 PM.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS