Stream Multiplication Performance with Functions

Started by
2 comments, last by jerrinx 11 years, 9 months ago
Hey guys,

I made a test program to check out performance of stream multiplication using normal, virtual, inline functions, and as a single function.

These are the results when Multiplying 1000,000,000 Floats

Results


[CUtil] ## PROFILE Stream Product normal function : 6.663393837 sec(s) ##
[CUtil] ## PROFILE Stream Product virtual function : 6.608961085 sec(s) ##
[CUtil] ## PROFILE Stream Product inline function : 6.584697760 sec(s) ##
[CUtil] ## PROFILE Stream Product in function : 12.363450801 sec(s) ##

What I don't understand is why Stream Product in a function takes twice as much !!???
Maybe its just my setup or something wrong with the code... I don't know.
Can somebody try this out ?

VS 2010 Express
Release Build
JerrinX
Advertisement
This test isn't going to tell you anything useful, at least not as posted. You don't seem to initialize your memory anywhere, so the operations could be doing anything, including arithmetic on denormals, which is far slower than arithmetic on normalized floats. You should set the memory up to contain known values beforehand to ensure you're doing consistent work in all of the tests.

Also, even with that aside, you're not really testing what you think you are here. The compiler is almost certainly smart enough to elide a lot of the excess work you're doing, unroll loops, devirtualize function calls, and so on. You also need to consider far more than just one execution of the work load to account for things like cache warming, branch prediction, and so on. In short, constructing valid artificial benchmarks is extremely complex on modern hardware.

Is there something specific you're trying to find out here?

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

If you want to gauge the performance difference, just look at the assembly difference between your different versions - if you see it calling out to a function on the inside of the loop as opposed to running it inline, then you have a perf hit right there.

As ApochPiQ pointed out, the values in memory can have some heavy impact on the performance of your functionality.

But, assembly and data aside, theres also the layout of your memory & whether or not your data is being pre-fetched from the cache in time.
Hey Guys,

Thanks for the replies.

I wanted to make a particle system and wanted to test out the performance difference when we call a function to operate on an element vs doing it all together in a single function.

@ApochPiQ
The data set is output = input1 * input2
So if in case theres an inconsistency in the multiplication, its going to affect the other test cases also.
I think loops can be unrolled only if you know the data set size before hand. Correct me if I am wrong.

Maybe I need to upcast that class in order to avoid devirtualisation.
Not sure about branch and cache warming though.

When I switched to debug mode, it gave me some sane results. Seems like release mode shuffled the code around a bit.

Debug Mode on 10000000 data set size


[CUtil] ## PROFILE Stream Product normal function : 0.549368595 sec(s) ##
[CUtil] ## PROFILE Stream Product virtual function : 0.582704152 sec(s) ##
[CUtil] ## PROFILE Stream Product inline function : 0.522523487 sec(s) ##
[CUtil] ## PROFILE Stream Product in function : 0.238292751 sec(s) ##

Release with Optimisation turned off on 1000000000 Data Set

[CUtil] ## PROFILE Stream Product normal function : 19.569217771 sec(s) ##
[CUtil] ## PROFILE Stream Product virtual function : 22.762712440 sec(s) ##
[CUtil] ## PROFILE Stream Product inline function : 16.949578101 sec(s) ##
[CUtil] ## PROFILE Stream Product in function : 17.004290188 sec(s) ##

But then again I would like to gauge the performance on release.
JerrinX

This topic is closed to new replies.

Advertisement