[Solved] Macro to replace namespaced function

Started by
13 comments, last by Skiller 15 years, 6 months ago
Quote:Original post by Skiller
Quote:Original post by swiftcoder
Quote:Original post by Skiller
But in an unexpected result but the function and std::max *increased* the time it took when using constants, the macro decreased to the point that the majority of the time is probably the time spent looping.
The macro with constants probably caused the entire loop to be optimised out, so those results are pretty much useless.

What's faster is faster that's all there is to it, it's a good thing that it gets optimized out and it's good to see how much faster it is in that case so I don't understand how those results are useless. If constants were used in the code then the results clearly show that a macro is the fastest option, though obviously it'd be much rarer for that to be the case which is why I'm only really concerned with the common use case of using variables.
You partially missed my point, which was along the same lines as agi_shi's point. Since the compiler can immediately reduce float maxVal = maxDefine(0.1f, 0.2f); to float maxVal = 0.2f;, it can then deduce that the loop doesn't do anything, and remove it entirely.

This means that you are comparing zero invocations of the maxDefine against one million invocations of std::max, which tells you nothing - of course zero calls is faster than many calls! However, if your loop does something non-trivial, it may not be optimised out, at which point you can check the relative performance. You really have to check the resulting assembly code to make sure that your loops haven't disappeared completely.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Advertisement
Quote:Original post by Skiller
It gets compiled down to just this:
*** Source Snippet Removed ***

And what does the function version get compiled to?
I'd expect that to be the same. The compiler should be able to optimize everything down to a single assignment in both the macro and function cases. (Although your timing results hint that that's probably not the case)

Also, what is the time unit it prints out? 0.00483 seconds?

As said above, your test isn't worth much though, because the entire loop can be optimized away in both cases. You're not testing a million max calls, you're testing *one*.

Apart from that, I can think of two possible sources for the slowdown in the function case.
One is that you're passing the arguments by reference, which is generally a waste of time with small POD datatypes. (On the other hand, I'd expect the compiler to be able to optimize that away in such a simple function), and the second might be the floating-point precision which causes extra float<->double casts in the function case.
That should be visible if you take a look at the assembly output though.
Quote:Original post by swiftcoderYou partially missed my point, which was along the same lines as agi_shi's point. Since the compiler can immediately reduce float maxVal = maxDefine(0.1f, 0.2f); to float maxVal = 0.2f;, it can then deduce that the loop doesn't do anything, and remove it entirely.

This means that you are comparing zero invocations of the maxDefine against one million invocations of std::max, which tells you nothing - of course zero calls is faster than many calls! However, if your loop does something non-trivial, it may not be optimised out, at which point you can check the relative performance. You really have to check the resulting assembly code to make sure that your loops haven't disappeared completely.


I'd checked the assembly and as far as I can tell everything for the loops is still there and the timing difference also supports the fact that it still runs the entire loop so that's not the problem. Also increasing the number of loops by 10 times made the results take 10 times longer.


Quote:Original post by Spoonbender
And what does the function version get compiled to?
I'd expect that to be the same. The compiler should be able to optimize everything down to a single assignment in both the macro and function cases. (Although your timing results hint that that's probably not the case)


The call to Math::Max disassembly:
			float maxVal = Math::Max(0.01f, 0.02f);0040D652  fld         dword ptr [__real@3ca3d70a (40F958h)] 0040D658  fstp        dword ptr [ebp-1BCh] 0040D65E  fld         dword ptr [__real@3c23d70a (40F954h)] 0040D664  fstp        dword ptr [ebp-1C0h] 0040D66A  lea         eax,[ebp-1BCh] 0040D670  push        eax  0040D671  lea         ecx,[ebp-1C0h] 0040D677  push        ecx  0040D678  call        Math::Max (401040h) 0040D67D  add         esp,8 0040D680  fstp        dword ptr [maxVal]


And the Math::Max function disassembly:
	inline float			Max(const float& value1, const float& value2){return ((value1 > value2) ? value1 : value2);}00401040  push        ebp  00401041  mov         ebp,esp 00401043  push        ecx  00401044  mov         eax,dword ptr [value1] 00401047  fld         dword ptr [eax] 00401049  mov         ecx,dword ptr [value2] 0040104C  fld         dword ptr [ecx] 0040104E  fcompp           00401050  fnstsw      ax   00401052  test        ah,5 00401055  jp          Math::Max+21h (401061h) 00401057  mov         edx,dword ptr [value1] 0040105A  fld         dword ptr [edx] 0040105C  fstp        dword ptr [ebp-4] 0040105F  jmp         Math::Max+29h (401069h) 00401061  mov         eax,dword ptr [value2] 00401064  fld         dword ptr [eax] 00401066  fstp        dword ptr [ebp-4] 00401069  fld         dword ptr [ebp-4] 0040106C  mov         esp,ebp 0040106E  pop         ebp  0040106F  ret


Alot of extra work by the look of it, a far cry from the 2 instructions the macro did :(. The fact that it's doing a call is rather confusing though, as I've said I'm not very familiar with assembly so I'm not sure if it's supposed to be doing that if it inlines the function.

Quote:Original post by Spoonbender
Also, what is the time unit it prints out? 0.00483 seconds?


Yes the unit of the results is in seconds. For reference I'm running an Intel Q9550 (45nm core 2 quad core) overclocked to 3.4ghz (8.5 x 400mhz).

Quote:Original post by Spoonbender
As said above, your test isn't worth much though, because the entire loop can be optimized away in both cases. You're not testing a million max calls, you're testing *one*.


Answered in response to swiftcoder.

Quote:Original post by Spoonbender
Apart from that, I can think of two possible sources for the slowdown in the function case.
One is that you're passing the arguments by reference, which is generally a waste of time with small POD datatypes. (On the other hand, I'd expect the compiler to be able to optimize that away in such a simple function), and the second might be the floating-point precision which causes extra float<->double casts in the function case.
That should be visible if you take a look at the assembly output though.


I'm not too familiar with assembly, I posted it earlier in the post though if you want to take a look. And I also tried passing by value instead of by reference but that yielded slightly worse performance, though it was insignificant enough that I can't be certain it's not just normal speed fluctuation from background processes.
-Skiller
Quote:Alot of extra work by the look of it, a far cry from the 2 instructions the macro did :(. The fact that it's doing a call is rather confusing though, as I've said I'm not very familiar with assembly so I'm not sure if it's supposed to be doing that if it inlines the function.

How can you be wanting to optimize without really understanding what's going on in the CPU? (i.e. which instructions do what) Voodoo and experimentation aren't the most effective way to go about things. Intel and AMD have freely available manuals that are worth reading :)

The "2 instructions" were just a load and a store, so not valid for comparison. The problem with the compiled form of Math::Max is that FNSTSW is quite slow and poorly predicted conditional branches are murder. Fortunately both can be avoided by enabling SSE in code generation options; you will know this has succeeded when you see a MAXSS instruction.

As an aside, the Intel compiler is much better at optimization and can fairly cheaply be had by students. MSC still just plain sucks, so much that MS are throwing away their 20+ year old hacked-together compiler and replacing it with "Phoenix" (rising from the "ashes", heh).
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
Well it seems I'm an idiot, I'd made the foolish assumption that the optimization compiler switches were on for release but I just double checked and they weren't so they weren't on for the modder build either. Turned them on and times are still favoring macros, but the difference is insignificant and probably just coincidental anyway (0.00001 - 0.00005 seconds).

So now I can just abolish the use of macros for speed-ups :D


Now that that's sorted, does anyone know what the "warning C4748: /GS can not protect parameters and local variables from local buffer overrun because optimizations are disabled in function" that I now get is about and how to fix it?
Never mind, I just needed it on for all projects in the solution.
-Skiller

This topic is closed to new replies.

Advertisement