Sign in to follow this  
Skiller

Unity [Solved] Macro to replace namespaced function

Recommended Posts

Hi, I'm re-writing my game engine atm (I made the stupid mistake of underestimating the difficulty of retrofitting thread safety) and decided I should take the opportunity to pretty up my code by using namespaces instead of 2-3 letter prefixes for each system. I'm also wanting to speed a few heavily used things up a bit like my math library but have run into a bit of a problem, I want to use macros for release builds instead of functions because they are faster but having the debug functions in namespaces is making things difficult :(. Is there a way to use macros to fake namespaces so something like the following would work?
namespace Math
{
#ifdef _DEBUG
    static float Max(const float value1, const float value2) {return ((value1 > value2) ? value1 : value2);}
#else
    #define Max(max, val) (((max) < (val)) ? (val) : (max))
#endif
};


void SomeFunction()
{
    float max = Math::Max(0.1f, 0.2f);
}

Obviously it all works fine for debug but in release it wont compile since macros aren't affected by namespaces :(. BTW before anyone says "inline functions are just as fast as macros" I have been profiling and no matter what I do the macro version will run nearly 2 times faster than the best version of the function I could come up with and in many situations as much as 10-100 times faster (like when using constants which will get compiled down to just an assignment). Though it is possible I'm doing something wrong that's preventing it inlining properly and if so then awesome, I can just use the much nicer/safer function rather than macros but I doubt that'll be the case :(. Thanks Edit: Solution is to enable optimization compiler switches and use functions instead. [Edited by - Skiller on October 12, 2008 5:30:15 AM]

Share this post


Link to post
Share on other sites
No, it's not possible; macros aren't aware of namespaces in any way.

Also, have you been profiling with full optimizations enabled? I wouldn't be surprised if the macro version were that faster in a debugging, unoptimized build -- seeing that kind of results in fully-optimized build stripped of any debugging seems rather strange to me.

Also, you don't need to write your own Math::Max, it's already been done for you: std::max.

Share this post


Link to post
Share on other sites
I'm gonna say it anyway; use the inline function. Yes, it is likely that you are doing something wrong if you get a 10-100 times difference, or you should get a decent compiler. For example, whole program optimization in Visual studio of later editions should have no problem inlining that function call.

Share this post


Link to post
Share on other sites
Quote:
Original post by Skiller
BTW before anyone says "inline functions are just as fast as macros" I have been profiling and no matter what I do the macro version will run nearly 2 times faster than the best version of the function I could come up with and in many situations as much as 10-100 times faster (like when using constants which will get compiled down to just an assignment). Though it is possible I'm doing something wrong that's preventing it inlining properly and if so then awesome, I can just use the much nicer/safer function rather than macros but I doubt that'll be the case :(.

I'd like to see your test cases. Got an example where the macro version is faster than the inlined one?

Share this post


Link to post
Share on other sites
I think your basic problem with any performance metrics you might be trying to use is that your function isn't actually an inline function. Inline functions use the inline keyword or get declared inside the body of class definitions. What you've got a non-inline function declared as static. Static will make it link, but it'll also create a separate version of the function in each and every source file that uses that function definition. This leads to code bloat, which will also adversely affect performance.

Share this post


Link to post
Share on other sites
Here is the exact code I use for this if you can spot why it might not be inlineing properly then please let me know as I would much prefer to use functions. BTW std::Max is actually slower than the function I wrote, but only by an insignificant amount ;).

UtMath.h
[source="cpp"]#ifndef UT_MATH_H
#define UT_MATH_H


///////////////////////////////////////////////////////////////////////////////
/// Various math constants and functions.
///////////////////////////////////////////////////////////////////////////////
namespace Math
{
//Variouse other function and constants not relevant edited out

#define maxDefine(max, val) (((max) < (val)) ? (val) : (max))

inline float Max(const float& value1, const float& value2){return ((value1 > value2) ? value1 : value2);}
};


#endif //UT_MATH_H



Main.cpp (just shoved in here as a temperary measure while profiling)
[source="cpp"]for (UInt32 pass = 0; pass < 10; ++pass)
{
float val1 = pass * 0.01f;
float val2 = pass * 0.02f;
UtProfiler_BeginLoad("std::max");
for (UInt32 i = 0; i < 1000000; ++i)
{
float maxVal = std::max(val1, val2);
}
UtProfiler_EndLoad();
UtProfiler_BeginLoad("Macro");
for (UInt32 i = 0; i < 1000000; ++i)
{
float maxVal = maxDefine(val1, val2);
}
UtProfiler_EndLoad();
UtProfiler_BeginLoad("Function");
for (UInt32 i = 0; i < 1000000; ++i)
{
float maxVal = Math::Max(val1, val2);
}
UtProfiler_EndLoad();
UtProfiler_BeginLoad("Function");
for (UInt32 i = 0; i < 1000000; ++i)
{
float maxVal = Math::Max(val1, val2);
}
UtProfiler_EndLoad();
UtProfiler_BeginLoad("Macro");
for (UInt32 i = 0; i < 1000000; ++i)
{
float maxVal = maxDefine(val1, val2);
}
UtProfiler_EndLoad();
UtProfiler_BeginLoad("std::max");
for (UInt32 i = 0; i < 1000000; ++i)
{
float maxVal = std::max(val1, val2);
}
UtProfiler_EndLoad();
}



I mixed up the order like that to see if it might have been a cache thing but results were still very consistent.
The profiler just does a QueryPerformanceCounter in UtProfiler_BeginLoad then another QueryPerformanceCounter in UtProfiler_EndLoad and traces out the difference.
I'm using a modder build config which I created, it's a copy of the release build config but with a _MODDER #define set that turns on all the profiling, asserts, memory tracking and other tools I have in debug but with compiler optimizations on, I'm also using the VS2008 compiler. In case you are wondering I plan to ship the modder build so content modders will get a heap of extra info on any potential problems their content may cause, and also if people are getting crashes then they can run the modder build and it'll probably throw an assert or trace out some extra info to help me track down the bug easier :).

Anyway the results:
Load complete: std::max, Completed in: 0.00483
Load complete: Macro, Completed in: 0.00215
Load complete: Function, Completed in: 0.00442
Load complete: Function, Completed in: 0.00441
Load complete: Macro, Completed in: 0.00235
Load complete: std::max, Completed in: 0.00442
Load complete: std::max, Completed in: 0.00451
Load complete: Macro, Completed in: 0.00225
Load complete: Function, Completed in: 0.00444
Load complete: Function, Completed in: 0.00439
Load complete: Macro, Completed in: 0.00235
Load complete: std::max, Completed in: 0.00447
Load complete: std::max, Completed in: 0.00451
Load complete: Macro, Completed in: 0.00222
Load complete: Function, Completed in: 0.00442
Load complete: Function, Completed in: 0.00440
Load complete: Macro, Completed in: 0.00234
Load complete: std::max, Completed in: 0.00447
Load complete: std::max, Completed in: 0.00451
Load complete: Macro, Completed in: 0.00219
Load complete: Function, Completed in: 0.00442
Load complete: Function, Completed in: 0.00441
Load complete: Macro, Completed in: 0.00235
Load complete: std::max, Completed in: 0.00447
Load complete: std::max, Completed in: 0.00460
Load complete: Macro, Completed in: 0.00221
Load complete: Function, Completed in: 0.00446
Load complete: Function, Completed in: 0.00440
Load complete: Macro, Completed in: 0.00235
Load complete: std::max, Completed in: 0.00447
Load complete: std::max, Completed in: 0.00452
Load complete: Macro, Completed in: 0.00212
Load complete: Function, Completed in: 0.00442
Load complete: Function, Completed in: 0.00442
Load complete: Macro, Completed in: 0.00235
Load complete: std::max, Completed in: 0.00447
Load complete: std::max, Completed in: 0.00451
Load complete: Macro, Completed in: 0.00234
Load complete: Function, Completed in: 0.00442
Load complete: Function, Completed in: 0.00440
Load complete: Macro, Completed in: 0.00235
Load complete: std::max, Completed in: 0.00447
Load complete: std::max, Completed in: 0.00451
Load complete: Macro, Completed in: 0.00219
Load complete: Function, Completed in: 0.00442
Load complete: Function, Completed in: 0.00440
Load complete: Macro, Completed in: 0.00235
Load complete: std::max, Completed in: 0.00449
Load complete: std::max, Completed in: 0.00451
Load complete: Macro, Completed in: 0.00227
Load complete: Function, Completed in: 0.00443
Load complete: Function, Completed in: 0.00441
Load complete: Macro, Completed in: 0.00233
Load complete: std::max, Completed in: 0.00447
Load complete: std::max, Completed in: 0.00451
Load complete: Macro, Completed in: 0.00212
Load complete: Function, Completed in: 0.00442
Load complete: Function, Completed in: 0.00442
Load complete: Macro, Completed in: 0.00233
Load complete: std::max, Completed in: 0.00447



Fairly consistently Macro > Function > std::max in terms of speed. If, as seems to be the case, there is no way to get the macros to work with namespaces I'll just use my function for both since CPUs should be fast enough for it to not make too big an impact on frame rate and I'd rather the type safety if I can get it, I'll also probably templatize the function if that doesn't have too big an impact on speed (which it shouldn't as far as I'm aware).



Edit: I re-did my profiling using the fast function and yer my results aren't anywhere near the 10-100 times faster when using constants I stated earlier, those figures must have been for the old function I was using that didn't seem to inline. But in an unexpected result but the function and std::max *increased* the time it took when using constants, the macro decreased to the point that the majority of the time is probably the time spent looping.
Main.cpp using constants:
[source="cpp"]for (UInt32 pass = 0; pass < 10; ++pass)
{
float val1 = pass * 0.01f;
float val2 = pass * 0.02f;
UtProfiler_BeginLoad("std::max");
for (UInt32 i = 0; i < 1000000; ++i)
{
float maxVal = std::max(0.1f, 0.2f);
}
UtProfiler_EndLoad();
UtProfiler_BeginLoad("Macro");
for (UInt32 i = 0; i < 1000000; ++i)
{
float maxVal = maxDefine(0.1f, 0.2f);
}
UtProfiler_EndLoad();
UtProfiler_BeginLoad("Function");
for (UInt32 i = 0; i < 1000000; ++i)
{
float maxVal = Math::Max(0.1f, 0.2f);
}
UtProfiler_EndLoad();
UtProfiler_BeginLoad("Function");
for (UInt32 i = 0; i < 1000000; ++i)
{
float maxVal = Math::Max(0.1f, 0.2f);
}
UtProfiler_EndLoad();
UtProfiler_BeginLoad("Macro");
for (UInt32 i = 0; i < 1000000; ++i)
{
float maxVal = maxDefine(0.1f, 0.2f);
}
UtProfiler_EndLoad();
UtProfiler_BeginLoad("std::max");
for (UInt32 i = 0; i < 1000000; ++i)
{
float maxVal = std::max(0.1f, 0.2f);
}
UtProfiler_EndLoad();
}


Results of using constants:
Load complete: std::max, Completed in: 0.00512
Load complete: Macro, Completed in: 0.00178
Load complete: Function, Completed in: 0.00580
Load complete: Function, Completed in: 0.00516
Load complete: Macro, Completed in: 0.00177
Load complete: std::max, Completed in: 0.00567
Load complete: std::max, Completed in: 0.00521
Load complete: Macro, Completed in: 0.00177
Load complete: Function, Completed in: 0.00581
Load complete: Function, Completed in: 0.00517
Load complete: Macro, Completed in: 0.00177
Load complete: std::max, Completed in: 0.00567
Load complete: std::max, Completed in: 0.00523
Load complete: Macro, Completed in: 0.00177
Load complete: Function, Completed in: 0.00568
Load complete: Function, Completed in: 0.00517
Load complete: Macro, Completed in: 0.00177
Load complete: std::max, Completed in: 0.00567
Load complete: std::max, Completed in: 0.00524
Load complete: Macro, Completed in: 0.00177
Load complete: Function, Completed in: 0.00587
Load complete: Function, Completed in: 0.00516
Load complete: Macro, Completed in: 0.00177
Load complete: std::max, Completed in: 0.00567
Load complete: std::max, Completed in: 0.00516
Load complete: Macro, Completed in: 0.00177
Load complete: Function, Completed in: 0.00555
Load complete: Function, Completed in: 0.00516
Load complete: Macro, Completed in: 0.00177
Load complete: std::max, Completed in: 0.00567
Load complete: std::max, Completed in: 0.00503
Load complete: Macro, Completed in: 0.00177
Load complete: Function, Completed in: 0.00582
Load complete: Function, Completed in: 0.00517
Load complete: Macro, Completed in: 0.00177
Load complete: std::max, Completed in: 0.00568
Load complete: std::max, Completed in: 0.00522
Load complete: Macro, Completed in: 0.00177
Load complete: Function, Completed in: 0.00569
Load complete: Function, Completed in: 0.00518
Load complete: Macro, Completed in: 0.00177
Load complete: std::max, Completed in: 0.00567
Load complete: std::max, Completed in: 0.00519
Load complete: Macro, Completed in: 0.00177
Load complete: Function, Completed in: 0.00601
Load complete: Function, Completed in: 0.00517
Load complete: Macro, Completed in: 0.00177
Load complete: std::max, Completed in: 0.00568
Load complete: std::max, Completed in: 0.00521
Load complete: Macro, Completed in: 0.00177
Load complete: Function, Completed in: 0.00567
Load complete: Function, Completed in: 0.00517
Load complete: Macro, Completed in: 0.00177
Load complete: std::max, Completed in: 0.00567
Load complete: std::max, Completed in: 0.00519
Load complete: Macro, Completed in: 0.00177
Load complete: Function, Completed in: 0.00559
Load complete: Function, Completed in: 0.00516
Load complete: Macro, Completed in: 0.00177
Load complete: std::max, Completed in: 0.00567

Share this post


Link to post
Share on other sites
Quote:
Original post by Skiller
But in an unexpected result but the function and std::max *increased* the time it took when using constants, the macro decreased to the point that the majority of the time is probably the time spent looping.
The macro with constants probably caused the entire loop to be optimised out, so those results are pretty much useless.

Share this post


Link to post
Share on other sites
Problem is that you used a constant, and the compiler is free to optimize that out:

maxDefine(1, 2)
// will become
1 > 2 ? 1 : 2
// will become
2

And given that you never even do anything with the float itself, the compiler should technically be free to get rid of the whole loop.

Share this post


Link to post
Share on other sites
Quote:
Original post by swiftcoder
Quote:
Original post by Skiller
But in an unexpected result but the function and std::max *increased* the time it took when using constants, the macro decreased to the point that the majority of the time is probably the time spent looping.
The macro with constants probably caused the entire loop to be optimised out, so those results are pretty much useless.


What's faster is faster that's all there is to it, it's a good thing that it gets optimized out and it's good to see how much faster it is in that case so I don't understand how those results are useless. If constants were used in the code then the results clearly show that a macro is the fastest option, though obviously it'd be much rarer for that to be the case which is why I'm only really concerned with the common use case of using variables.

Quote:
Original post by agi_shi
Problem is that you used a constant, and the compiler is free to optimize that out:

maxDefine(1, 2)
// will become
1 > 2 ? 1 : 2
// will become
2

And given that you never even do anything with the float itself, the compiler should technically be free to get rid of the whole loop.


It gets compiled down to just this:
			float maxVal = maxDefine(0.01f, 0.02f);
0040D6E9 fld dword ptr [__real@3ca3d70a (40F958h)]
0040D6EF fstp dword ptr [maxVal]

I don't understand assembly but that looks like it's probably just an assignment which would prove your theory, except compiling out the loop as that still gets done.

But as I said if it gets compiled out then that's the best possible solution as far as speed goes so that's not a problem, it just clearly shows macros are the best solution when using constants.



Anyway I'm not concerned about use with constants, I just put it there for those who were interested and to clear up the erroneous statement that I was getting 10-100 times the speed with constants which it's quite clear I'm not anymore since the function is much faster than when I last checked :).

Share this post


Link to post
Share on other sites
Quote:
Original post by Skiller
Quote:
Original post by swiftcoder
Quote:
Original post by Skiller
But in an unexpected result but the function and std::max *increased* the time it took when using constants, the macro decreased to the point that the majority of the time is probably the time spent looping.
The macro with constants probably caused the entire loop to be optimised out, so those results are pretty much useless.

What's faster is faster that's all there is to it, it's a good thing that it gets optimized out and it's good to see how much faster it is in that case so I don't understand how those results are useless. If constants were used in the code then the results clearly show that a macro is the fastest option, though obviously it'd be much rarer for that to be the case which is why I'm only really concerned with the common use case of using variables.
You partially missed my point, which was along the same lines as agi_shi's point. Since the compiler can immediately reduce float maxVal = maxDefine(0.1f, 0.2f); to float maxVal = 0.2f;, it can then deduce that the loop doesn't do anything, and remove it entirely.

This means that you are comparing zero invocations of the maxDefine against one million invocations of std::max, which tells you nothing - of course zero calls is faster than many calls! However, if your loop does something non-trivial, it may not be optimised out, at which point you can check the relative performance. You really have to check the resulting assembly code to make sure that your loops haven't disappeared completely.

Share this post


Link to post
Share on other sites
Quote:
Original post by Skiller
It gets compiled down to just this:
*** Source Snippet Removed ***

And what does the function version get compiled to?
I'd expect that to be the same. The compiler should be able to optimize everything down to a single assignment in both the macro and function cases. (Although your timing results hint that that's probably not the case)

Also, what is the time unit it prints out? 0.00483 seconds?

As said above, your test isn't worth much though, because the entire loop can be optimized away in both cases. You're not testing a million max calls, you're testing *one*.

Apart from that, I can think of two possible sources for the slowdown in the function case.
One is that you're passing the arguments by reference, which is generally a waste of time with small POD datatypes. (On the other hand, I'd expect the compiler to be able to optimize that away in such a simple function), and the second might be the floating-point precision which causes extra float<->double casts in the function case.
That should be visible if you take a look at the assembly output though.

Share this post


Link to post
Share on other sites
Quote:
Original post by swiftcoderYou partially missed my point, which was along the same lines as agi_shi's point. Since the compiler can immediately reduce float maxVal = maxDefine(0.1f, 0.2f); to float maxVal = 0.2f;, it can then deduce that the loop doesn't do anything, and remove it entirely.

This means that you are comparing zero invocations of the maxDefine against one million invocations of std::max, which tells you nothing - of course zero calls is faster than many calls! However, if your loop does something non-trivial, it may not be optimised out, at which point you can check the relative performance. You really have to check the resulting assembly code to make sure that your loops haven't disappeared completely.


I'd checked the assembly and as far as I can tell everything for the loops is still there and the timing difference also supports the fact that it still runs the entire loop so that's not the problem. Also increasing the number of loops by 10 times made the results take 10 times longer.


Quote:
Original post by Spoonbender
And what does the function version get compiled to?
I'd expect that to be the same. The compiler should be able to optimize everything down to a single assignment in both the macro and function cases. (Although your timing results hint that that's probably not the case)


The call to Math::Max disassembly:
			float maxVal = Math::Max(0.01f, 0.02f);
0040D652 fld dword ptr [__real@3ca3d70a (40F958h)]
0040D658 fstp dword ptr [ebp-1BCh]
0040D65E fld dword ptr [__real@3c23d70a (40F954h)]
0040D664 fstp dword ptr [ebp-1C0h]
0040D66A lea eax,[ebp-1BCh]
0040D670 push eax
0040D671 lea ecx,[ebp-1C0h]
0040D677 push ecx
0040D678 call Math::Max (401040h)
0040D67D add esp,8
0040D680 fstp dword ptr [maxVal]



And the Math::Max function disassembly:
	inline float			Max(const float& value1, const float& value2){return ((value1 > value2) ? value1 : value2);}
00401040 push ebp
00401041 mov ebp,esp
00401043 push ecx
00401044 mov eax,dword ptr [value1]
00401047 fld dword ptr [eax]
00401049 mov ecx,dword ptr [value2]
0040104C fld dword ptr [ecx]
0040104E fcompp
00401050 fnstsw ax
00401052 test ah,5
00401055 jp Math::Max+21h (401061h)
00401057 mov edx,dword ptr [value1]
0040105A fld dword ptr [edx]
0040105C fstp dword ptr [ebp-4]
0040105F jmp Math::Max+29h (401069h)
00401061 mov eax,dword ptr [value2]
00401064 fld dword ptr [eax]
00401066 fstp dword ptr [ebp-4]
00401069 fld dword ptr [ebp-4]
0040106C mov esp,ebp
0040106E pop ebp
0040106F ret



Alot of extra work by the look of it, a far cry from the 2 instructions the macro did :(. The fact that it's doing a call is rather confusing though, as I've said I'm not very familiar with assembly so I'm not sure if it's supposed to be doing that if it inlines the function.

Quote:
Original post by Spoonbender
Also, what is the time unit it prints out? 0.00483 seconds?


Yes the unit of the results is in seconds. For reference I'm running an Intel Q9550 (45nm core 2 quad core) overclocked to 3.4ghz (8.5 x 400mhz).

Quote:
Original post by Spoonbender
As said above, your test isn't worth much though, because the entire loop can be optimized away in both cases. You're not testing a million max calls, you're testing *one*.


Answered in response to swiftcoder.

Quote:
Original post by Spoonbender
Apart from that, I can think of two possible sources for the slowdown in the function case.
One is that you're passing the arguments by reference, which is generally a waste of time with small POD datatypes. (On the other hand, I'd expect the compiler to be able to optimize that away in such a simple function), and the second might be the floating-point precision which causes extra float<->double casts in the function case.
That should be visible if you take a look at the assembly output though.


I'm not too familiar with assembly, I posted it earlier in the post though if you want to take a look. And I also tried passing by value instead of by reference but that yielded slightly worse performance, though it was insignificant enough that I can't be certain it's not just normal speed fluctuation from background processes.

Share this post


Link to post
Share on other sites
Quote:
Alot of extra work by the look of it, a far cry from the 2 instructions the macro did :(. The fact that it's doing a call is rather confusing though, as I've said I'm not very familiar with assembly so I'm not sure if it's supposed to be doing that if it inlines the function.

How can you be wanting to optimize without really understanding what's going on in the CPU? (i.e. which instructions do what) Voodoo and experimentation aren't the most effective way to go about things. Intel and AMD have freely available manuals that are worth reading :)

The "2 instructions" were just a load and a store, so not valid for comparison. The problem with the compiled form of Math::Max is that FNSTSW is quite slow and poorly predicted conditional branches are murder. Fortunately both can be avoided by enabling SSE in code generation options; you will know this has succeeded when you see a MAXSS instruction.

As an aside, the Intel compiler is much better at optimization and can fairly cheaply be had by students. MSC still just plain sucks, so much that MS are throwing away their 20+ year old hacked-together compiler and replacing it with "Phoenix" (rising from the "ashes", heh).

Share this post


Link to post
Share on other sites
Well it seems I'm an idiot, I'd made the foolish assumption that the optimization compiler switches were on for release but I just double checked and they weren't so they weren't on for the modder build either. Turned them on and times are still favoring macros, but the difference is insignificant and probably just coincidental anyway (0.00001 - 0.00005 seconds).

So now I can just abolish the use of macros for speed-ups :D


Now that that's sorted, does anyone know what the "warning C4748: /GS can not protect parameters and local variables from local buffer overrun because optimizations are disabled in function" that I now get is about and how to fix it?
Never mind, I just needed it on for all projects in the solution.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this