What is your longest C++ macro

Started by
15 comments, last by heron3d 12 years, 4 months ago
It's like any cache. Macros increase code size, and each individual instance is considered distinct so it will take up a new slot in cache. A function can be stored once and referred to repeatedly. The trade off is a question of the memory used by repeating the code, versus the overhead of stack management in the call.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Advertisement

[quote name='Hodgman' timestamp='1323743418' post='4893360']
[quote name='jjd' timestamp='1323741620' post='4893351']I'd be really interested in finding out more about this topic. Is there a good source online that you would recommend?
Wikipedia ;)

Modern compilers actually treat [font="Courier New"]inline[/font] as a hint, not a demand.
http://msdn.microsof...y/z8y1yy88.aspx
The insertion (called inline expansion or inlining) occurs only if the compiler's cost/benefit analysis show it to be profitable. Inline expansion alleviates the function-call overhead at the potential cost of larger code size.[/quote]So if you feel that some code is too small + used too often to be placed into a function, you can just put it into an inline function -- If the compiler agrees with you, it will inline that code (i.e. the same as if you'd used the OP's macro approach), otherwise, it will compile it as a regular function. In either case, the inline function is superior to the macro, as it's more maintainable, readable, debuggable, etc...
[/quote]

Sorry, I wasn't clear in my reply. I know the generic effect of the 'inline' keyword, i.e. that it's a hint etc., I'm actually more interested in the details of how instruction cache design determines when it is beneficial or not.

-Josh
[/quote]

This is hugely complicated anymore with multicore chips sharing various parts of the cache. But, at the most basic, the lowest level cache is relatively small and if a function gets too large (especially if it contains a loop), the CPU will have to request more code from a higher level cache which is a time consuming operation. This is not normally a problem as the CPU will be requesting the code before it is actually required due to the pipelined nature of most CPU's today. But, if you make a loop with too many instructions and external calls to fit in the code cache, you can give the CPU a tizzy fit quite easily. It will load up the start of the loop and start executing it, oops, call to another function, start that loading before the uops get there, hmm had to evict the start of the loop to do it, keep processing, damn, more calls evict some old functions and more of the starting loop, keep going....... Eventually when the loop comes around, nothing in the code cache is usable and you get a big ass stall while the start of the loop is reloaded and some of the first functions called are reloaded. The problem for the cache system is that if the loop point comes up and everything from the beginning of the loop is evicted, there will be too much data to load before the pipeline flushes, so you get a nasty stall while the cache is reloaded with the start of the loop. (Ignoring the bad case of branch missprediction where the cache started loading for "after" the loop instead of predicting a restart.)


Under normal circumstances none of this causes actual stalls. Over use of __forceinline and attribute equivalents on GCC/CLang can cause this because you are bypassing the compiler smarts and saying you are smarter. You are almost *NEVER* smarter than the compiler in anything other than very specific cases where you know a specific (short) function is massively used and should always be inlined. Even then, the compiler usually gets it correct so you are just stating the obvious. If you use forced inline functions, you should probably consider writing them in assembly instead. First, if they are properly short and concise then that should not be a problem and rewriting as a naked function will speed things up over what the compiler does. If you can't do that effectively, I highly suggest not using forced inline because you are inhibiting the compiler optimizations instead of benefiting them.


Sorry, I wasn't clear in my reply. I know the generic effect of the 'inline' keyword, i.e. that it's a hint etc., I'm actually more interested in the details of how instruction cache design determines when it is beneficial or not.

-Josh


Yup, if you are a game developer and didn't know cpu cache, now it's time to learn it.
Google "cpu cache optimization" (without quote) will give you a lot of good topics.

For a quick start,
GDC 2003 — Memory Optimization
is a very very good start.

Though it's written in 8 years ago, it's still a must read before current CPU architecture dies. :-)
I decided to learn some about cpu cache optimization after read it.

https://www.kbasm.com -- My personal website

https://github.com/wqking/eventpp  eventpp -- C++ library for event dispatcher and callback list

https://github.com/cpgf/cpgf  cpgf library -- free C++ open source library for reflection, serialization, script binding, callbacks, and meta data for OpenGL Box2D, SFML and Irrlicht.

The largest macro I use is 9 lines long and is used begin the definition of a wrapper class that forwards function calls to it into a thread-safe message queue. Essentially, it gave me an easy way to make my threading invisible, as long as I kept my interfaces asynchronous.

However, I would agree with the people here that a set of trigonometric computations could probably best be extracted into a function. You can make an inline function every bit as performant as a macro (and likely better). Though, I'd be rather surprised if you saw a significant drop in performance even from something as expensive as a virtual call if you're already doing 27 lines worth of trig.

The largest macro I use is 9 lines long and is used begin the definition of a wrapper class that forwards function calls to it into a thread-safe message queue. Essentially, it gave me an easy way to make my threading invisible, as long as I kept my interfaces asynchronous.

However, I would agree with the people here that a set of trigonometric computations could probably best be extracted into a function. You can make an inline function every bit as performant as a macro (and likely better). Though, I'd be rather surprised if you saw a significant drop in performance even from something as expensive as a virtual call if you're already doing 27 lines worth of trig.


A macro causes explicit inlined code to be created, function calls can be better especially for maths functions as you can guarentee that stuff stays withing SSE2 registers when using that optimisation. The load-store-hit on marshalling between normal floating point and SSE2 registers is far worse than the function call for example.

Worked on titles: CMR:DiRT2, DiRT 3, DiRT: Showdown, GRID 2, theHunter, theHunter: Primal, Mad Max, Watch Dogs: Legion

I'd never use a macro in the C-sense to inline a function for performance. I don't see any advantage of that over just having an inline function (possibly a template). If you don't like the "hint" character of inline then there is usually some compiler specific way to force inlines. Also with high enough optimization settings, the compiler will also see things like stuff that can be kept in registers etc. I did some auto vectorization with templates lately where most sse intrinsics sat in their own inline function and the compiler literally optimized away EVERYTHING.
Cryptographers like to write macro abused code like that, see for example a snippet from AES:
#define ROUND(i,d,s) \
d##0 = TE0(s##0) ^ TE1(s##1) ^ TE2(s##2) ^ TE3(s##3) ^ rk[4 * i]; \
d##1 = TE0(s##1) ^ TE1(s##2) ^ TE2(s##3) ^ TE3(s##0) ^ rk[4 * i + 1]; \
d##2 = TE0(s##2) ^ TE1(s##3) ^ TE2(s##0) ^ TE3(s##1) ^ rk[4 * i + 2]; \
d##3 = TE0(s##3) ^ TE1(s##0) ^ TE2(s##1) ^ TE3(s##2) ^ rk[4 * i + 3]
Note that each TE... is a macro itself, and ROUND is called several times. Now try and debug that.

It's a matter of style, there are people who write code like this (taken from the "russian range coder") too:
while((low_ ^ low_ + range_) < TOP || range_ < BOT && ((range_ = (0 - low_) & BOT - 1), 1))
Without any doubt, the code is perfectly correct, and some people will even consider code like this "cool".

However, it's not obvious to the casual observer what's going on. Personally, I prefer that what code does is immediately obvious (to me and anyone else reading it). This includes, among other things, not having to read a line three times before understanding what it does, and error messages pointing to the correct location in the source and not being totally bogus. Macros tend to make code "ungraspable" and tend to generate bogus (or at least hard to pinpoint) errors.

Also, at several times in the past, I've found that writing clear and obvious code is not only much easier, but in fact generates faster code than a totally unreadable hand "optimized" version.
Thank you for all the responses guys. A function it is.

N

This topic is closed to new replies.

Advertisement