Template function won't inline

This topic is 2821 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

I'm writing a wrapper for various SSE and Altivec intrinsics and I've run into a problem when dealing with _mm_shuffle_ps(). This intrinsic requires an immediate argument to determine which elements from each vector are chosen.

Since this value is needed at compile time, I decided to use a template function:

[source]
template < int i1, int i2, int i3, int i4 >
FORCE_INLINE SIMDFloat4 shuffle( const SIMDFloat4& a, const SIMDFloat4& b )
{
return SIMDFloat4( _mm_shuffle_ps( a, b, _MM_SHUFFLE( i4, i3, i2, i1 ) ) );
}
[/source]

This compiles and works just fine, but when I examine assembly code produced by using this function, it doesn't get inlined. This is obviously unacceptable for SSE code whose whole purpose is to be as fast as possible, especially considering that this method should reduce to only 1 or 2 instructions at most.

Does anyone have an idea why GCC would not inline this method, even though I've specified for it to always be inlined using compiler attributes? I don't want to have to use any macros if I don't have to...

Share on other sites
OK, well I figured out what is causing it to not be inlined. The method in question is declared as a friend in my SIMDFloat4 class. When I remove the friend declaration, the method is then inlined correctly.

Any clues why this is? The other methods that are declared as friends are inlined properly, though they are not template methods.

Share on other sites
Which version of GCC are you using? What flags are you passing? Can you create a minimal example?

Is this representative?
 #include <iostream> class Foo { public: Foo(int n) : n(n) { } private: int n; template<int i> friend void frobnicate(const Foo & foo) { std::cout << (foo.n * i) << '\n'; } }; int main() { Foo foo(42); frobnicate<13>(foo); } 
My gcc version 4.6.1 inlines this without any special keywords at optimisation level 3. If I use __attribute__((force_inline)), it inlines the call even with no optimisations specified.

Share on other sites

This is obviously unacceptable for SSE code whose whole purpose is to be as fast as possible, especially considering that this method should reduce to only 1 or 2 instructions at most.

Incorrect. The aim is to get your code decoded into microcode as quickly as possible, and then to keep it instruction cache for as long as possible. Possible guesstimations:

* Remove the explicit construction of SIMDFloat within the method. This is just dumb....

return SIMDFloat4( _mm_shuffle_ps( a, b, _MM_SHUFFLE( i4, i3, i2, i1 ) ) );

It implies SIMDFloat has a ctor that takes an __m128, so why on earth are you not just using that directly? (rather than inserting an additional copy ctor in there!?!?!). It can simply be:

return _mm_shuffle_ps( a, b, _MM_SHUFFLE( i4, i3, i2, i1 ) );

* Is [color=#000000][font=Consolas,]SIMDFloat4 declared as a DLL exported class? If it is, there's nothing you can do to inline any of it because you have told the compiler to ALWAYS extract the methods from a DLL at runtime, and so it will never inline. DLL export always takes priority. (the same well be true for shared objects).

* is [/font]shuffle declared as part of a DLL exported class? (or a part of a shared object?)

* Are you storing a function pointer to the inlined method somewhere?

Share on other sites
especially considering that this method should reduce to only 1 or 2 instructions at most.[/quote]

It might, in ideal case.

But number of registers is limited. Instruction cache is small. Passing by const reference instead of by value (remember that SSE types fit into single register, so usual heuristics don't apply) might require compile to keep variables in memory rather than in registers. There might be scheduling or pipelining conflicts. Returned value might not be aligned properly, so it requires more than a single assignment. Aliasing of input parameters might matter (not sure how compiler approaches this for SSE). And if SIMDFloat4 is anything but a typedef for __m128, then a whole lot of other issues might arise.

Experiments were done on optimal instruction sequences. The most they came up with was about 10 instructions or so, being an NP class problem, more simply took too long to compute. So unless you have a cluster and centuries of time, your compiler needs to use certain heuristics and emit a good guess on what the code should look like.

Share on other sites
Well it turns out that all I had to do was add the force-inline directive to the friend declaration and things were inlined without a hitch. I guess that GCC looks at the first declaration of a method when determining its attributes, rather than all of them.

* Remove the explicit construction of SIMDFloat within the method. This is just dumb....

return SIMDFloat4( _mm_shuffle_ps( a, b, _MM_SHUFFLE( i4, i3, i2, i1 ) ) );

It implies SIMDFloat has a ctor that takes an __m128, so why on earth are you not just using that directly? (rather than inserting an additional copy ctor in there!?!?!). It can simply be:

return _mm_shuffle_ps( a, b, _MM_SHUFFLE( i4, i3, i2, i1 ) );

* Is [color=#000000][font=Consolas,]SIMDFloat4 declared as a DLL exported class? If it is, there's nothing you can do to inline any of it because you have told the compiler to ALWAYS extract the methods from a DLL at runtime, and so it will never inline. DLL export always takes priority. (the same well be true for shared objects).

* is [/font]shuffle declared as part of a DLL exported class? (or a part of a shared object?)

* Are you storing a function pointer to the inlined method somewhere?

None of these should be (and aren't) an issue. The class is a header-only template class and all of the methods/constructors are force-inlined. It is very advantageous to wrap the __m128 type in a class, it allows for much nicer syntax and automatic construction from various sources (a single float, a pointer to floats, 4 floats, etc.). I also get operator overloading, plus other methods for indexing vector components, etc. I also don't have to remember all of the intrinsic names for general use. Intel has its own set of classes that do this that come with its compiler. Working with raw __m128 is probably more trouble than it's worth. A properly written wrapper can be just as fast.

• Game Developer Survey

We are looking for qualified game developers to participate in a 10-minute online survey. Qualified participants will be offered a \$15 incentive for your time and insights. Click here to start!

• 15
• 21
• 21
• 11
• 25