Fast dot product algorithm

Started by
7 comments, last by johnb 18 years, 5 months ago
Hey guys, RIght now I am using Intl's VTune for profiling my collision detection system. Turns out that a bottle neck in my system is performing the dot product. VTunes says that it only has 20 percent parallel activity. The code is as follows :

inline float vertex::operator* (const vertex &rhs) {

	float temp = a[0] * rhs.a[0]; 
	float temp2 = a[1] * rhs.a[1];
	temp += temp2;
	float temp3 =  a[2] * rhs.a[2];
	temp += temp3;

	return temp;
}
I tried many many ways todo this but it seems I cannot raise the parallel activity up or the effencincy. Just wanted to know guys if there was any compiler optimizations that you guys know of. Also I do know of one optimization found in Micheal Abrash's Black book, but I'm afraid that will hurt the cross platform usability of my system once its finished, so thats why I askof a "code" optimization. Aiight guys any help I appreciate thanks.
Advertisement
I'm not sure if this is exactly what you're looking for, but it can't hurt looking at it.
*Update* turns out the additions are killing me.

So I changed the code to this.
inline float vertex::operator* (const vertex &rhs) {	float temp = x * rhs.x; 	float temp2 = y * rhs.y;	float temp3 =  z * rhs.z;		return (temp + temp2 + temp3);}


Which helps.
One thing that might be disabling optimizations that your compiler can do may be the way you have your return value computed via the +='s. Try changing the equation to something like:
  return a[0] * rhs.a[0] + a[1] * rhs.a[1] + a[2] * rhs.a[2];

And see if that helps any.

edit: looks like you noticed already :P
1. Don't tell the compiler how to do it's job.
Use: return a[0]*rhs.a[0] + a[1]*rhs.a[1] + a[2]*rhs.a[2];

2. Use SSE and if possible, use SSE3. SSE3 has a very sweet horizontal add opcode.

3. Try using __forceinline if the compiler supports it.

4. If you are using Intel compiler, turn the vectorization on, if not try manually vectorizing your code.

5. Avoid using the dot product that much :)

EDIT: Wow! So many replies already. When I wrote this, I thought I was the first. :)
It's not the adds that are at fault so much as the multiplications and the adds interacting. The problem is that your add instructions stall while their addends are computed. So by doing the three multiplications first, you've kicked off the third multiplication before starting the additions thus giving you the benefit that the first two multiplies will be farther along before they are used in the first add as well as having the last multiply be farther along before the second add.

Since your reorganization made a difference, I'm going to hazard a guess that you have optimizations off. Your compiler's optimizer should be able to see that one and reorganize the instructions. In addition, if you're building debug, your code isn't being inlined. This is your biggest gain since all the code will be sloshed around with other code to minimize the amound of stalling. Without inlining, you have no choice but to accept a lot of stalling waiting for intermediates to be computed.
Quote:Original post by ury
1. Don't tell the compiler how to do it's job.

...

3. Try using __forceinline if the compiler supports it.


LOL, I know where you're coming from with these but it's still kinda funny the way you phrased 1 & 3.

You can try SSE or SSE2 which may give you enough performace for your purposes, but really you need to figure out why you are calling the dot product so much. You should see if you can move it out of an inner loop in your program, or perhaps cache the results (if there are vectors do not change between frames)

EDIT: also, what compiler are you using? If possible it might be good to look at the disassembly for that function as well.
Quote:
LOL, I know where you're coming from with these but it's still kinda funny the way you phrased 1 & 3.


Hehe :) Just a little female logic. Nobody ever died from that, right? :)
One thing which helps the compiler, and sometimes helps you spot problems, is make both inputs const, i.e.

inline float vertex::operator* (const vertex &rhs) const {	float temp = x * rhs.x; 	float temp2 = y * rhs.y;	float temp3 =  z * rhs.z;		return (temp + temp2 + temp3);}


Also make sure you put this in a header or other file (e.g. '.inl') file that is included by all files that call it, otherwise the compiler won't be able to inline it.

Last (though this won't much help speed things up) putting 'inline' doesn't make much difference to generated code. With optimisations turned on a modern compiler generally decides for itself whether to inline. Inlining doesn't always make things faster - e.g. if it makes code bigger and less able to fit in the cache. The compiler is usually better able to judge this than the programmer.
John BlackburneProgrammer, The Pitbull Syndicate

This topic is closed to new replies.

Advertisement