Sign in to follow this  

SSE question

This topic is 3677 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi, can someone please explain to me why is former code faster than later? extern "C" Vector maskVectorW; //(1,1,1,0) inline void GetBlinn(Vector& vLight, Vector* vertexVectors, float& cosh) { __asm { mov edi, vertexVectors mov esi, vLight movaps xmm0, [edi+0x10] movaps xmm1, xmm0 mulps xmm0, xmm0 mulps xmm0, maskVectorW haddps xmm0, xmm0 haddps xmm0, xmm0 rsqrtps xmm0, xmm0 mulps xmm1, xmm0 //xmm1-normalized tVector movaps xmm0, [esi] subps xmm0, xmm1 //xmm0 - vl vector movaps xmm1, xmm0 mulps xmm1, xmm1 mulps xmm1, maskVectorW haddps xmm1, xmm1 haddps xmm1, xmm1//vl length sqrtps xmm1, xmm1 divps xmm0, xmm1 //xmm0 - HalfWay vector movaps xmm1, [edi+0x30] mulps xmm1, maskVectorW mulps xmm0, xmm1 haddps xmm0, xmm0 haddps xmm0, xmm0 mov edi, cosh movss [edi], xmm0 } } //LATER CODE :) inline void GetBlinn(Vector& vLight, Vector* vertexVectors, float& cosh) { __asm { mov edi, vertexVectors mov esi, vLight movaps xmm3, maskVectorW //put in xmm3 - that's the diff movaps xmm0, [edi+0x10] movaps xmm1, xmm0 mulps xmm0, xmm0 mulps xmm0, xmm3 haddps xmm0, xmm0 haddps xmm0, xmm0 rsqrtps xmm0, xmm0 mulps xmm1, xmm0 //xmm1-normalized tVector movaps xmm0, [esi] subps xmm0, xmm1 //xmm0 - vl vector movaps xmm1, xmm0 mulps xmm1, xmm1 mulps xmm1, xmm3 haddps xmm1, xmm1 haddps xmm1, xmm1//vl length sqrtps xmm1, xmm1 divps xmm0, xmm1 //xmm0 - HalfWay vector movaps xmm1, [edi+0x30] mulps xmm1, xmm3 mulps xmm0, xmm1 haddps xmm0, xmm0 haddps xmm0, xmm0 mov edi, cosh movss [edi], xmm0 } } thanks, regards.

Share this post


Link to post
Share on other sites
Thats one mighty long dependency chain you have constructed there..

Work gets done at the rate of instruction latency on dependency chains .. the division instruction has the highest latency of all, giving the CPU plenty of time to look ahead and try to accomplish more work

Here:

.
.
divps xmm0, xmm1 //xmm0 - HalfWay vector
movaps xmm1, [edi+0x30]
mulps xmm1, maskVectorW
mulps xmm0, xmm1
.
.



The first two instructions following the division are non-dependent and will likely be finished executing LONG before the division is finished... so the idea of loading maskVectorW into a register prior to this division cannot actualy help performance and will likely instead hurt performance

The division instructions have very high latency and will always have very high latency .. the fact is that division is very complicated thing to perform and CPUs have never been very efficient at it.

Nearly every time you write assembly code that uses a division of any kind (integer or float) your biggest optimization step (aside from eliminating it!) will be to hide as many non-dependent instructions within the divisions latency as possible.

It is almost never the case that you will have enough work available to truely saturate the other execution units for the duration of the division. Thats how slow it is. A CPU like the AMD64 can literally execute 123 operations while a DIVPS is taking place (its latency is 41 cpu cycles and the CPU can idealy perform 3 operations per cycle.. 41 * 3 = 123)

Share this post


Link to post
Share on other sites
One thing I noticed is that your using RSQRTPS. From what I remember playing around with writing a raytracer, that instruction is not particularly accurate, to the point where I had noticeable visual artefacts.

I'm not sure how much of a difference it'll make to the lighting

Regards
elFarto

Share this post


Link to post
Share on other sites
Quote:
Original post by DobarDabar2
Thanks guys!

@Rockoon1
sorry for cross-posting.
Can you point me to some reference about dependency chains that is good to read. Thanks

Best regards.


The only reference you really need for that is AMD's or Intel's optimization manuals. They list the latencies in clock cycles of every instruction. And when you know that, you just have to look through your code.
If an instruction has a latency of 10 cycles, and produces a value in xmm0, then any subsequent instruction that reads from xmm0 will be delayed for 10 cycles anyway, so there's no point in putting them immediately after that one. Instead, it should be followed by instructions that don't depend on it, so they can fill in the gap.

Share this post


Link to post
Share on other sites

This topic is 3677 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this