load hit store while decode normal?

Started by
1 comment, last by alek314?? 9 years, 4 months ago

hi!

I try to convert a normal vector from a DWORD to float3, using the following code:

__m128i n_i;

n_i.m128i_i32[0] = n&0xff;

n_i.m128i_i32[1] = (n >> 8)&0xff;

n_i.m128i_i32[2] = (n >> 16)&0xff;

n_i.m128i_i32[3] = 0;

__m128 n_f = _mm_cvtepi32_ps(n_i);

...

here are the assembly:

...

mov dword ptr [esp], ecx

mov dword ptr [esp+0x4], edx

mov dword ptr [esp+0x8], eax

mov dword ptr [esp+0xc], 0

movdqa xmm0, xmmword ptr [esp]

cvtdq2ps xmm0, xmm0

...

And "cvtdq2ps xmm0, xmm0" has a high CPI rate(1.65).

According to https://fgiesen.wordpress.com/2013/03/04/speculatively-speaking/ , CPU can not forward multiple store to one big load. I wonder whether this is a load hit store or not.

Advertisement
the instruction pointer stalls at cvtdq2ps, as it waits for movdqa to complete and movdqa waits for the pipeline to write out the previous stores before it reads, usual classical RAW hazard

beside that...
you should avoid to access m128i_i32, platforms can support SSE without having to define m128i_i32. (m128i_i32 is rather for debugging).

you can use _mm_load_si128 or _mm_loadu_si128 to load n directly into an SSE register, shuffle it into all 4 lanes (or 3), AND it using an SSE constant that has your mask for R G B, shift right and that way avoid the RAW hazard.

I use _mm_loadu_si128 and process 4 normal at a time, and the stall is gone, thank you!

This topic is closed to new replies.

Advertisement