load hit store while decode normal?

Graphics and GPU Programming Programming

Started by alek314?? December 18, 2014 12:08 PM

1 comment, last by alek314?? 9 years, 4 months ago

alek314??

398

Author

December 18, 2014 12:08 PM

hi!

I try to convert a normal vector from a DWORD to float3, using the following code:

__m128i n_i;

n_i.m128i_i32[0] = n&0xff;

n_i.m128i_i32[1] = (n >> 8)&0xff;

n_i.m128i_i32[2] = (n >> 16)&0xff;

n_i.m128i_i32[3] = 0;

__m128 n_f = _mm_cvtepi32_ps(n_i);

...

here are the assembly:

...

mov dword ptr [esp], ecx

mov dword ptr [esp+0x4], edx

mov dword ptr [esp+0x8], eax

mov dword ptr [esp+0xc], 0

movdqa xmm0, xmmword ptr [esp]

cvtdq2ps xmm0, xmm0

...

And "cvtdq2ps xmm0, xmm0" has a high CPI rate(1.65).

According to https://fgiesen.wordpress.com/2013/03/04/speculatively-speaking/ , CPU can not forward multiple store to one big load. I wonder whether this is a load hit store or not.

Krypt0n

4,769

December 18, 2014 03:30 PM

the instruction pointer stalls at cvtdq2ps, as it waits for movdqa to complete and movdqa waits for the pipeline to write out the previous stores before it reads, usual classical RAW hazard

beside that...
you should avoid to access m128i_i32, platforms can support SSE without having to define m128i_i32. (m128i_i32 is rather for debugging).

you can use _mm_load_si128 or _mm_loadu_si128 to load n directly into an SSE register, shuffle it into all 4 lanes (or 3), AND it using an SSE constant that has your mask for R G B, shift right and that way avoid the RAW hazard.

video game porting and optimization service + consulting

alek314??

398

Author

December 20, 2014 09:57 AM

I use _mm_loadu_si128 and process 4 normal at a time, and the stall is gone, thank you!

load hit store while decode normal?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

load hit store while decode normal?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines