Can this asm code be better?

Started by
13 comments, last by Nils Pipenbrinck 15 years, 4 months ago
There is an SSE3 shuffle instruction, pshufb, which shuffles 16 bytes at once. So, 3 shuffles per 48 bytes. This should be fast enough :P
Advertisement
Nice question. Here is my version.

inline void myConvert_From_BGR_To_RGB_asm(BYTE *pBuf, UINT BufSize){  // scale from bytes to pixels.  BufSize /= 3;  _asm  {    pushad    mov edi, pBuf;         mov esi, BufSize        shr esi, 2            // loop-counterRestart:    mov   eax, [edi]      // load 4 pixels    mov   ebx, [edi+4]    mov   ecx, [edi+8]    add   edi, 12         // increase pointer    // conversion of 4 pixels:    bswap eax    ror   ebx, 8    xchg  al, bl    xchg  cl, bh    ror   eax, 8    bswap ecx    rol   ebx, 8    rol   ecx, 8    sub   esi, 1          // decrement loop counter    mov   [edi-12], eax   // store 4 pixels    mov   [edi-8],  ebx    mov   [edi-4],  ecx    jnz   Restart    popad  }}


Works with 4 pixels in parallel. SSE pshufb would be faster, but this one still beats the C-version by roughly factor two on my P4 machine.

It would be interesting to see how this code performs on a Core or Pentium-M because all the bit-rotates slow the code down on the P4.

Quote:Original post by eq
Without diving into SSE code, prefetching and loop unrolling, this is what I came up with:
*** Source Snippet Removed ***
There's probably better ways, but I couldb't spend more than 5 minutes on this...

Edit: spelling


Hint:

Partial register writes can result in false dependency stalls. What ideally would be considered two independent operations are actually a dependency chain.

This is where the CPU cannot leverage register renaming in order to begin a new operation before a previous operation has completed, "falsely."

While neither AL and AH are dependent, EAX is dependent upon both of them. So in code such as:

mov al, ..
mov ah, ..

The entire register 'eax' is essentially 'locked' until the first move is completed, so that second mov stalls out waiting for an accurate value of eax before an execution unit can proceed to modify it.

While the x86 guarantees that the two memory reads will always happen in-order, regardless of any dependency, the second read can begin 1 cycle after the first begins if there is no dependency, otherwise the second cannot begin until the first instruction is completely retired (3+ cycle latency on mov instructions)
Quote:Original post by Nils Pipenbrinck
It would be interesting to see how this code performs on a Core or Pentium-M because all the bit-rotates slow the code down on the P4.


It looks to me like only the 1st rotate has any meaningful stallout, and that will also happen on a core2. The latency of the others should be mostly hidden within other independent instructions.

That entire section looks like performance trouble:

ror ebx, 8
xchg al, bl
xchg cl, bh
ror eax, 8

The 2nd and 3rd instructions depends upon the 1st
The 3rd and 4th instructions depends upon the 2nd

I'm not sure how to eliminate the trouble effectively considering how tight the register situation is.

You are right that the sse byte swizzle is far more suited to this problem.. but it requires SSSE3, which rules out many fairly new processors (thats SSSE3, not SSE3)

This whole thing would be much more efficient if the OP was using 32-bit pixels instead of 24-bit pixels. If performance is a concern, he should probably consider it if he can.
Hi Rockoon,

I know - the byte-exchanges are potential problem spots. In theory it would be a good idea to put a prefetch in front of them. That way the memory controller could do something while the ALU processes the slow exchanges.

However, in practice it does not make any difference (tried it).

I've just benchmarked the same code with the entire bit-twiddeling bswap/rol/ror/xchg block commented out, and the performance does not change at all.

My guess is that the code is already memory bound and that the ALU work is completely hidden by the memory latency.

This topic is closed to new replies.

Advertisement