Sign in to follow this  

Can this asm code be better?

This topic is 3293 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi, im trying to find the fastest way to do this code in asm: // This code is used to swap the red and blue component of an image // stored in a raw BYTE buffer BYTE b = 0; for(int i = 0; i < BufSize; i += 3){ b = pBuf[0]; pBuf[0] = pBuf[2]; pBuf[2] = b; } So far, i got that, that run 3.25x faster than the above code:
inline void Convert_From_BGR_To_RGB(BYTE *pBuf, UINT BufSize)
{
	_asm{
		// Save registers
		pushad;
	
		// Init. registers
		xor eax, eax;      // Used to store our red and blue byte
		mov ebx, pBuf;     // Store the address of our buffer
		xor ecx, ecx;      // Counter
		mov edx, BufSize;  // Store the buffer size

		// Check the next 3 Bytes until we're done...
		Restart:

		// Move the 2 Bytes from memory in the registers
		mov al, byte ptr [ebx+ecx];
		add ecx, 2;
		mov ah, byte ptr [ebx+ecx];

		// Swap the 2 Bytes
		mov byte ptr [ebx+ecx], al;
		sub ecx, 2;
		mov byte ptr [ebx+ecx], ah;
		add ecx, 3;

		// Are we done?
		cmp ecx, edx;
		jb  Restart;
	
		// Restore registers
		popad;
	}
}
I can live with that, but if anyone got an idea on how to make this faster, let me know!

Share this post


Link to post
Share on other sites
I had to do this a little while ago. Here is my code:

void swapRedAndBlue(unsigned char * p_pBitmap, int p_Width, int p_Height)
{
for (int i = 0; i < p_Width * p_Height * 3; i += 3)
{
std::swap(p_pBitmap[i], p_pBitmap[i + 2]);
}

// below is some assembly that swaps the red and blue components
// if this function is a bottle neck, then use the assembly to swap
// It does the exact same thing as the C++ above
/*
int counter = p_Width * p_Height;
__asm
{
mov ebx, p_pBitmap
mov ecx, counter
label:
mov al, [ebx + 0]
mov ah, [ebx + 2]
mov [ebx + 2], al
mov [ebx + 0], ah

add ebx, 3
dec ecx
jnz label
}
*/

}


The asm is from the AVI playback NeHe lesson

Personally I think the C++ portion of my code is much cleaner, as well as the asm.

Share this post


Link to post
Share on other sites
Quote:
Original post by Vortez
No, it's the equivalent function i've wrote using inline asm in visual studio. Im just wondering if this is the best code to swap those 2 bytes.


Well, it's just that when I tried to compile your code myself, I get the following (from Visual Studio 2008, 32-bit build, time in seconds for a 10KB buffer):

avg. asm: 0.00643581
avg. C++: 0.00616662

So the C++ version is slightly faster than your ASM version (with full optimizations turned on, of course).

Here's the complete source code:


#include <stdio.h>
#include <tchar.h>
#include <windows.h>
#include <iostream>

inline void Convert_From_BGR_To_RGB_asm(BYTE *pBuf, UINT BufSize)
{
_asm{
// Save registers
pushad;

// Init. registers
xor eax, eax; // Used to store our red and blue byte
mov ebx, pBuf; // Store the address of our buffer
xor ecx, ecx; // Counter
mov edx, BufSize; // Store the buffer size

// Check the next 3 Bytes until we're done...
Restart:

// Move the 2 Bytes from memory in the registers
mov al, byte ptr [ebx+ecx];
add ecx, 2;
mov ah, byte ptr [ebx+ecx];

// Swap the 2 Bytes
mov byte ptr [ebx+ecx], al;
sub ecx, 2;
mov byte ptr [ebx+ecx], ah;
add ecx, 3;

// Are we done?
cmp ecx, edx;
jb Restart;

// Restore registers
popad;
}
}

inline void Convert_From_BGR_To_RGB_cpp(BYTE *pBuf, UINT BufSize)
{
BYTE b = 0;
for(int i = 0; i < BufSize; i += 3)
{
b = pBuf[i];
pBuf[i] = pBuf[i + 2];
pBuf[i + 2] = b;
}
}

int _tmain(int argc, _TCHAR* argv[])
{
int n = 1024 * 1024 * 10;
BYTE *buffer = new BYTE[n];
for(int i = 0; i < n; i++)
{
// generate some non-zero data in there
buffer[i] = i % 255;
}

LARGE_INTEGER freq;
::QueryPerformanceFrequency(&freq);

LARGE_INTEGER start, end;

double avg_asm = 0.0;
double avg_cpp = 0.0;

for(int i = 0; i < 10; i++)
{
::QueryPerformanceCounter(&start);
Convert_From_BGR_To_RGB_asm(buffer, n);
::QueryPerformanceCounter(&end);

double assem = (double) (end.QuadPart - start.QuadPart) / (double) freq.QuadPart;

::QueryPerformanceCounter(&start);
Convert_From_BGR_To_RGB_cpp(buffer, n);
::QueryPerformanceCounter(&end);

double cpp = (double) (end.QuadPart - start.QuadPart) / (double) freq.QuadPart;

avg_asm += assem;
avg_cpp += cpp;
std::cout << "asm version: " << assem << std::endl;
std::cout << "C++ version: " << cpp << std::endl;
}

std::cout << "avg. asm: " << (avg_asm / 10.0) << std::endl;
std::cout << "avg. C++: " << (avg_cpp / 10.0) << std::endl;
}



This is just what I came up with in 5 minutes, I might've missed something obvious...

In any case, the compiler is really good at optimizing these little snippets of code. There's (usually) no point converting them to assembly yourself.

Share this post


Link to post
Share on other sites
Hint:

x86 addressing modes include REG+REG+immediate

This will remove 2 of your twiddlings with ECX

Additionally, the only register you need to save and restore there is EBX

Share this post


Link to post
Share on other sites
I'd read 4 bytes of data into a register. Swap the data using this register and write 4 bytes back. If your array is properly aligned at the 4 byte boundary, then that should save you 2 expensive memory accesses per pixel.

Secondly, make your counter a counter that counts to zero and use the jz/jnz jump after decrementing it, saving you a compare operation.

Thirdly, use a branch prediction hint (don't know the syntax for x86) to tell the jump that it is more likely to jump than not.

Finally, if you can do this with SIMD instructions, it'll be a lot faster!

Hope that helps!

Share this post


Link to post
Share on other sites
Without diving into SSE code, prefetching and loop unrolling, this is what I came up with:

inline void Convert_From_BGR_To_RGB_asm_eq(BYTE *pBuf, UINT BufSize)
{
_asm
{
pusha
mov edx, pBuf
mov ecx, BufSize
add edx, ecx
neg ecx
Restart: mov ebx, ecx
add ebx, edx
mov al, byte ptr [ebx]
mov ah, byte ptr [ebx+2]
add ecx, 3
mov byte ptr [ebx], ah
mov byte ptr [ebx+2], al
jne Restart
popa
}
}



There's probably better ways, but I couldb't spend more than 5 minutes on this...

Edit: spelling

Share this post


Link to post
Share on other sites
I'd unroll it a bit and process 4 pixels per iteration. That'll let you do 3 32-bit reads and writes per 4 pixels. It'd look something like this (untested):


unsigned int *piBuf = (unsigned int*)pBuf;
int j=0;
for(int i = 0; i < BufSize; i += 12)
{
unsigned int i0 = piBuf[j]; //rgbR
unsigned int i1 = piBuf[j+1]; //GBrg
unsigned int i2 = piBuf[j+2]; //bRGB

piBuf[j] = (i0 & 0xff0000) | ((i0 & 0xff00) << 16) | ((i0 & 0xff000000) >> 16) | ((i1 & 0xff0000) >> 16);
piBuf[j+1] = (i1 & 0xff0000ff) | ((i0 & 0xff) << 16) | ((i2 & 0xff000000) >> 16);
piBuf[j+2] = (i2 & 0xff00) | ((i1 & 0xff00) << 16) | ((i2 & 0xff0000) >> 16) | ((i2 & 0xff) << 16);

j += 3;
}




Note that you'd also need to handle the last few bytes with the original code if the data isn't a multiple of 12 bytes (4 pixels).

An SSE2 version should be somewhat quicker still. You'd need to unroll some more to get to 48 bytes per iteration, unfortunately on x86 I don't think there's a shuffle instruction that treats the 128-bit register as sixteen 8-bit integers so you'd still have to use masks and shifts.

[Edited by - Adam_42 on December 8, 2008 8:07:15 AM]

Share this post


Link to post
Share on other sites
Nice question. Here is my version.


inline void myConvert_From_BGR_To_RGB_asm(BYTE *pBuf, UINT BufSize)
{
// scale from bytes to pixels.
BufSize /= 3;

_asm
{
pushad

mov edi, pBuf;
mov esi, BufSize
shr esi, 2 // loop-counter

Restart:
mov eax, [edi] // load 4 pixels
mov ebx, [edi+4]
mov ecx, [edi+8]
add edi, 12 // increase pointer

// conversion of 4 pixels:
bswap eax
ror ebx, 8
xchg al, bl
xchg cl, bh
ror eax, 8
bswap ecx
rol ebx, 8
rol ecx, 8

sub esi, 1 // decrement loop counter
mov [edi-12], eax // store 4 pixels
mov [edi-8], ebx
mov [edi-4], ecx
jnz Restart

popad
}
}


Works with 4 pixels in parallel. SSE pshufb would be faster, but this one still beats the C-version by roughly factor two on my P4 machine.

It would be interesting to see how this code performs on a Core or Pentium-M because all the bit-rotates slow the code down on the P4.

Share this post


Link to post
Share on other sites
Quote:
Original post by eq
Without diving into SSE code, prefetching and loop unrolling, this is what I came up with:
*** Source Snippet Removed ***
There's probably better ways, but I couldb't spend more than 5 minutes on this...

Edit: spelling


Hint:

Partial register writes can result in false dependency stalls. What ideally would be considered two independent operations are actually a dependency chain.

This is where the CPU cannot leverage register renaming in order to begin a new operation before a previous operation has completed, "falsely."

While neither AL and AH are dependent, EAX is dependent upon both of them. So in code such as:

mov al, ..
mov ah, ..

The entire register 'eax' is essentially 'locked' until the first move is completed, so that second mov stalls out waiting for an accurate value of eax before an execution unit can proceed to modify it.

While the x86 guarantees that the two memory reads will always happen in-order, regardless of any dependency, the second read can begin 1 cycle after the first begins if there is no dependency, otherwise the second cannot begin until the first instruction is completely retired (3+ cycle latency on mov instructions)

Share this post


Link to post
Share on other sites
Quote:
Original post by Nils Pipenbrinck
It would be interesting to see how this code performs on a Core or Pentium-M because all the bit-rotates slow the code down on the P4.


It looks to me like only the 1st rotate has any meaningful stallout, and that will also happen on a core2. The latency of the others should be mostly hidden within other independent instructions.

That entire section looks like performance trouble:

ror ebx, 8
xchg al, bl
xchg cl, bh
ror eax, 8

The 2nd and 3rd instructions depends upon the 1st
The 3rd and 4th instructions depends upon the 2nd

I'm not sure how to eliminate the trouble effectively considering how tight the register situation is.

You are right that the sse byte swizzle is far more suited to this problem.. but it requires SSSE3, which rules out many fairly new processors (thats SSSE3, not SSE3)

This whole thing would be much more efficient if the OP was using 32-bit pixels instead of 24-bit pixels. If performance is a concern, he should probably consider it if he can.

Share this post


Link to post
Share on other sites
Hi Rockoon,

I know - the byte-exchanges are potential problem spots. In theory it would be a good idea to put a prefetch in front of them. That way the memory controller could do something while the ALU processes the slow exchanges.

However, in practice it does not make any difference (tried it).

I've just benchmarked the same code with the entire bit-twiddeling bswap/rol/ror/xchg block commented out, and the performance does not change at all.

My guess is that the code is already memory bound and that the ALU work is completely hidden by the memory latency.

Share this post


Link to post
Share on other sites

This topic is 3293 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this