Jump to content
  • Advertisement

Archived

This topic is now archived and is closed to further replies.

Promit

Tip: Writing assembly-friendly C

This topic is 5192 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

A while ago, I rewrote a section of C++ that swaps BGR to RGB in a bitmap in assembly. It ran noticeably faster, and looking at the listing revealed that the assembly version was much tighter and smaller. Since the code still has to compile on non-x86 targets, I left the C++ loop in when not compiled on x86. But I was playing with the loop and looking at the listing today, and what I found was pretty interesting. Depending on how you write your code, the assembly output can be very different. Let''s look at the original loop:
	unsigned char temp;
	for( int i = 0; i < m_Size * 3; i += 3 )
	{
		temp = RawData[i];
		RawData[i] = RawData[i+2];
		RawData[i+2] = temp;
	}
It''s pretty self explanatory; RawData is the bitmap data in BGR, and m_Size is the number of pixels in the texture. The assembly listing, however, is pretty long:
; 220  : 	unsigned char temp;
; 221  : 	for( int i = 0; i < m_Size * 3; i += 3 )

	mov	eax, DWORD PTR [esi+12]
	add	esp, 24					; 00000018H
	lea	ecx, DWORD PTR [eax+eax*2]
	test	ecx, ecx
	jbe	SHORT $L33435
	mov	ebp, -2					; fffffffeH
	lea	eax, DWORD PTR [ebx+2]
	sub	ebp, ebx
$L33433:

; 222  : 	{
; 223  : 		temp = RawData[i];

	mov	cl, BYTE PTR [eax-2]

; 224  : 		RawData[i] = RawData[i+2];

	mov	dl, BYTE PTR [eax]
	mov	BYTE PTR [eax-2], dl

; 225  : 		RawData[i+2] = temp;

	mov	BYTE PTR [eax], cl
	mov	ecx, DWORD PTR [esi+12]
	add	eax, 3
	lea	edx, DWORD PTR [eax+ebp]
	lea	ecx, DWORD PTR [ecx+ecx*2]
	cmp	edx, ecx
	jb	SHORT $L33433
$L33435:

; 226  : 	}
I''m not even going to bother explaining that; there''s a ton of stuff going on, and it''s not exactly relevant. Suffice to say that it''s way longer than it needs to be -- 18 instructions. So now I''m wondering how I can restructure this slightly. The first thing I do is add a couple variables and mark them with the ''register'' keyword, to try to discourage the amount of memory interaction above. Here''s the new version of the C:
	register unsigned char temp;
	register int i = 0;
	register unsigned char* data = RawData;

	for( ; i < m_Size * 3; i += 3 )
	{
		temp = data[i];
		data[i] = data[i+2];
		data[i+2] = temp;

	}
And the resulting assembly:
; 220  : 	register unsigned char temp;
; 221  : 	register int i = 0;
; 222  : 	register unsigned char* data = RawData;
; 223  : 	for( ; i < m_Size * 3; i += 3 )

	mov	eax, DWORD PTR [esi+12]
	add	esp, 24					; 00000018H
	lea	ecx, DWORD PTR [eax+eax*2]
	test	ecx, ecx
	jbe	SHORT $L33436
	mov	ebp, -2					; fffffffeH
	lea	eax, DWORD PTR [ebx+2]
	sub	ebp, ebx
$L33434:

; 224  : 	{
; 225  : 		temp = data[i];

	mov	cl, BYTE PTR [eax-2]

; 226  : 		data[i] = data[i+2];

	mov	dl, BYTE PTR [eax]
	mov	BYTE PTR [eax-2], dl

; 227  : 		data[i+2] = temp;

	mov	BYTE PTR [eax], cl
	mov	ecx, DWORD PTR [esi+12]
	add	eax, 3
	lea	edx, DWORD PTR [eax+ebp]
	lea	ecx, DWORD PTR [ecx+ecx*2]
	cmp	edx, ecx
	jb	SHORT $L33434
$L33436:

; 228  : 	}
Well, it got shorter, but let''s see if we can do better... What I tried next was iterating backwards and incrementing by only 1 each time. Instead of moving i so much, I move the pointer to the data as well.
	register unsigned char temp;
	register int i = m_Size;
	register unsigned char* reg = RawData;
	while( i > 0 )
	{
		temp = reg[0];
		reg[0] = reg[2];
		reg[2] = temp;
		reg += 3;
		--i;
	}
We''re now down to using a while loop, and iterating backwards. Why did I change it like this? Well, the first important thing is that we''re now checking i against 0 instead of m_Size. Also, decrementing i is the last step. This is because a decrement call (dec in x86) will set the zero bit on the CPU if it hits 0, and we can jump back to the top of the loop based on that bit alone. Also, we''re moving reg by 3, which allows us to use constant indices into reg and avoid accessing another variable. So what does the assembly look like?
; 223  : 	while( i > 0 )

	test	ecx, ecx
	jle	SHORT $L33436

; 166  : 	}
; 167  : 
; 168  : 	fread( &Info, sizeof(Info), 1, pBitmap );

	mov	ebp, ecx
$L33435:

; 224  : 	{
; 225  : 		temp = reg[0];

	mov	cl, BYTE PTR [eax]

; 226  : 		reg[0] = reg[2];

	mov	dl, BYTE PTR [eax+2]
	mov	BYTE PTR [eax], dl

; 227  : 		reg[2] = temp;

	mov	BYTE PTR [eax+2], cl

; 228  : 		reg += 3;

	add	eax, 3
	dec	ebp
	jne	SHORT $L33435
$L33436:

; 229  : 		--i;
; 230  : 	}
Now we''ve got some really pretty short code here. I noticed, however, that there was an extra test and jump (jle) at the beginning of the loop, since the test condition is at the beginning and if the test fails, it has to jump to the end. I removed this by using do-while, neatly moving the test to the end:
	register unsigned char temp;
	register int i = m_Size;
	register unsigned char* reg = RawData;
	do
	{
		temp = reg[0];
		reg[0] = reg[2];
		reg[2] = temp;
		reg += 3;
		--i;
	} while( i > 0 );
And of course its assembly:
; 221  : 	register unsigned char temp;
; 222  : 	register unsigned char* reg = RawData;

	mov	eax, ebx
$L33433:

; 222  : 	do
; 223  : 	{
; 224  : 		temp = RawData[0];

	mov	cl, BYTE PTR [edi]

; 225  : 		RawData[0] = RawData[2];

	mov	dl, BYTE PTR [edi+2]
	mov	BYTE PTR [edi], dl

; 226  : 		RawData[2] = temp;

	mov	BYTE PTR [edi+2], cl

; 227  : 		RawData += 3;

	add	edi, 3

; 228  : 		--i;

	dec	eax

; 229  : 	} while( i > 0 );
	test	eax, eax
	jg	SHORT $L33433
Now, after having rearranged the code and giving the compiler a few hints, we''re down to just a handful of instructions (9 to be exact). The only thing left bothering me was the test/jg pair at the end of the loop. It should only take a single instruction to loop back; jump if the 0 bit isn''t set. However, this isn''t what I wrote! I told it to check i > 0. Changing this last line to i != 0 yielded:
	register int i = m_Size;
	register unsigned char temp;
	register unsigned char* reg = RawData;
	do
	{
		temp = reg[0];
		reg[0] = reg[2];
		reg[2] = temp;
		reg += 3;
		--i;
	} while( i != 0 );

And its assembly output:
; 221  : 	register unsigned char temp;
; 222  : 	register unsigned char* reg = RawData;

	mov	eax, ebx
$L33434:

; 223  : 	do
; 224  : 	{
; 225  : 		temp = reg[0];

	mov	cl, BYTE PTR [eax]

; 226  : 		reg[0] = reg[2];

	mov	dl, BYTE PTR [eax+2]
	mov	BYTE PTR [eax], dl

; 227  : 		reg[2] = temp;

	mov	BYTE PTR [eax+2], cl

; 228  : 		reg += 3;

	add	eax, 3

; 229  : 		--i;

	dec	ebp

; 230  : 	} while( i != 0 );

	jne	SHORT $L33434
And voila! We are down to a mere 8 instructions, the same length as the handwritten assembly I "stole" from NeHe to do this a few weeks back. More importantly, it''s still written in C++, not assembly, so it maintains all of its portability across architectures and doesn''t look totally meaningless to someone unfamiliar with assembly. To be fair, the new C++ block is somewhat less obvious in purpose than the original, but a quick comment is all it takes to help someone else see what is going on. The point of this whole example is basically that just because it''s a high level language doesn''t mean you can do whatever the hell you want and let the compiler try and sort it out into fast assembly. You have to write your code to give it hints about what you''re trying to do and how you want it done, and this is hard to do unless you actually know assembly. The concepts of handwritten assembly are important, even if actually writing the stuff isn''t. This new C++ code runs at over double the speed of the original loop, and actually saves a lot of load time if you have a significant number of bitmaps. Now I''m beginning to think this should''ve been an article, but it''s too late for that, isn''t it

Share this post


Link to post
Share on other sites
Advertisement
Hi!

Thx for the information. I always wondered if the things you did to the code would really have impact on the assembly code.
Only one question left: Which compiler did you use and would other compilers give you the same results?

pink panther

Share this post


Link to post
Share on other sites
Visual C++, and I haven''t tested other compilers but I imagine the results would be similar.

Share this post


Link to post
Share on other sites
tip: get a new compiler
I fed vc++ 7.1 your original code, and this is what it produced:

for( int i = 0; i < n * 3; i += 3 )
00401350 test esi,esi
00401352 jle main+74h (401374h)
00401354 lea edx,[esi-1]
00401357 mov eax,0AAAAAAABh
0040135C mul eax,edx
0040135E shr edx,1
00401360 lea ecx,[ebp+2]
00401363 inc edx
//LOOP BODY START

{
temp = RawData[i];
00401364 mov al,byte ptr [ecx-2]
RawData[i] = RawData[i+2];
00401367 mov bl,byte ptr [ecx]
00401369 mov byte ptr [ecx-2],bl
RawData[i+2] = temp;
0040136C mov byte ptr [ecx],al
0040136E add ecx,3
00401371 dec edx
00401372 jne main+64h (401364h)
}
//LOOP BODY END


The loop bodies are basically identical. Sure, you did shave off 7 instructions from the initialization - but that code only executes once.
Conclusion: unless you''re l337, you probably won''t beat modern compilers =)

Share this post


Link to post
Share on other sites
Did any of you tried compiling with .NET 7.1/7''s optimization enabled?



"C lets you shoot yourself in the foot rather easily. C++ allows you to reuse the bullet!"

Share this post


Link to post
Share on other sites
quote:
Original post by CPPMaster Poppet
Did any of you tried compiling with .NET 7.1/7''s optimization enabled?



quote:
Original post by sjelkjd I fed vc++ 7.1 your original code

Share this post


Link to post
Share on other sites
the "register" keyword is ignored by all new compilers.

What version are you using? AFAIK even VC++ 6.0 will ignore it ...

[edited by - noVum on March 6, 2004 4:13:00 PM]

Share this post


Link to post
Share on other sites
All fine & good - may be a different compiler will make a better job of the code, but the thing that bothers me most is that your ''optimised'' code does not do exactly the same thing as your original code.

You are now depending on m_Size being at least 1; in other words the old code handles m_Size = 0 gracefully, where as this code will read from a (presuably if it has size 0) invalid source and try to write it to and invalid destination - ie your code has introduced a potential bug. Probably the thing to do would be add an explicit if test around the loop (obviously more code). I''d also suggest that having i!=0 as the non-termination condition is slightly dangerous - what happens if m_Size starts off as a negative value ? In this instance you might be able to get your single instruction by changing the while clause to include a pre-decrement operator on the ''i'' rather than being in a loop...


register int i = m_Size;
if(i > 0) {
register unsigned char temp;
register unsigned char* reg = RawData;
do
{
temp = reg[0];
reg[0] = reg[2];
reg[2] = temp;
reg += 3;
} while( --i > 0 );
};

Share this post


Link to post
Share on other sites
Assuming this is in a function:


void __fastcall SwapRGB(DWORD size, DWORD* raw)
{
__asm
{
begin: xchg al, byte ptr[edx]
xchg [edx+2], al
xchg al, byte ptr[edx]
add edx, 3
loopne begin
}
}


5 lines. Moral of the story? A mediocre ASM programmer can still outdo a compiler nine times out of ten no matter how you write your code.

Share this post


Link to post
Share on other sites
quote:
5 lines. Moral of the story? A mediocre ASM programmer can still outdo a compiler nine times out of ten no matter how you write your code.

Except that any fule kno that you don''t use xchg unless you *really* want to pay for synchronization, because if the indirect operands aren''t in cache then you incur the locking penalty.

mov reg, mem is almost invariably better than xchg. xchg may be shorter, but that''s the only benefit it provides.

Share this post


Link to post
Share on other sites

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!