• Advertisement

Archived

This topic is now archived and is closed to further replies.

Tip: Writing assembly-friendly C

This topic is 5094 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

A while ago, I rewrote a section of C++ that swaps BGR to RGB in a bitmap in assembly. It ran noticeably faster, and looking at the listing revealed that the assembly version was much tighter and smaller. Since the code still has to compile on non-x86 targets, I left the C++ loop in when not compiled on x86. But I was playing with the loop and looking at the listing today, and what I found was pretty interesting. Depending on how you write your code, the assembly output can be very different. Let''s look at the original loop:
	unsigned char temp;
	for( int i = 0; i < m_Size * 3; i += 3 )
	{
		temp = RawData[i];
		RawData[i] = RawData[i+2];
		RawData[i+2] = temp;
	}
It''s pretty self explanatory; RawData is the bitmap data in BGR, and m_Size is the number of pixels in the texture. The assembly listing, however, is pretty long:
; 220  : 	unsigned char temp;
; 221  : 	for( int i = 0; i < m_Size * 3; i += 3 )

	mov	eax, DWORD PTR [esi+12]
	add	esp, 24					; 00000018H
	lea	ecx, DWORD PTR [eax+eax*2]
	test	ecx, ecx
	jbe	SHORT $L33435
	mov	ebp, -2					; fffffffeH
	lea	eax, DWORD PTR [ebx+2]
	sub	ebp, ebx
$L33433:

; 222  : 	{
; 223  : 		temp = RawData[i];

	mov	cl, BYTE PTR [eax-2]

; 224  : 		RawData[i] = RawData[i+2];

	mov	dl, BYTE PTR [eax]
	mov	BYTE PTR [eax-2], dl

; 225  : 		RawData[i+2] = temp;

	mov	BYTE PTR [eax], cl
	mov	ecx, DWORD PTR [esi+12]
	add	eax, 3
	lea	edx, DWORD PTR [eax+ebp]
	lea	ecx, DWORD PTR [ecx+ecx*2]
	cmp	edx, ecx
	jb	SHORT $L33433
$L33435:

; 226  : 	}
I''m not even going to bother explaining that; there''s a ton of stuff going on, and it''s not exactly relevant. Suffice to say that it''s way longer than it needs to be -- 18 instructions. So now I''m wondering how I can restructure this slightly. The first thing I do is add a couple variables and mark them with the ''register'' keyword, to try to discourage the amount of memory interaction above. Here''s the new version of the C:
	register unsigned char temp;
	register int i = 0;
	register unsigned char* data = RawData;

	for( ; i < m_Size * 3; i += 3 )
	{
		temp = data[i];
		data[i] = data[i+2];
		data[i+2] = temp;

	}
And the resulting assembly:
; 220  : 	register unsigned char temp;
; 221  : 	register int i = 0;
; 222  : 	register unsigned char* data = RawData;
; 223  : 	for( ; i < m_Size * 3; i += 3 )

	mov	eax, DWORD PTR [esi+12]
	add	esp, 24					; 00000018H
	lea	ecx, DWORD PTR [eax+eax*2]
	test	ecx, ecx
	jbe	SHORT $L33436
	mov	ebp, -2					; fffffffeH
	lea	eax, DWORD PTR [ebx+2]
	sub	ebp, ebx
$L33434:

; 224  : 	{
; 225  : 		temp = data[i];

	mov	cl, BYTE PTR [eax-2]

; 226  : 		data[i] = data[i+2];

	mov	dl, BYTE PTR [eax]
	mov	BYTE PTR [eax-2], dl

; 227  : 		data[i+2] = temp;

	mov	BYTE PTR [eax], cl
	mov	ecx, DWORD PTR [esi+12]
	add	eax, 3
	lea	edx, DWORD PTR [eax+ebp]
	lea	ecx, DWORD PTR [ecx+ecx*2]
	cmp	edx, ecx
	jb	SHORT $L33434
$L33436:

; 228  : 	}
Well, it got shorter, but let''s see if we can do better... What I tried next was iterating backwards and incrementing by only 1 each time. Instead of moving i so much, I move the pointer to the data as well.
	register unsigned char temp;
	register int i = m_Size;
	register unsigned char* reg = RawData;
	while( i > 0 )
	{
		temp = reg[0];
		reg[0] = reg[2];
		reg[2] = temp;
		reg += 3;
		--i;
	}
We''re now down to using a while loop, and iterating backwards. Why did I change it like this? Well, the first important thing is that we''re now checking i against 0 instead of m_Size. Also, decrementing i is the last step. This is because a decrement call (dec in x86) will set the zero bit on the CPU if it hits 0, and we can jump back to the top of the loop based on that bit alone. Also, we''re moving reg by 3, which allows us to use constant indices into reg and avoid accessing another variable. So what does the assembly look like?
; 223  : 	while( i > 0 )

	test	ecx, ecx
	jle	SHORT $L33436

; 166  : 	}
; 167  : 
; 168  : 	fread( &Info, sizeof(Info), 1, pBitmap );

	mov	ebp, ecx
$L33435:

; 224  : 	{
; 225  : 		temp = reg[0];

	mov	cl, BYTE PTR [eax]

; 226  : 		reg[0] = reg[2];

	mov	dl, BYTE PTR [eax+2]
	mov	BYTE PTR [eax], dl

; 227  : 		reg[2] = temp;

	mov	BYTE PTR [eax+2], cl

; 228  : 		reg += 3;

	add	eax, 3
	dec	ebp
	jne	SHORT $L33435
$L33436:

; 229  : 		--i;
; 230  : 	}
Now we''ve got some really pretty short code here. I noticed, however, that there was an extra test and jump (jle) at the beginning of the loop, since the test condition is at the beginning and if the test fails, it has to jump to the end. I removed this by using do-while, neatly moving the test to the end:
	register unsigned char temp;
	register int i = m_Size;
	register unsigned char* reg = RawData;
	do
	{
		temp = reg[0];
		reg[0] = reg[2];
		reg[2] = temp;
		reg += 3;
		--i;
	} while( i > 0 );
And of course its assembly:
; 221  : 	register unsigned char temp;
; 222  : 	register unsigned char* reg = RawData;

	mov	eax, ebx
$L33433:

; 222  : 	do
; 223  : 	{
; 224  : 		temp = RawData[0];

	mov	cl, BYTE PTR [edi]

; 225  : 		RawData[0] = RawData[2];

	mov	dl, BYTE PTR [edi+2]
	mov	BYTE PTR [edi], dl

; 226  : 		RawData[2] = temp;

	mov	BYTE PTR [edi+2], cl

; 227  : 		RawData += 3;

	add	edi, 3

; 228  : 		--i;

	dec	eax

; 229  : 	} while( i > 0 );
	test	eax, eax
	jg	SHORT $L33433
Now, after having rearranged the code and giving the compiler a few hints, we''re down to just a handful of instructions (9 to be exact). The only thing left bothering me was the test/jg pair at the end of the loop. It should only take a single instruction to loop back; jump if the 0 bit isn''t set. However, this isn''t what I wrote! I told it to check i > 0. Changing this last line to i != 0 yielded:
	register int i = m_Size;
	register unsigned char temp;
	register unsigned char* reg = RawData;
	do
	{
		temp = reg[0];
		reg[0] = reg[2];
		reg[2] = temp;
		reg += 3;
		--i;
	} while( i != 0 );

And its assembly output:
; 221  : 	register unsigned char temp;
; 222  : 	register unsigned char* reg = RawData;

	mov	eax, ebx
$L33434:

; 223  : 	do
; 224  : 	{
; 225  : 		temp = reg[0];

	mov	cl, BYTE PTR [eax]

; 226  : 		reg[0] = reg[2];

	mov	dl, BYTE PTR [eax+2]
	mov	BYTE PTR [eax], dl

; 227  : 		reg[2] = temp;

	mov	BYTE PTR [eax+2], cl

; 228  : 		reg += 3;

	add	eax, 3

; 229  : 		--i;

	dec	ebp

; 230  : 	} while( i != 0 );

	jne	SHORT $L33434
And voila! We are down to a mere 8 instructions, the same length as the handwritten assembly I "stole" from NeHe to do this a few weeks back. More importantly, it''s still written in C++, not assembly, so it maintains all of its portability across architectures and doesn''t look totally meaningless to someone unfamiliar with assembly. To be fair, the new C++ block is somewhat less obvious in purpose than the original, but a quick comment is all it takes to help someone else see what is going on. The point of this whole example is basically that just because it''s a high level language doesn''t mean you can do whatever the hell you want and let the compiler try and sort it out into fast assembly. You have to write your code to give it hints about what you''re trying to do and how you want it done, and this is hard to do unless you actually know assembly. The concepts of handwritten assembly are important, even if actually writing the stuff isn''t. This new C++ code runs at over double the speed of the original loop, and actually saves a lot of load time if you have a significant number of bitmaps. Now I''m beginning to think this should''ve been an article, but it''s too late for that, isn''t it

Share this post


Link to post
Share on other sites
Advertisement
Hi!

Thx for the information. I always wondered if the things you did to the code would really have impact on the assembly code.
Only one question left: Which compiler did you use and would other compilers give you the same results?

pink panther

Share this post


Link to post
Share on other sites
Visual C++, and I haven''t tested other compilers but I imagine the results would be similar.

Share this post


Link to post
Share on other sites
tip: get a new compiler
I fed vc++ 7.1 your original code, and this is what it produced:

for( int i = 0; i < n * 3; i += 3 )
00401350 test esi,esi
00401352 jle main+74h (401374h)
00401354 lea edx,[esi-1]
00401357 mov eax,0AAAAAAABh
0040135C mul eax,edx
0040135E shr edx,1
00401360 lea ecx,[ebp+2]
00401363 inc edx
//LOOP BODY START

{
temp = RawData[i];
00401364 mov al,byte ptr [ecx-2]
RawData[i] = RawData[i+2];
00401367 mov bl,byte ptr [ecx]
00401369 mov byte ptr [ecx-2],bl
RawData[i+2] = temp;
0040136C mov byte ptr [ecx],al
0040136E add ecx,3
00401371 dec edx
00401372 jne main+64h (401364h)
}
//LOOP BODY END


The loop bodies are basically identical. Sure, you did shave off 7 instructions from the initialization - but that code only executes once.
Conclusion: unless you''re l337, you probably won''t beat modern compilers =)

Share this post


Link to post
Share on other sites
Did any of you tried compiling with .NET 7.1/7''s optimization enabled?



"C lets you shoot yourself in the foot rather easily. C++ allows you to reuse the bullet!"

Share this post


Link to post
Share on other sites
quote:
Original post by CPPMaster Poppet
Did any of you tried compiling with .NET 7.1/7''s optimization enabled?



quote:
Original post by sjelkjd I fed vc++ 7.1 your original code

Share this post


Link to post
Share on other sites
the "register" keyword is ignored by all new compilers.

What version are you using? AFAIK even VC++ 6.0 will ignore it ...

[edited by - noVum on March 6, 2004 4:13:00 PM]

Share this post


Link to post
Share on other sites
All fine & good - may be a different compiler will make a better job of the code, but the thing that bothers me most is that your ''optimised'' code does not do exactly the same thing as your original code.

You are now depending on m_Size being at least 1; in other words the old code handles m_Size = 0 gracefully, where as this code will read from a (presuably if it has size 0) invalid source and try to write it to and invalid destination - ie your code has introduced a potential bug. Probably the thing to do would be add an explicit if test around the loop (obviously more code). I''d also suggest that having i!=0 as the non-termination condition is slightly dangerous - what happens if m_Size starts off as a negative value ? In this instance you might be able to get your single instruction by changing the while clause to include a pre-decrement operator on the ''i'' rather than being in a loop...


register int i = m_Size;
if(i > 0) {
register unsigned char temp;
register unsigned char* reg = RawData;
do
{
temp = reg[0];
reg[0] = reg[2];
reg[2] = temp;
reg += 3;
} while( --i > 0 );
};

Share this post


Link to post
Share on other sites
Assuming this is in a function:


void __fastcall SwapRGB(DWORD size, DWORD* raw)
{
__asm
{
begin: xchg al, byte ptr[edx]
xchg [edx+2], al
xchg al, byte ptr[edx]
add edx, 3
loopne begin
}
}


5 lines. Moral of the story? A mediocre ASM programmer can still outdo a compiler nine times out of ten no matter how you write your code.

Share this post


Link to post
Share on other sites
quote:
5 lines. Moral of the story? A mediocre ASM programmer can still outdo a compiler nine times out of ten no matter how you write your code.

Except that any fule kno that you don''t use xchg unless you *really* want to pay for synchronization, because if the indirect operands aren''t in cache then you incur the locking penalty.

mov reg, mem is almost invariably better than xchg. xchg may be shorter, but that''s the only benefit it provides.

Share this post


Link to post
Share on other sites

void __fastcall SwapRGB(DWORD size, DWORD* raw)
{
__asm
{
begin: mov al, byte ptr[edx]
mov [edx], byte ptr[edx+2]
mov [edx+2], al
add edx, 3
loopne begin
}
}


Still five lines with the mov instead.

Share this post


Link to post
Share on other sites
quote:
Original post by DukeAtreides076
Still five lines with the mov instead.



Five lines that won''t compile.

Share this post


Link to post
Share on other sites

void __fastcall SwapRGB(DWORD size, DWORD* raw)
{
__asm
{
begin: mov al, byte ptr[edx]
mov ah, byte ptr[edx+2]
mov [edx], ah
mov [edx+2], al
add edx, 3
loopne begin
}
}


Happy now? you can tell I havent used asm in a while.

Share this post


Link to post
Share on other sites
Bravo DukeAtreides076. Your code is small, compact, and makes sense. The way that Promit had to degenerate his C++ code to optimize the produced assembly made it a little more difficult to understand. I think in a real world situation this added speed improvement may not be entirely worth the effort. But it''s very cool to look into.
~Mo

Share this post


Link to post
Share on other sites
Duke put it well here.. a real eye-opener, because I am a real expert at writing inefficient loops!

I could not understand how loopne would work, when I cannot see the 2 parameters used (even though I assumed that somehow edx was the address to the memory block). ecx tricked me here, but after reading up on __fastcall, everything made sense..

Share this post


Link to post
Share on other sites
quote:
Original post by DukeAtreides076
Happy now? you can tell I havent used asm in a while.

Not really. loopne is takes quite a few clock cycles to execute. It''s still less efficient than dec/jne pair that sjelkjd showed that MSVC 7.1 generated given the original code. Also you have issues doing two successive mov into the same register, even if you try loading into different parts of the register, this will still most likely generate a pipeline stall, so the second pair of load/stores should be moved to the bl register. Implement both those changes and you get exactly the same loop body as in skelkjd''s post.

Share this post


Link to post
Share on other sites
Oh, one other thing: Promit, if you''re going to investigate the assembly output that your compiler produces do it with the /FA flag, not /FAs. If you use /FAs it disables many of the optimizations it normally does in order to maintain the correlation between assembly and source code.

Share this post


Link to post
Share on other sites
quote:
Oh, one other thing: Promit, if you''re going to investigate the assembly output that your compiler produces do it with the /FA flag, not /FAs. If you use /FAs it disables many of the optimizations it normally does in order to maintain the correlation between assembly and source code.

Never heard that claimed before, and I see no evidence for it myself.

Share this post


Link to post
Share on other sites
Ok, so I was using VC6--clearly 7 is capable of doing better. I still have to get around to testing g++ though.

quote:

You are now depending on m_Size being at least 1; in other words the old code handles m_Size = 0 gracefully, where as this code will read from a (presuably if it has size 0) invalid source and try to write it to and invalid destination - ie your code has introduced a potential bug. Probably the thing to do would be add an explicit if test around the loop (obviously more code). I''d also suggest that having i!=0 as the non-termination condition is slightly dangerous - what happens if m_Size starts off as a negative value ? In this instance you might be able to get your single instruction by changing the while clause to include a pre-decrement operator on the ''i'' rather than being in a loop...



All of those circumstances have been dealt with just before the shown snippet of code; m_Size is guaranteed to be greater than 0.

Duke: your version is interesting, but loopne takes 5-6 cycles, whereas the dec/jne version takes only 2. So you haven''t managed to beat my C, not to mention the pipeline stall you''re likely to invoke, as mentioned before. And besides, this will still compile and run efficiently on non-x86 targets, which was part of my initial goal (missed that part?).

quote:

the "register" keyword is ignored by all new compilers.

What version are you using? AFAIK even VC++ 6.0 will ignore it ..


VC6 will ignore it in the case that /Oe is set, in which case it will make its own choices about registers. I didn''t try it, but my guess is those variables would''ve ended up in registers anyway.

To be honest, the code isn''t *that* incomprehensible. It''s not as obvious as before, but it''s pretty easy to understand if you don''t just glance at it. And this is completely cross-compiler cross-CPU-architecture code. The real point here is that you should think about how you write things like loops, especially in speed critical code, since how you write your code affects how things look when they get compiled, and that can matter.

Share this post


Link to post
Share on other sites
No. You should NOT. Your code should be readable.

EVERY new compiler will optimize such kind of code without the need of "register" or other hacks.

Share this post


Link to post
Share on other sites
actually, you SHOULD. while compilers are getting better all the time, they still cannot optimize as well as an intermediate person. they may be better at doing simple things like loops or other control structures, but what about complex algorithms that search data structures or do some type of sorting? in critical sections of code, readability should suffer at the expense of speed, albeit to an extent... which is why i have a comment almost every one or two lines in my asm blocks.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
You need to be careful about how you optimise, in this example Promit has optimised his code for the MSVC++ 6 compiler on an x86 platform - this may well make the code worse for another compiler/platform. Basically, optimise the algorithm, not the detail - often compilers can manage that to some extent - don''t prematurely optimise the code....

Share this post


Link to post
Share on other sites
Does someone have InterC to try it with full optimisations on that?
IntelC will do some funny mmx/sse stuff ,i think.

Even without mmx or sse,it''s almost always much faster to load 3 32 bit registers and do 4 triples at _one_ iteration.

We can transform
1234 5678 9abc
into
3216 5498 7cba . (321 654 987 cba)

Let data are storen in eax,ebx,ecx.
So
1234 5678 9abc
bswap eax //1
4321 5678 9abc
xchg al,bh //2
6321 5478 9abc
ror eax,8 //3
3216 5478 9abc
bswap ecx //4
3216 5478 cba9
rol ecx,8 //5
3216 5478 9cba
ror ebx,8 //6
3216 4785 9cba
xchg bh,cl //7
3216 4985 7cba
rol ebx,8 //8
3216 5498 7cba


And finally,my nice code that does 4 triples in one iteration:
// write initialisation yourself,if you want.
// esi=-m_size*3
// edx=byte ptr source + esi
// edi=byte ptr destination + esi
// destination may be equal to source

loop1:
// if you really want, experimentally you can
// find better(for pairing)place for that commands.

mov eax,[esi+edx]
mov ebx,[esi+edx+4]
mov ecx,[esi+edx+8]
bswap eax //1
xchg al,bh //2
ror eax,8 //3
bswap ecx //4
rol ecx,8 //5
ror ebx,8 //6
xchg bh,cl //7
rol ebx,8 //8

// and that too
mov [esi+edi],eax
mov [esi+edi+4],ebx
mov [esi+edi+8],ecx
add esi,12
jnc loop1
// when esi crosses zero,there''s CF=1 and no jump.ESI are initialised to negative value.
Funny that i don''t need any extra commands or extra counter.

8 pairing one-clock commands for swapping 4 triples and 6 more commands to store to ram,and 2 more for loop. Probably about 12 cycles all together.
(will test it.)

And that''s without SSE. Have no time to write sse version :-)

Regards,

Dmytry Lavrov.

Share this post


Link to post
Share on other sites
quote:
actually, you SHOULD. while compilers are getting better all the time, they still cannot optimize as well as an intermediate person.

No. You should optimize the algorithm, not such details.
A compiler that cannot detect that these variables can reside in registers is not worth working with.

Share this post


Link to post
Share on other sites

  • Advertisement