Archived

This topic is now archived and is closed to further replies.

I thought that...

This topic is 5007 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I''ve been working on a additive blit for a sprite engine, and soon realized that the complexity of the algorithm was O(n^2). I thought to myself that a faster way had to exist, and MMX came up as a valid solution (I think everyone with a semi-modern computer has MMX cababilities). I implemented a bit of MMX code, but the gain in FPS I get is really poor (just about 2fps). I thought MMX was supposed to accelerate code in the sense that it can do 8 operations at the same time (well... same time... you know what I mean). Am I doing something wrong, or is MMX just not a great tool? To give an example, the following shows the old additive blit pixel adding part of the code and the MMX version. Note that ATable is a 256x256 array containing each possible alpha blend values (so I don''t have to calculate it all the time).
//the header file
struct Pixel32
{
unsigned char rgbBlue;
unsigned char rgbGreen;
unsigned char rgbRed;
unsigned char alpha;
};

//side effect Pixel32 addition in first Parameter
void BlitRGBQUADEx(Pixel32 *a, Pixel32 *b, int SourceAlpha)
{
	//CLng(.Red) + m_bytLookup(m_lngSourceAlpha, m_pxlSource.Red
	a->rgbRed = ClipBytesC(a->rgbRed,ATable[SourceAlpha][b->rgbRed]);
	a->rgbGreen = ClipBytesC(a->rgbGreen,ATable[SourceAlpha][b->rgbGreen]);
	a->rgbBlue = ClipBytesC(a->rgbBlue,ATable[SourceAlpha][b->rgbBlue]);
} 

//summation of two pixels
unsigned char ClipBytesC(unsigned char x, unsigned char y)
{
	/////MAKE A HABIT OF USING BRACES WITH IF''s AND SUCH!!
	int tmp = x+y;
	if (tmp>255)
	{
		return 255;
	}
	else
	{
		return tmp;
	}
}
 
Now, it looks like this (with MMX)
voidTransAlphaMMX(Pixel32 *dest, Pixel32 *src, int ALPHA)
{
	//int cap = width * height; //calculate the number of iterations to go through
	//int tmpcal;
	//int numNext;

	//if (ALPHA<0) //prevent out of bounds error
	//	ALPHA = 0;
	//else if(ALPHA>255) 
	//	ALPHA = 255;
	PointerReturn pr;
	pr.p1 = dest;
	pr.p2 = src;
	pr.Alpha = ALPHA;
	int r,g,b,br,bg,bb; //red green blue, backred, backgreen, backblue
	
	r = dest->rgbRed; g = dest->rgbGreen; b = dest->rgbBlue;
	br = ATable[ALPHA][src->rgbRed]; bg = ATable[ALPHA][src->rgbGreen]; bb = ATable[ALPHA][src->rgbBlue];


	_asm
	{
		push edi	;Save off to restore later
		push esi	;Save off to restore later
		
		//SPAN_RUN_565: movq mm7,[edi]	; Copy the 8 bytes pointed to by edi into mm7
		//movq mm6,[esi]				; Copy the 8 bytes pointed to by esi into mm6

		;place dest pixels
		movq mm0, r
		movq mm1, g
		movq mm2, b

		;place source alpha pixels
		movq mm3, br
		movq mm4, bg
		movq mm5, bb

		;add pixels, dest = dest + source
		paddw mm0, mm3 ;red
		paddw mm1, mm4 ;green
		paddw mm2, mm5 ;blue

		;move transformed pixels to r,g,b
		movq r, mm0
		movq g, mm1
		movq b, mm2

		emms		;Clean up the MMX registers

		//cmp eax,0					; If eax = 0 we have set the flag
		//jg SPAN_RUN_565				; if flag is zero finish else loop back for more

		pop esi		;Restore esi
		pop edi		;Restore edi
		
	}
	//store pixels transforme
	dest->rgbRed = r; dest->rgbGreen = g; dest->rgbBlue = b;
	//return ALPHA;
	return pr;
}
 
Thanks My signature used to suck. But it''s much better now.

Share this post


Link to post
Share on other sites
emms is a very slow instruction, try only calling it when you have done all the blends. you wont see an 8x speed improvement, your probably going to see more like 1.5x -- if you optimize it well.
also, amd cpu''s have femms (fast emms), not sure if newer intel processors support it aswell.

Share this post


Link to post
Share on other sites
This is how I''d do it:

Well, you seem to be just adding two arrays, right?

with SSE you can add 4 components at a time. [Don''t confuse this as being twice as fast as MMX]. Let say you''re have 1,000 RGB values that you want to add, this is what you''d do:

so 1,000 RGB values (RGB = 3*4 = 12 bytes for one RGB val). So our two arrays are 12,000 bytes long. SSE can do 4 adds in one "operation", so we need to looop 3000 times.

loop 3000 times
{
do 2 moves to load 2 registers, one from src array, one from dest
add 2 registers
write the result back to memory
increment pointers
}

tada, there you go.

Plus, I couldn''t really see how you were taking advantage of MMX. That code load one value into one register, and another value in another register and tried to add them. What I think you want to do is load a bunch of values in one register, and bunch of values in another register and call add on them.

Here''s what I use: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26568.pdf

Look up these instructions:
MOVDQU [page 151]
PADD [page 213]

Also, try not to traverse 2 arrays that are a multiple of 4k apart on AMD64s, because I think they''ll map to the same place in cache... and that''s not good.

Share this post


Link to post
Share on other sites
quote:
Original post by ngill
Well, you seem to be just adding two arrays, right?

That''s correct. It is actually a matrix, but iterating through the pointer makes it act as an array.

quote:
Original post by ngill
with SSE you can add 4 components at a time. [Don''t confuse this as being twice as fast as MMX].

loop 3000 times
{
do 2 moves to load 2 registers, one from src array, one from dest
add 2 registers
write the result back to memory
increment pointers
}

tada, there you go.



Two things. First, I''m not quite sure how to code the pseudo code above because I am not familiar at all with SSE stuff. Also, I''m not quite sure what you mean by "do two moves to load 2 registers," and "add 2 registers" (call me stupid, but I just don''t seem to get it). Second, coudn''t the same process be done with MMX? The idea in using MMX was to obtain a performance boost (at least to some degree considerable) without forcing users to have a newer machine. In other words, my entire game was written in order to work with older machines, and I know that MMX has been around for quite some time. I understand that SSE is indeed faster, but I am not sure it has quite as long a history as MMX. Am I correct in beleiving this, or is SSE on most semi-mordern computers nowadays? If SSE has existed for as long as MMX has and most semi- computers have it, then how would I write the code in SSE (or MMX even, since you say that I''m doing it the slow way)?

Thanks a million!

My signature used to suck. But it''s much better now.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
This is a fast mmx additive blitter, doing 2 pixels (8 color components) per iteration. It can be done even faster.

s is a pointer to the source, d is a pointer to destination, len is the length of the array in pixels. Preferably, s and s should be 8-byte aligned.

if (len>3) {
__asm {
mov esi,s
mov edi,d
mov ebx,len
mov ecx,len
mov edx,ecx
shr ecx,1
and ebx,3
mov len,ebx
addblitloop:
movq mm0,[esi] //load 2 pixels
movq mm1,[edi] //load 2 pixels
paddusb mm0,mm1 //add them (all 8 bytes at once)
movq [edi],mm0 //store 2 pixels
add esi,8 //increase loop poointers
add edi,8
sub edx,2
dec ecx
jnz addblitloop

cmp edx,0
jz skipfinishadd
movd mm0,[esi]
movd mm1,[edi]
paddusb mm0,mm1
movd [edi],mm0
skipfinishadd:
emms
}

this isn''t 100% optimized but you should be getting the idea.

Share this post


Link to post
Share on other sites
quote:
Original post by ageny6
Two things. First, I'm not quite sure how to code the pseudo code above because I am not familiar at all with SSE stuff.


It's REALLY EASY, just google for a tutorial on SSE.

quote:

Also, I'm not quite sure what you mean by "do two moves to load 2 registers," and "add 2 registers" (call me stupid, but I just don't seem to get it). Second, coudn't the same process be done with MMX?




Check out the pdf link in my earlier post, I even gave you the page numbers of the instructions. It's all there bro.

If you really want to learn, do this: write it once in MMX. Write it again in SSE. then write some code to test which one in faster. If you wrote your code correctly, i wouldn't be surprised if SSE2 is faster. Then if it's available, use the SSE2 codepath on a newer machine, and MMX on an older machine.

"SSE2 code can always be written to perform as well as MMX. Frequently better, especially with unrolling, but need to understand the machine [if you want to optimize ALOT]"
--Ranganathan Sudhakar [an expert in athlon64 architecture]

[edited by - ngill on March 26, 2004 3:45:31 PM]

Share this post


Link to post
Share on other sites
Nice! Thanks a lot

I stumbled upon a program called quexal, which abstract assembly code into a nicer interface. I do, however have to pay to get a full version.

Does anyone know whether the free shareware version is worthwhile? Is there a better application out there? Is there a free application of the sort?

Thanks

My signature used to suck. But it''s much better now.

Share this post


Link to post
Share on other sites
I''d pay to learn SSE, not to run away from it.

If managing registers scares you, try intrinsics [which handle register allocation and scheduling for you]

But intrinsics performance sucks unless you have VS 8.0 =[. Don''t know how it is with other compilers.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Do it with bit shifts:


int blend(int surfaceColor, int newColor , int amt ) {
return 0xff000000 |
( ( ( amt * ( ( ( newColor ) & 0xff00ff ) -
( ( surfaceColor ) & 0xff00ff ) ) >> 8 ) +
( ( surfaceColor ) & 0xff00ff ) ) & 0xff00ff ) |
( ( amt * ( ( ( newColor ) & 0x00ff00 ) -
( ( surfaceColor ) & 0x00ff00 ) ) >> 8 ) +
( ( surfaceColor ) & 0x00ff00 ) ) & 0x00ff00 ;
}

Share this post


Link to post
Share on other sites
Problem,

I am passing the pointer of the array as a parameter (look at TransAlphaMMX method in source code given on first post). So when I do:


void TransAlphaMMX(Pixel32 *dest, Pixel32 *src, int ALPHA)
{
movq mm0, dest;
movq mm1, src;
paddusb mm0,mm1;
}

it doesn't work, and worst off all, it messes up my pointers and the entire algorithm crashes hopelessly on the ground. What is a man to do?

What I am thinking is that the pointers are added together, and not the data that the pointer points to. Am I correct to beleive this, or should I stop banging my head on the computer screen?

P.S. I know I haven't packed the data in the same mmx register, but the small bit of code above is just an example.

My signature used to suck. But it's much better now.



[edited by - ageny6 on March 29, 2004 5:56:44 PM]

[edited by - ageny6 on March 29, 2004 6:16:50 PM]

Share this post


Link to post
Share on other sites
why? it doesn''t seem like it''s going to be faster..

remember just because it''s in ASM/SSE/whatever doesn''t mean it''s going to be fast.

quote:
Original post by Anonymous Poster
Do it with bit shifts:


int blend(int surfaceColor, int newColor , int amt ) {
return 0xff000000 |
( ( ( amt * ( ( ( newColor ) & 0xff00ff ) -
( ( surfaceColor ) & 0xff00ff ) ) >> 8 ) +
( ( surfaceColor ) & 0xff00ff ) ) & 0xff00ff ) |
( ( amt * ( ( ( newColor ) & 0x00ff00 ) -
( ( surfaceColor ) & 0x00ff00 ) ) >> 8 ) +
( ( surfaceColor ) & 0x00ff00 ) ) & 0x00ff00 ;
}



Share this post


Link to post
Share on other sites
yep, you''re right, it''s adding the pointer values together.

quote:
Original post by ageny6
Problem,

I am passing the pointer of the array as a parameter (look at TransAlphaMMX method in source code given on first post). So when I do:


void TransAlphaMMX(Pixel32 *dest, Pixel32 *src, int ALPHA)
{
movq mm0, dest;
movq mm1, src;
paddusb mm0,mm1;
}

it doesn''t work, and worst off all, it messes up my pointers and the entire algorithm crashes hopelessly on the ground. What is a man to do?

What I am thinking is that the pointers are added together, and not the data that the pointer points to. Am I correct to beleive this, or should I stop banging my head on the computer screen?

P.S. I know I haven''t packed the data in the same mmx register, but the small bit of code above is just an example.

My signature used to suck. But it''s much better now.



[edited by - ageny6 on March 29, 2004 5:56:44 PM]

[edited by - ageny6 on March 29, 2004 6:16:50 PM]


Share this post


Link to post
Share on other sites
Than what is a man to do???
quote:
Original post by ngill
yep, you''re right, it''s adding the pointer values together.

quote:
Original post by ageny6
Problem,

I am passing the pointer of the array as a parameter (look at TransAlphaMMX method in source code given on first post). So when I do:


void TransAlphaMMX(Pixel32 *dest, Pixel32 *src, int ALPHA)
{
movq mm0, dest;
movq mm1, src;
paddusb mm0,mm1;
}

it doesn''t work, and worst off all, it messes up my pointers and the entire algorithm crashes hopelessly on the ground. What is a man to do?

What I am thinking is that the pointers are added together, and not the data that the pointer points to. Am I correct to beleive this, or should I stop banging my head on the computer screen?

P.S. I know I haven''t packed the data in the same mmx register, but the small bit of code above is just an example.

My signature used to suck. But it''s much better now.



[edited by - ageny6 on March 29, 2004 5:56:44 PM]

[edited by - ageny6 on March 29, 2004 6:16:50 PM]







My signature used to suck. But it''s much better and more accurate now.

Share this post


Link to post
Share on other sites