#### Archived

This topic is now archived and is closed to further replies.

# I thought that...

This topic is 5200 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I''ve been working on a additive blit for a sprite engine, and soon realized that the complexity of the algorithm was O(n^2). I thought to myself that a faster way had to exist, and MMX came up as a valid solution (I think everyone with a semi-modern computer has MMX cababilities). I implemented a bit of MMX code, but the gain in FPS I get is really poor (just about 2fps). I thought MMX was supposed to accelerate code in the sense that it can do 8 operations at the same time (well... same time... you know what I mean). Am I doing something wrong, or is MMX just not a great tool? To give an example, the following shows the old additive blit pixel adding part of the code and the MMX version. Note that ATable is a 256x256 array containing each possible alpha blend values (so I don''t have to calculate it all the time).
//the header file
struct Pixel32
{
unsigned char rgbBlue;
unsigned char rgbGreen;
unsigned char rgbRed;
unsigned char alpha;
};

//side effect Pixel32 addition in first Parameter
void BlitRGBQUADEx(Pixel32 *a, Pixel32 *b, int SourceAlpha)
{
//CLng(.Red) + m_bytLookup(m_lngSourceAlpha, m_pxlSource.Red
a->rgbRed = ClipBytesC(a->rgbRed,ATable[SourceAlpha][b->rgbRed]);
a->rgbGreen = ClipBytesC(a->rgbGreen,ATable[SourceAlpha][b->rgbGreen]);
a->rgbBlue = ClipBytesC(a->rgbBlue,ATable[SourceAlpha][b->rgbBlue]);
}

//summation of two pixels
unsigned char ClipBytesC(unsigned char x, unsigned char y)
{
/////MAKE A HABIT OF USING BRACES WITH IF''s AND SUCH!!
int tmp = x+y;
if (tmp>255)
{
return 255;
}
else
{
return tmp;
}
}

Now, it looks like this (with MMX)
voidTransAlphaMMX(Pixel32 *dest, Pixel32 *src, int ALPHA)
{
//int cap = width * height; //calculate the number of iterations to go through
//int tmpcal;
//int numNext;

//if (ALPHA<0) //prevent out of bounds error
//	ALPHA = 0;
//else if(ALPHA>255)
//	ALPHA = 255;
PointerReturn pr;
pr.p1 = dest;
pr.p2 = src;
pr.Alpha = ALPHA;
int r,g,b,br,bg,bb; //red green blue, backred, backgreen, backblue

r = dest->rgbRed; g = dest->rgbGreen; b = dest->rgbBlue;
br = ATable[ALPHA][src->rgbRed]; bg = ATable[ALPHA][src->rgbGreen]; bb = ATable[ALPHA][src->rgbBlue];

_asm
{
push edi	;Save off to restore later
push esi	;Save off to restore later

//SPAN_RUN_565: movq mm7,[edi]	; Copy the 8 bytes pointed to by edi into mm7
//movq mm6,[esi]				; Copy the 8 bytes pointed to by esi into mm6

;place dest pixels
movq mm0, r
movq mm1, g
movq mm2, b

;place source alpha pixels
movq mm3, br
movq mm4, bg
movq mm5, bb

;add pixels, dest = dest + source

;move transformed pixels to r,g,b
movq r, mm0
movq g, mm1
movq b, mm2

emms		;Clean up the MMX registers

//cmp eax,0					; If eax = 0 we have set the flag
//jg SPAN_RUN_565				; if flag is zero finish else loop back for more

pop esi		;Restore esi
pop edi		;Restore edi

}
//store pixels transforme
dest->rgbRed = r; dest->rgbGreen = g; dest->rgbBlue = b;
//return ALPHA;
return pr;
}

Thanks My signature used to suck. But it''s much better now.

##### Share on other sites
emms is a very slow instruction, try only calling it when you have done all the blends. you wont see an 8x speed improvement, your probably going to see more like 1.5x -- if you optimize it well.
also, amd cpu''s have femms (fast emms), not sure if newer intel processors support it aswell.

##### Share on other sites
This is how I''d do it:

Well, you seem to be just adding two arrays, right?

with SSE you can add 4 components at a time. [Don''t confuse this as being twice as fast as MMX]. Let say you''re have 1,000 RGB values that you want to add, this is what you''d do:

so 1,000 RGB values (RGB = 3*4 = 12 bytes for one RGB val). So our two arrays are 12,000 bytes long. SSE can do 4 adds in one "operation", so we need to looop 3000 times.

loop 3000 times
{
do 2 moves to load 2 registers, one from src array, one from dest
write the result back to memory
increment pointers
}

Plus, I couldn''t really see how you were taking advantage of MMX. That code load one value into one register, and another value in another register and tried to add them. What I think you want to do is load a bunch of values in one register, and bunch of values in another register and call add on them.

Here''s what I use: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26568.pdf

Look up these instructions:
MOVDQU [page 151]

Also, try not to traverse 2 arrays that are a multiple of 4k apart on AMD64s, because I think they''ll map to the same place in cache... and that''s not good.

##### Share on other sites
quote:
Original post by ngill
Well, you seem to be just adding two arrays, right?

That''s correct. It is actually a matrix, but iterating through the pointer makes it act as an array.

quote:
Original post by ngill
with SSE you can add 4 components at a time. [Don''t confuse this as being twice as fast as MMX].

loop 3000 times
{
do 2 moves to load 2 registers, one from src array, one from dest
write the result back to memory
increment pointers
}

Two things. First, I''m not quite sure how to code the pseudo code above because I am not familiar at all with SSE stuff. Also, I''m not quite sure what you mean by "do two moves to load 2 registers," and "add 2 registers" (call me stupid, but I just don''t seem to get it). Second, coudn''t the same process be done with MMX? The idea in using MMX was to obtain a performance boost (at least to some degree considerable) without forcing users to have a newer machine. In other words, my entire game was written in order to work with older machines, and I know that MMX has been around for quite some time. I understand that SSE is indeed faster, but I am not sure it has quite as long a history as MMX. Am I correct in beleiving this, or is SSE on most semi-mordern computers nowadays? If SSE has existed for as long as MMX has and most semi- computers have it, then how would I write the code in SSE (or MMX even, since you say that I''m doing it the slow way)?

Thanks a million!

My signature used to suck. But it''s much better now.

##### Share on other sites
This is a fast mmx additive blitter, doing 2 pixels (8 color components) per iteration. It can be done even faster.

s is a pointer to the source, d is a pointer to destination, len is the length of the array in pixels. Preferably, s and s should be 8-byte aligned.

if (len>3) {
__asm {
mov esi,s
mov edi,d
mov ebx,len
mov ecx,len
mov edx,ecx
shr ecx,1
and ebx,3
mov len,ebx
movq [edi],mm0 //store 2 pixels
sub edx,2
dec ecx

cmp edx,0
movd mm0,[esi]
movd mm1,[edi]
movd [edi],mm0
emms
}

this isn''t 100% optimized but you should be getting the idea.

##### Share on other sites
quote:
Original post by ageny6
Two things. First, I'm not quite sure how to code the pseudo code above because I am not familiar at all with SSE stuff.

It's REALLY EASY, just google for a tutorial on SSE.

quote:

Also, I'm not quite sure what you mean by "do two moves to load 2 registers," and "add 2 registers" (call me stupid, but I just don't seem to get it). Second, coudn't the same process be done with MMX?

Check out the pdf link in my earlier post, I even gave you the page numbers of the instructions. It's all there bro.

If you really want to learn, do this: write it once in MMX. Write it again in SSE. then write some code to test which one in faster. If you wrote your code correctly, i wouldn't be surprised if SSE2 is faster. Then if it's available, use the SSE2 codepath on a newer machine, and MMX on an older machine.

"SSE2 code can always be written to perform as well as MMX. Frequently better, especially with unrolling, but need to understand the machine [if you want to optimize ALOT]"
--Ranganathan Sudhakar [an expert in athlon64 architecture]

[edited by - ngill on March 26, 2004 3:45:31 PM]

##### Share on other sites
Nice! Thanks a lot

I stumbled upon a program called quexal, which abstract assembly code into a nicer interface. I do, however have to pay to get a full version.

Does anyone know whether the free shareware version is worthwhile? Is there a better application out there? Is there a free application of the sort?

Thanks

My signature used to suck. But it''s much better now.

##### Share on other sites
I''d pay to learn SSE, not to run away from it.

If managing registers scares you, try intrinsics [which handle register allocation and scheduling for you]

But intrinsics performance sucks unless you have VS 8.0 =[. Don''t know how it is with other compilers.

##### Share on other sites
Do it with bit shifts:

int blend(int surfaceColor, int newColor , int amt ) {    	return 0xff000000 |        ( ( ( amt * ( ( ( newColor ) & 0xff00ff ) -                  ( ( surfaceColor ) & 0xff00ff ) ) >> 8 ) +                  ( ( surfaceColor ) & 0xff00ff ) ) & 0xff00ff ) |        ( ( amt * ( ( ( newColor ) & 0x00ff00 ) -                  ( ( surfaceColor ) & 0x00ff00 ) ) >> 8 ) +                  ( ( surfaceColor ) & 0x00ff00 ) ) & 0x00ff00 ;}

1. 1
2. 2
Rutin
19
3. 3
4. 4
5. 5

• 14
• 13
• 9
• 12
• 9
• ### Forum Statistics

• Total Topics
631438
• Total Posts
3000074
×