Software blitting (SSE and ASM

Started by
8 comments, last by l0calh05t 15 years, 9 months ago
For a long time I have been reading and researching about SSE and other way to speed up time consuming functions. Unfortunately there is not that much information about SSE, or at least not that I can find. I haven't gotten very far, but Im already stuck on a think. I get a memory witting error EDIT: As of now, Im just trying to take copy the texture into the screen pointer by using the SSE registers

	int	i = height;

	do
	{
		_asm
		{
			// Move the destianation pointer into the edi register
			mov EDI, screenDataPnt

			// Move the sprite pointer into the ESI register
			mov ESI, textureDataPnt

			// Move 8 pixels into the 8 SSE registers
			// Using + 8 to create an offset from the pointer, as the 
			// Information we are acessing are 8 bits * 4 (RGBA)
			// Each SSe register will now have 32bites in them, which is less then the 128
			// that is possible, but for making it simpler, there will only be one pixel in each register
			MOVUPS XMM0, [ESI]   
			MOVUPS XMM1, [ESI + 32]  
			MOVUPS XMM2, [ESI + 64]  
			MOVUPS XMM3, [ESI + 96]  
			MOVUPS XMM4, [ESI + 128] 
			MOVUPS XMM5, [ESI + 160]  
			MOVUPS XMM6, [ESI + 192]  
			MOVUPS XMM7, [ESI + 224]  

			MOVUPS [EDI], XMM0
			MOVUPS [EDI + 32], XMM1
			MOVUPS [EDI + 64], XMM2
			MOVUPS [EDI + 96], XMM3
			MOVUPS [EDI + 128], XMM4
			MOVUPS [EDI + 160], XMM5
			MOVUPS [EDI + 192], XMM6
			MOVUPS [EDI + 224], XMM7
		}

		screenDataPnt += endScreenOffset;
		textureDataPnt += endTextureOffset;

	}while (--i > 0);


Could anyone tell me what I am doing wrong? Thanks
Advertisement
Each SSE register is 16 bytes rather than 32 bytes wide, as your code implies. That is each register merely contains four 32-bit pixels.
Also this is not a particularly efficient way of moving data, you want to at least align the destination pointer and use MOVAPS/MOVDQA for the stores. Otherwise plain memcpy() will probably be faster..

edit: I reread the comments and it seems like you've confused the SS (single-scalar) instructions for processing individual floats, and PS (packed-scalar) instructions for processing four floats in parallel. Furthermore the memory offsets in the loops are in *bytes* (four per pixel) rather than bits (32 per pixel). So what you really want is to use MOVSS from offsets 0,4,8,12,16,.. instead.
I second the recommendation for memcpy instead of this. If you must use assembly then it's probably faster to use REP STOSW than to do what you're doing now.

Richard "Superpig" Fine - saving pigs from untimely fates - Microsoft DirectX MVP 2006/2007/2008/2009
"Shaders are not meant to do everything. Of course you can try to use it for everything, but it's like playing football using cabbage." - MickeyMouse

Hey, thanks a lot for helping me out.
EDIT:
Removed question, solved it.

The reason I want to use SSE (apart from trying to learn it) is that I have to preform several calculations on each pixel.
SSE is a valid option then?

What I am trying to do is converting this into SSE:
screenDataPnt[0] = screenDataPnt[0] + (( alpha * (blue - screenDataPnt[0] )) >> 8);screenDataPnt[1] = screenDataPnt[1] + (( alpha * (green - screenDataPnt[1] )) >> 8);screenDataPnt[2] = screenDataPnt[2] + (( alpha * (red - screenDataPnt[2] )) >> 8);


Again, thanks

[Edited by - h3ro on July 5, 2008 1:44:57 PM]
when you turn SSE on, VC++ replaces memcpy with MOVDQA if possible. so yeah, use memcpy. and when you don't, make sure to use aligned moves. (obviously your buffers will have to be aligned)

anyways, that operation should be possible to implement with SSE2. but i'd recommend using intrinsics instead of inline assembler.
I have been re-thinking some of what I am trying to do.

As I am going to access both screen pixels and texture pixels all the time, I need to load them both into the SSE register in order to get the speed gain SSE offers(right? Please let me know if any of my logic anywhere is wrong)

So,
MMX0 to MMX3 is going to be texturePixels
MMX4 to MMX7 is going to be screenPixels

Then at the end, I swap MMX0 to MMX3 with MMX4 to MMX7

But as each register holds an whole pixel(int where BBBBGGGGRRRRAAAA) how do I access each colour of a pixel? I know I can use bit logic to mask out the values I want, but I have to store that masked value in a new register. And by doing that I cant keep all the information in the SSE register.

So, how would I go about converting the following code into SSE?
Final Screen Blue = screenBlue + ((textureAlpha * (textureBlue - screenBlue)) >> 8)

I know that in assembly you have to do it one instruction at the time, so what I really need help to understand is how to efficiently access the different information.
i think something like this should work:

movdqa -> load 4 texture pixels into xmm register
movdqa -> load 4 screen pixels into an xmm register
pextrb -> extract alpha bytes
psubusb -> subtract screen values from texture values (saturating, dunno if that's what you want)
pshufb -> distribute bytes to have have a byte space between one another (will also need a copy to another xmm register)
pmulhuw -> perform multiplication (only stores the upper bytes so no shift needed)
pshufb+pblendvb -> mix both xmm registers again so the resulting bytes are next to one another again
paddusb -> add result to screen pixel register (saturating)
movdqa -> store result to screen

edit:
don't use a whole xmm register for a 32 bit pixel. you can load 4 at a time (128 bits), so you should.
>>don't use a whole xmm register for a 32 bit pixel. you can load 4 at a time (128 bits), so you should.
I know, but I have never used SSE before, so I decided to get it working with one pixel in each register first.

Thanks for the rest of the code, Ill try it out as soon as I get home.
OK, now im a big confused, again.

I have the following code, that works. It blits the image to the screen (no alpha blending so far), but I am not really sure I understand what I did to get it working.

My original data is stored in an array where the format is like this, where each element is a byte:
textureDataPnt[0] is pixel_1 Blue
textureDataPnt[1] is pixel_1 Green
textureDataPnt[2] is pixel_1 Red
textureDataPnt[3] is pixel_1 alpha
textureDataPnt[4] is pixel_2 Blue
...

so when I load them into the SSE register, with 4 as an offset am I not just loading one color for each pixel?
MOVSS XMM0, [ESI] // textureDataPnt[0] <-- Pixel_1 Blue
MOVSS XMM1, [ESI + 4] // textureDataPnt[0+4] <-- Pixel_2 Blue
MOVSS XMM2, [ESI + 8] // textureDataPnt[0+8] <-- Pixel_3 Blue
MOVSS XMM3, [ESI + 12] // textureDataPnt[0+12] <-- Pixel_4 Blue

Is that correct? If so, why does the following code work?

	int dividedWidth = width / 4;	int j = dividedWidth;	//-------------------------------------------------------	// SEE BLITTING CODE!	//-------------------------------------------------------	//int	i = height;	for (int i = 0; i < height; i++)	{		//dividedWidth is width of the texture / 8		for (int j = 0; j < dividedWidth; j++)		{			_asm			{				// Move the destianation pointer into the edi register				mov EDI, screenDataPnt				// Move the sprite pointer into the ESI register				mov ESI, textureDataPnt				// Move 4 pixels into 4 of the 8 SSE registers				// Each variable is an int (32 bit) and the SSE register can store 128 byte, but for now there				// Will only be one pixel in each register, to make it simpler to work with.				// 4 bytes it 32 bit, hence the offset value				// Colour mode is BGRA with 32bit colour				MOVSS  XMM0, [ESI]   				MOVSS  XMM1, [ESI + 4]  				MOVSS  XMM2, [ESI + 8]  				MOVSS  XMM3, [ESI + 12] 				// Move 4 screen pixels				MOVSS  XMM4, [EDI + 0] 				MOVSS  XMM5, [EDI + 4]  				MOVSS  XMM6, [EDI + 8]				MOVSS  XMM7, [EDI + 12] 				// Get the value of Blue 				// final screenBlue = Blue				// final screenBlue = - screenBlue				// final screenBlue = * alpha				// final screenBlue = + screenBlue				// final screenBlue =  >> 8				// PSUBSW  XMM0, XMM4		// -				// PMULLW  XMM0, AlphaVal	// *				// Move the pixels into the screen pointer				MOVSS  [EDI], XMM0				MOVSS  [EDI + 4], XMM1				MOVSS  [EDI + 8], XMM2				MOVSS  [EDI + 12], XMM3			}			screenDataPnt += 16;			textureDataPnt += 16;		}		// (ScreenWidth - textureWidth) * number of pixels		//	640         -      64		*     4		screenDataPnt += 2304;	}}


How do I load more then one variable into each register?
How do I put textureDataPnt[0] to textureDataPnt[3] into one SSE register?

And, why does this not work?
MOVSS XMM4, XMM0
MOVSS XMM5, XMM1
MOVSS XMM6, XMM2
MOVSS XMM7, XMM3

Thanks for reading:)
movss means loading a single float. to load multiple floats at once, you need to use movps. does that answer your question?

oh, and I mentioned it before, but I'll say it again anyways: use intrinsics.

This topic is closed to new replies.

Advertisement