Sign in to follow this  
h3ro

Software blitting (SSE and ASM

Recommended Posts

For a long time I have been reading and researching about SSE and other way to speed up time consuming functions. Unfortunately there is not that much information about SSE, or at least not that I can find. I haven't gotten very far, but Im already stuck on a think. I get a memory witting error EDIT: As of now, Im just trying to take copy the texture into the screen pointer by using the SSE registers
	int	i = height;

	do
	{
		_asm
		{
			// Move the destianation pointer into the edi register
			mov EDI, screenDataPnt

			// Move the sprite pointer into the ESI register
			mov ESI, textureDataPnt

			// Move 8 pixels into the 8 SSE registers
			// Using + 8 to create an offset from the pointer, as the 
			// Information we are acessing are 8 bits * 4 (RGBA)
			// Each SSe register will now have 32bites in them, which is less then the 128
			// that is possible, but for making it simpler, there will only be one pixel in each register
			MOVUPS XMM0, [ESI]   
			MOVUPS XMM1, [ESI + 32]  
			MOVUPS XMM2, [ESI + 64]  
			MOVUPS XMM3, [ESI + 96]  
			MOVUPS XMM4, [ESI + 128] 
			MOVUPS XMM5, [ESI + 160]  
			MOVUPS XMM6, [ESI + 192]  
			MOVUPS XMM7, [ESI + 224]  

			MOVUPS [EDI], XMM0
			MOVUPS [EDI + 32], XMM1
			MOVUPS [EDI + 64], XMM2
			MOVUPS [EDI + 96], XMM3
			MOVUPS [EDI + 128], XMM4
			MOVUPS [EDI + 160], XMM5
			MOVUPS [EDI + 192], XMM6
			MOVUPS [EDI + 224], XMM7
		}

		screenDataPnt += endScreenOffset;
		textureDataPnt += endTextureOffset;

	}while (--i > 0);


Could anyone tell me what I am doing wrong? Thanks

Share this post


Link to post
Share on other sites
Each SSE register is 16 bytes rather than 32 bytes wide, as your code implies. That is each register merely contains four 32-bit pixels.
Also this is not a particularly efficient way of moving data, you want to at least align the destination pointer and use MOVAPS/MOVDQA for the stores. Otherwise plain memcpy() will probably be faster..

edit: I reread the comments and it seems like you've confused the SS (single-scalar) instructions for processing individual floats, and PS (packed-scalar) instructions for processing four floats in parallel. Furthermore the memory offsets in the loops are in *bytes* (four per pixel) rather than bits (32 per pixel). So what you really want is to use MOVSS from offsets 0,4,8,12,16,.. instead.

Share this post


Link to post
Share on other sites
Hey, thanks a lot for helping me out.
EDIT:
Removed question, solved it.

The reason I want to use SSE (apart from trying to learn it) is that I have to preform several calculations on each pixel.
SSE is a valid option then?

What I am trying to do is converting this into SSE:

screenDataPnt[0] = screenDataPnt[0] + (( alpha * (blue - screenDataPnt[0] )) >> 8);
screenDataPnt[1] = screenDataPnt[1] + (( alpha * (green - screenDataPnt[1] )) >> 8);
screenDataPnt[2] = screenDataPnt[2] + (( alpha * (red - screenDataPnt[2] )) >> 8);


Again, thanks

[Edited by - h3ro on July 5, 2008 1:44:57 PM]

Share this post


Link to post
Share on other sites
when you turn SSE on, VC++ replaces memcpy with MOVDQA if possible. so yeah, use memcpy. and when you don't, make sure to use aligned moves. (obviously your buffers will have to be aligned)

anyways, that operation should be possible to implement with SSE2. but i'd recommend using intrinsics instead of inline assembler.

Share this post


Link to post
Share on other sites
I have been re-thinking some of what I am trying to do.

As I am going to access both screen pixels and texture pixels all the time, I need to load them both into the SSE register in order to get the speed gain SSE offers(right? Please let me know if any of my logic anywhere is wrong)

So,
MMX0 to MMX3 is going to be texturePixels
MMX4 to MMX7 is going to be screenPixels

Then at the end, I swap MMX0 to MMX3 with MMX4 to MMX7

But as each register holds an whole pixel(int where BBBBGGGGRRRRAAAA) how do I access each colour of a pixel? I know I can use bit logic to mask out the values I want, but I have to store that masked value in a new register. And by doing that I cant keep all the information in the SSE register.

So, how would I go about converting the following code into SSE?
Final Screen Blue = screenBlue + ((textureAlpha * (textureBlue - screenBlue)) >> 8)

I know that in assembly you have to do it one instruction at the time, so what I really need help to understand is how to efficiently access the different information.

Share this post


Link to post
Share on other sites
i think something like this should work:

movdqa -> load 4 texture pixels into xmm register
movdqa -> load 4 screen pixels into an xmm register
pextrb -> extract alpha bytes
psubusb -> subtract screen values from texture values (saturating, dunno if that's what you want)
pshufb -> distribute bytes to have have a byte space between one another (will also need a copy to another xmm register)
pmulhuw -> perform multiplication (only stores the upper bytes so no shift needed)
pshufb+pblendvb -> mix both xmm registers again so the resulting bytes are next to one another again
paddusb -> add result to screen pixel register (saturating)
movdqa -> store result to screen

edit:
don't use a whole xmm register for a 32 bit pixel. you can load 4 at a time (128 bits), so you should.

Share this post


Link to post
Share on other sites
>>don't use a whole xmm register for a 32 bit pixel. you can load 4 at a time (128 bits), so you should.
I know, but I have never used SSE before, so I decided to get it working with one pixel in each register first.

Thanks for the rest of the code, Ill try it out as soon as I get home.

Share this post


Link to post
Share on other sites
OK, now im a big confused, again.

I have the following code, that works. It blits the image to the screen (no alpha blending so far), but I am not really sure I understand what I did to get it working.

My original data is stored in an array where the format is like this, where each element is a byte:
textureDataPnt[0] is pixel_1 Blue
textureDataPnt[1] is pixel_1 Green
textureDataPnt[2] is pixel_1 Red
textureDataPnt[3] is pixel_1 alpha
textureDataPnt[4] is pixel_2 Blue
...

so when I load them into the SSE register, with 4 as an offset am I not just loading one color for each pixel?
MOVSS XMM0, [ESI] // textureDataPnt[0] <-- Pixel_1 Blue
MOVSS XMM1, [ESI + 4] // textureDataPnt[0+4] <-- Pixel_2 Blue
MOVSS XMM2, [ESI + 8] // textureDataPnt[0+8] <-- Pixel_3 Blue
MOVSS XMM3, [ESI + 12] // textureDataPnt[0+12] <-- Pixel_4 Blue

Is that correct? If so, why does the following code work?


int dividedWidth = width / 4;
int j = dividedWidth;

//-------------------------------------------------------
// SEE BLITTING CODE!
//-------------------------------------------------------

//int i = height;

for (int i = 0; i < height; i++)
{
//dividedWidth is width of the texture / 8
for (int j = 0; j < dividedWidth; j++)
{
_asm
{
// Move the destianation pointer into the edi register
mov EDI, screenDataPnt

// Move the sprite pointer into the ESI register
mov ESI, textureDataPnt

// Move 4 pixels into 4 of the 8 SSE registers
// Each variable is an int (32 bit) and the SSE register can store 128 byte, but for now there
// Will only be one pixel in each register, to make it simpler to work with.
// 4 bytes it 32 bit, hence the offset value

// Colour mode is BGRA with 32bit colour
MOVSS XMM0, [ESI]
MOVSS XMM1, [ESI + 4]
MOVSS XMM2, [ESI + 8]
MOVSS XMM3, [ESI + 12]

// Move 4 screen pixels
MOVSS XMM4, [EDI + 0]
MOVSS XMM5, [EDI + 4]
MOVSS XMM6, [EDI + 8]
MOVSS XMM7, [EDI + 12]

// Get the value of Blue
// final screenBlue = Blue
// final screenBlue = - screenBlue
// final screenBlue = * alpha
// final screenBlue = + screenBlue
// final screenBlue = >> 8

// PSUBSW XMM0, XMM4 // -
// PMULLW XMM0, AlphaVal // *




// Move the pixels into the screen pointer
MOVSS [EDI], XMM0
MOVSS [EDI + 4], XMM1
MOVSS [EDI + 8], XMM2
MOVSS [EDI + 12], XMM3
}

screenDataPnt += 16;
textureDataPnt += 16;
}

// (ScreenWidth - textureWidth) * number of pixels
// 640 - 64 * 4
screenDataPnt += 2304;

}
}





How do I load more then one variable into each register?
How do I put textureDataPnt[0] to textureDataPnt[3] into one SSE register?

And, why does this not work?
MOVSS XMM4, XMM0
MOVSS XMM5, XMM1
MOVSS XMM6, XMM2
MOVSS XMM7, XMM3

Thanks for reading:)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this