# Any MMX gurus?

This topic is 3597 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hey guys, I'll admit that I'm being lazy here, but sometimes it's easier just to ask someone that already knows the answer than to wade through pages and pages of documentation. From what I've already read about MMX, I think I may have found a use for it in my own code. I'm hoping that someone out there would be able to come up with a quick solution to solve the problem. I have a 32-bit pixel, and I want to multiply each component with a 32-bit integer, and then divide the result by a 32-bit integer:
int x = ...;
int y = ...;

unsigned char new_r = (unsigned char)((old_r * x) / y);
unsigned char new_g = (unsigned char)((old_g * x) / y);
unsigned char new_b = (unsigned char)((old_b * x) / y);
unsigned char new_a = (unsigned char)((old_a * x) / y);

If I could somehow load two pixels into an __m64 variable, is there some way that I can perform these operations all at the same time? Cheers

##### Share on other sites
I am working on something very much like this.

What I am trying to do is speed up my alpha blitting code by using SSE.
BTW, why are you using MMX and not SSE?

One of the problems is how to arrange the data so that it is easy to do the needed operations.

What I did (using SSE) is to load each colour into a separate register:
XMM0 = red part of pixel 1 to 4
XMM1 = blue part of pixel 1 to 4
XMM2 = green part of pixel 1 to 4
XMM3 = alpha part of pixel 1 to 4

Then do the math on that.

But there is probably better ways, Ill let you know if I find something clever as I am still working on my code

##### Share on other sites
Another way would be to "load1" the integer to multiply with into another register for the multiply. That way you don't need to SoA your data. Most programmers are much more comfortable with SoA, and many algorithms and most hardware are much more comfortable with or implemented that way, too. For example, you probably won't find a video card that stores the red, green, blue, and alpha channels separately, and you probably won't have much luck trying to modify a texture that way. That's one reason why MMX/SSE sucks so much except for maybe 0.01% of all applications. </rant>

Anyway, you'll have a hard time doing that integer division.
To my knowledge, there is no such thing as integer division in either MMX or SSE(2, 3,...). You might use shifts for power-of-two, or do some multiplicative inverse trickery. Or, you might convert to float/double and back, but either solution pretty much makes using MMX/SSE in the first place absurd performance-wise. :-(

SoA?

##### Share on other sites
SoA = Structure of Arrays
AoS = Array of Structures

So SoA would look like this

struct Pixels
{
unsigned char red[128];
unsigned char green[128];
unsigned char blue[128];
};

Where as AoS would be

struct Pixel
{
unsigned char red;
unsigned char green;
unsigned char blue;
};

Pixel pixels[128];

Hope this makes sense.

##### Share on other sites
The first optimization I'd look at with that code is not MMX - it's converting that multiply and divide into a multiply and shift. You probably also only need 16 bits of accuracy since you're working with 8-bit values.

Something like:

// Calculate this once. Assumes x * 256 doesn't overflow.const unsigned int mul = (x * 256) / y;// Do this per pixel.unsigned char new_r = (unsigned char)((old_r * mul) >> 8);

After that optimization the conversion to MMX / SSE2 instructions should be much simpler. However you'll have to do the calculation with 16-bit values so you won't fit 2 pixels into one MMX register.

Also SOA vs AOS won't make any difference here - you're doing the same calculation on every byte of the data.

##### Share on other sites
Quote:
 Original post by Adam_42However you'll have to do the calculation with 16-bit values so you won't fit 2 pixels into one MMX register.

though if you use an XMM register you can process 16 uchar's (4 pixels) in one go.....

##### Share on other sites
Quote:
 though if you use an XMM register you can process 16 uchar's (4 pixels) in one go.....

From my understanding the SSE commands works in blocks of 4 or 2. Meaning that if you want to use say _mm_mul_() you would have to have your data sorted in this way:
2 64bit data
4 32bit data

Anything else needs to be padded so it fits that system. Again, I migth be wrong as im still trying to learn this myself. If someone could verify this, that would be great (until someone does, assume its wrong...)

BTW:
RobTheBloke from cgtalk.com?

##### Share on other sites
It can also work with 8 shorts or 16 bytes, there are SSE2 instructions for doing so.

##### Share on other sites
Quote:
 Original post by AndyPandyV2It can also work with 8 shorts or 16 bytes, there are SSE2 instructions for doing so.

Can you please give a few examples of commands that does that or a link? I have been looking for something like that for a long time, but have not found anything.

• 10
• 16
• 14
• 18
• 15