# C++: Rotating and scaling a bitmask - slow!

This topic is 4216 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hey, I'm trying to write a function to generate a scaled, rotated bitmask from a source bitmask (which is not scaled or rotated at all). This is for collision detection, so it's done by the CPU, since it uses MMX to detect collisions as well. The problem is the intense maths done for every pixel to generate a rotated bitmask is incredibly slow. I'm struggling to get above 35fps generating a bitmask every frame for a 200x200 mask! This is no good, I want my platformer guy being able to run around on rotating platforms and stuff. It runs fast enough to get away with when its only scaling, but rotating really bottlenecks it. Here's my code. I know it's pretty bad, but I can't think how to speed it up. The transformations HAVE to be done on a pixel per pixel basis. Any thoughts?
// Coords of centre of image.
float cX = ...;
float cY = ...;

// Loop every Y pixel
for (int y = 0; y < pxHeight; y++)
{
bEight = 0;
count = 0;

// Loop every X pixel
for (int x = 0; x < pxWidth; x++)
{

// Transform this pixel (x,y) to a pixel on the source mask

float fx = x;				// Convert the x/y in to floats
float fy = y;

float dx = cX - fx;			// Difference between this pixel and the centre
float dy = cY - fy;
float r = sqrt(dx*dx + dy*dy);		// Distance/radius from centre to this pixel
float a = atan2(dy, dx);		// Angle from centre to this pixel

// Rotate pixel around centre
float fSrcX = cX + cos(a + angle) * r;
float fSrcY = cY + sin(a + angle) * r;

// Scale pixel
fSrcX *= scaleX;
fSrcY *= scaleY;

// Convert back to a rounded integral
int iSrcX = fSrcX + 0.5f;
int iSrcY = fSrcY + 0.5f;

BYTE b;

// Bounds check: due to slack space in the bounding box, avoid pasting in static
if (iSrcX >= 0 && iSrcY >= 0 && iSrcX < src->width && iSrcY < src->height) {

// Extract a single bit from the source mask
b = srcbits[iSrcY * srcPitch + (iSrcX/8)];
b &= 1 << (7 - (iSrcX % 8));
b >>= (7 - (iSrcX % 8));
}
else
// Out of bounds bit
b = 0;

// Write to dest mask a byte at a time
bEight = (bEight << 1) + b;
count++;
if(count == 8)
{
destbits[y * destPitch + (x/8)] = bEight;
count = 0;
bEight = 0;
}
}//for
}//for



##### Share on other sites
You will find that the bottleneck in your code is most likely the floating point arithmetic, especially the parts that deal with the trig functions. Have you considered working in fixed point arithmetic? Then perhaps pre-calculating some values and use a look-up table?

- sqrt(), atan2(), cos() and sin() are very expensive.
- float to int conversions are also expensive.
- Integer division and modulo can be sometimes a performance hit, such as iSrcX/8 or iSrcX%8, but compilers sometimes substitute those with faster "bit trick" equivalents (see below).

If the divisor is a 2^n number and the dividend is a positive number, you can use bit tricks to speed things up a little. For example, divisor is 8 (= 2^3), then:

a / 8 == a >> 3
a % 8 == a & (8-1)
a * 8 == a << 3

##### Share on other sites
Avoid the trig and use vectors and rotation matrices instead. That will increase the speed and additionally it will allow for a variety of optimizations, especially in the two dimensional case you're dealing with. In fact, rotation and scaling in two dimensions can be represented with just two numbers. Video game systems like the super nintendo used this representation (and then had specialized hardware for performing the multiplications).

Another possibility is to allow only a finite number of rotations. You can precompute rotations for, say, 16 different evenly spaced angles at intervals of 22.5 degrees.

##### Share on other sites
Vorpy, can you fill me in on some of the math behind that? I only know how to achieve it via the old fashioned trig way. Oh and I gather the D3DX matrices are optimised - can you do it using those too?

##### Share on other sites
The slow down comes in when you convert your float to an int. This is why fix point math might give you more speed.

##### Share on other sites
You can do much better if you treat this as line-drawing. Basically, you figure out the slope of the rows of the rotated bitmap. Then you extract each row from the source bitmap as if you were drawing a line across it. Do that for every row in the rotated bitmap. Read about Bresenham's line algorithm. Using this algorithm, you only have to do trigonometry or rotation math once to find the origin and slopes. Then for each pixel, you just do couple adds, a compare, and a subtract to the get its location in the unrotated image.

Normally, the quality of an unfiltered rotation is not very high, but since you are just using this for collision detection, that shouldn't matter.

Quote:
 Original post by TachikomaIf the divisor is a 2^n number and the dividend is a positive number, you can use bit tricks to speed things up a little. For example, divisor is 8 (= 2^3), then: a / 8 == a >> 3a % 8 == a & (8-1)a * 8 == a << 3

This might have been a good trick 10 or 15 years ago, but not any longer. The compiler will do this for you, so there is no reason to obfuscate your code like this.

##### Share on other sites
I guess I can try to simplify this a bit...

First of all, direct3d matrices are fine for 3d transformations, and you could use them for 2 dimensions, but I'm not so sure how much of an improvement it would be. There's a lot of multiplication involved in there and most of it would be unnecessary.

Two dimensions isn't so bad. You have your rotation angle theta. Now you want to rotate each pixel around the center by the angle theta. Your current method is to calculate the angle and distance of the pixel from the center, add theta to the angle, and convert it back. It works, but often when you do a trigonometric conversion back and forth like that there's a more direct way that skips the trig altogether. In this case, a lot more can be skipped.

I guess for two dimensions I'll just skip the rotation matrix stuff...let's say the displacement from the origin is (x,y). After rotating by angle t, the displacement is (x*cos(t)-y*sin(t), x*sin(t)+y*cos(t)).

Since the rotation angle doesn't change, the values for cos(t) and sin(t) can be calculated before the loop.

There are even more savings to be had because of how you're rotating values on a grid. On each iteration of the inner loop, you're incrementing x by 1. This is the same as simply taking the result from the previous iteration and adding cos(t) to the x coordinate and sin(t) to the y coordinate. A similiar change can be made to the outer loop, where the y coordinate is incremented. Now the rotations require no trig or multiplications inside the loop.

We can also get the scaling operation at no extra cost. Instead of calculating cos(t) and sin(t), calculate cos(t)*scale and sin(t)*scale and use those values instead. Since you're transforming from the destination matrix to the source matrix, the angle and scale are inverted before doing these calculations. Angle is inverted by negating it, and scale is inverted by calculating its reciprical.

The idea from Bresenham's algorithm could be used to replace the floating point stuff in the loops with integer calculations. I'm not sure how much this would help, if at all, on modern processors, especially if it isn't optimized. It can also be tricky to implement. On the other hand I think fixed point could be done with relatively little trouble, but once again it is unclear if this would really be faster than floating point since it's mostly just replacing float->int conversions with some bit shifting. According to some web sites it sounds like that could give a real boost in performance.

[Edited by - Vorpy on March 9, 2007 9:05:51 PM]

##### Share on other sites
Woah... you guys are GOOD! Got it from 30fps to 240-290fps with those techniques!

Cheers guys :D

------
Hi CT!

##### Share on other sites
Fixed-point arithmetic is not faster than floating point arithmetic on modern processors. In fact, it is not uncommon for fixed-point to be slower since it involves heavy use of renormalization shifts, which cost extra cycles and instructions.

The practical use of fixed-point math in consumer-PC hardware ended many years ago.

1. 1
2. 2
3. 3
4. 4
Rutin
12
5. 5

• 12
• 18
• 10
• 14
• 10
• ### Forum Statistics

• Total Topics
632662
• Total Posts
3007702
• ### Who's Online (See full list)

There are no registered users currently online

×