Sign in to follow this  
AshleysBrain

C++: Rotating and scaling a bitmask - slow!

Recommended Posts

Hey, I'm trying to write a function to generate a scaled, rotated bitmask from a source bitmask (which is not scaled or rotated at all). This is for collision detection, so it's done by the CPU, since it uses MMX to detect collisions as well. The problem is the intense maths done for every pixel to generate a rotated bitmask is incredibly slow. I'm struggling to get above 35fps generating a bitmask every frame for a 200x200 mask! This is no good, I want my platformer guy being able to run around on rotating platforms and stuff. It runs fast enough to get away with when its only scaling, but rotating really bottlenecks it. Here's my code. I know it's pretty bad, but I can't think how to speed it up. The transformations HAVE to be done on a pixel per pixel basis. Any thoughts?
// Coords of centre of image.
float cX = ...;
float cY = ...;

// Loop every Y pixel
for (int y = 0; y < pxHeight; y++) 
{
	bEight = 0;
	count = 0;

	// Loop every X pixel
	for (int x = 0; x < pxWidth; x++) 
	{

		// Transform this pixel (x,y) to a pixel on the source mask
		
		float fx = x;				// Convert the x/y in to floats
		float fy = y;

		float dx = cX - fx;			// Difference between this pixel and the centre
		float dy = cY - fy;
		float r = sqrt(dx*dx + dy*dy);		// Distance/radius from centre to this pixel
		float a = atan2(dy, dx);		// Angle from centre to this pixel

		// Rotate pixel around centre
		float fSrcX = cX + cos(a + angle) * r;
		float fSrcY = cY + sin(a + angle) * r;

		// Scale pixel
		fSrcX *= scaleX;
		fSrcY *= scaleY;

		// Convert back to a rounded integral
		int iSrcX = fSrcX + 0.5f;
		int iSrcY = fSrcY + 0.5f;


		BYTE b;

		// Bounds check: due to slack space in the bounding box, avoid pasting in static
		if (iSrcX >= 0 && iSrcY >= 0 && iSrcX < src->width && iSrcY < src->height) {

			// Extract a single bit from the source mask
			b = srcbits[iSrcY * srcPitch + (iSrcX/8)];
			b &= 1 << (7 - (iSrcX % 8));
			b >>= (7 - (iSrcX % 8));
		}
		else
			// Out of bounds bit
			b = 0;
		
		// Write to dest mask a byte at a time
		bEight = (bEight << 1) + b;
		count++;
		if(count == 8)
		{
			destbits[y * destPitch + (x/8)] = bEight;
			count = 0;
			bEight = 0;
		}
	}//for
}//for


Share this post


Link to post
Share on other sites
You will find that the bottleneck in your code is most likely the floating point arithmetic, especially the parts that deal with the trig functions. Have you considered working in fixed point arithmetic? Then perhaps pre-calculating some values and use a look-up table?

- sqrt(), atan2(), cos() and sin() are very expensive.
- float to int conversions are also expensive.
- Integer division and modulo can be sometimes a performance hit, such as iSrcX/8 or iSrcX%8, but compilers sometimes substitute those with faster "bit trick" equivalents (see below).

If the divisor is a 2^n number and the dividend is a positive number, you can use bit tricks to speed things up a little. For example, divisor is 8 (= 2^3), then:

a / 8 == a >> 3
a % 8 == a & (8-1)
a * 8 == a << 3

Share this post


Link to post
Share on other sites
Avoid the trig and use vectors and rotation matrices instead. That will increase the speed and additionally it will allow for a variety of optimizations, especially in the two dimensional case you're dealing with. In fact, rotation and scaling in two dimensions can be represented with just two numbers. Video game systems like the super nintendo used this representation (and then had specialized hardware for performing the multiplications).

Another possibility is to allow only a finite number of rotations. You can precompute rotations for, say, 16 different evenly spaced angles at intervals of 22.5 degrees.

Share this post


Link to post
Share on other sites
Vorpy, can you fill me in on some of the math behind that? I only know how to achieve it via the old fashioned trig way. Oh and I gather the D3DX matrices are optimised - can you do it using those too?

Share this post


Link to post
Share on other sites
You can do much better if you treat this as line-drawing. Basically, you figure out the slope of the rows of the rotated bitmap. Then you extract each row from the source bitmap as if you were drawing a line across it. Do that for every row in the rotated bitmap. Read about Bresenham's line algorithm. Using this algorithm, you only have to do trigonometry or rotation math once to find the origin and slopes. Then for each pixel, you just do couple adds, a compare, and a subtract to the get its location in the unrotated image.

Normally, the quality of an unfiltered rotation is not very high, but since you are just using this for collision detection, that shouldn't matter.

Quote:
Original post by Tachikoma
If the divisor is a 2^n number and the dividend is a positive number, you can use bit tricks to speed things up a little. For example, divisor is 8 (= 2^3), then:

a / 8 == a >> 3
a % 8 == a & (8-1)
a * 8 == a << 3

This might have been a good trick 10 or 15 years ago, but not any longer. The compiler will do this for you, so there is no reason to obfuscate your code like this.

Share this post


Link to post
Share on other sites
I guess I can try to simplify this a bit...

First of all, direct3d matrices are fine for 3d transformations, and you could use them for 2 dimensions, but I'm not so sure how much of an improvement it would be. There's a lot of multiplication involved in there and most of it would be unnecessary.

Two dimensions isn't so bad. You have your rotation angle theta. Now you want to rotate each pixel around the center by the angle theta. Your current method is to calculate the angle and distance of the pixel from the center, add theta to the angle, and convert it back. It works, but often when you do a trigonometric conversion back and forth like that there's a more direct way that skips the trig altogether. In this case, a lot more can be skipped.

I guess for two dimensions I'll just skip the rotation matrix stuff...let's say the displacement from the origin is (x,y). After rotating by angle t, the displacement is (x*cos(t)-y*sin(t), x*sin(t)+y*cos(t)).

Since the rotation angle doesn't change, the values for cos(t) and sin(t) can be calculated before the loop.

There are even more savings to be had because of how you're rotating values on a grid. On each iteration of the inner loop, you're incrementing x by 1. This is the same as simply taking the result from the previous iteration and adding cos(t) to the x coordinate and sin(t) to the y coordinate. A similiar change can be made to the outer loop, where the y coordinate is incremented. Now the rotations require no trig or multiplications inside the loop.

We can also get the scaling operation at no extra cost. Instead of calculating cos(t) and sin(t), calculate cos(t)*scale and sin(t)*scale and use those values instead. Since you're transforming from the destination matrix to the source matrix, the angle and scale are inverted before doing these calculations. Angle is inverted by negating it, and scale is inverted by calculating its reciprical.

The idea from Bresenham's algorithm could be used to replace the floating point stuff in the loops with integer calculations. I'm not sure how much this would help, if at all, on modern processors, especially if it isn't optimized. It can also be tricky to implement. On the other hand I think fixed point could be done with relatively little trouble, but once again it is unclear if this would really be faster than floating point since it's mostly just replacing float->int conversions with some bit shifting. According to some web sites it sounds like that could give a real boost in performance.

[Edited by - Vorpy on March 9, 2007 9:05:51 PM]

Share this post


Link to post
Share on other sites
Fixed-point arithmetic is not faster than floating point arithmetic on modern processors. In fact, it is not uncommon for fixed-point to be slower since it involves heavy use of renormalization shifts, which cost extra cycles and instructions.

The practical use of fixed-point math in consumer-PC hardware ended many years ago.

Share this post


Link to post
Share on other sites
The only use of fixed point in this case would be for additions, and as far as I can recall those are about as fast as floating point additions. The main thing is avoiding that conversion from a float to an int, which would be done by a bit shift operating. I also googled some ways to do a faster float to int conversion, but what could possibly compete with a shift?

Share this post


Link to post
Share on other sites
A fixed point add is just an integer add; on any CPU younger than 5 years or so there's not really an appreciable difference between an integer add and a floating point add.

Eliminating the float/int conversion would certainly add some speed, although it isn't clear how much. Obviously the original performance problem was algorithmic in nature, not low-level. Any further gains would likely have to come from careful profiling and tuning. In any case IMHO the small gains of eliminating the conversion are probably not worth the cost in code complexity for moving to fixed-point.

Removing inner conditionals from the loop, optimizing the stepping algorithm (Bresenham is not the fastest known linear step algorithm although I can't recall the names of any alternatives at the moment), and possibly exploiting SIMD vectorization would probably all give better payoff. (In particular, since this is a repeated application of a linear step through a rectangular space, it should be possible to do, say, 4 rows at a time via SIMD.)

Share this post


Link to post
Share on other sites
Quote:
Original post by ApochPiQ
The practical use of fixed-point math in consumer-PC hardware ended many years ago.


Sorry I have to disagree with this, to a point, at least for a P4 CPU (is that still modern?).

I recently wrote some image manipulation functions that originally used floating-point and found it way too slow. So I did a comparitive tested using fixed-point and discovered it performed over twice as fast.

So in general my expereince has shown that fixed-point is faster than floating point. However I then switched much of the function into simple look-up-tables and that blew both fixedpoint and floatingpoint out of the water, but thats a completely different optimisation technique ;)

Having said that I'd expect SIMD to be as fast if not faster than fixedpoint, but I didn't bother invesigating it at the time since the lookuptable was clearly the most effcient method at the algorithm level, albeit at a slight cost in memory.

Ultimately I feel its always good to test different approaches out if you have the time. If you can write and support SIMD then they are probably going to be as good if not better than fixed point, but fixed-point is likely to always be faster than floatingpoint. However there may be alternative optimsiations at the algorithm level that can easily beat any gains you get from switching from floatingpoint to fixedpoint or SIMD that should also be profiled.

Share this post


Link to post
Share on other sites
A few things.

First of all, without sharing both the initial and rewritten code, your anecdotal evidence is less than worthless. For all we know you "accidentally" introduced an algorithmic speedup during the float/fixed changeover and that's the source of your results. There's also a fairly well-known performance hit to using "magic" values (INF, NAN, and so on) on the P4 processor series in particular, so if your code involved any of that, look no further for your performance explanation.

Secondly, anecdotal evidence without even profiling stats to back it up is less than worthless anyways since we're talking about highly situational micro-optimizations here.

Third, it seems to me like you don't know what SIMD means. SIMD stands for Single Instruction Multiple Data. Assuming your algorithm can be reasonably implemented to handle it, SIMD allows you to do, say, four calculations on different data at one time - in the same time it would have taken to do a single calculation. There are of course some small bits of overhead and not all code can be vectorized, but in general, well-written SIMD code will absolutely stomp all over non-vectorized fixed point.

Finally, it's an inescapable fact that FP pipelines in modern CPUs are just as fast as the integer pipelines. Both Intel and AMD's official optimization references show floating-point operations running neck-and-neck with integer operations in general; it's even possible for FP throughput to be higher than integer throughput if the code is pipelined right (the compiler makes a really huge difference here). The bottom line is that unless you're doing a huge amount of integer/real swapping (i.e. introducing float conversions), chances are just doing it in floating point will be more than fast enough.

Share this post


Link to post
Share on other sites
Quote:

float r = sqrt(dx*dx + dy*dy); // Distance/radius from centre to this pixel
float a = atan2(dy, dx); // Angle from centre to this pixel
// Rotate pixel around centre
float fSrcX = cX + cos(a + angle) * r;
float fSrcY = cY + sin(a + angle) * r;


in a inner loop? 40,000 times? I rest my case.

Transform matrices. Yeah... Or some funky hacky 'Doom style' bitmap manipulation.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this