Jump to content
  • Advertisement
dst

2D Old school fill algorithm (looking for optimization tricks)

Recommended Posts

Posted (edited)

I'm reviving some of my graphics programming abilities by playing around with some old HW of mine and trying to do graphics programming without any external libraries. I'm getting stuck implementing a fast enough fill routine to to draw overlaps of objects. I'm letting the method figure out which is on which side, so the corner coordinates aren't sorted and might move outside of the image and needs to be clipped. The result is therefore:

void fill(int x0, int y0, int x1, int y1, int color) {
  if (x1 < x0) std::swap(x0, x1);
  if (y1 < y0) std::swap(y0, y1);
  if (x1 < 0 || y1 < 0 || x1 >= 1024 || y1 >= 768) return;

  x0 = x1 < 0 ? 0 : x0;
  x1 = x1 >= 1024 ? 1023 : x1;
  y0 = y0 < 0 ? 0 : y0;
  y1 = y1 >= 768 ? 765 : y1;

  // Rest of code... 
}

The problem is that the inputs are pretty randomly ordered etc. so the branches aren't well predicted. I assume that there must be some neat tricks for optimizing this chunk of code. I've tried various tricks that compile to SAR for the clamping, but if I convert the code line by line to bit fiddlings the code gets significantly longer and actually executes worse. I assume this is a pretty well known method to implement efficiently and there must be some old tricks for speeding it up significantly?

Edited by dst

Share this post


Link to post
Share on other sites
Advertisement
Posted (edited)

I have profiled it extensively and it's spending most of its time right here. The rest is just a set of tight loops in increasing memory order, so the CPU runs it quite well. This is why I'm asking. I've also played around with the same code on a few years old HW and it shows the same issues (I wanted to see what VTune has to say).

Edited by dst

Share this post


Link to post
Share on other sites
Posted (edited)

Yeah, it was an abbreviation of my actual code, so should have checked that I wrote it down correctly. I've managed to cut down the runtime of an inner loop by 20% with lots of bit fiddling. My current problem is port congestion in this function, which is caused by emitting tons of bit fiddling instructions. Some other obvious issues remain like the sequence of code. For example, if I want to do:

if (x1 < x0) std::swap(x0, x1);
int b = x0 < 0;

Technically, I could get rid of the delay from the swap by doing the equivalent:

int b = (x0 < 0) | (x1 < 0);
if (x1 < x0) std::swap(x0, x1);

However, this adds unnecessary instructions. It would be cool to somehow compute the boolean with the lesser amount of instructions above, but without the penalty of the extra instructions in the version below.

Edited by dst

Share this post


Link to post
Share on other sites

There are some tricks for handle for outside point without branches

// x1 is in (0..1023) y1 is in (0..767)
b=x1|y1|(~(x1-1023))|(~(y1-767));
if (b<0) return;
         

There are similar trick for clamp, but if you test every pixel for fill, there are better algos

Share this post


Link to post
Share on other sites
2 minutes ago, pabloreda said:

There are some tricks for handle for outside point without branches


// x1 is in (0..1023) y1 is in (0..767)
b=x1|y1|(~(x1-1023))|(~(y1-767));
if (b<0) return;
         

There are similar trick for clamp, but if you test every pixel for fill, there are better algos

The problem I still need to do the swap before calculating indices into the fill buffer, since I need to multiply the right y indices with the width of the buffer. The swap seems to stall the CPU. If I switch it to a branchless version, then it blocks speculation. If I put a branch there, then speculation is better, but I get mispredictions...

Share this post


Link to post
Share on other sites

Why are you microoptimizing a fragment of code run only once per fill? That is to say: a constant cost? Does the rest of your fill algorithm, the "rest of code..." you've abbreviated, consume only some tens of clock cycles total?

Also: all that bit-twiddling may well be what's preventing the compiler from guessing what you're trying to do. The compiler is smarter and faster than you, when you give it all the information it requires. Just replace all that nonsense with this and move on:

void fill(int x0, int y0, int x1, int y1, int color) {
  if (x1 < x0) std::swap(x0, x1);
  if (y1 < y0) std::swap(y0, y1);
  if (x1 < 0 || y1 < 0 || x1 >= 1024 || y1 >= 768) return;

  x0 = std::max(x0, 0);
  x1 = std::min(x1, 1023);
  y0 = std::max(y0, 0);
  y1 = std::min(y1, 767);

  // Rest of code...
}

 

Share this post


Link to post
Share on other sites

Three possibilities, in order of likelihood:

  1. This is not really your bottleneck.
  2. You are calling this function way too often.
  3. Your compiler is doing something really crazy, like failing to inline std::swap.

For option 3, look at the generated assembly.  For option 2, add a counter each time this function is called and see how often it is called per second.  For option 1, look at the rest of your program and fix that.

I could probably shave quite a bit of time off this function by optimizing at the assembly level, but it would almost certainly be a complete waste of time to do so.

Share this post


Link to post
Share on other sites

You already know x1/y1 to be within bounds (the third if statement..) so no need to check for that again further down.. That's two lines less.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!