Old school fill algorithm (looking for optimization tricks)

Started by
12 comments, last by FordPerfect 5 years, 1 month ago

I'm reviving some of my graphics programming abilities by playing around with some old HW of mine and trying to do graphics programming without any external libraries. I'm getting stuck implementing a fast enough fill routine to to draw overlaps of objects. I'm letting the method figure out which is on which side, so the corner coordinates aren't sorted and might move outside of the image and needs to be clipped. The result is therefore:


void fill(int x0, int y0, int x1, int y1, int color) {
  if (x1 < x0) std::swap(x0, x1);
  if (y1 < y0) std::swap(y0, y1);
  if (x1 < 0 || y1 < 0 || x1 >= 1024 || y1 >= 768) return;

  x0 = x1 < 0 ? 0 : x0;
  x1 = x1 >= 1024 ? 1023 : x1;
  y0 = y0 < 0 ? 0 : y0;
  y1 = y1 >= 768 ? 765 : y1;

  // Rest of code... 
}

The problem is that the inputs are pretty randomly ordered etc. so the branches aren't well predicted. I assume that there must be some neat tricks for optimizing this chunk of code. I've tried various tricks that compile to SAR for the clamping, but if I convert the code line by line to bit fiddlings the code gets significantly longer and actually executes worse. I assume this is a pretty well known method to implement efficiently and there must be some old tricks for speeding it up significantly?

Advertisement

You may want to start with measuring where the performance loss happens before you start changing code.

I have profiled it extensively and it's spending most of its time right here. The rest is just a set of tight loops in increasing memory order, so the CPU runs it quite well. This is why I'm asking. I've also played around with the same code on a few years old HW and it shows the same issues (I wanted to see what VTune has to say).

I have trouble believing that this is your bottleneck.  Also, I see two obvious errors in your code.

Yeah, it was an abbreviation of my actual code, so should have checked that I wrote it down correctly. I've managed to cut down the runtime of an inner loop by 20% with lots of bit fiddling. My current problem is port congestion in this function, which is caused by emitting tons of bit fiddling instructions. Some other obvious issues remain like the sequence of code. For example, if I want to do:


if (x1 < x0) std::swap(x0, x1);
int b = x0 < 0;

Technically, I could get rid of the delay from the swap by doing the equivalent:


int b = (x0 < 0) | (x1 < 0);
if (x1 < x0) std::swap(x0, x1);

However, this adds unnecessary instructions. It would be cool to somehow compute the boolean with the lesser amount of instructions above, but without the penalty of the extra instructions in the version below.

There are some tricks for handle for outside point without branches


// x1 is in (0..1023) y1 is in (0..767)
b=x1|y1|(~(x1-1023))|(~(y1-767));
if (b<0) return;
         

There are similar trick for clamp, but if you test every pixel for fill, there are better algos

2 minutes ago, pabloreda said:

There are some tricks for handle for outside point without branches



// x1 is in (0..1023) y1 is in (0..767)
b=x1|y1|(~(x1-1023))|(~(y1-767));
if (b<0) return;
         

There are similar trick for clamp, but if you test every pixel for fill, there are better algos

The problem I still need to do the swap before calculating indices into the fill buffer, since I need to multiply the right y indices with the width of the buffer. The swap seems to stall the CPU. If I switch it to a branchless version, then it blocks speculation. If I put a branch there, then speculation is better, but I get mispredictions...

Why are you microoptimizing a fragment of code run only once per fill? That is to say: a constant cost? Does the rest of your fill algorithm, the "rest of code..." you've abbreviated, consume only some tens of clock cycles total?

Also: all that bit-twiddling may well be what's preventing the compiler from guessing what you're trying to do. The compiler is smarter and faster than you, when you give it all the information it requires. Just replace all that nonsense with this and move on:


void fill(int x0, int y0, int x1, int y1, int color) {
  if (x1 < x0) std::swap(x0, x1);
  if (y1 < y0) std::swap(y0, y1);
  if (x1 < 0 || y1 < 0 || x1 >= 1024 || y1 >= 768) return;

  x0 = std::max(x0, 0);
  x1 = std::min(x1, 1023);
  y0 = std::max(y0, 0);
  y1 = std::min(y1, 767);

  // Rest of code...
}

 

RIP GameDev.net: launched 2 unusably-broken forum engines in as many years, and now has ceased operating as a forum at all, happy to remain naught but an advertising platform with an attached social media presense, headed by a staff who by their own admission have no idea what their userbase wants or expects.Here's to the good times; shame they exist in the past.

Three possibilities, in order of likelihood:

  1. This is not really your bottleneck.
  2. You are calling this function way too often.
  3. Your compiler is doing something really crazy, like failing to inline std::swap.

For option 3, look at the generated assembly.  For option 2, add a counter each time this function is called and see how often it is called per second.  For option 1, look at the rest of your program and fix that.

I could probably shave quite a bit of time off this function by optimizing at the assembly level, but it would almost certainly be a complete waste of time to do so.

You already know x1/y1 to be within bounds (the third if statement..) so no need to check for that again further down.. That's two lines less.

.:vinterberg:.

This topic is closed to new replies.

Advertisement