# Cache misses??

This topic is 3930 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hey all, here is some profiled code from the main inner loop of my gouraud triangle renderer: //------------ (3) for (int XPos = Left; XPos <= Right; XPos++) { // Test against z buffer and draw (4) if (*ZBuf > ZVal) { (1) *Buf = (RVal & 0xff0000) | ((GVal & 0xff0000) >> 8) | (BVal >> 16); *ZBuf = ZVal; } (5) Buf++; (4) ZBuf++; (16) ZVal += DeltaZ; (2) RVal += DeltaR; (2) GVal += DeltaG; (2) BVal += DeltaB; } //------------ The majority of the percentage of time spent rendering these scanlines seems to be incrementing variables...does this have something to do with cache misses? When I reorder the instructions, it's always some unlikely line or variable that receives the bad end of the deal. How can I increase the performance of this routine...as it seems that the bottleneck is my ignorance in memory cache misses? Or is there not a whole lot I can do here....

##### Share on other sites
posted more legibly

(3) for (int XPos = Left; XPos <= Right; XPos++)    {        // Test against z buffer and draw(4)     if (*ZBuf > ZVal)        {(1)         *Buf = (RVal & 0xff0000) | ((GVal & 0xff0000) >> 8) | (BVal >> 16);            *ZBuf = ZVal;        }(5)     Buf++;(4)     ZBuf++;(16)    ZVal += DeltaZ;(2)     RVal += DeltaR;(2)     GVal += DeltaG;(2)     BVal += DeltaB;    }

I don't see anything off the top of my head. Have you already made algorithmic optimizations to improve the performance of you app: call this block less, structure the code differently so it isn't necessary at all, etc? Micro-optimizations like this will generally not net you more than 1% overall performance boosts. Algorithmic optimizations is where you want to start; it's not unheard of to double or triple your performance by reducing the BigO of various parts of your app.

Honestly, the best thing to do at this level is to take a look at the assembly generated by the compiler and figure out what it's doing. It could be that it's just generating some weird code.

-me

##### Share on other sites
Heh...thanks. Hadn't looked at the FAQ yet.

##### Share on other sites
Yeah, that code block is the bottom triangle half, and there is an identical top triangle half code block. Combined they take about 60-70% of the processing time, and it just seems weird that like 16-20% of that is something like 'ZVal += DeltaZ;'

Yes, algorithmically the triangle renderer should hardly even have any overdraw to it (only when the zsort from front to back doesn't work properly...you know the scenerios). There aren't any cases where it is called unnecessarily...I mean I'm sure there are other things I can do to limit it slightly more...but I *believe* that will add more overhead than it will fix.

##### Share on other sites
If you're drawing a lot of things on top of each other and sorting from front to back you could maybe try something like:

        // Test against z buffer and draw(4)     if (*ZBuf > ZVal)        {(1)         *Buf = ((RVal+DeltaR*D) & 0xff0000) | (((GVal+DeltaG*D) & 0xff0000) >> 8) | ((BVal+DeltaB*D) >> 16);            *ZBuf = ZVal;        }(5)     Buf++;(4)     ZBuf++;(16)    ZVal += DeltaZ;        D++;    }

It's also possible to calculate the pixel buffer address inside the IF using the z buffer address so you don't need that increment either. If you don't have much overlapping then this won't help at all and would probably be slower.

##### Share on other sites
Oh yeah. I got a 50+% speed increase. Using the z buffer to calculate the pixel buffer address didn't help or hurt, but calculating the deltas when needed helped immensely. All the overlapping polygons are sort of a necessity that can't be avoided so yeah that was a perfect idea. Thanks much.

-Scott

##### Share on other sites
Ahhh to clarify. Stonemonkey's excellent idea gave me a 25% boost...I got an additional 25% boost by allocating the zbuffer and pixel buffer in the same block of memory!

##### Share on other sites
Which profiler did you use? Which CPU? Which compiler and which settings?

Modern processors can't really be profiled on a per-instruction basis. An addition only costs one clock cycle, but there might be other reasons why the CPU stalls a little at that instruction. You'd have to look at the assembly code. There are only 6 freely available integer registers on an x86 CPU, so it's quite possible that right at these increments it has to write registers back to memory and load other values into the registers.

You can gain -a lot- of performance by using MMX and SSE here.

##### Share on other sites
I expect your compiler is doing a fine job at doing this anyway but have you tried using the prefix increment operator rather than the postfix? In theory the prefix increment can be faster as it avoids temporary allocation, but you wont know unless you profile it or look at the assembly e.g. ++Buf instead of Buf++.

##### Share on other sites
I don't know how much it'd gain you here, but you could try rewriting your if statement into using the ternary (?:) operator instead.
That typically compiles to a conditional move rather than a branch, so it *might* boost performance and allow the compiler to better optimize the code (optimizing across branches is a pain)

And as said above, MMX/SSE might be able to help you too. (increment four vars in one go, and also relieve register pressure a bit)

##### Share on other sites
Ummm I used LTProf with a Pentium 4 in VS2005 and just default "Release" optimizations.

I think the next easiest optimization is definitely using SSE. I don't want to use MMX though because I don't think it is supposed to be officially supported much longer with more and more 64 bit processors coming out, and we have to support/take advantage of 64 bit processors probably by the end of the year. This also means I'll have to code the SSE stuff in an external asm source file to link it with a VS2005 64 bit project - so I'm gonna hold off unless the extra speed boost is needed. (I hope I'm correct in the above statements and not sounding like a completely ignorant person).

Yeah the assembly source looks like there is a lot of switching registers with variables, but 2005 seems to optimize it OK.

But just in case...anyone know an especially good source for a quick SSE tutorial? Or will a simple google search suffice.

Thanks.

##### Share on other sites
Quote:
 Original post by dmatter...but have you tried using the prefix increment operator rather than the postfix? In theory the prefix increment can be faster as it avoids temporary allocation...

Any compiler that claims to be an optimizing compiler will generate exactly the same code for prefix or postfix increment (at least for simple integers). Basically, unless you truely need prefix increment, postfix increment is just fine for all your incrementing needs (and reads a tiny bit more fleuntly - although that's a matter of taste).

Anyway, I strongly suggest to never change high-level code to get low-level code the way you want. It's a hack. If you need the assembly to be what you expect, write assembly. ;-)

##### Share on other sites
Quote:
 Original post by popsoftheyearUmmm I used LTProf with a Pentium 4 in VS2005 and just default "Release" optimizations.

I'm not familiar with LTProf, but I assume it works quite similar to VTune. VS2005 in release mode will indeed optimize that code very well. A Pentium 4 is notorious for having unpredictable behavior though.
Quote:
 I don't want to use MMX though because I don't think it is supposed to be officially supported much longer with more and more 64 bit processors coming out...

MMX is 64-bit processing. And it's definitely still supported for x86-64. The only disadvantage is that you only get 8 MMX registers while you get 16 SSE registers in x86-64. Anyway, if you want to write just one version that works for x86-32 as well just stick to MMX and SSE. Besides, MMX instructions are often shorter than their 128-bit SSE equivalents.
Quote:
 ...and we have to support/take advantage of 64 bit processors probably by the end of the year.

There's not a whole lot to take advantage of really.
Quote:
 This also means I'll have to code the SSE stuff in an external asm source file to link it with a VS2005 64 bit project - so I'm gonna hold off unless the extra speed boost is needed.

You can also generate the code dynamically inside C++ with SoftWire, which also supports 64-bit now.
Quote:
 But just in case...anyone know an especially good source for a quick SSE tutorial?

I love Tommesani's doucments as a quick reference. If you're already comfortable with x86 assembly that should get you started quickly. Get the Intel reference manuals as well though. They are more cryptic but contain all the correct details.

##### Share on other sites
Thanks for both those references, I'm eager to get the chance to check them out.

Maybe things have changed and the article is old but this came from http://msdn2.microsoft.com/en-us/library/bb147385.aspx

Quote:
 The x87, MMX, and 3dNow! instruction sets are deprecated in 64-bit modes. The instructions sets are still present for backwards compatibility for 32-bit mode, but to avoid compatibility issues in the future, their use in current and future projects is discouraged.

-Scott

##### Share on other sites
Quote:
 Original post by popsoftheyearI think the next easiest optimization is definitely using SSE.

Try getting rid of the branch first. Shouldn't take more than 5 minutes to do.

Quote:
 This also means I'll have to code the SSE stuff in an external asm source file to link it with a VS2005 64 bit project

Nah, use the compiler's SSE intrinsics. That's generally a better idea than ASM. (Among other things, it enables the compiler to know what you're doing, so it can better do register allocationn and instruction reordering). Plus, it works with both 32 and 64 bit.

But you're probably right about avoiding MMX.

Quote:
 But just in case...anyone know an especially good source for a quick SSE tutorial? Or will a simple google search suffice.

All I used when I last had to code SSE stuff was simply AMD's instruction reference, and the compiler's documentation on the intrinsics available.
Then it's pretty straightforward to code.

##### Share on other sites
Maybe I'm being slow but how would you go about removing the branch?

##### Share on other sites
No I was wondering too... something along these lines compiles fine, but I didn't really see any change.

*ZBuf > ZVal ?Buf[D] = ((RVal+DeltaR*D) & 0xff0000) | (((GVal+DeltaG*D) & 0xff0000) >> 8) | ((BVal+DeltaB*D) >> 16),*ZBuf = ZVal : __noop;

Good idea though. (I hope this code formats right?)

##### Share on other sites
You can get rid of a branch with a conditional move instruction. popsoftheyear, you can check that the compiler is emitting one of the CMOV instructions when you use a ternary, but it might not since it's not a simple move.

##### Share on other sites
Maybe I don't get it but it seems to me the conditional move would only be better if I was calculating the color value every time....otherwise doesn't the code NEED to jump in order to skip that calculation when it doesn't pass the Z test??

##### Share on other sites
Quote:
 Original post by popsoftheyearMaybe I don't get it but it seems to me the conditional move would only be better if I was calculating the color value every time....otherwise doesn't the code NEED to jump in order to skip that calculation when it doesn't pass the Z test??

Calculating the color everytime might be faster if it allows you to avoid a branch or a cache miss.

##### Share on other sites
Well, normally it wouldn't be much of a problem because current processors speculate on the result of the comparison and start executing one of the branches, but that particular comparison looks pretty hard to predict, and a mis-prediction costs you a lot. You might be much better off calculating the value at *Buf and conditionally storing it if the comparison passes. Even though you would be doing the calculation for hidden pixels, it might be a lot cheaper than a branch and possibly many mis-predictions. Going off the last bit of code posted, try something like this:

BufVal = ((RVal+DeltaR*D) & 0xff0000) | (((GVal+DeltaG*D) & 0xff0000) >> 8) | ((BVal+DeltaB*D) >> 16);

if (*ZBuf > ZVal)
{
*Buf = BufVal;
*ZBuf = ZVal;
}

Buf++;
ZBuf++;
ZVal += DeltaZ;
D++;

...or one of the ternary variations, hopefully the compiler will emit CMOVs if it's not already doing that. If not, try doing it yourself.

##### Share on other sites
Quote:
 Original post by popsoftheyearMaybe I don't get it but it seems to me the conditional move would only be better if I was calculating the color value every time....otherwise doesn't the code NEED to jump in order to skip that calculation when it doesn't pass the Z test??

Hard to tell. Keep in mind that a modern CPU can have *a lot* of instructions in flight (pipelined and superscalar and out-of-order means it has a lot of flexibility), so it's not necessarily a problem to have to compute this every time, if it means avoiding a branch (which would normally screw with all the above).

As long as the code is nice and sequential, the CPU might be able to crunch through everything with no stalls, which might be faster than a branch which saves a few instructions, but also inserts stalls in the pipeline, and limits instruction reordering.

It's worth a try, anyway. Hard to predict exactly how today's CPU perform with any given code snippet, so only way to find out is to test it... [grin]

I've seen code where this trick gave a huge (30-40%) performance boost though. If you have a branch in a tight loop, it *can* have a large performance impact. And of course, in other cases, it'll be useless, or even slow you down.

As for how to avoid the branch, I'd do like this:
*Buf = (*ZBuf > ZVal) ? ((RVal & 0xff0000) | ((GVal & 0xff0000) >> 8) | (BVal >> 16)) : *Buf;*ZBuf = (*ZBuf > ZVal) ? ZVal : *ZBuf;

(maybe lift *ZBuf and *Buf out into intermediate non-pointer variables, to avoid any aliasing problems (I'd expect the compiler to be able to do the same in this case, but you never know)

##### Share on other sites
Since you are using pointers, how about helping your compiler with the restrict keyword, if supported?

##### Share on other sites
I'm using VS2003 for now (until I get home) and it is definitely not generating cmov instructions. I'll try again on 2005 later, otherwise do it by hand. I am quite interested to find out the impact though (right now it is making it slower for sure however).

'restrict' also doesn't seem to be supported by 2003...but I only took a couple seconds to see.

It looks like I have a combination of SSE intrinsics and conditional move ideas to try...

##### Share on other sites
Quote:
 Original post by SpoonbenderAmong other things, it enables the compiler to know what you're doing, so it can better do register allocationn and instruction reordering

That's overrated. It might help when inserting just one or very few instructions, but with a full pixel loop written in assembly the programmer can still do a better job at register allocation.