Cache misses??

This topic is 4219 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

Recommended Posts

Hey all, here is some profiled code from the main inner loop of my gouraud triangle renderer: //------------ (3) for (int XPos = Left; XPos <= Right; XPos++) { // Test against z buffer and draw (4) if (*ZBuf > ZVal) { (1) *Buf = (RVal & 0xff0000) | ((GVal & 0xff0000) >> 8) | (BVal >> 16); *ZBuf = ZVal; } (5) Buf++; (4) ZBuf++; (16) ZVal += DeltaZ; (2) RVal += DeltaR; (2) GVal += DeltaG; (2) BVal += DeltaB; } //------------ The majority of the percentage of time spent rendering these scanlines seems to be incrementing variables...does this have something to do with cache misses? When I reorder the instructions, it's always some unlikely line or variable that receives the bad end of the deal. How can I increase the performance of this routine...as it seems that the bottleneck is my ignorance in memory cache misses? Or is there not a whole lot I can do here....

Share on other sites
posted more legibly

(3) for (int XPos = Left; XPos <= Right; XPos++)    {        // Test against z buffer and draw(4)     if (*ZBuf > ZVal)        {(1)         *Buf = (RVal & 0xff0000) | ((GVal & 0xff0000) >> 8) | (BVal >> 16);            *ZBuf = ZVal;        }(5)     Buf++;(4)     ZBuf++;(16)    ZVal += DeltaZ;(2)     RVal += DeltaR;(2)     GVal += DeltaG;(2)     BVal += DeltaB;    }

I don't see anything off the top of my head. Have you already made algorithmic optimizations to improve the performance of you app: call this block less, structure the code differently so it isn't necessary at all, etc? Micro-optimizations like this will generally not net you more than 1% overall performance boosts. Algorithmic optimizations is where you want to start; it's not unheard of to double or triple your performance by reducing the BigO of various parts of your app.

Honestly, the best thing to do at this level is to take a look at the assembly generated by the compiler and figure out what it's doing. It could be that it's just generating some weird code.

-me

Share on other sites
Heh...thanks. Hadn't looked at the FAQ yet.

Share on other sites
Yeah, that code block is the bottom triangle half, and there is an identical top triangle half code block. Combined they take about 60-70% of the processing time, and it just seems weird that like 16-20% of that is something like 'ZVal += DeltaZ;'

Yes, algorithmically the triangle renderer should hardly even have any overdraw to it (only when the zsort from front to back doesn't work properly...you know the scenerios). There aren't any cases where it is called unnecessarily...I mean I'm sure there are other things I can do to limit it slightly more...but I *believe* that will add more overhead than it will fix.

Share on other sites
If you're drawing a lot of things on top of each other and sorting from front to back you could maybe try something like:

        // Test against z buffer and draw(4)     if (*ZBuf > ZVal)        {(1)         *Buf = ((RVal+DeltaR*D) & 0xff0000) | (((GVal+DeltaG*D) & 0xff0000) >> 8) | ((BVal+DeltaB*D) >> 16);            *ZBuf = ZVal;        }(5)     Buf++;(4)     ZBuf++;(16)    ZVal += DeltaZ;        D++;    }

It's also possible to calculate the pixel buffer address inside the IF using the z buffer address so you don't need that increment either. If you don't have much overlapping then this won't help at all and would probably be slower.

Share on other sites
Oh yeah. I got a 50+% speed increase. Using the z buffer to calculate the pixel buffer address didn't help or hurt, but calculating the deltas when needed helped immensely. All the overlapping polygons are sort of a necessity that can't be avoided so yeah that was a perfect idea. Thanks much.

-Scott

Share on other sites
Ahhh to clarify. Stonemonkey's excellent idea gave me a 25% boost...I got an additional 25% boost by allocating the zbuffer and pixel buffer in the same block of memory!

Share on other sites
Which profiler did you use? Which CPU? Which compiler and which settings?

Modern processors can't really be profiled on a per-instruction basis. An addition only costs one clock cycle, but there might be other reasons why the CPU stalls a little at that instruction. You'd have to look at the assembly code. There are only 6 freely available integer registers on an x86 CPU, so it's quite possible that right at these increments it has to write registers back to memory and load other values into the registers.

You can gain -a lot- of performance by using MMX and SSE here.

Share on other sites
I expect your compiler is doing a fine job at doing this anyway but have you tried using the prefix increment operator rather than the postfix? In theory the prefix increment can be faster as it avoids temporary allocation, but you wont know unless you profile it or look at the assembly e.g. ++Buf instead of Buf++.

Share on other sites
I don't know how much it'd gain you here, but you could try rewriting your if statement into using the ternary (?:) operator instead.
That typically compiles to a conditional move rather than a branch, so it *might* boost performance and allow the compiler to better optimize the code (optimizing across branches is a pain)

And as said above, MMX/SSE might be able to help you too. (increment four vars in one go, and also relieve register pressure a bit)

1. 1
2. 2
3. 3
Rutin
20
4. 4
5. 5
khawk
14

• 9
• 11
• 11
• 23
• 12
• Forum Statistics

• Total Topics
633655
• Total Posts
3013181
×