Cache misses??

Graphics and GPU Programming Programming

Started by popsoftheyear May 21, 2007 12:41 PM

34 comments, last by popsoftheyear 16 years, 10 months ago

2,195

Author

May 22, 2007 11:32 AM

Ummm I used LTProf with a Pentium 4 in VS2005 and just default "Release" optimizations.

I think the next easiest optimization is definitely using SSE. I don't want to use MMX though because I don't think it is supposed to be officially supported much longer with more and more 64 bit processors coming out, and we have to support/take advantage of 64 bit processors probably by the end of the year. This also means I'll have to code the SSE stuff in an external asm source file to link it with a VS2005 64 bit project - so I'm gonna hold off unless the extra speed boost is needed. (I hope I'm correct in the above statements and not sounding like a completely ignorant person).

Yeah the assembly source looks like there is a lot of switching registers with variables, but 2005 seems to optimize it OK.

But just in case...anyone know an especially good source for a quick SSE tutorial? Or will a simple google search suffice.

Thanks.

C0D1F1ED

453

May 22, 2007 12:24 PM

Quote:Original post by dmatter
...but have you tried using the prefix increment operator rather than the postfix? In theory the prefix increment can be faster as it avoids temporary allocation...

Any compiler that claims to be an optimizing compiler will generate exactly the same code for prefix or postfix increment (at least for simple integers). Basically, unless you truely need prefix increment, postfix increment is just fine for all your incrementing needs (and reads a tiny bit more fleuntly - although that's a matter of taste).

Anyway, I strongly suggest to never change high-level code to get low-level code the way you want. It's a hack. If you need the assembly to be what you expect, write assembly. ;-)

C0D1F1ED

453

May 22, 2007 12:43 PM

Quote:Original post by popsoftheyear
Ummm I used LTProf with a Pentium 4 in VS2005 and just default "Release" optimizations.

I'm not familiar with LTProf, but I assume it works quite similar to VTune. VS2005 in release mode will indeed optimize that code very well. A Pentium 4 is notorious for having unpredictable behavior though.

Quote:I don't want to use MMX though because I don't think it is supposed to be officially supported much longer with more and more 64 bit processors coming out...

MMX is 64-bit processing. And it's definitely still supported for x86-64. The only disadvantage is that you only get 8 MMX registers while you get 16 SSE registers in x86-64. Anyway, if you want to write just one version that works for x86-32 as well just stick to MMX and SSE. Besides, MMX instructions are often shorter than their 128-bit SSE equivalents.

Quote:...and we have to support/take advantage of 64 bit processors probably by the end of the year.

There's not a whole lot to take advantage of really.

Quote:This also means I'll have to code the SSE stuff in an external asm source file to link it with a VS2005 64 bit project - so I'm gonna hold off unless the extra speed boost is needed.

You can also generate the code dynamically inside C++ with SoftWire, which also supports 64-bit now.

Quote:But just in case...anyone know an especially good source for a quick SSE tutorial?

I love Tommesani's doucments as a quick reference. If you're already comfortable with x86 assembly that should get you started quickly. Get the Intel reference manuals as well though. They are more cryptic but contain all the correct details.

popsoftheyear

2,195

Author

May 22, 2007 01:01 PM

Thanks for both those references, I'm eager to get the chance to check them out.

Maybe things have changed and the article is old but this came from http://msdn2.microsoft.com/en-us/library/bb147385.aspx

Quote:The x87, MMX, and 3dNow! instruction sets are deprecated in 64-bit modes. The instructions sets are still present for backwards compatibility for 32-bit mode, but to avoid compatibility issues in the future, their use in current and future projects is discouraged.

-Scott

Spoonbender

1,258

May 22, 2007 01:07 PM

Quote:Original post by popsoftheyear
I think the next easiest optimization is definitely using SSE.

Try getting rid of the branch first. Shouldn't take more than 5 minutes to do.

Quote:
This also means I'll have to code the SSE stuff in an external asm source file to link it with a VS2005 64 bit project

Nah, use the compiler's SSE intrinsics. That's generally a better idea than ASM. (Among other things, it enables the compiler to know what you're doing, so it can better do register allocationn and instruction reordering). Plus, it works with both 32 and 64 bit.

But you're probably right about avoiding MMX.

Quote:But just in case...anyone know an especially good source for a quick SSE tutorial? Or will a simple google search suffice.

All I used when I last had to code SSE stuff was simply AMD's instruction reference, and the compiler's documentation on the intrinsics available.
Then it's pretty straightforward to code.

Stonemonkey

142

May 22, 2007 01:38 PM

Maybe I'm being slow but how would you go about removing the branch?

popsoftheyear

2,195

Author

May 22, 2007 02:05 PM

No I was wondering too... something along these lines compiles fine, but I didn't really see any change.

*ZBuf > ZVal ?Buf[D] = ((RVal+DeltaR*D) & 0xff0000) | (((GVal+DeltaG*D) & 0xff0000) >> 8) | ((BVal+DeltaB*D) >> 16),*ZBuf = ZVal : __noop;

Good idea though. (I hope this code formats right?)

outRider

852

May 22, 2007 02:23 PM

You can get rid of a branch with a conditional move instruction. popsoftheyear, you can check that the compiler is emitting one of the CMOV instructions when you use a ternary, but it might not since it's not a simple move.

popsoftheyear

2,195

Author

May 22, 2007 02:27 PM

Maybe I don't get it but it seems to me the conditional move would only be better if I was calculating the color value every time....otherwise doesn't the code NEED to jump in order to skip that calculation when it doesn't pass the Z test??

Steadtler

220

May 22, 2007 02:45 PM

Quote:Original post by popsoftheyear
Maybe I don't get it but it seems to me the conditional move would only be better if I was calculating the color value every time....otherwise doesn't the code NEED to jump in order to skip that calculation when it doesn't pass the Z test??

Calculating the color everytime might be faster if it allows you to avoid a branch or a cache miss.

Cache misses??

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Cache misses??

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines