x86 PC, is it worth optimizing for cpu cache?

Started by
1 comment, last by wqking 12 years, 11 months ago
[ Please keep the topic to cpu cache only, no need to talk such as "optimize algorithm" first, that should be another topic. ]

Hi cpu optimizing ninjas,

What I need to do in one of my hobby C++ project, is to process small data (hundreds or thousands of bytes in a lot of (may be up to 10K or 100K) iterations eath time, such like changing some bits in some bytes, etc. And that may happen a lot of times too, such like 100K. 10K*100K is driving me to think about optimizing...

Question 1, if you have any similar experience on optimizing cpu cache missing on x86 PC, can you tell the result? Is it significant performance improvement?
I highly guess so, but some real life experience may give me more confidence.

Question 2, if you can recommend any very good guide on optimizing cpu cache in C++, that would help me to kick off fast. I'm googling and gdnet-ing also. But my current cpu knowledge is still at the 80386 era.
Currently I found this post is quite good to read,
http://www.gamedev.net/topic/542247-cache-optimisations-for-beginners/

https://www.kbasm.com -- My personal website

https://github.com/wqking/eventpp  eventpp -- C++ library for event dispatcher and callback list

https://github.com/cpgf/cpgf  cpgf library -- free C++ open source library for reflection, serialization, script binding, callbacks, and meta data for OpenGL Box2D, SFML and Irrlicht.

Advertisement
1) IMHO, cache-usage optimisation is THE main low-level optimisation technique these days. CPU cycles are cheap, but memory is horribly slow in modern computers. I don't care about how many CPU cycles an algorithm takes, I only care about which parts of RAM it's accessing.

As an example:
At work I've been rewriting our renderer lately. The old renderer was not optimised for cache at all, but with the new one, I've been thinking about the cache constantly. Every time I write a structure or allocate some memory, I consider how, when and why that data will be used by the CPU.

We knew that the old renderer was slow, but there were no real 'bottlenecks' in it -- when you profiled it, there wasn't an obvious part that needed to be optimised. It was just slow everywhere due to constant cache misses.

The old renderer took over 8ms of time to process a sample level, whereas the new re-written renderer takes 0.6ms to process the same level. That's more than a 10x speed-up, mostly due to caring about memory!

2) It's hard to take existing code and optimise it for good cache usage. Usually you'll have to rewrite your data structures.
http://research.scee...ing_GCAP_09.pdf
http://gamesfromwith...oriented-design
http://bitsquid.blog...a-oriented.html
http://www.slideshar...oriented-design
http://www.slideshar...ata-orientation

The old renderer took over 8ms of time to process a sample level, whereas the new re-written renderer takes 0.6ms to process the same level. That's more than a 10x speed-up, mostly due to caring about memory!


8->0.6 is already a good reason for me to keep cpu cache optimization in my mind.
I will try to tweak my OOP-only data structure and code a little more DOP (data oriented) like.

Thanks for the timing data!



2) It's hard to take existing code and optimise it for good cache usage. Usually you'll have to rewrite your data structures.
http://research.scee...ing_GCAP_09.pdf
http://gamesfromwith...oriented-design
http://bitsquid.blog...a-oriented.html
http://www.slideshar...oriented-design
http://www.slideshar...ata-orientation


I also found this presentation is quite good,
http://www.research.scea.com/research/pdfs/GDC2003_Memory_Optimization_18Mar03.pdf

https://www.kbasm.com -- My personal website

https://github.com/wqking/eventpp  eventpp -- C++ library for event dispatcher and callback list

https://github.com/cpgf/cpgf  cpgf library -- free C++ open source library for reflection, serialization, script binding, callbacks, and meta data for OpenGL Box2D, SFML and Irrlicht.

This topic is closed to new replies.

Advertisement