Jump to content
  • Advertisement
Sign in to follow this  
Lode

Cache Concept

This topic is 5402 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

In a course about Computer Architecture, there was a chapter about the Cache Concept today, and the professor talked a bit about possible source code optimizations to use the cache at it's best: declare variables you often use at the same time together so that the compiler will put them next to each other so they'll be cached together, declare functions that call each other often together so that the program itself can also be cached as much as possible, and, when working on large blocks of memory, do all operations on small chunks, chunk by chunk, so that these can be cached too. By doing similar optimizations, you can get up to 10x speed increase, and 100x less power consumptions (important when doing it for GSM's and PDA's and stuff) I'm just wondering, do you game developers keep this stuff in mind all the time while coding games?

Share this post


Link to post
Share on other sites
Advertisement
the only thing I have to say is what I learned in Effective C++ by Scott Meyers. And that was NOT to group variables, because 1) that just makes one part (initialization) of the function slower than the rest, try to keep it even. 2) decreases readability, declare the variable first when you need it.

my $0.02

Share this post


Link to post
Share on other sites
I don't know for embedded systems, but desktop CPUs have seperate caches for data and instructions so the advise seems quite odd.
This kind of nano-optimisation is also up to the compiler.

On portable devices this approach isn't even feasable at all because most mobile phones run Java apps and you have no influence on how the VM actually handles the byte code.

Optimisations like that can only be verified with one single method: profiling!

Share this post


Link to post
Share on other sites
Quote:
Original post by Lode
I'm just wondering, do you game developers keep this stuff in mind all the time while coding games?
No. That's what we have intelligent compilers for.

Share this post


Link to post
Share on other sites
If you're curious about this stuff I encourage you to turn on your compiler's map file so you can see where it put everything and how it corresponds to the original source. You are able to see things like how much inlining took place, where functions are located, etc.

Share this post


Link to post
Share on other sites
Usually seperate data and instruction caches for L1 cache... usually unified L2 cache. Data alignment is the minimum you want to think about for cache friendly code/data, and is also pretty much necessary with SSE (don't remember alignment constraints of 3dnow at the moment). The other thing is choosing the right set of data structures and there layout is optimal in certain cases, as it is rare if not impossible to get the most memory friendly code for all cases. Basically what I'm saying is for each data structure you have you have a set of operations that will be perform both on the data structure and the data that is encapsulated within. If you want to attempt to optimize the layout optimize it for your more speed critical functions, if there is a conflict see if you can come up with an comprimise that doesn't favor one layout over the other but gives good performance for both.

It is also a good idea to see which functions are suitable for lazy evaluation, so that if you can implement with it low overhead you can batch work to ensure good cache utilization. Make use off 'intrinsic' data when possible, aka if you have an array of static data see if you can make the initial order of the data so that the address or index into the array is used for something else. Also intrinsic in the sense where one can derive a third data value from some other data, and can be extracted at a low cost.

Get a good grasp concept of caching (cache lines, associativity and hit/miss latency) and remember that when modern processors have a cache miss they fill an entire cache line, which is aligned to the cacheline size. So if the data in question is not in the first transfer on the memory bus, you will not only be dealing with access latency but transfer latency as well. This latency is based on the memory clock, which on the PC platform is usually a fraction of the CPU speed.

To give a quantatative example as well as a model to understand why it can be rather importent (this is pretty much what happens on all PC's since the 486):
100mhz memory 3 cycle initial latency, bus width 64bits, burst mode transfers of 8 cycle (remember these are memory clocks cycles).
1000mhz cpu cache line size 64 bytes.
data access cache miss, first piece of aligned middle of cache line.

(the thing to note is one memory clock cycle is 10 cpu clock )
0.Instruction stream reaches and tries to execute read instruction - however out of order execution and pending memory access with strict order may cause additional latency, however most CPU's pre-fetch so this might also be "negative" LAT0
1.CPU tries cache - cache miss latency .. call it LAT1
2.CPU sends address on local bus to north bridge - call it LAT2
3.North bridges bus arbitrator see if memory bus is currently busy and arbitrates access between CPU,AGP, and PCI busmasters including other pending memory access from the CPU that are queued. Arbitration can be round robin or some more advanced algorithm.. and can vary call it LAT3
4.North bridge initiate memory access burst cycle and waits for
first data to be sent back. this is the quoted 3 cycle latency. call it LAT4
5.First data starts getting returned to the north bridge, and now needs to be streamed back to the CPU assuming the front side bus and memory bus are the same width. If not there will be another slight latency. Some basic math here 64bit memory bus = 8bytes, so 8 transfers will transfer the full cache line. Remember that burst mode number i gave that is what this is about, without burst transfers you would need LAT4 for each single bus transfer. With burst mode one LAT4 for a certain number of accesses to sequential addresses (for SDRAM and DDR they support 1,2,4,and 8 burst lengths). So instead of (3+1)*8 cycles to transfer 64byte you would only need 3+8 cycles. (BTW this isn't entirely accurate since there is thing like precharge and bus turn around time that can make this take longer, also note writes have slightly different timings) anyhow if you actually got through all that call it LAT5.a because only the first transfer is hitting the north bridge.
5b. LAT5.b 3 more transfers and we are upto the data I mentioned at the beginning of the example. (data aligned mid-way in the cache line)
6. the Data you requested has been streamed back to the CPU... LAT6
7. The CPU has to route the data to an execution unit so that the operation may be performed on it. LAT7

Oh man i have a headache... i hope you appreciate this. lol

Alright we have a weighted sum here aka, while

LAT_Total is = LAT0+LAT1+LAT2+...
remember you have different clock rates so
LAT0,1, and 7 operate at the cpu clock speed
LAT2, and LAT6 operate at the FSB speed (i'll leave it to google or someone else to explain that i'm getting lazy at this point, however i will point out LAT6 varies with both where the data is in the cacheline aligned transfer and a constant since there is fixed delay between when data recieved on the northbridge can actually be sent to the CPU but there can be overlap between memory bus and fsb bus transfers since they can happen concurrently)
LAT4 and 5 happen at the memory bus speed.
LAT3 I don't really know and it varies both with pending bus master transfers and arbitration method.

Essentially Convert all clocks to cpu relative terms aka in my example 3 memory cycles = 30 cpu cycles ...
add em up and voila...

Hence you will not be taking this fully into account when programming and was just given so you will understand what is really involved when accessing memory. Since the post is already this long i might a well mention how the athlon64 has the memory controller built into the cpu, essentially getting rid of lat 2 and 6 and shorten other latencies because some other optimizations possible due to everything being on the same chip.

enjoy lol

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
i work at a small studio making games for PS2 and GC.

on the gamecube, you can still get a decent performing game without worrying overly much about cache performance or utilization.

on the ps2, though, if you want your game to run fast, you need to worry about how you use the cache. so you might not think about it every day, but you do think about it when you design new systems for the game.

Share this post


Link to post
Share on other sites
Quote:
Original post by Lode
...the professor talked a bit about possible source code optimizations to use the cache at it's best...


Realize that you shouldn't do this unless you can demonstrate the need to through profiling or benchmarking.

If you write good programs, then performing the optimization later will not be difficult. If you go out of your way to optimize the cache usage before you demonstrate the need to do so, you will likely do nothing more than make your code more difficult to read, create bugs, decrease optimization oportunities for the compiler, and decrease your own future optimization oportunities.

In addition, optimizing cache usage is not an entirely portable optimization. It may increase execution speed when using one line of CPU and decrease it when using another.

Share this post


Link to post
Share on other sites
Someone once said that all optimisations are an exersize in caching.
I think you'd usually find that your average program has a good proportion of it's data cached already. For example data on the stack in a function you have just entered, is often already cached because it was cached when the same portion of stack was used, it some previous function call.

Hand optimisations as mentioned in this thread might make the difference between something being in a particular cache 98% of the time, as opposed to 90% of the time.
It doesn't make enough difference to warrant spending time on, before getting your app working correctly.

Unless you're writing a software 3D engine, then there's not really any point in thinking twice about this subject.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!