Raytracer memory layout

Started by
12 comments, last by cignox1 19 years, 2 months ago
Ive been playing around for a bit, deciding whats the best structure to set up my raytracer. Ive been annoyed a lot by having to choose between elegance and speed, for example, how do you store vertices, ect. Ive now come to the following conclusion: I have a Mesh class, which holds all primitives in highlevel form. From this i then build a seperate kdtree class, where i dump all notions of highlevelness. It will consist of a tree of nodes, with the leafnodes having a pointer to ONE chunk of data, which contains all needed info without having to dereference and memory pointers anymore. I realize ill be wasting some memory because a lot of vertices will appear double, aswell as a lot of primitives, since they may appear in multiple leafs. However, i have decided this isnt really important, as im tracing cubic patches with low memory requirements anyway. Fast access is much more important. However, i have a question about the way cache memory behaves. If i dereference the pointer to my chunk of data, this will mean this entire chunk will be read into cache memory, right? unless its too big ofcource, but i doubt that will be a problem with 4-5 patches. Lets say a chunk is 1kb at the very most. So by once dereferencing this pointer i should then have fast access to all my needed data, right? Is this going to work? Or is it smarter to keep pointers to all individual primitives, and dereference them one at a time?
Advertisement
the more i think about it the more i realize how clueless i am regarding cache behavior.

does anyone have a link to a nice tut explaining the basics?
Three basic things to remember with the cache:

- Do things in blocks of 32 bits. This means things like using struct padding settings in the compiler, using ints instead of chars for data storage, and so on. In general, nicely arranged 32-bit chunks will neatly fit inside a single cache line as well (you may need to do some research to find out the size of a single line as I don't recall that off the top of my head).

- If you access a block of memory, stay within a few dozen bytes as long as possible. e.g. nested linked lists are evil here (of the form a->next->b->next->c->next->d->x_coord etc.) and arrays are your friend. In general do as much as you can in a flat data space before moving on to other data.

- You (presumably) work in a multitasking environment. You can never guarantee that more than a small amount of your personal data will be in the cache, because if the OS kernel context switches, the cache may get flushed and refilled with other data. When execution returns to your thread, you have to reload part of the cache in this case. So it is generally best not to rely too much on specific cache behavior, because you can only guarantee that behavior will work in a true single-task environment.


The main problem with ray tracing and cache thrashing is that even with basic aids like using arrays instead of linked list-style structures, you fundamentally have to access disconnected blocks of memory at intervals the CPU caching system is not likely to understand or predict nicely (which further increases the chances of a context switch introducing a cache flush of data you wanted). There are things you can do to help with this, but they only help so far.

I'm personally working on rebuilding my raytracer code to go from library-based memory management and a linked-list structure to an internally-managed memory system and pure array-based structure. I'm not sure exactly what this will do for performance, but I'm fairly confident that it will make huge improvements in areas where memory access plays a much bigger role than raw computation.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

thank you.

i just finished doing some reading on the subject, so im a little more educated now, but there are still some things that dont make sense to me.

its common practice when using kd-trees to keep kd-tree nodes 8 bytes, so that one node fits in one cache line, or so ive been told. cacheline size seems to vary from chip to chip, but i suppose 8 bytes is common on todays chips. thats fine and dandy, but i dont really get where the speedup comes from then.

i request my treenode, and all that gets cached is this one single node?? no problem, thats fine for this purpose, although im a bit disappointed by only halving my RAM access. but what about my up to 1kb leafdata chunks? when i dereference them, only the first 8 bytes of them get cached? what do i even have the stuff for then, 1kb / 8 is still 128 seperate RAM access queries.

is there no way to say: 'put this block of memory into cache.'? because if it really works like i understand it now, caching will only halve my RAM queries, unless my whole model fits into cache. now that isnt unthinkable for trivial scenes, but you cant count on any of that for complex scenes.

or does the benefit of cache only kick in between multiple coherent rays? coherent rays are ofcource likely to access the same nodes and primitives, although i see that benefit getting trashed quickly when your recursion depth goes up.

but anyway, so there isnt a point really in putting my primitive data in one big chunk, since one primitive on its own is bigger than 8 bytes anyway?
hmm it is possible to manually set your cacheline size. however, google is reluctant to give me the details. besides im not really sure if thats something i want to be messing with.

however, wouldnt it be nice to have a keyword you could put in front of a struct, specifying youd want that struct to be loaded into cache completely every time you dereference it? or maybe a keyword/symbol to force complete caching of a dereferenced object.

would be nice imho, although i suppose it wouldnt be used that often.
Generally there are only two things to keep in mind to be cache friendly in more high level code (like C ;):

* The smaller your data type is, the more you can fit in the cache. This is why making tree nodes smaller has a big effect. It's not that being 8 bytes large is specifically useful as much as being smaller in general is good. This has a very large effect if it means you can fit all of your data set in the cache.

* Access memory linearly. Simple really, if you do this, the data you want is likely already in the cache. Individually allocated small tree nodes with lots of pointer indirection=bad, all nodes in a big array with some sort of locality=good.

If you really want to get fancy you can try to prefetch data, i.e. read the first byte of a chunk of data you need, do something usful for a while to hide the latency and then acces the data hoping that it's already fetched from memory. I doubt that you can get consistent behaviour from the compiler when trying to coach it into doing this though, so it might be easier to go to full ASM at that point. Besides according to SoftWire ASM-guru Nicolas Capens, modern processors prefectch pretty well on their own anyway.
thanks for your reply.

Quote:Original post by GameCat
* Access memory linearly. Simple really, if you do this, the data you want is likely already in the cache. Individually allocated small tree nodes with lots of pointer indirection=bad, all nodes in a big array with some sort of locality=good.

yeah thats what i used to think.

but what difference does it make if my cacheline is a meager 8* bytes? i might aswell grab my childnodes from the other side of my memory: they wernt prefetched with another node anyway, since there fits only one node in a cacheline.

*(thats what ive heard, but then again ive hear lots of wildly different things. anyone who knows where i can look it up (centrino processor))

going by that logic, it really doesnt matter a damn where i place my nodes in memory, certainly not for big structs like a complete cubic patch of 120+ bytes.

now i stumbled across some asm commands that allowed you to precache a block of memory (dont recall if that was just one cache line or an arbitrary sized block), but the language im using doesnt seem to include that asm command (nor a decent instruction reference to look it up goddamnit)
Cache lines are not as small as eight bytes. More like 32 or 64 bytes (edit: the Pentium-M has 64 bytes). Besides accessing memory linearly also helps with prefetching. And there are several levels of cache, so even if you don't hit the one closest to the CPU, you might still not miss completely and end up waiting hundreds of cycles for an actual memory access.
have a look at these 3 articles
i haven t tried it myselft, but it sounds logic :)

http://www.gamedev.net/reference/articles/article1187.asp
http://www.gamedev.net/reference/articles/article1097.asp
http://www.8ung.at/basiror/theironcross.html
Quote:Original post by GameCat
Cache lines are not as small as eight bytes. More like 32 or 64 bytes (edit: the Pentium-M has 64 bytes). Besides accessing memory linearly also helps with prefetching. And there are several levels of cache, so even if you don't hit the one closest to the CPU, you might still not miss completely and end up waiting hundreds of cycles for an actual memory access.

thanks a lot!

so lumping multiple patches together in a structure wont help me, since theyre 120+ bytes, but itd be wise to constrain them to 128 bytes, so they fit in two cachelines.

im going to try and put this in practice.

This topic is closed to new replies.

Advertisement