Sorting/structuring renderables and cache locality

Started by
13 comments, last by Lemmi 9 years, 7 months ago
Hi, so I'm building a graphics engine for fun, and I've been thinking about how to approach renderable sorting for the different passes (I'm doing deferred rendering). I'd heard about how you can make huge gains by sorting everything so that access is linear for each pass. The problem for me comes when I want to re-use the same renderables for several different passes during the same frame.


First of all I want to start off by saying that my knowledge of how the modern CPU cache actually works is very rudimentary, so I'm mostly going off assumptions here, please do correct me if I am wrong at any point. Also don't hesitate to ask for clarifications if I'm making no sense.


My current idea would be to keep a large, preallocated buffer where I store all the renderables (transforms, meshes, bundled with material and texture handles, flyweight pattern style) that got through culling each frame update.


Then I would keep different index/handle "lists"(not necessarily an actual list) -- one list per render pass -- with handles or direct indices to the renderable array.

This way I can access the same renderable from several different passes. I don't have to copy or move the renderables around. I'd just send in a pointer to the renderables array and then for each pass access all the relevant renderables through the index lists. This would essentially mean that I never sort the actual renderables array, only sorting the index lists for things like depth, translucency (depending on what pass).

Now comes my question, would this be inefficient because I'd be essentially randomly accessing different indices in the big renderable array? The cache would have no real good way to predict where I'd be accessing next, so I'd probably be getting tons of cache misses.


I just feel that despite this, it's a flexible and hopefully workable approach.


How do real, good engines deal with this sort of thing? Should I just not bother thinking about how the cache handles it?
Advertisement


Should I just not bother thinking about how the cache handles it?

basically, yes.

smacks of premature optimization.

for true cache friendly data, you'd need to create actual sorted lists for each pass (using data redundancy to eliminate skipping around while reading data), not just sorted indices to an unorderd list.

and you'd need to make sure they didn't reference other data elsewhere during processing to avoid cache misses - probably easier said than done with graphics calls.

Norm Barrows

Rockland Software Productions

"Building PC games since 1989"

rocklandsoftware.net

PLAY CAVEMAN NOW!

http://rocklandsoftware.net/beta.php

smacks of premature optimization.

Yes! Okay. Thank you for this advice. I'm fully aware that premature optimization is wrong, but last time I wrote a "graphics engine", I spent 1½ years of regretting that I had made a bunch of stupid design mistakes that would be really hard to fix, so this time around I rather overthink than underthink! :)


Also, I'm a dummy. By data redundancy, do you mean that you actually have several instances of the same object in different places for better data locality?

Also, I'm a dummy. By data redundancy, do you mean that you actually have several instances of the same object in different places for better data locality?


Maybe not whole objects but just the relevant pieces of data. Each array you are sorting/using should contain a struct of all relevant data, or you need to split things up into multiple arrays and keep them properly paired up when sorting.

You can also minimize the size of your arrays. Skipping randomly around a small array is much better than skipping around a bigger one. If you keep all your shaders, textures, etc. in their own arrays and then can generate small indices for each, you can pack all these indices into a single 32-bit or 64-bit number. Combine that with instance data (transform, etc.) and sort just that. Your sort is faster (less crap in the array you're sorting) even with std::sort or the like, you can use radix sort since your key is a simple integer (faster than quicksort usually, but not general-purpose, hence why it's not default), and then you just need to reference back to those small resource arrays.

You can even pack some resources into single GPU resources. Multiple textures can be stored in one texture array, multiple objects can be stored in the same vertex and index buffers (so long as you keep the start and count for each object separately), etc. This both reduces the number of GPU objects your keys need to be able to hold but also improve rendering by vastly improving your ability to batch draw calls (as you have to change state and bound objects less often).

Sean Middleditch – Game Systems Engineer – Join my team!

I like the idea about packing different (small) indices into a 64bit structure. Regarding all the shaders, textures and such, I was currently thinking about using the flyweight pattern, which is pretty much what you're describing, I think. It's also what I used last time and it worked fine.

Am I to understand that you suggest I'd do something like this?:

struct Renderable

{

InstanceData transform; //either pos+quat or a 4x4 matrix, I guess. possibly other things? Locally stored copy, right? because we want optimal data locality

long key; //or potentially even a 128bit structure

}

struct RenderPass

{

vector<Renderable>

vector<shaderIndices> //or shader pointers

}

And then each render pass would:

1) Sort each renderable based on, for example, their transform's camera depth

2) Applying each shader to each renderable, where you'd fetch from localized and sorted mesh/texture/material arrays somewhere else, using the handles that you'd extract from the renderable 'key' variable.

Or you can sort all renderables once per frame based on their transform camera depth and THEN insert them by going from the back to the front and pushing them back into every render pass that they've been flagged for. That's probably better.

I guess that'd mean that the actual entity would also need another set of flags then, so that the renderer knows which passes I want to insert it into.

Why would I sort by the key and not the transform? To optimize texture/resource usage and bundling render calls? How would I go about doing that on the fly? Rebuilding and merging vertex buffers every frame and doing some sort of semi-instancing? Or are you meaning just sorting them by what textures/materials they use so that I can send those resources once and then render several meshes without changing anything? I can't remember if that was possible to do in directX. I'm going to be using openGL, btw, if that is in any way relevant and helps the discussion.

Edit: Oh, yeah. Of course. You want to sort by the keys because then you know that you'd constantly be accessing the closest meshes/textures/materials every time you move to the next renderable.

Btw are you implying that I should sort by both transform and key? Say, I first sort by keys, and then again within all renderables that have the identical key(unlikely) sort again by transforms?

Btw are you implying that I should sort by both transform and key? Say, I first sort by keys, and then again within all renderables that have the identical key(unlikely) sort again by transforms?


Yes.

You'll want your draw calls ordered by material (shader+textures+constants for example) and then the instances inside that in a rough front to back order (z sort based on the centre point of a sphere would probably do it) which will get you good z coverage (for non-translucent objects; stuff which blends generally need to be drawn back to front).

However, you can build all this into a single sort key where the top 32bit hold details about pass, material id and other stuff and the bottom 32bits contain the depth information allowing you to sort on a single thing in one pass.

However, you can build all this into a single sort key where the top 32bit hold details about pass, material id and other stuff and the bottom 32bits contain the depth information allowing you to sort on a single thing in one pass.

Nice.

It is not particularly hard. Avoid fragmenting your data access patterns. Keep objects in small, 64-byte chunks and ensure the chunks are properly aligned.

Your example lists a matrix (64 bytes) and potentially other nested structures. All total you are looking at what, 80 bytes? 92 bytes?Is there a way you can reduce what you store?

For example, often it is cheaper to store coordinates and rotation angles, totaling 24 bytes, rather than a full matrix. Doing that requires doing the matrix work somewhere else, and if you can find a way to push that off to the GPU it can result in a bunch of nearly-free operations limited mostly by cache load speed.

Another useful tidbit for cache friendly systems -- such as particle systems -- is to add a 'deleted' or 'dead' flag, that means to skip that block.

The KISS version of cache friendliness is to keep the data you are iterating over exactly 64 bytes, properly aligned, with each 64-byte block being accessed sequentially. You can jump around a little bit within each block, such as accessing .x, .y, and .z. Marking an entity as 'deleted' or 'dead' means no processing on that block and you can continue to the next on the sequence. As long as you have the 64-byte blocks with obvious sequential loops, the compiler's optimizer and the processor's internals can usually do the rest.

Dead flags, while they might seem like a good idea, are something that requires some thought however and it is important to always make sure the work you are skipping is worth the cost of the compare and potentially mispredicted jump as well as the extra cache space you are taking up. This one flag has bulked your data up by 4bytes.

In some cases you might be better off keeping the 'dead' flag separate from the objects themselves; that way a single cache line can contain information about the alive state of 16 objects and keep the size of your larger structure down meaning more cache wins and less overall trips to memory.

In other situations killing and sorting might be a better win; if you can execute your main loop with the assumption that everything is alive it makes the logic easier there and a second sweep can be performed to kill any dead elements and compact your data structures.

I built an experimental particle system on this principle with the 'simulate and draw' phases assuming all particles are alive and then a 'kill' phase which would walk the (separate) life time information and do a 'mark and compact' scheme on the particle data. So if you had 1000 particles you would walk up until the first 'dead' particle and remember that point and continue walking until you hit the next 'alive' particle and note that, then you'd walk again until the next 'dead' particle (or end) at which point you would copy the data (which was in separate memory locations per component) for the 'alive' block down so all particles were alive up until that point. Repeat from step two until end of particles.

Particle data was just data so a nice fast update/draw phase, the 'are you dead?' phase touched the minimal amount as life time information was in it's own place in memory. The copy down might have had some cost but it wouldn't have been paid often and was a trade off vs fast update and it was a linear access so would have been prefetcher friendly.

(This was a few years ago but recently an article surfaced which backed up my idea where someone tested the cost of sorting a huge array + access vs jumping about in memory a bit (including pools and the like); short version - sort + linear access won.)

But, as with all things, it requires some thought; don't go blindly throwing in 'dead' flags but at the same time don't blindly throw the idea out either.

Know your data.
Know your processing.

But if you use the method you mentioned earlier, of a large 64 bit sort key, couldn't the 'alive' variable be the topmost bit, so it's already included in the sorting pass? That way you wouldn't have to sort by depth and transparency and aliveness, but it's one sort - and once sorted, you can locate the dividing point between dead and alive before you draw, so you aren't branching during the draw itself.

This topic is closed to new replies.

Advertisement