Sign in to follow this  
Lemmi

Sorting/structuring renderables and cache locality

Recommended Posts

Hi, so I'm building a graphics engine for fun, and I've been thinking about how to approach renderable sorting for the different passes (I'm doing deferred rendering). I'd heard about how you can make huge gains by sorting everything so that access is linear for each pass. The problem for me comes when I want to re-use the same renderables for several different passes during the same frame.


First of all I want to start off by saying that my knowledge of how the modern CPU cache actually works is very rudimentary, so I'm mostly going off assumptions here, please do correct me if I am wrong at any point. Also don't hesitate to ask for clarifications if I'm making no sense.


My current idea would be to keep a large, preallocated buffer where I store all the renderables (transforms, meshes, bundled with material and texture handles, flyweight pattern style) that got through culling each frame update.


Then I would keep different index/handle "lists"(not necessarily an actual list) -- one list per render pass -- with handles or direct indices to the renderable array.

This way I can access the same renderable from several different passes. I don't have to copy or move the renderables around. I'd just send in a pointer to the renderables array and then for each pass access all the relevant renderables through the index lists. This would essentially mean that I never sort the actual renderables array, only sorting the index lists for things like depth, translucency (depending on what pass).

Now comes my question, would this be inefficient because I'd be essentially randomly accessing different indices in the big renderable array? The cache would have no real good way to predict where I'd be accessing next, so I'd probably be getting tons of cache misses.


I just feel that despite this, it's a flexible and hopefully workable approach.


How do real, good engines deal with this sort of thing? Should I just not bother thinking about how the cache handles it?

Share this post


Link to post
Share on other sites


Should I just not bother thinking about how the cache handles it?

 

basically, yes. 

 

smacks of premature optimization. 

 

for true cache friendly data, you'd need to create actual sorted lists for each pass (using data redundancy to eliminate skipping around while reading data), not just sorted indices to an unorderd list.

 

and you'd need to make sure they didn't reference other data elsewhere during processing to avoid cache misses - probably easier said than done with graphics calls.

Share this post


Link to post
Share on other sites

smacks of premature optimization.

Yes! Okay. Thank you for this advice. I'm fully aware that premature optimization is wrong, but last time I wrote a "graphics engine", I spent 1½ years of regretting that I had made a bunch of stupid design mistakes that would be really hard to fix, so this time around I rather overthink than underthink! :)


Also, I'm a dummy. By data redundancy, do you mean that you actually have several instances of the same object in different places for better data locality?

Share this post


Link to post
Share on other sites

Also, I'm a dummy. By data redundancy, do you mean that you actually have several instances of the same object in different places for better data locality?


Maybe not whole objects but just the relevant pieces of data. Each array you are sorting/using should contain a struct of all relevant data, or you need to split things up into multiple arrays and keep them properly paired up when sorting.

You can also minimize the size of your arrays. Skipping randomly around a small array is much better than skipping around a bigger one. If you keep all your shaders, textures, etc. in their own arrays and then can generate small indices for each, you can pack all these indices into a single 32-bit or 64-bit number. Combine that with instance data (transform, etc.) and sort just that. Your sort is faster (less crap in the array you're sorting) even with std::sort or the like, you can use radix sort since your key is a simple integer (faster than quicksort usually, but not general-purpose, hence why it's not default), and then you just need to reference back to those small resource arrays.

You can even pack some resources into single GPU resources. Multiple textures can be stored in one texture array, multiple objects can be stored in the same vertex and index buffers (so long as you keep the start and count for each object separately), etc. This both reduces the number of GPU objects your keys need to be able to hold but also improve rendering by vastly improving your ability to batch draw calls (as you have to change state and bound objects less often).

Share this post


Link to post
Share on other sites

I like the idea about packing different (small) indices into a 64bit structure. Regarding all the shaders, textures and such, I was currently thinking about using the flyweight pattern, which is pretty much what you're describing, I think. It's also what I used last time and it worked fine.

 

 

Am I to understand that you suggest I'd do something like this?:

 

 

struct Renderable

{

 InstanceData transform; //either pos+quat or a 4x4 matrix, I guess. possibly other things? Locally stored copy, right? because we want optimal data locality

 long key; //or potentially even a 128bit structure

}

 

 

struct RenderPass

{

  vector<Renderable>

  vector<shaderIndices> //or shader pointers

}

 

 

And then each render pass would:

1) Sort each renderable based on, for example, their transform's camera depth

2) Applying each shader to each renderable, where you'd fetch from localized and sorted mesh/texture/material arrays somewhere else, using the handles that you'd extract from the renderable 'key' variable.

 

 

Or you can sort all renderables once per frame based on their transform camera depth and THEN insert them by going from the back to the front and pushing them back into every render pass that they've been flagged for. That's probably better.

 

 

I guess that'd mean that the actual entity would also need another set of flags then, so that the renderer knows which passes I want to insert it into.

 

 

Why would I sort by the key and not the transform? To optimize texture/resource usage and bundling render calls? How would I go about doing that on the fly? Rebuilding and merging vertex buffers every frame and doing some sort of semi-instancing? Or are you meaning just sorting them by what textures/materials they use so that I can send those resources once and then render several meshes without changing anything? I can't remember if that was possible to do in directX. I'm going to be using openGL, btw, if that is in any way relevant and helps the discussion.

 

Edit: Oh, yeah. Of course. You want to sort by the keys because then you know that you'd constantly be accessing the closest meshes/textures/materials every time you move to the next renderable.

 

Btw are you implying that I should sort by both transform and key? Say, I first sort by keys, and then again within all renderables that have the identical key(unlikely) sort again by transforms?

Edited by Lemmi

Share this post


Link to post
Share on other sites

Btw are you implying that I should sort by both transform and key? Say, I first sort by keys, and then again within all renderables that have the identical key(unlikely) sort again by transforms?


Yes.

You'll want your draw calls ordered by material (shader+textures+constants for example) and then the instances inside that in a rough front to back order (z sort based on the centre point of a sphere would probably do it) which will get you good z coverage (for non-translucent objects; stuff which blends generally need to be drawn back to front).

However, you can build all this into a single sort key where the top 32bit hold details about pass, material id and other stuff and the bottom 32bits contain the depth information allowing you to sort on a single thing in one pass.

Share this post


Link to post
Share on other sites

However, you can build all this into a single sort key where the top 32bit hold details about pass, material id and other stuff and the bottom 32bits contain the depth information allowing you to sort on a single thing in one pass.

 

Nice.

Share this post


Link to post
Share on other sites

It is not particularly hard. Avoid fragmenting your data access patterns. Keep objects in small, 64-byte chunks and ensure the chunks are properly aligned.  

 

Your example lists a matrix (64 bytes) and potentially other nested structures. All total you are looking at what, 80 bytes? 92 bytes?Is there a way you can reduce what you store? 

 

For example, often it is cheaper to store coordinates and rotation angles, totaling 24 bytes, rather than a full matrix. Doing that requires doing the matrix work somewhere else, and if you can find a way to push that off to the GPU it can result in a bunch of nearly-free operations limited mostly by cache load speed.

 

Another useful tidbit for cache friendly systems -- such as particle systems -- is to add a 'deleted' or 'dead' flag, that means to skip that block.

 

The KISS version of cache friendliness is to keep the data you are iterating over exactly 64 bytes, properly aligned, with each 64-byte block being accessed sequentially. You can jump around a little bit within each block, such as accessing .x, .y, and .z. Marking an entity as 'deleted' or 'dead' means no processing on that block and you can continue to the next on the sequence. As long as you have the 64-byte blocks with obvious sequential loops, the compiler's optimizer and the processor's internals can usually do the rest.

Edited by frob

Share this post


Link to post
Share on other sites
Dead flags, while they might seem like a good idea, are something that requires some thought however and it is important to always make sure the work you are skipping is worth the cost of the compare and potentially mispredicted jump as well as the extra cache space you are taking up. This one flag has bulked your data up by 4bytes.

In some cases you might be better off keeping the 'dead' flag separate from the objects themselves; that way a single cache line can contain information about the alive state of 16 objects and keep the size of your larger structure down meaning more cache wins and less overall trips to memory.

In other situations killing and sorting might be a better win; if you can execute your main loop with the assumption that everything is alive it makes the logic easier there and a second sweep can be performed to kill any dead elements and compact your data structures.

I built an experimental particle system on this principle with the 'simulate and draw' phases assuming all particles are alive and then a 'kill' phase which would walk the (separate) life time information and do a 'mark and compact' scheme on the particle data. So if you had 1000 particles you would walk up until the first 'dead' particle and remember that point and continue walking until you hit the next 'alive' particle and note that, then you'd walk again until the next 'dead' particle (or end) at which point you would copy the data (which was in separate memory locations per component) for the 'alive' block down so all particles were alive up until that point. Repeat from step two until end of particles.

Particle data was just data so a nice fast update/draw phase, the 'are you dead?' phase touched the minimal amount as life time information was in it's own place in memory. The copy down might have had some cost but it wouldn't have been paid often and was a trade off vs fast update and it was a linear access so would have been prefetcher friendly.

(This was a few years ago but recently an article surfaced which backed up my idea where someone tested the cost of sorting a huge array + access vs jumping about in memory a bit (including pools and the like); short version - sort + linear access won.)

But, as with all things, it requires some thought; don't go blindly throwing in 'dead' flags but at the same time don't blindly throw the idea out either.

Know your data.
Know your processing.

Share this post


Link to post
Share on other sites

But if you use the method you mentioned earlier, of a large 64 bit sort key, couldn't the 'alive' variable be the topmost bit, so it's already included in the sorting pass? That way you wouldn't have to sort by depth and transparency and aliveness, but it's one sort - and once sorted, you can locate the dividing point between dead and alive before you draw, so you aren't branching during the draw itself.

Share this post


Link to post
Share on other sites
That assumes one draw call per object, plus by the time you've got to the 'sort draw calls' you'll have already done a lot of dead object removal so you should never see a 'dead' object in your draw call lists to sort by.

At the highest 'game' level you'd be tracking the game entity which any attached renderables (1 or more draw calls) are associated; when these die the renderer never sees them.

Vis-culling per "camera", again above renderer submission, takes care of visible objects for a given scene.

Only once you get beyond vis-culling do you start breaking renderables down into their draw-call components and start sorting them and routing them to the correct passes for a scene.

Share this post


Link to post
Share on other sites

note that sort order ought to be based on binding times.

 

slowest to fastest these appear to be (someone correct me if i'm wrong):

 

1. texture

2. mesh

3 material

4. constants

5. transforms

6. other flags

 

so for no alpha blend, sort on tex, then mesh, then material, then constants (perhaps?), then near to far, then draw you instances with your various transforms, setting other flags on the fly as needed (using a state manager, of course).

 

for alpha blend its the same, but sort far to near.

 

i personally don't sort on constants, as i'm not using shader code in my current project. one of the shader coders here can tell you if its worth it or not. its may depend on shader model used, as i recall, constant bind times were slow in some early shader models. but i defer to those here with more shader experience who may be able to elaborate on that point...

Edited by Norman Barrows

Share this post


Link to post
Share on other sites

Okay! I had no internet access for about a week (feels like a year). This is all super good advice and I will definitely take all of these into consideration when moving forward. However, it just occurred to me, I'm unsure of how I want to produce and store scene depth during the pre-render transform pass?

 

I'm quite sure I can't just use the normal transforms as they are, because they're probably in world-space or own-space.

 

Should I just as an extra step transform everything by the camera's view matrix to produce the depth from the camera's point of view and then store that and use that for sorting? 

 

My intent is to store my depth buffer linearly, the way MJP describes in his excellent tutorials.

 

See, last time I did something like this, I did no sorting at all, and I just did all transforms on the vertex shader, where I'd multiply every vertex by WVP or whatever was needed. I did no sorting at all, I just let the shader sort out depth through painter's algorithm.

 

To roughly sketch out what I'm imagining here:

 

Game's pre-render update pass:

 

for(each renderable)

{

 

   //EITHER THIS VVVVVV

  //transform renderable to camera viewspace and get the depth from camera POV

  float renderableDepth = (renderable.transform * camera.viewMatrix).z

 

  //OR THIS VVVVV

  //simply calculate a rough distance between camera and renderable. when we've reached this point, we've already culled away all the objects that are outside the camera frustum,     //so it should be pretty OK?

  float renderableDepth = vector3Distance(renderable.position, camera.position) //returns a length as a single float

 

  //no matter the method, we'd finally do this

  //encode the depth somewhere within the flags variable

  (renderable.flags & 0x0000ffff) |= renderableDepth; //Don't pay too much attention to how I pack it. I can never remember bitshifting syntax without looking it up.

}

 

Then later on:

 

void Sort(all the renderables)

{

  for(each renderable)

  {

    Sort based on... flag? just straight up sort on which value is lowest?

    and that the lowest value also indicates the first textures/materials/meshes.. Perhaps first compare it on textures, then meshes etc, as was suggested above?

  }

 

  for(each renderable)

  {

      and then after it's been sorted once by the first 32bits of the flag (where we'd possibly store all those things)

      we sort it again by depth?

  }

}

 

I'm sorry for being so dense. ;)

 

I am also of course aware that I'll have to profile this to see if I even gain anything at all by sorting, but I sort of just want to try making a system of this sort either way, as I think it could be very useful to understand the techniques.

Edited by Lemmi

Share this post


Link to post
Share on other sites
For distance you'll want the second version, however instead of working out the distance stick with the squared distance as it is cheaper to calculate as it doesn't need the square root operation, and does the same job.

Share this post


Link to post
Share on other sites

For distance you'll want the second version, however instead of working out the distance stick with the squared distance as it is cheaper to calculate as it doesn't need the square root operation, and does the same job.

You're right and I agree! I'll also go ahead and assume that my thoughts surrounding the sorting approach are at least somewhat on the right track. I'll carefully re-read all the posts before acting.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this