Optimum Programmable Shader Flow

Started by
8 comments, last by WedgeMan 13 years, 4 months ago
Hi All,

I'm making a 3D rendering engine from scratch. I'm under no illusions here, I know it won't be used for a triple A title or anything like that, but it is the best way to learn how it all works under the hood.

I've written a fixed function pipeline engine before in software (CPU) and now I want to write a programmable pipeline with vertex and fragment shaders.

Have never worked with the GPU before I'm curious as to a fundamental architecture decision.

Say I have 3D objects in my scene. Each object has a collection of vertices and indices and uv's and a set of textures to describe how that object is supposed to look.

In my main render loop, should I be iterating through all of my visible objects (we'll assuming object culling has happened) and telling each object to render itself?

ie. Pass in the device to the object, upload the Vertex and Index buffer's to the GPU, upload the Shader and the constants/parameters etc, then render the object and then go onto the next object.

While this approach seems to be flexible and easy to conceptualize, it worries me that speed will be a major issue and that marshaling the individual vertex/index buffers and shaders to the GPU will be inefficient.

The second option then is the reverse, where I loop through all of my visible objects and acquire their vertices/indices and store them in one giant buffer. I then upload all the vertices in one batch.

The problem with that being that different objects have different materials attached and thus different shaders and i'm not sure what the flow would be to ensure that the correct triangles are associated with the right shaders.

Can anyone help me out with the ideal flow here?

Definitely appreciated.

Thanks!
Advertisement
Heya,

One thing that's really important is to minimize state changes (which is also true in fixed function pipeline).

Because of this, the higher end rendering engines will sort objects based on their rendering properties so that objects using the same shaders will get rendered at the same time.

There's a rough list of the order in which things get more expensive. I don't remember what that is but hopefully someone else can chime in (and also give you info about other things you should be watching out for or doing different).

But like for instance changing parameters to a shader is less expensive than changing a shader, so you would sort the list accordingly.

Another thing is that some people render there scene once to the Z buffer only, then render again to the screen using Z equals for a z func.

What this does is make it so only visible pixels get pixel shaders calculated for them. Rendering only to the z buffer is quite a fast operation, so it's a good way to make sure you aren't taking the time to shade a pixel then just over-writing it with another pixel.

That would only matter if your game had a lot of pixel over-draw though.

And my last advice for you is avoid branching and looping in your shaders (if statements and loops).

When you hit an if statement, many cards will compute BOTH paths of the if, but throw out the one it doesn't need (or other such wasteful / slow behavior including poor branch prediction).

Hope this helps some!!
Quote:Original post by WedgeMan

ie. Pass in the device to the object, upload the Vertex and Index buffer's to the GPU, upload the Shader and the constants/parameters etc, then render the object and then go onto the next object.

While this approach seems to be flexible and easy to conceptualize, it worries me that speed will be a major issue and that marshaling the individual vertex/index buffers and shaders to the GPU will be inefficient.

The second option then is the reverse, where I loop through all of my visible objects and acquire their vertices/indices and store them in one giant buffer. I then upload all the vertices in one batch.
Thanks!


I'm in the same situation in the I've done DX9 fixed pipeline stuff before, but I'm now trying to get my head round DX10 and shaders.

For the quoted part, from what I can gather, you create the vertex and index buffers when the object is created, this stores them in the graphics card's memory and it's just a case of switching between them to render the objects, you don't need to 'upload' them each frame. Pointers to the buffers are stored by the object and the draw routine iterates through the objects, querying them for the location of their vertex and index buffers. As long as you aren't making something like GTA IV, you don't have to worry about releasing buffers to free memory (except on program termination of course).

The way I've done shaders is to have a shaderManager class which will return a pointer to a shader when requested by an object's initialisation routine, which an object then stores. It will store shaders that it's already loaded so if the same one gets requested it'll just return a pointer to the existing one rather than having to load it again.
You don't upload vertex/index data every time you draw static geometry, because that would be wasteful. Instead when loading objects, you create GPU resources (vertex + index buffers) using your API and then fill those resources with your mesh data. Then at runtime when you draw something, you call commands specifying that you want to use a particular vertex + index buffer and then issue a Draw command.

Materials + meshes is more of an organizational thing. The simplest setup would be to have a 1:1 relation between materials and meshes. Then you only need one vertex/index buffer per mesh, and you just set your material parameters (shaders, textures, render states) before drawing each mesh and issue the Draw call. If multiple materials per mesh is desirable, it's still possible to do it and keep one vertex/index buffer per mesh. Typically you just organize your data such that each material is a group of contiguous indices, and then you draw your mesh in pieces with multiple Draw calls by specifying the start index and the number of primitives.
Like MJP said, treat "static" meshes (aka: the geometry doesn't change/deform/animates at vertex-level) as you treat textures: load the vertex and index buffers in the GPU during load-time and just bind them before you render.

If you want optimal batching, you need an intermediate layer between the objects' render() method and the actual rendering: instead of having the objects render themselves, make them simply add entries to a render list. An entry to the render list needs only a pointer to the vertex/index buffer to be rendered, a transform matrix and the material (which holds textures, shaders and constants). When drawing objects with multiple meshes/materials, just have it submit more items to the render list.

You can also have different types of entries or different lists for different types (at the very least: opaque and translucent). After the render lists are generated, you sort them. Translucent stuff needs to be sorted by distance (and you can store the "sort pivot" on the render list entry), while opaque items are sorted by material and vertex buffers. This allows batching to work even if your objects use multiple materials: all meshes using the same materal will be rendered together, no matter the object they came from.

For better performance, use a fast compact allocator for the render list entries (since you'll be allocating tons of them every frame) so iterating through the list is as fast and cache-friendly as possible.

If you use a thread-safe queue/list, you can submit entries to the render list from multiple threads.
Hi all,

Thanks for the responses, this has been super helpful.

I believe I understand the approach now, but obviously the only way to know for sure is to do it and see the results so I'll go do that now.

The one other question I had was about an Uber Shader.

In talking with quite a few people about switching to a programmable pipeline and avoiding the switching of shaders, the concept of an Uber Shader has come up.

Essentially it's a Shader that handles nearly every thing you'd want a shader to do with default values of zero if you don't need them. This would have more overhead in terms of calculations that amount to nothing, but probably not as much a switching the shader program on the GPU.

Is this a generally frowned upon or encouraged practice?
What I'm doing now is have models create their meshes at load time and store a reference for later use. When it's time to render, they(or the client program) must handle state management themselves. This resulted in a nice simple, flexible renderer, but the client has much work to do(well not not much, but enough).

I plan on doing something similar to Manoel in the next iteration. Rather than immediately drawing, The renderer will, when it's ready to draw, sort the command queue into different buckets depending on render strategy required. If a mesh is used multiple times, the render commands will be concatenated into instanced DIP calls. The renderer will then begin it's scene, and draw from the optimized command queue.

As for the list atrix mentioned, I found this a while back:

API Call		 		Average cost (# of Cycles)SetVertexDeclaration			6500 - 11250SetVertexShader				3000 - 12100SetPixelShader				6300 -  7000SetRenderTarget				6000 -  6250SetPixelShaderConstant (1 Constant)	1500 -  9000SetVertexShaderConstant (1 Constant)	1000 -  2700SetStreamSource				3700 -  5800SetIndices				 900 -  5600SetTexture				2500 -  3100DrawIndexedPrimitive			1200 - 1400DrawPrimitive				1050 - 1150ZFUNC 					 510 -  700ZENABLE					 700 - 3900ZWRITEENABLE 				 520 -  680
Quote:Original post by WedgeManThe one other question I had was about an Uber Shader.

In talking with quite a few people about switching to a programmable pipeline and avoiding the switching of shaders, the concept of an Uber Shader has come up.

Essentially it's a Shader that handles nearly every thing you'd want a shader to do with default values of zero if you don't need them. This would have more overhead in terms of calculations that amount to nothing, but probably not as much a switching the shader program on the GPU.

Is this a generally frowned upon or encouraged practice?

That's not really an "ubershader". If you use gigantic shader that always processes all your material effects, but "disables" some of them by zeroing their output values, even simple materials will be as expensive to render as your most complex ones.

An ubershader is a way of generating several different shaders from one single shader file, by compiling the same shader using different #define statements so you switch parts of the shader on/off depending on each material's needs.
One thing to remember when dealing with performance best practices is that it is a game with trump cards:

1) local observed results trump internet observed results (*)
2) observed results (of any kind) trump theoretical performance (**)

(*) this presumes that you have access to all the hardware your product/project will run on
(**) this presumes that you are releasing a project/product on a deadline. Over a long enough time line, theory usually wins... sometimes. :)

With that in mind, test, test, test and re-test. The best practice for growing as a rendering programmer (or any performance centric programmer) is to learn how to test performance accurately.

...

Now (ahem) my gut (wrong 50% of the time, and trumped 2-ways) says that the uber-shader method will be slower. Most modern GPU's handle state changes pretty well, though it is still something to look at and wrangle if needed. The major GPU vendors (nVidia, ATI/AMD, Intel) all have different nuances that affect shader performance, so it can't be an apples:apples comparison.

One last nugget of thought: You have control over your state changes. You can minimize them across _all_ GPUs through architecture and proper data planning/management.

You don't have control over your hardware (even programmable hardware). The long list of gpu accelerated features that are vendor specific all change performance characteristics.

This is enough, in my book, to go with a potentially sub-optimal solution of reducing state changes through architecture and data manipulation.

Will your shader's zero'd contributions cost you nothing? Maybe. What if you use branching? Maybe. Will your uber shader run acceptably on 75% of the cards and completely bomb the other 25%? Probably.

Beware optimization by addition (bigger shaders will save rendering engine! complex silver-bullet-solution will work!), favor optimization by subtraction (less state changes for all).

-Wandering Bort

Well this has been exceptionally useful, thank you all for your comments.

Things are progressing along nicely and I should be able to start stress testing at a geometry level soon.

This topic is closed to new replies.

Advertisement