Rendering design thoughts

Started by
16 comments, last by JorenJoestar 14 years, 1 month ago
Quote:Original post by JorenJoestar
In a multithreaded environment...how could you manage culling? A simple double-buffered list is a simple yet effective solution? You work on the next-frame visibility...can be a good solution?

Also...there can be different bounding hierarchies based on the type of objects you want to draw (static meshes, terrain, dynamic ...) thus the need to an abstract way of thinking about it.
A mesh, model or whatever, could contain informations for visibility.


You don't need to abstract or centralize those concepts. I'd say, keep your architecture simple, and flexible, features will come in easily.

For example, having visibility information in a mesh is usually not great for cache locality, as you don't need the mesh while computing visibility. Also a mesh is too abstract to understand visibility. As you pointed out, you might use meshes for many things.

I like to have rendering entitites that manage the entire rendering of a given feature, for example the players in a soccer game. The entity will know what data to load and how to perform culling and so on.

Then, if that functionality is needed across entitites, you'll abstract it into a shared service. Working in a MT environment is not complicated this way. You could simply run in parallel the evaluation of the "prerender" pass of each entity. If you have shared services, you will update them in parallel before their users (the entities again).

We tend to look at the bigger picture too much when designing engines. For most games, it's not worth it. You want to have the ability of running some concepts in parallel, but not formalize what those things are. Because each game requires different technologies. Only if you're making a FPS or such of an heavily artist driven game then you'll go towards something generic. And anyway, I like always to opt-out and code my specific rendering stuff when needed.
Advertisement
Quote:Original post by JorenJoestar
The only problem I found in this list of commandtypes (there are more commands, like begin/endscene, clear buffers...) is in lock/unlock commands, or maybe command that needs to know informations about other resources.


Eh. In practice you'll face many more problems :)

First of all, DirectX and OpenGL are state machines. So recording a command buffer means that you have to fix an assumption about the state of the machine at the beginning of the recording, and keep that valid when you play the recorded stuff. That might lead to unnecessary state setting, in order to reset each time before you play stuff, your state to default. Also, choosing the default can be not easy.

And then, when you start doing it in practice, you'll notice even more problems, depending on the platform you're targeting, not all commands can be recorded.

In the end is not impossible, far from being impossible. But it is not so convenient.
Quote:Original post by JorenJoestar
Quote:Original post by kenpex
Most of the times tou don't want to record directly directX command buffers, because then you can't sort rendercalls, meaning that each thread has to work on parts of the scene that are independent than the others. Often that's not the case.


This is ok, I don't want to...but I do want to let different threads to add commands to a (maybe) per-thread queue, that is then merged and sorted!
At what level of granularity are you creating commands?
Say a very basic design that contains an abstract renderer and some implementations (DirectX9, OpenGL...), do you provide atomic operations that became our commands?

Quote:Original post by kenpex
If you want to go for the DirectX command recording (or anyway, record native rendercalls) then at least you probably want to have worker threads that prepare all the rendering data in parallel, then the data becomes read only, and the command-recording threads access the read only data to create the different scene segments.

This is the exacy implementation made by gamebryo (and full of sourcecode, examples and paper), but this is something I don't like very much.
API abstraction is essential for me, I do want to create something that can handle different APIs!

Quote:Original post by kenpex
Also, when you are thinking about such an archictecture, pay also attention on cache misses - branch mispredictions. Some source I've seen in this thread is naive in that respect.


This is a REALLY GOOD point.
The source posted is to only give the idea, but do you suggest something to pay attention to cache misses and branch misprediction?
I feel that using function pointers to speed up the command execution (command id = index in an array of funcion pointers...) could be a nuclear bomb for branches.

What are your thoughts about code execution? Would you like to explain further your thoughs about your design?

Thanks!


Eh, I can't really say much because it depends on your application. You have multiple choices, and no "right way" there are always trade-offs. In general it's not true that using command buffers to record native commands is API dependent, you could simply abstract the recording API, and issue the native commands from a device abstraction layer that you probably already have. So even if you're recording native stuff, still you can be platform indep.

To me the major drawback of using command buffers directly is that you can't sort them. So I prefer to have another layer, record some drawing primitives information, then sort them, then in parallel generate command buffers, then play them.

I've already explained in my blog one scheme to do that. From what I can see, it's good, the only drawback that I see in it is that you are basically working with handles all the times, so you pay that on the cache when going from the hashes to the resources...

An alternative to avoid that is to record abstracted commands that embed pointers to the native resources. In my engine test, a draw command is a short bit string made of handles, that is both the command and the sorting key for it.

I.e a command is, for example
Framebuffer handle...Texture handle...mesh handle

An alternative is to record commands/pointers + a sort key for all of them. That takes more space, but avoids the indirection. To do the same draw, you'll record something like

settexture...pointer + sortkey
setmesh...pointer + sortkey

If the sort is stable, then you can rearrange your recorded abstracted commands (something you can't do with the native ones) and not pay any cache hits. The downside is that your record buffer can be longer (more hits!), and the whole thing is less abstracted (that could be good!).

Notice that in this scheme, all the sortkeys can be stored in a separate array, as they're only used in the sorting pass, that makes sense. Also you could still cull redundant commands when recording, thus making sure your recorded stuff is not too big. Deriving the right sortkeys can be a bit of a problem though.
Quote:Original post by JorenJoestarA simple double-buffered list is a simple yet effective solution? You work on the next-frame visibility...can be a good solution?


How much latency can you afford? I'd say, if your game is at 30hz, then probably you don't want to add much, if it's at 60, it can be just fine. In general, this choice should depend on the game and not be fixed in the engine. The engine should only provide a way of doing work in MT that is not just directly using threads. The engine should provide services, the rendering should depend on the specific game.
Quote:Original post by kenpex
Eh. In practice you'll face many more problems :)

First of all, DirectX and OpenGL are state machines. So recording a command buffer means that you have to fix an assumption about the state of the machine at the beginning of the recording, and keep that valid when you play the recorded stuff. That might lead to unnecessary state setting, in order to reset each time before you play stuff, your state to default. Also, choosing the default can be not easy.

And then, when you start doing it in practice, you'll notice even more problems, depending on the platform you're targeting, not all commands can be recorded.

In the end is not impossible, far from being impossible. But it is not so convenient.


Yes, this is an option I don't like very much, I prefer (as you wrote on your blog) an abstraction layer with sortable drawing commands, and then the commands calls API-dependent code.
Looking at Emergent's presentation, is clear that methods that retrieves informations and locks are not possible in low-level commands (the d3ddevice recorder), but I think that in an abstract command configuration this is possible.
I have defenitely to try it out! Testing is the best test!



Quote:Original post by kenpex
Eh, I can't really say much because it depends on your application. You have multiple choices, and no "right way" there are always trade-offs. In general it's not true that using command buffers to record native commands is API dependent, you could simply abstract the recording API, and issue the native commands from a device abstraction layer that you probably already have. So even if you're recording native stuff, still you can be platform indep.

To me the major drawback of using command buffers directly is that you can't sort them. So I prefer to have another layer, record some drawing primitives information, then sort them, then in parallel generate command buffers, then play them.

I've already explained in my blog one scheme to do that. From what I can see, it's good, the only drawback that I see in it is that you are basically working with handles all the times, so you pay that on the cache when going from the hashes to the resources...

An alternative to avoid that is to record abstracted commands that embed pointers to the native resources. In my engine test, a draw command is a short bit string made of handles, that is both the command and the sorting key for it.

I.e a command is, for example
Framebuffer handle...Texture handle...mesh handle

An alternative is to record commands/pointers + a sort key for all of them. That takes more space, but avoids the indirection. To do the same draw, you'll record something like

settexture...pointer + sortkey
setmesh...pointer + sortkey

If the sort is stable, then you can rearrange your recorded abstracted commands (something you can't do with the native ones) and not pay any cache hits. The downside is that your record buffer can be longer (more hits!), and the whole thing is less abstracted (that could be good!).

Notice that in this scheme, all the sortkeys can be stored in a separate array, as they're only used in the sorting pass, that makes sense. Also you could still cull redundant commands when recording, thus making sure your recorded stuff is not too big. Deriving the right sortkeys can be a bit of a problem though.


Maybe you can wrap different API-dependent resources and use wrappers allocated with a custom memory manager, so you can take play with pointer logic (base + id) to take the direct pointer to the wrapper.


Quote:Original post by kenpex
How much latency can you afford? I'd say, if your game is at 30hz, then probably you don't want to add much, if it's at 60, it can be just fine. In general, this choice should depend on the game and not be fixed in the engine. The engine should only provide a way of doing work in MT that is not just directly using threads. The engine should provide services, the rendering should depend on the specific game.


Yeah, my feeling about it is to have different worker threads and task assigned to them by a scheduler, just like BadCompany2 or Capcom's MT Engine (and many other). Flexibility is fundamental to adapt to different situations!

---------------------------------------http://badfoolprototype.blogspot.com/
(I'll continue with the brainstorming...)

What commands are possible?

How can you describe a scene?


Questions are important because your brain will always find an answer. (NLP)

I'll begin with the description of a scene: this is a bottom down approach.
Take a scene in which you have:
- Shadow mapping (VSM, CSM...)
- Refraction effects (water, ice, glass?)
- Reflection effects
- Opaque and transparent geometries
- Particles
- Static meshes, skinned meshes, morphed meshes
- PostProcess effects (HDR, Motion Blur, DOF, Bloom, SSAO, SSGI)
- Dynamic lights and shadows
- Static lightmaps
- Radiosity Normal Mapping
- More ???

The scene, depending on light complexity and geometry complexity, can be handled in the following way
- Forward Rendering
- Deferred Rendering
- Light Prepass Rendering
- Deferred Light Rendering (as in STALKER, Deferred Rendering for opaque geometries and forward pass for transparent...) (I don't know if this is the correct name)

Basically all the effects above are a combination of:
- Render to texture;
- Render geometry;
- Set geometry;
- Set texture;
- Set shader and params;
- Set render states;

The rendering possible situations are:
- Render a list of objects with the same shader (like shadows...)
- Render a list of objects with their material

The real matter of all these effects is the DEPENDENCIES between the intermediate steps.
You can imagine the rendering as a Petri's net in which you need some resource created by previous steps.
Consider for example an in-game scene from Gears Of War 2.
With a FORWARD APPROACH, it is only a matter of:
- Creating shadow maps, rendering to texture a list of objects with the same shader;
- Render the opaque geometries with their materials;
- Render the transparent geometries with their materials;
- Render post-process effect;

If I remember well Unreal Engine 3 uses a Lightbuffer, that accumulates shadows and light contribution, so it is more a
deferred approach.
With a DEFERRED APPROACH, using for shadows a technique like Deferred Shadow Maps:
- Render the G-Buffer;
- Render the light/shadow buffer, adding each contribute;
- Render the post-process fx;

Maybe it can seem to simple, but in practice the rendering became only a combination of those bricks; the important thing is
the timing, but if you have dependencies in mind you can easly understand which rendering step came before each other.

If you consider that transparent objects (like water, glass, distortions) needs to distort the opaque geometries (and other
transparent ones, sorting is always necessary) first you have to render opaque and then transparent object ordered.
Shadows need to apply the same shader to all geometries; Postprocesses needs at least one fullscreen quad and a shader, and maybe
some other infos (like motion blur and eye adaptation).
Rendering became a dependency graph based on steps and texture.
This is what I thought about rendering, and with that in mine I developed the Stage/Pipeline model: really easy, but extensible
and really EXPRESSIVE.

I think there are many other better methods to render, and I know that hard-coding is the faster way to render, but I think that
the real power of this approach is to understand the MENTALITY and the DEPENDENCIES behind any rendering.


What do you think about it?
---------------------------------------http://badfoolprototype.blogspot.com/
Quote:Original post by JorenJoestar

Maybe you can wrap different API-dependent resources and use wrappers allocated with a custom memory manager, so you can take play with pointer logic (base + id) to take the direct pointer to the wrapper.


Good idea, I thought about that. But it won't work. Textures and meshes are big, so even if you redirect their allocation to a linear pool and you subtract the base address, you'll still have a large gap between them, and in the end I suspect you won't save many bits (respect to storing a full 32bit pointer).
Other thoughts on the subject.
I'm thinking about a more abstract architecture, like command + command data.
Inside the command, there are informations like rendertarget, material/z order bits and an index to a data structure allocated in a pool.

The commands are not so many, the main distinction is in set/draw commands, and the set have different types, like set geometry (that can include vertex buffer, index buffer), set material (shader + params), set render target.
The clue is that rendering can be really be described by simple commands like those.

[Shadow map example]
- SetRenderTarget (render target: shadow)
- For each casting shadow object
- set geometry infos
- set vertex buffer
- set index buffer
- set material infos
- set shader
- set shader params
- draw
- SetRenderTarget (render target: main)
- For each visible object
- set geometry infos
- set material infos
- draw

Geometry informations can be contained inside a structure containing pointers to the REAL resources (D3D buffers, Shaders) and commands are handled by API-Specific devices.
Obviously all this flow is generated by sorting the commands based on render targets and other params.

Another fundamental thing is that each command has its own "do" method, in which uses API-Specific code to set resources or draw.


About parallelism.
Based on the design of Stages I've written before, each stage can be considered a macro task. Then this macro task can be subdivided in more fine-grained tasks.
Eg. ShadowStageTask. To draw to a shadow map, we set the current render target,
then for each visible+castingshadow object set the shadow shader and draw all the geometries using this shader.
There must be someone who knows about informations to create a commandkey in the proper manner: someone who knows the stage for example.
Also consider using a parallel for to subdivide the cycling through shadowcasting objects.
And then, you create a command "setstage", another "setshader" and many "draw", with the bits of the stage that are known by the current stage.

The rendering itself is to be considered:
for each stage
draw stage

Here we can use a parallelo for also, and for each stage use a mentality like that.


What do you think about it?

[Edited by - JorenJoestar on February 26, 2010 6:12:36 AM]
---------------------------------------http://badfoolprototype.blogspot.com/

This topic is closed to new replies.

Advertisement