Sign in to follow this  
obhi

To Thread or Not to Thread

Recommended Posts

obhi    196
I finished my prototype "Multithreaded" renderer recently. The renderer is based on opengl 4.1/DirectX 11 and much of it (Render State management, Shader to stream binding, etc) is more akeen to the DirectX 11 architecture and sits very well with the opengl implementation (I have not implemented the DirectX part yet). But during various profiling sessions I have found some results which has forced me to reevaluate what needs to be threaded.

Before I get to that point let me explain briefly the architecture I went for this renderer.
In general, a subsystem requires to be complete, Physics, Rendering, AI, etc are subsystems and a Renderer can further be divided into Culling+Processing Render Queues.
My goal was to implement the Culling and Render Queue processing in seperate threads. Having said that, there were still dependecies and constraints between these two systems. The primary constraint being, once a cull request is generated the execution of the previous render batch queue needed to be fully done along with back to front buffer flipping (by which I mean framebuffer). The syncronization between the render data and the scene data was done by double buffering every data. So there would be two render queues, and each shader parameter will have two memory instances. The update + cull thread will be writing to an area not read by the Render thread at that instance. These shader parametrs had Write Only access from the Update thread and Read Only access from the Render thread. After the render queue is fully generated for flushing, a syncronization step flipped all those shader parameters which were updated in the last cull+update iteration. Also I did not plan to update the Vertex buffers + Index buffers from the update thread, (most updates can be shifted to Geometry shader anyway). So the render and cull threads can be Laid out as:

Update: [--- cull + update ---|--- sync and submit---- ]
Render: [--- render -----------|---------- wait --------------]

The 'wait' would rather be short BUT, is a bad design. Using a condition/critical section pair for both threads meant the render thread after finishing one render step had to wait during the Sync phase which would most likely move to kernel space (along with a thread preemtion). I knew it was a bad design but I was hoping it would be better than a single threaded design where the update thread has to wait for all Render API calls. I am also aware there are lock step algorithms, etc. to alleviate waiting, but during the profiling phase I observed few things:

1. The render API (OpenGL/DirectX) will use command buffers to queue up the render api calls. Considering this fact it is apparent what any parallel cull+draw architecture will do is already achieved via the renderer API.
2. While looking into the thread allocations I observed that ATI 5770 driver for opengl allocated 3 threads, which makes sense considering the command buffer is filled rather asyncrhonously and execution could be done in another thread.
3. I was only getting about 20% boost on an average and using a very cumbersome desing to achieve it, which also added a lot more constraints to the whole system put together.

All this made me conclude that Culling and Rendering, being interdependent should rather be done serially.

For all the folks out there I am asking this: How well do you guys think App + Cull + Draw parallelism is a good approach given today's Render hardware and drivers. Would rather a parallel complete System desing all running asyncrhonously with message passing be a better design. By complete system I mean systems with no apparent exclusive interdependency?


Thanks for reading,
obhi

Share this post


Link to post
Share on other sites
landlocked    103
Take what I'm about write with a grain of salt. I haven't worked with the types of systems that you describe. The render process and such has typically be handled for me as I haven't gotten into the belly of the popular APIs and worked them manually as you're doing. That said I've been a programmer for the better part of a decade now and I have experience writing parallel operations.

Keep in mind the individual limits of the computer that will be executing your code. You should write your code such that you detect how many cores are on the machine whether they exist in the CPU or GPU be that 1 core or 100 cores. Naturally, if there is only 1 core then you can't parallel anything. Then, your thread handling code should generate an equal number of threads with a messenger process that handles communication between the threads. In your example of app code, rendering and culling, ideally you'd want a quad core processor if you want to parallel-ize each of those while using the 4th core for either your message handling code or offloading your app code into two threads. This, of course, is assuming the rendering only refers to the process of writing the frames onto the screen. If your rendering process also handles physics and other things then you'd probably want to split the rendering process onto two threads instead of the app code. Basically, whichever is the "heaviest" code should take up the extra core. All your code should talk to your cross-thread messenger handler to issue commands and coordinate cross-thread communication. This will probably live on your main thread which is also where you will probably handle inputs and output messages. If you want to get fancy you could cannibalize the down time on other threads for particularly intensive operations but that demands a delicate balance and takes extensive care and profiling to get right or else you'll create more problems than you solve.

Above all though, always profile before you change anything like you did.

Hope I helped.

Share this post


Link to post
Share on other sites
_the_phantom_    11250
You are (both) thinking at the wrong level of abstraction; Threads are for doing work, but 'tasks' are the level you should be coding your work at.

For example; if you run on a 4 core machine then you would spin up 4 threads and then spread the resulting work load across those threads as required (paying attention to dependices of course).

Given 'todays' systems the 'best' method is to break down everything into independant tasks and groups of tasks which are doing the same type of work run at the same time.

So, to use the simplified example;

[update tasks] -> [cull tasks] -> [render list construction tasks] -> [update + render task] -> [cull tasks] -> [render list construction task] -> etc

Note that there is a 'render list construction' set of tasks and a 'render' task running at the same time as the update tasks. The reason for this is that you can parallelise construction of a (sorted) list of objects you want to render (be it via DX11's Command Lists or just constructing a list of data which you later go over*) you have to submit all your render work from a single thread so you need that single task to deal with that final submission.

A good place to start looking at systems like this is Intel's Threading Building Blocks which impliements a work stealing task based system.

-------------------------------------
*the reason you'd want to do this is because doing all your code <--> DX/(user mode) driver transistions in one go without anything else going on gains you a win from the instruction cache and having all your per object/draw call render data together helps the data cache as well provided you do it right (so arrays/std::vector, not a linked list)

Share this post


Link to post
Share on other sites
landlocked    103
I come from a .NET background so Task and Thread is virtually the same deal to me. ;)

If I said task anywhere you can use the two interchangeably for what I wrote.

Share this post


Link to post
Share on other sites
obhi    196
[quote name='phantom' timestamp='1305563196' post='4811486']
You are (both) thinking at the wrong level of abstraction; Threads are for doing work, but 'tasks' are the level you should be coding your work at.

For example; if you run on a 4 core machine then you would spin up 4 threads and then spread the resulting work load across those threads as required (paying attention to dependices of course).

Given 'todays' systems the 'best' method is to break down everything into independant tasks and groups of tasks which are doing the same type of work run at the same time.

So, to use the simplified example;

[update tasks] -> [cull tasks] -> [render list construction tasks] -> [update + render task] -> [cull tasks] -> [render list construction task] -> etc

Note that there is a 'render list construction' set of tasks and a 'render' task running at the same time as the update tasks. The reason for this is that you can parallelise construction of a (sorted) list of objects you want to render (be it via DX11's Command Lists or just constructing a list of data which you later go over*) you have to submit all your render work from a single thread so you need that single task to deal with that final submission.

A good place to start looking at systems like this is Intel's Threading Building Blocks which impliements a work stealing task based system.

-------------------------------------
*the reason you'd want to do this is because doing all your code <--> DX/(user mode) driver transistions in one go without anything else going on gains you a win from the instruction cache and having all your per object/draw call render data together helps the data cache as well provided you do it right (so arrays/std::vector, not a linked list)
[/quote]

Here are few things I forgot to mention during my first post. I used a Scheduler and Task based system for the renderer, where the Scheduler posted tasks to available threads. Also I had to make sure that certain tasks were bound to a specific thread (or channel), which could execute it. This was necessary for the render thread which had to be the same for each render jobs. Apart from that the Scheduler allocated the threads by checking the HW threads available or by being forced to spawn more threads through configuration file. My tests however were on done on a Dual Core with 2 threads spawned by the Scheduler.

I had previously looked into Nullstein available in the intel site which did implement Task Stealing and by coincidence I looked into TBB manual today while at work :P, so I am aware of that approach. However, my concern is dependencies.

Granted message passing mechanism can be implemented such that there is very little synchronization required but the memory footprint involved could be very high. Taking your example how do you run 'Update' parallel with 'Render List Generation'. Update could be overwriting object positions and rotation required by the 'render list generation' (or 'render task' for that matter). Would you rather have independent instances of these objects for each of these tasks and sync during a sync phase (via message passing or whatever)? The point is since these tasks are so much data dependent upon one another what could really be parallel in these cases?

Thanks for the replies guys,
obhi

Share this post


Link to post
Share on other sites
_the_phantom_    11250
You don't run 'update' and 'render list' generation at the same time.

In the flow I gave above each block is a group of tasks, each group of tasks runs in sequence.

So you complete all your 'update' tasks and then move on to all your 'render list' tasks and so on.

'Updates' can be done in at the same time by maintaining two sets of state; an internal state which is being updated and a 'public' state which people can read if required.

In fact, given that, the flow should really read;

[update tasks] -> [object Sync tasks] -> [cull tasks] -> [render list construction tasks] -> [update + render task] -> [object Sync tasks] -> [cull tasks] -> [render list construction task] -> etc

Where the sync tasks mirror ONLY what is public data (positions etc), while this might give some memory overhead it removes the need for any locks to be introduced removing a whole class of problems that come with them :)

This system also removes the need to 'lock' your render tasks to a physical/logical thread for OpenGL context reasons; as only one 'task' is ever executing for rendering that task can have some code at the start and end which obtains and releases the context from being locked to the thread.

[source]
void RenderTask()
{
ObtainContext();

// Render work done here, looping over a list created in the list construction tasks to set states and draw things

ReleaseContext();

}
[/source]

Share this post


Link to post
Share on other sites
obhi    196
That is a very neat idea. I looked into AnandTech's interview with EPIC on the potential areas for multithreading. From what I understand the it boils down to this:

Mostly for a range of independent objects doing a parallel update will incur no cost and will also be scalable. Also having separate states per object where syncing will be an issue will yeild good results, keeping in mind the memory usage. Given there are lot of areas to look into for splitting a task, parallel_for would be a good starting point.

I can see the light now!

Thank you very much.
obhi

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this