Multi-threading

Started by
10 comments, last by Krohm 17 years, 6 months ago
Hi all, Is it possible to multi-thread (2 threads) OpenGL application and link some the data of the other thread to OpenGL's thread and vice versa? If yes, how do i go about linking the 2 threads? Thanks.
Advertisement
The easiest way is to have a bunch of shared buffers protected by a few critical sections to be accessed in ring order.
In win32 this is CRITICAL_SECTION: a few calls are EnterCriticalSection, LeaveCriticalSection, TryEnterCriticalSection... search them on MSDN.
When I multithread I often find I need events. Check out CreateEvent, SetEvent and WaitForMultipleObjects.

Avoid merging multiple threads in a GL stream however: some drivers won't like it very much and the benefit is usually very minimal when compared to application threading.

Avoid having a "computation" and a "rendering" thread: the number of syncronizations required will likely degenerate your system in a fully sequential one. Even on a carefully designed system, the lag introduced could build up considerably.
There are a few cases in which this is actually useful but again, your first rank priority should be threading the application on the data level. This has been proven to give the highest speedups nearly all cases.

Previously "Krohm"

I was considering having a rendering and computation thread running concurrantly....the rendering thread would never modify any data that the computation thread needed to modify, (the rendering thread would read it, not write it)...and I thought that I would therefore need very few thread syncs in order to make this work...am I incorrect? Would this be a bad idea? I figured I could get away with directly syncing the threads once every 5-10 rendering loops..but you say there is something wrong with this method...do you think you could elaborate?
Quote:Original post by Steve132
I was considering having a rendering and computation thread running concurrantly....the rendering thread would never modify any data that the computation thread needed to modify, (the rendering thread would read it, not write it)...and I thought that I would therefore need very few thread syncs in order to make this work...am I incorrect? Would this be a bad idea? I figured I could get away with directly syncing the threads once every 5-10 rendering loops..but you say there is something wrong with this method...do you think you could elaborate?


the majority of the rendering work is done on the GPU (unless you are using a software renderer), thus the benefit of a separate thread for rendering is lost,

what you really want to do is make the highlevel render calls (opengl/dx), then modify the data, and finally swap the buffers.

the reason for this is that the bufferswap will wait for a vertical retrace (normally) thus putting the cpu in an idle state until the gpu is finnished,

by doing the game logic between the render calls and the bufferswap you ensure that the cpu does something while the gpu is working.

the traditional method (used by software renderers) would if used with hardware rendering result in this

      logic  rCalls  render   swapcpu:-working-working-nothing-nothing  -gpu:-nothing-working-working-working  -


vs

       render     rCalls  +logic   swap   cpu:-working-working-working -gpu:-working-working-working -


if you want to take advantage of multiple cpus the best option is to split parts of the logic while keeping the main flow of the game loop sequential, things such as physics and AI are often well suited for a task based model. (comercial physics engines such as AGEIA and Havoc hydracore does this for you).

thus it would look something like this
                      /--AI--    /--Physics--rCalls--        ---             --OtherLogic--swap-        \--AI--/   \--Physics--/

using multiple threads for AI is reasonably easy.
task based, each thread grabs an actor from a queue, processes it then grabs a new one, the only thing that needs to be synchronized is queue access (you don't want both threads to try to grab the same actor for example).
physics is a bit harder but luckily there are shrinkwrapped solutions for that allready.

the time required to make the rendercalls is so small though that the synchronization overhead is likely to reduce performance, thus a separate render thread is worthless at best. (software renderers are an exception)

another reasonable way to use multiple threads is for on the fly loading of data (useful for large worlds etc) and similar time independant tasks, this is useful even on single cpu systems since it greatly simplifies things.

[Edited by - SimonForsman on October 4, 2006 8:44:54 AM]
[size="1"]I don't suffer from insanity, I'm enjoying every minute of it.
The voices in my head may not be real, but they have some good ideas!
Let me say one thing first, i haven't had much experience with multithreading so i could be just a bit wrong.

Anyway, what i think Krohm is saying is that all dataprocessing you do while rendering will stall one or the other thread multiple times to a point that you might not gain that much from multithreading, of cause this depends a lot on your implementation.

The real problem is that the rendering thread has to be horribly linear, everything has to be done in the correct order and everything depend on what happened before it, this means that you just can't split up the rendering thread without running into problems (The irony is that computer graphics is both the greatest champion of parallel computing and it's greatest opponent).
Now while the rendering thread is horribly linear it is also pretty easy to predict what data it will need thus you can (if you want to) precalculate the data up to a frame in advance.
I hope that this gives you a few ideas.
I would tend to disagree with the guys who are telling you that a thread for the renderer is not a good idea.

It's being touted as the way to go by Microsoft in their recent Gamefest presentations and I'd tend to agree with them.

An independent thread for rendering gives you much more time to make efficient use of the GPU, it will allow for a more detailed sorting or objects and will allow you to do more in the game.

Things which aren't reliant on game processing can be put into the renderer and processed here too. A particale system for example can be built to update inside the renderer, this would allow for more detailed and better looking particle systems.

It will also allow the support for more objects in the world as there is a greater opportunity to do culling on a seperate processor at a more detailed level.

As nice as the data level parallelism is you're also going to have to remember the overheads provided by context switches. Having more threads than you do cores is not free, and doesn't always give you a performance benefit. Setting up all your idependent physics objects to be processed at once if you only have two or four cores is going to kill your performance, the same can be said for AI updates, particle updates and the like.

You also need to watch for SMT cores rather than truely independent cores. Putting two CPU intensive threads on the same core will also slow you down as they will be fighting to share the same CPU cache, and ultimately slow your frame rate.

While it's all good to say that you should be threading at the data level, it brings many problems with synchronisation, race conditons and SMT core sharing that need to be designed efficiently.

But like you say, if you are just splitting the game loop and the rendering loop then you will only need a couple of syncs to switch double buffered render lists. This would allow you to write a detailed and expensive renderer and have a good game loop for updating physics etc.

It's worth getting the Gamefest presentations from MSDN and doing a bit of research into what they have to say.
i would like to make a difference here between directx and opengl.

directx is based on batching, calling the render-routines also involves the cpu. but you have a speed (re)gain when uploading your data in big batches.

on the other hand, opengl and the vbos do not take that much time to upload data to the graphics hardware. ogl "collects" the data first, and then passes it off, it is less cpu-intensive.

so i would say since you are using opengl, your synchronizing code will take more time to execute than you are going to gain when using multithreading.
(this is only true for single-cpu systems).
on the other hand, i heard of speed gains of about 5-10 percent when using direct graphics and multiple threads.

But, as SimonForsman suggested, it will be a greater speed improvement when you optimize things like ai and physics. maybe you want to use the "fork-join" model to avoid sychronizing issues. and maybe you would like to have a look at the OpenMP multithreading standard which is implemented by many compilers (g++, ms visual studio 2005, etc.).
Quote:Posted by Steve132
and I thought that I would therefore need very few thread syncs in order to make this work...am I incorrect? Would this be a bad idea? I figured I could get away with directly syncing the threads once every 5-10 rendering loops..

This could work (I'm still not sure) if all your operations are atomical but that's unlikely. It also depends on how much your app is complex. If you're going to pull a lot of data to/from GPU or not, how much you rething your rendering strategies, how much your data structures are likely to be inconsistent... I personally would not trust this too much but I encourage you on that way and prove me wrong. I would really appreciate if you could post a way to do this on a "complex" app. It really takes a lot of strategy to do that and I would likely mess it up easily but that's my opinion.

Say for example you have three "command buffers" which you refresh each 10 render loops. That's just fine: each 10 render loops, you get a critical section (which probably won't lock) then signal the other thread to use the successsive buffer (which is now in consistent state). That could work and requires very few syncs but for example, collisions (at least player camera) must be handled on a per-frame basis so you must... somehow... lock at least your MVP and that's just an example!

EDIT:
I forgot to note you can also perform some simple tasks with interlocked operations. Nothing really fancy but they're with all other functions on msdn
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/synchronization_functions.asp
Interlocked operations are FAST, so they're a natural fit for the above example.
/EDIT

Quote:Posted by lc_overlord
Anyway, what i think Krohm is saying is that all dataprocessing you do while rendering will stall one or the other thread multiple times to a point that you might not gain that much from multithreading, of cause this depends a lot on your implementation.

The real problem is that the rendering thread has to be horribly linear, everything has to be done in the correct order and everything depend on what happened before it, this means that you just can't split up the rendering thread without running into problems (The irony is that computer graphics is both the greatest champion of parallel computing and it's greatest opponent).
Now while the rendering thread is horribly linear it is also pretty easy to predict what data it will need thus you can (if you want to) precalculate the data up to a frame in advance.
I hope that this gives you a few ideas.

Yes, this is quite similar to what I tried to say... I was thinking at another aspect of it. You expose a very important issue but I will try again to explain why I don't like thread "pipelining" so much.

Both "parallelizing" and "pipelining" can yeld to very good performance increments.
What it has been observed however is that to have near-optimal sync access you need to buffer the commands from a stage to the other.

Now, this yelds to two cases: on slow systems, the delay can be noticeable because of the longer tick, which is unfortunate. By contrast, on faster systems, the delay is ok but maybe those systems don't need the extra horsepower.
It is often tolerated having 1 or 2 lag frames but those are added to API's buffering and can easily go out of control. I'm sure you can design something which avoids all of those issues, it depends on how much you want to invest on it.
You can obviously fight this by using smaller buffers (say a 1buffer ring instead of two or three) but this may come to the point of serializing both threads. Even on this case there could be a gain but it would be absymal when compared on a fully parallel solution.

Quote:Posted by bonus
It's being touted as the way to go by Microsoft in their recent Gamefest presentations and I'd tend to agree with them.

You also need to watch for SMT cores rather than truely independent cores. Putting two CPU intensive threads on the same core will also slow you down as they will be fighting to share the same CPU cache, and ultimately slow your frame rate.

Keep in mind that Direct3D takes much more effort on the CPU side and does have some issues with multithreading. If you don't specify the D3DCREATE_MULTITHREADED flag you may have bad trouble... but this may easily cost you a bit of perf. It can also happen it works perfectly without it, (who knows) I believe those metrics to be really application dependant. It definetly makes sense to move this critical path out of D3D (which runs standard) and sync to another thread because you have control and knowledge on when this sync happens.
If this makes sense in OpenGL I don't know.

Intel says you can put CPU intensive threads on the same cache because they'll have warm cache benefits. It really takes more than a few lines on this.

I have begun reading gamefest presentations a few days ago. I still have to read them all but up to now, the only thing interesting I found was the sparse textures hash. Wow, that rocked! The sphere-based ambient occlusion also looks nice, but I'm not sure it's worth it. By the way, what's your feeling on those two?

[Edited by - Krohm on October 4, 2006 10:21:06 AM]

Previously "Krohm"

Quote:Original post by Krohm
Keep in mind that Direct3D takes much more effort on the CPU side and does have some issues with multithreading. If you don't specify the D3DCREATE_MULTITHREADED flag you may have bad trouble... but this may easily cost you a bit of perf. It can also happen it works perfectly without it, (who knows) I believe those metrics to be really application dependant. It definetly makes sense to move this critical path out of D3D (which runs standard) and sync to another thread because you have control and knowledge on when this sync happens.
If this makes sense in OpenGL I don't know.

Intel says you can put CPU intensive threads on the same cache because they'll have warm cache benefits. It really takes more than a few lines on this.

I have begun reading gamefest presentations a few days ago. I still have to read them all but up to now, the only thing interesting I found was the sparse textures hash. Wow, that rocked! The sphere-based ambient occlusion also looks nice, but I'm not sure it's worth it. By the way, what's your feeling on those two?


Apparently using the D3DCREATE_MULTITHREAD is a big performance hassle too as it causes D3D to synchronise with every call to it. The way to go is to set d3d up in one thread and have only that thread call it, which is probably another reason why they recommend a rendering thread for D3D apps.

I haven't looked at those other two yet as I'm currently looking at the feasibility of doing a multithreaded engine, other nice optimisations and code can come once the base systems are looked into (i.e. data sharing between the game logic thread and the rendering thread). I'm looking at locking the frame rate to 60fps (console outlook) so the system doesn't need to be written to run as fast as humanly possible but should be written to keep the GPU busy for as much of a 60fps time slice as possible.

So my outlook on things was definately aimed more at D3D than OpenGL as that's what I'm currently looking at but I'd still be sceptical of the benefits of forcing context switches by data parallelism in things like physics systems. Maybe splitting the list across the number of idle cores you have available would be good, but how many systems actually have idle cores at any given time? For professional release games you'd be much better off looking at sticking major systems such as networking and resource caching into their own thread as they get major benefots from running compression algorithms and reducing the latency between asking for something and getting it.

Once you have the bases covered for reducing the latency in the application from loading or communicating then worry about threading things like physics, if you have enough (i.e. any) cores left which wont be impacted by context switches..
Quote:Original post by bonus
I'd still be sceptical of the benefits of forcing context switches by data parallelism in things like physics systems.

The const of switches is by far regained by the extreme performance increment in most cases. The cost itself is however pretty low when consider that ring0 transitions are actually context switches with extra fat added. I'm not really convinced you should use this as a design metric.
Quote:Original post by bonus
Maybe splitting the list across the number of idle cores you have available would be good, but how many systems actually have idle cores at any given time?

You're right, you cannot really know if a core is idle or not but assuming your application is the only one "performance hungry" will give you enough clues. If the user then launches another performance app, it's his/her business.
EDIT:
Thinking again on it, maybe win32 does provide some performance metric to estimate this. I'm not sure however.
/EDIT
Quote:Original post by bonus
For professional release games you'd be much better off looking at sticking major systems such as networking and resource caching into their own thread as they get major benefots from running compression algorithms and reducing the latency between asking for something and getting it.

Although I agree threading networking or streaming is definetly useful, I don't see how the insane amount of network latency (or HD for the matter) could be usefully reduced. Threading is not a magic. Also keep in mind that streaming for example will kick in just a few times per second (an order of magnitude less than physics integration). Theorically you don't even need threading if you use non-blocking I/O (but I also like threading there because it insulates the ugly things).
Quote:Original post by bonus
Once you have the bases covered for reducing the latency in the application from loading or communicating then worry about threading things like physics, if you have enough (i.e. any) cores left which wont be impacted by context switches..

I suggest another thing:
After you have something running and benchmarked against the target machine, optimize performance paths, then care about the details.
I say the context switches are detail because if you take physics at 30fps (usually it's 20 or even 12fps) you realize this is somewhat longer than the normal OS time slice, which is already a good solution.

In short, I'm likely misunderstanding your points.

[Edited by - Krohm on October 7, 2006 3:33:20 AM]

Previously "Krohm"

This topic is closed to new replies.

Advertisement