Sign in to follow this  
shela

Multi-threading

Recommended Posts

Hi all, Is it possible to multi-thread (2 threads) OpenGL application and link some the data of the other thread to OpenGL's thread and vice versa? If yes, how do i go about linking the 2 threads? Thanks.

Share this post


Link to post
Share on other sites
The easiest way is to have a bunch of shared buffers protected by a few critical sections to be accessed in ring order.
In win32 this is CRITICAL_SECTION: a few calls are EnterCriticalSection, LeaveCriticalSection, TryEnterCriticalSection... search them on MSDN.
When I multithread I often find I need events. Check out CreateEvent, SetEvent and WaitForMultipleObjects.

Avoid merging multiple threads in a GL stream however: some drivers won't like it very much and the benefit is usually very minimal when compared to application threading.

Avoid having a "computation" and a "rendering" thread: the number of syncronizations required will likely degenerate your system in a fully sequential one. Even on a carefully designed system, the lag introduced could build up considerably.
There are a few cases in which this is actually useful but again, your first rank priority should be threading the application on the data level. This has been proven to give the highest speedups nearly all cases.

Share this post


Link to post
Share on other sites
I was considering having a rendering and computation thread running concurrantly....the rendering thread would never modify any data that the computation thread needed to modify, (the rendering thread would read it, not write it)...and I thought that I would therefore need very few thread syncs in order to make this work...am I incorrect? Would this be a bad idea? I figured I could get away with directly syncing the threads once every 5-10 rendering loops..but you say there is something wrong with this method...do you think you could elaborate?

Share this post


Link to post
Share on other sites
Quote:
Original post by Steve132
I was considering having a rendering and computation thread running concurrantly....the rendering thread would never modify any data that the computation thread needed to modify, (the rendering thread would read it, not write it)...and I thought that I would therefore need very few thread syncs in order to make this work...am I incorrect? Would this be a bad idea? I figured I could get away with directly syncing the threads once every 5-10 rendering loops..but you say there is something wrong with this method...do you think you could elaborate?


the majority of the rendering work is done on the GPU (unless you are using a software renderer), thus the benefit of a separate thread for rendering is lost,

what you really want to do is make the highlevel render calls (opengl/dx), then modify the data, and finally swap the buffers.

the reason for this is that the bufferswap will wait for a vertical retrace (normally) thus putting the cpu in an idle state until the gpu is finnished,

by doing the game logic between the render calls and the bufferswap you ensure that the cpu does something while the gpu is working.

the traditional method (used by software renderers) would if used with hardware rendering result in this


logic rCalls render swap
cpu:-working-working-nothing-nothing -
gpu:-nothing-working-working-working -


vs

       render
rCalls +logic swap
cpu:-working-working-working -
gpu:-working-working-working -


if you want to take advantage of multiple cpus the best option is to split parts of the logic while keeping the main flow of the game loop sequential, things such as physics and AI are often well suited for a task based model. (comercial physics engines such as AGEIA and Havoc hydracore does this for you).

thus it would look something like this


/--AI-- /--Physics--
rCalls-- --- --OtherLogic--swap-
\--AI--/ \--Physics--/

using multiple threads for AI is reasonably easy.
task based, each thread grabs an actor from a queue, processes it then grabs a new one, the only thing that needs to be synchronized is queue access (you don't want both threads to try to grab the same actor for example).
physics is a bit harder but luckily there are shrinkwrapped solutions for that allready.

the time required to make the rendercalls is so small though that the synchronization overhead is likely to reduce performance, thus a separate render thread is worthless at best. (software renderers are an exception)

another reasonable way to use multiple threads is for on the fly loading of data (useful for large worlds etc) and similar time independant tasks, this is useful even on single cpu systems since it greatly simplifies things.

[Edited by - SimonForsman on October 4, 2006 8:44:54 AM]

Share this post


Link to post
Share on other sites
Let me say one thing first, i haven't had much experience with multithreading so i could be just a bit wrong.

Anyway, what i think Krohm is saying is that all dataprocessing you do while rendering will stall one or the other thread multiple times to a point that you might not gain that much from multithreading, of cause this depends a lot on your implementation.

The real problem is that the rendering thread has to be horribly linear, everything has to be done in the correct order and everything depend on what happened before it, this means that you just can't split up the rendering thread without running into problems (The irony is that computer graphics is both the greatest champion of parallel computing and it's greatest opponent).
Now while the rendering thread is horribly linear it is also pretty easy to predict what data it will need thus you can (if you want to) precalculate the data up to a frame in advance.
I hope that this gives you a few ideas.

Share this post


Link to post
Share on other sites
I would tend to disagree with the guys who are telling you that a thread for the renderer is not a good idea.

It's being touted as the way to go by Microsoft in their recent Gamefest presentations and I'd tend to agree with them.

An independent thread for rendering gives you much more time to make efficient use of the GPU, it will allow for a more detailed sorting or objects and will allow you to do more in the game.

Things which aren't reliant on game processing can be put into the renderer and processed here too. A particale system for example can be built to update inside the renderer, this would allow for more detailed and better looking particle systems.

It will also allow the support for more objects in the world as there is a greater opportunity to do culling on a seperate processor at a more detailed level.

As nice as the data level parallelism is you're also going to have to remember the overheads provided by context switches. Having more threads than you do cores is not free, and doesn't always give you a performance benefit. Setting up all your idependent physics objects to be processed at once if you only have two or four cores is going to kill your performance, the same can be said for AI updates, particle updates and the like.

You also need to watch for SMT cores rather than truely independent cores. Putting two CPU intensive threads on the same core will also slow you down as they will be fighting to share the same CPU cache, and ultimately slow your frame rate.

While it's all good to say that you should be threading at the data level, it brings many problems with synchronisation, race conditons and SMT core sharing that need to be designed efficiently.

But like you say, if you are just splitting the game loop and the rendering loop then you will only need a couple of syncs to switch double buffered render lists. This would allow you to write a detailed and expensive renderer and have a good game loop for updating physics etc.

It's worth getting the Gamefest presentations from MSDN and doing a bit of research into what they have to say.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
i would like to make a difference here between directx and opengl.

directx is based on batching, calling the render-routines also involves the cpu. but you have a speed (re)gain when uploading your data in big batches.

on the other hand, opengl and the vbos do not take that much time to upload data to the graphics hardware. ogl "collects" the data first, and then passes it off, it is less cpu-intensive.

so i would say since you are using opengl, your synchronizing code will take more time to execute than you are going to gain when using multithreading.
(this is only true for single-cpu systems).
on the other hand, i heard of speed gains of about 5-10 percent when using direct graphics and multiple threads.

But, as SimonForsman suggested, it will be a greater speed improvement when you optimize things like ai and physics. maybe you want to use the "fork-join" model to avoid sychronizing issues. and maybe you would like to have a look at the OpenMP multithreading standard which is implemented by many compilers (g++, ms visual studio 2005, etc.).

Share this post


Link to post
Share on other sites
Quote:
Posted by Steve132
and I thought that I would therefore need very few thread syncs in order to make this work...am I incorrect? Would this be a bad idea? I figured I could get away with directly syncing the threads once every 5-10 rendering loops..

This could work (I'm still not sure) if all your operations are atomical but that's unlikely. It also depends on how much your app is complex. If you're going to pull a lot of data to/from GPU or not, how much you rething your rendering strategies, how much your data structures are likely to be inconsistent... I personally would not trust this too much but I encourage you on that way and prove me wrong. I would really appreciate if you could post a way to do this on a "complex" app. It really takes a lot of strategy to do that and I would likely mess it up easily but that's my opinion.

Say for example you have three "command buffers" which you refresh each 10 render loops. That's just fine: each 10 render loops, you get a critical section (which probably won't lock) then signal the other thread to use the successsive buffer (which is now in consistent state). That could work and requires very few syncs but for example, collisions (at least player camera) must be handled on a per-frame basis so you must... somehow... lock at least your MVP and that's just an example!

EDIT:
I forgot to note you can also perform some simple tasks with interlocked operations. Nothing really fancy but they're with all other functions on msdn
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/synchronization_functions.asp
Interlocked operations are FAST, so they're a natural fit for the above example.
/EDIT

Quote:
Posted by lc_overlord
Anyway, what i think Krohm is saying is that all dataprocessing you do while rendering will stall one or the other thread multiple times to a point that you might not gain that much from multithreading, of cause this depends a lot on your implementation.

The real problem is that the rendering thread has to be horribly linear, everything has to be done in the correct order and everything depend on what happened before it, this means that you just can't split up the rendering thread without running into problems (The irony is that computer graphics is both the greatest champion of parallel computing and it's greatest opponent).
Now while the rendering thread is horribly linear it is also pretty easy to predict what data it will need thus you can (if you want to) precalculate the data up to a frame in advance.
I hope that this gives you a few ideas.

Yes, this is quite similar to what I tried to say... I was thinking at another aspect of it. You expose a very important issue but I will try again to explain why I don't like thread "pipelining" so much.

Both "parallelizing" and "pipelining" can yeld to very good performance increments.
What it has been observed however is that to have near-optimal sync access you need to buffer the commands from a stage to the other.

Now, this yelds to two cases: on slow systems, the delay can be noticeable because of the longer tick, which is unfortunate. By contrast, on faster systems, the delay is ok but maybe those systems don't need the extra horsepower.
It is often tolerated having 1 or 2 lag frames but those are added to API's buffering and can easily go out of control. I'm sure you can design something which avoids all of those issues, it depends on how much you want to invest on it.
You can obviously fight this by using smaller buffers (say a 1buffer ring instead of two or three) but this may come to the point of serializing both threads. Even on this case there could be a gain but it would be absymal when compared on a fully parallel solution.

Quote:
Posted by bonus
It's being touted as the way to go by Microsoft in their recent Gamefest presentations and I'd tend to agree with them.

You also need to watch for SMT cores rather than truely independent cores. Putting two CPU intensive threads on the same core will also slow you down as they will be fighting to share the same CPU cache, and ultimately slow your frame rate.

Keep in mind that Direct3D takes much more effort on the CPU side and does have some issues with multithreading. If you don't specify the D3DCREATE_MULTITHREADED flag you may have bad trouble... but this may easily cost you a bit of perf. It can also happen it works perfectly without it, (who knows) I believe those metrics to be really application dependant. It definetly makes sense to move this critical path out of D3D (which runs standard) and sync to another thread because you have control and knowledge on when this sync happens.
If this makes sense in OpenGL I don't know.

Intel says you can put CPU intensive threads on the same cache because they'll have warm cache benefits. It really takes more than a few lines on this.

I have begun reading gamefest presentations a few days ago. I still have to read them all but up to now, the only thing interesting I found was the sparse textures hash. Wow, that rocked! The sphere-based ambient occlusion also looks nice, but I'm not sure it's worth it. By the way, what's your feeling on those two?

[Edited by - Krohm on October 4, 2006 10:21:06 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Krohm
Keep in mind that Direct3D takes much more effort on the CPU side and does have some issues with multithreading. If you don't specify the D3DCREATE_MULTITHREADED flag you may have bad trouble... but this may easily cost you a bit of perf. It can also happen it works perfectly without it, (who knows) I believe those metrics to be really application dependant. It definetly makes sense to move this critical path out of D3D (which runs standard) and sync to another thread because you have control and knowledge on when this sync happens.
If this makes sense in OpenGL I don't know.

Intel says you can put CPU intensive threads on the same cache because they'll have warm cache benefits. It really takes more than a few lines on this.

I have begun reading gamefest presentations a few days ago. I still have to read them all but up to now, the only thing interesting I found was the sparse textures hash. Wow, that rocked! The sphere-based ambient occlusion also looks nice, but I'm not sure it's worth it. By the way, what's your feeling on those two?


Apparently using the D3DCREATE_MULTITHREAD is a big performance hassle too as it causes D3D to synchronise with every call to it. The way to go is to set d3d up in one thread and have only that thread call it, which is probably another reason why they recommend a rendering thread for D3D apps.

I haven't looked at those other two yet as I'm currently looking at the feasibility of doing a multithreaded engine, other nice optimisations and code can come once the base systems are looked into (i.e. data sharing between the game logic thread and the rendering thread). I'm looking at locking the frame rate to 60fps (console outlook) so the system doesn't need to be written to run as fast as humanly possible but should be written to keep the GPU busy for as much of a 60fps time slice as possible.

So my outlook on things was definately aimed more at D3D than OpenGL as that's what I'm currently looking at but I'd still be sceptical of the benefits of forcing context switches by data parallelism in things like physics systems. Maybe splitting the list across the number of idle cores you have available would be good, but how many systems actually have idle cores at any given time? For professional release games you'd be much better off looking at sticking major systems such as networking and resource caching into their own thread as they get major benefots from running compression algorithms and reducing the latency between asking for something and getting it.

Once you have the bases covered for reducing the latency in the application from loading or communicating then worry about threading things like physics, if you have enough (i.e. any) cores left which wont be impacted by context switches..

Share this post


Link to post
Share on other sites
Quote:
Original post by bonus
I'd still be sceptical of the benefits of forcing context switches by data parallelism in things like physics systems.

The const of switches is by far regained by the extreme performance increment in most cases. The cost itself is however pretty low when consider that ring0 transitions are actually context switches with extra fat added. I'm not really convinced you should use this as a design metric.
Quote:
Original post by bonus
Maybe splitting the list across the number of idle cores you have available would be good, but how many systems actually have idle cores at any given time?

You're right, you cannot really know if a core is idle or not but assuming your application is the only one "performance hungry" will give you enough clues. If the user then launches another performance app, it's his/her business.
EDIT:
Thinking again on it, maybe win32 does provide some performance metric to estimate this. I'm not sure however.
/EDIT
Quote:
Original post by bonus
For professional release games you'd be much better off looking at sticking major systems such as networking and resource caching into their own thread as they get major benefots from running compression algorithms and reducing the latency between asking for something and getting it.

Although I agree threading networking or streaming is definetly useful, I don't see how the insane amount of network latency (or HD for the matter) could be usefully reduced. Threading is not a magic. Also keep in mind that streaming for example will kick in just a few times per second (an order of magnitude less than physics integration). Theorically you don't even need threading if you use non-blocking I/O (but I also like threading there because it insulates the ugly things).
Quote:
Original post by bonus
Once you have the bases covered for reducing the latency in the application from loading or communicating then worry about threading things like physics, if you have enough (i.e. any) cores left which wont be impacted by context switches..

I suggest another thing:
After you have something running and benchmarked against the target machine, optimize performance paths, then care about the details.
I say the context switches are detail because if you take physics at 30fps (usually it's 20 or even 12fps) you realize this is somewhat longer than the normal OS time slice, which is already a good solution.

In short, I'm likely misunderstanding your points.

[Edited by - Krohm on October 7, 2006 3:33:20 AM]

Share this post


Link to post
Share on other sites
Thank you very much everyone...

Haha but all these solutions look difficult to me.

Because I never done multithreading before.

I found a site which had an abstract thread but I don't know how to use it.
http://www.leunen.com/fclt/threadcls.html

When i compile with my main thread, i got this error:
"Projectest error LNK2019: unresolved external symbol "public: __thiscall mlLib::WinThread::WinThread(void)" (??0WinThread@mlLib@@QAE@XZ) referenced in function "public: __thiscall MyThread::MyThread(void)" (??0MyThread@@QAE@XZ)" and "Projectest error LNK2019: unresolved external symbol "public: virtual __thiscall mlLib::WinThread::~WinThread(void)" (??1WinThread@mlLib@@UAE@XZ) referenced in function "public: virtual __thiscall MyThread::~MyThread(void)" (??1MyThread@@UAE@XZ)"


Can someone help me by doing a file reading thread which will keep reading from a file and then update some of the variables (ie.GLfloat) to the "owner" of the file reading thread? The main thread will be a OpenGL thread which keep rendering and change the position of the object due to the update of the variables caused by the file reading thread.

Please help because i tried to do multithreading for weeks and i am no way there. :(

If there is someone really kind enough to help me, pls send me the codes to my email: shelai@gmail.com

Thanks to everyone.

By the way, I am using visual studio.

Share this post


Link to post
Share on other sites
Quote:
Haha but all these solutions look difficult to me.
Because I never done multithreading before.
I found a site which had an abstract thread but I don't know how to use it.

They could look like, but I hope you'll find it easier after reading this.
Point 1: do not use wrappers, unless you did them yourself.
When you use threading, you're really saying "I 4/\/\ r0xx0|2" so you really must get your hands dirty with to good ol' win32 calls.
As you'll see, the calls are quite easy and trying to wrap them completely would probably be overkill.
So, in other words, you cannot go around saying you multithreading and then expect some magic lib to do your work efficiently. Threading is low-level stuff and you have to know what's happening behind the scenes for the sake of efficiency. If you don't want to know, you can use wrappers but if something goes awry, you were the one aiming at the foot.

Quote:
By the way, I am using visual studio.

Point 2: you need to get your hands dirty so make sure you're compiling native code and using the last MS platform SDK.
VC2005 express, freely available from MS does not compile to win32 by default.
Canonical URL: native win32 code with VC2005 Express

After you have everything set up and running, Point 3 it's cruch time.
You'll have your "main" doing something... an the other thread doing something other enclosed where? Threads in win32 are special functions (you can also use standard C threads but I feel simply better with the native win32 ones).
Canonical URL: Process and thread functions for win32 (but you'll be much happier with the platform documentation which comes with the PSDK).

Starting a thread takes calling CreateThread with the right parameters. Quick reference:

  1. lpThreadAttributes is NULL.

  2. dwStackSize is 0.

  3. lpStartAddress is a pointer to your routine. All it takes is a "special" function withouth the function call operator - that is, without the (), name only.

  4. lpParameter
  5. is a pointer to an arbitrary buffer. It'll likely take a few casts (yes, ugly, unsafe C casts) so you may need a bit of sugar to digest this one. I usually pass object pointers here for example. All the parameters should be there in short.
  6. dwCreationFlags specify if the thread should run immediatly or not (maybe other less useful things). I usually set this to 0.

  7. lpThreadId is a thread identifier I don't really like... I usually set it to NULL but you may want to keep it. I find the returned HANDLE to be much more powerful.


What about your "special" thread function? Nothing really special. It just needs to be a stdcall (WINAPI) returning 32bit uint (DWORD) and taking a void* (LPVOID).

If you want to put your thread in a object, you have to use a static function (which will become stdcall instead of thiscall). You should then pass the this pointer as the LPVOID parameter. When the thread starts, it'll take the object pointer from its private stack and live with it.

Take care: everything in the thread is private to it so local variables don't mess up each other. Global variables however means trouble: they do mess each other. Use Thread-Local-Storage for thread-local global variables or find a way to manage the issue (I find this easier).

Test the threads do work. For example, make 'em printf something to console or create and fill two different files... or a single file with 0 and 1.

After this has proven to work, you need to work out the communications. This takes the function named above so, CRITICAL_SECTION objects, CreateEvent, WaitForMultipleObjects[Ex]...


Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this