Jump to content

  • Log In with Google      Sign In   
  • Create Account

Interested in a FREE copy of HTML5 game maker Construct 2?

We'll be giving away three Personal Edition licences in next Tuesday's GDNet Direct email newsletter!

Sign up from the right-hand sidebar on our homepage and read Tuesday's newsletter for details!


We're also offering banner ads on our site from just $5! 1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.


Multithreading in games


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
20 replies to this topic

#1 Ripiz   Members   -  Reputation: 529

Like
0Likes
Like

Posted 04 September 2012 - 02:52 AM

I've implemented Fixed-Time-Step based on L.Spiro's article (http://lspiroengine.com/?p=378); such approach forces to separate update and rendering completely. I thought this would be a good place to split them into 2 threads:
  • update thread; input, physics, AI, etc, etc; doesn't touch rendering at all;
  • rendering thread; not allowed to write shared data at all, only reads it; takes care only of rendering.

However there was other multithreading thread and some people said it's bad approach because of multiple dependencies, therefore tasks are better, ex. resource loading. Does this still apply even though threads are quite separate, and dependencies aren't very high (GUI seems to have heaviest dependency)?

Thank you in advance.

Sponsor:

#2 Hodgman   Moderators   -  Reputation: 31021

Like
4Likes
Like

Posted 04 September 2012 - 03:21 AM

rendering thread; not allowed to write shared data at all, only reads it

Assuming no synchronisation, this is still a race condition. So you need two copies of your shared data - the update thread is reading the previous state from B and writing a new game-state to A, while the render thread is also reading the previous game-state B and drawing it, when they're both finished, you swap the A/B pointers around and start both threads going again.
There's a microsoft presentation called "Multicore Programming, Two Years Later" that explains this technique quite well, but all my links are dead Posted Image

N.B. this design only scales up to dual-core CPUs, and is only of any use if your CPU usage is fairly well split between "update" and "render" tasks.

Edited by Hodgman, 04 September 2012 - 03:23 AM.


#3 L. Spiro   Crossbones+   -  Reputation: 13986

Like
2Likes
Like

Posted 04 September 2012 - 05:37 AM

This type of decoupling is different from multi-threaded rendering. Both logic and rendering happen on the same thread so it is implicit that, when rendering, nothing else is happening to the vertex buffers or object states, etc.

Actual multi-threaded rendering is done by keeping a synchronous command buffer of your own design which takes render commands from the game thread and executes them in order on the render thread.
Any resources such as vertex buffers and index buffers that need to be modified on the game thread while there is a change it is being used on the render thread need to be double-buffered as mentioned by Hodgman.


L. Spiro

Edited by L. Spiro, 04 September 2012 - 05:46 AM.

It is amazing how often people try to be unique, and yet they are always trying to make others be like them. - L. Spiro 2011
I spent most of my life learning the courage it takes to go out and get what I want. Now that I have it, I am not sure exactly what it is that I want. - L. Spiro 2013
I went to my local Subway once to find some guy yelling at the staff. When someone finally came to take my order and asked, “May I help you?”, I replied, “Yeah, I’ll have one asshole to go.”
L. Spiro Engine: http://lspiroengine.com
L. Spiro Engine Forums: http://lspiroengine.com/forums

#4 Ripiz   Members   -  Reputation: 529

Like
0Likes
Like

Posted 05 September 2012 - 10:53 AM

Hm... Thank you.
That does have a point, but I wonder where I could create tasks to make use of at least 2 cores. Resource loading doesn't happen every frame.

#5 zfvesoljc   Members   -  Reputation: 440

Like
1Likes
Like

Posted 06 September 2012 - 12:58 AM

Hm... Thank you.
That does have a point, but I wonder where I could create tasks to make use of at least 2 cores. Resource loading doesn't happen every frame.


Where ever you have N items to update, ie: 200 particle effects can be split into 4 tasks, where each one can update 50 particle effects.

#6 L. Spiro   Crossbones+   -  Reputation: 13986

Like
3Likes
Like

Posted 06 September 2012 - 05:57 AM

Resource-loading is easy enough to handle via thread pools. You have to make a decision as to what to display during loading, however. It could just be a loading screen, or it could be like in Unreal Engine where low-quality textures are loaded first (and can cause a stall, although it is rare that it does) and as higher-quality texture data is loaded things suddenly gain more detail.

This can be extended to also dynamically flush and reload resources behind-the-scenes to try to allow loading an indefinite amount of resources, which your engine would have to decide to how page in and out.

If you are looking for other ways to utilize 2 cores, again you can just do the multi-threaded rendering technique with a custom command buffer.
Also input should be on its own thread (on Windows it has to be the main thread since input is sent as window messages) since that is the only reliable way to time-stamp the input values you get. This is necessary for smooth processing of input data.
My recommended setup for Windows would be:
Main thread dedicated to input.
2nd thread runs game logic.
3rd thread does rendering.
4th thread does sound processing (and runs on a lower priority, often sleeping).

These 4 threads are dedicated (always running).
Then the thread pool can be used to send more threads out to handle resource loading etc.


L. Spiro
It is amazing how often people try to be unique, and yet they are always trying to make others be like them. - L. Spiro 2011
I spent most of my life learning the courage it takes to go out and get what I want. Now that I have it, I am not sure exactly what it is that I want. - L. Spiro 2013
I went to my local Subway once to find some guy yelling at the staff. When someone finally came to take my order and asked, “May I help you?”, I replied, “Yeah, I’ll have one asshole to go.”
L. Spiro Engine: http://lspiroengine.com
L. Spiro Engine Forums: http://lspiroengine.com/forums

#7 web383   Members   -  Reputation: 787

Like
0Likes
Like

Posted 06 September 2012 - 12:18 PM

Also input should be on its own thread (on Windows it has to be the main thread since input is sent as window messages) since that is the only reliable way to time-stamp the input values you get. This is necessary for smooth processing of input data.


L. Spiro, why don't you make use of GetMessageTime() in the Windows Procedure? This seems easier than handling input on a separate thread. Is it not accurate enough? What sort of precision between keystrokes do you typically look for?

#8 L. Spiro   Crossbones+   -  Reputation: 13986

Like
2Likes
Like

Posted 06 September 2012 - 05:48 PM

I personally prefer to handle all of my times in microseconds rather than milliseconds, but milliseconds are acceptable for input timestamps.
If you also handle the other issues related to using GetMessageTime() then it is an acceptable alternative.


L. Spiro
It is amazing how often people try to be unique, and yet they are always trying to make others be like them. - L. Spiro 2011
I spent most of my life learning the courage it takes to go out and get what I want. Now that I have it, I am not sure exactly what it is that I want. - L. Spiro 2013
I went to my local Subway once to find some guy yelling at the staff. When someone finally came to take my order and asked, “May I help you?”, I replied, “Yeah, I’ll have one asshole to go.”
L. Spiro Engine: http://lspiroengine.com
L. Spiro Engine Forums: http://lspiroengine.com/forums

#9 slicer4ever   Crossbones+   -  Reputation: 3945

Like
1Likes
Like

Posted 06 September 2012 - 06:47 PM

rendering thread; not allowed to write shared data at all, only reads it

Assuming no synchronisation, this is still a race condition. So you need two copies of your shared data - the update thread is reading the previous state from B and writing a new game-state to A, while the render thread is also reading the previous game-state B and drawing it, when they're both finished, you swap the A/B pointers around and start both threads going again.
There's a microsoft presentation called "Multicore Programming, Two Years Later" that explains this technique quite well, but all my links are dead Posted Image

N.B. this design only scales up to dual-core CPUs, and is only of any use if your CPU usage is fairly well split between "update" and "render" tasks.


to add to this, Hodgman's method(from my understanding), means that you can theoretically stall the update thread, when waiting for the render thread to complete it's job. Another approach is to use 1 set of data, and 2 draw buffers, the update thread can continuously modify the data set, and when it's done, it can check if a buffer is ready to write the data into(if not, the frame is essentially dropped), so, if the buffer is ready, it writes the data, and mark's it as swappable(this flag is what determines if the buffer is writable or not), so the renderer comes along, swaps the draw buffers if the flag is set, clear's the swappable flag, and then continues to draw the same data until it sees the swapable flag again.

in essence, this doesn't tie your two threads together at all, and you can still do time-stepping code without worry about potential live-locks.

This of course only works for two threads, any more would require some other method of synchronization, such as Thread pools.

Edited by slicer4ever, 06 September 2012 - 06:52 PM.

Check out https://www.facebook.com/LiquidGames for some great games made by me on the Playstation Mobile market.

#10 Servant of the Lord   Crossbones+   -  Reputation: 20308

Like
0Likes
Like

Posted 06 September 2012 - 07:59 PM

My recommended setup for Windows would be:
Main thread dedicated to input.
2nd thread runs game logic.
3rd thread does rendering.
4th thread does sound processing (and runs on a lower priority, often sleeping).

These 4 threads are dedicated (always running).
Then the thread pool can be used to send more threads out to handle resource loading etc.

Does "Networking" count as input, and must be done on the main thread, or is that a fifth thread?

Further, you say your "setup for Windows" - aside from specific platforms (like game consoles) where the hardware is always known ahead of time and can be optimized for, is there some reason to be laying out the threads differently on Linux or Mac OSX?
It's perfectly fine to abbreviate my username to 'Servant' rather than copy+pasting it all the time.
All glory be to the Man at the right hand... On David's throne the King will reign, and the Government will rest upon His shoulders. All the earth will see the salvation of God.
Of Stranger Flames - [indie turn-based rpg set in a para-historical French colony] | Indie RPG development journal

[Fly with me on Twitter] [Google+] [My broken website]

[Need web hosting? I personally like A Small Orange]


#11 L. Spiro   Crossbones+   -  Reputation: 13986

Like
2Likes
Like

Posted 07 September 2012 - 04:20 AM

Networking would be on another thread.
Quad-core systems are fairly standard today so if we assume 4 cores, my recommended layout would be:

Sound takes medium resources and network takes medium-to-low–medium, so these could be on a core together.
Input takes few resources and the game thread should only be firing once every 30 milliseconds or so, so these could be on one core.
Then rendering would have its own core.
This leaves one core entirely free for whatever else you need, especially for the thread pool and resource loading.


I singled out Windows just because I guessed it would apply to him or her, but it would work just as well on Linux and Macintosh OS X.


L. Spiro
It is amazing how often people try to be unique, and yet they are always trying to make others be like them. - L. Spiro 2011
I spent most of my life learning the courage it takes to go out and get what I want. Now that I have it, I am not sure exactly what it is that I want. - L. Spiro 2013
I went to my local Subway once to find some guy yelling at the staff. When someone finally came to take my order and asked, “May I help you?”, I replied, “Yeah, I’ll have one asshole to go.”
L. Spiro Engine: http://lspiroengine.com
L. Spiro Engine Forums: http://lspiroengine.com/forums

#12 Hodgman   Moderators   -  Reputation: 31021

Like
3Likes
Like

Posted 07 September 2012 - 05:32 AM

My personal setup for a quad-core PC is:
3 main threads running game/sim/render tasks via SPMD. Tasks that must use a specific thread (e.g. DX9 calls) can be masked out e.g. if( CurrentThreadIndex()==0 )
1 low priority / oft-sleeping thread for long-running background tasks (e.g. decompression jobs)
+other low priority / oft-sleeping threads that are created by middleware (e.g. audio buffering).
Background loading and networking via asynchronous commands, not threads.

#13 web383   Members   -  Reputation: 787

Like
0Likes
Like

Posted 07 September 2012 - 09:51 AM

How are you telling Windows to run a thread on a certain core? Can we assume the OS will automatically distribute your threads across cores for you in an appropriate manner? Does this information show up in the task manager/resource monitor for verification?

#14 Hodgman   Moderators   -  Reputation: 31021

Like
2Likes
Like

Posted 07 September 2012 - 10:36 AM

Yeah you can assume that Windows will do a decent job of distributing your threads over the cores.
You can override it's decisions with the SetThreadAffinityMask function, but this can be harmful to performance if you're not as clever as Windows. e.g. maybe the user is encoding a video on the background which is fully using up one core -- Windows knows that but your game doesn't.
I provide an option in my config file that's off by default, but if it's enabled then it specifically binds the threads to cores -- users can turn on this option at their own risk.

#15 pmvstrm   Members   -  Reputation: 122

Like
1Likes
Like

Posted 07 September 2012 - 10:54 AM

Hm... Thank you.
That does have a point, but I wonder where I could create tasks to make use of at least 2 cores. Resource loading doesn't happen every frame.


Just check out the opensource version of Intel Tread Building Blocks. Runs on VC++/GCC in 32/64 Bit. You simply have to redeploy the
tbb.dll on your Setup or compile it statically.

http://threadingbuildingblocks.org/ver.php?fid=188

Use the:tbb41_20120718oss_win.zip file and compile it with visual C/C++, its compiled and ready in just under 2 minutes.
There are a lot of samples in the package where you can see how Task based Multicore OOP Operations can be defined.

But be honest. This is nothing if you compare it with OpenCL or CUDA but there is no OpenSource implementation out
there right now, so you have to make source the specific closed source CL driver is present, wich unfortently comes only
budled with the specific Videoadapter. Bottomline: For CL you need at lest a ATI and NVidia Card installed and
the Vendorspeficic driver for each Graphichardware. My advide. Stay with TBB and OpenGL 3.2 and use shader
where possible, its lower headache.

Peter

#16 pmvstrm   Members   -  Reputation: 122

Like
0Likes
Like

Posted 07 September 2012 - 11:05 AM

Hmm,

Quad-core systems are fairly standard today so if we assume 4 cores, my recommended layout would be:


Intel in 2009 talks about the possibilities to integrate 1024 cores on a chip. There will be a lot more cores in relative short time, so any code should in generall be prepared to scale over a hugh amoun of cores. Also the OpenCL, DirectCompute and Cuda multipuporse Units are growing every day. On modern Graphicards over 2000 Shader cores are today deployed to the customers and far beyond the power of today Multicore CPU's. I think it is important. The other thing is cloudbased, serverside rendering and
small, mobiel device in an interconnect scenario.

Sound takes medium resources and network takes medium-to-low–medium, so these could be on a core together.


For networking you have to deal with lags and you cannot run a such important part in an async thread, but syncronizing threads and locks are a performance killer firstclass.I think using Intels TBB's taskbased approach can be the easier way instead of dealing with platformspecific, native threadsubsystems (posix/mac/win/threads).

Peter

#17 phantom   Moderators   -  Reputation: 7398

Like
0Likes
Like

Posted 07 September 2012 - 11:42 AM

But be honest. This is nothing if you compare it with OpenCL or CUDA


And nor is it meant to do the same thing.

Despite that NV might want you to believe the GPU isn't "BEST AT EVERYTHING!!!!!!" and the CPU still has plenty of work to do which it comes to things you need to get the result back from quickly.

Using the GPU is good when you aren't too worried about the latency involved in getting the data back but it's not the be all and end all of parallel development.

#18 pmvstrm   Members   -  Reputation: 122

Like
0Likes
Like

Posted 07 September 2012 - 03:48 PM

Not if you use cuda or opencl. In this scope any CPU/GPU is only a core. You are not forced in most cases to transfer data from memory to GPU Memory and vice versa.You can write an opencl kernel wich can access the videoram directly without waiting of the CPU Cores or RAM. Anyway: If the Future is no longer Multicore and now Genereal Puporse Cores for anything, i a few years 16 cores per CPU is Standad like Quadcore is today standard. GPU's are todays 500 upto 2000 and counting. We have to deal with thousands of core in the Future and some code wich is not limited or using only one core is future ready.

#19 phantom   Moderators   -  Reputation: 7398

Like
1Likes
Like

Posted 07 September 2012 - 05:38 PM

Yes, even if you use CUDA or OpenCL (more so if you use CUDA as you are locked on NV hardware and not even on the CPU).

Not all work loads are going to parallalise well onto a GPU and use them effectively, at that point you need alternative solutions.

GPUs are good at highly parallel workloads where you can get good occupancy and don't need to worry about the latency involved. However there is a point of deminishing returns when it comes to the occupancy issue and if you start issuing too little work then the GPU starts to stall out waiting around for memory and your thousands of cores goto waste. Dispatching less than 64 threads worth of work on a modern GPU is going to bite you in the effiency stakes. GPUs also don't deal well with branching as with 64 threads all moving in lock step you need to ensure thread branch cohearancy is good or you'll start wasting time and resources. If you had an 'if...else' block on a GPU where both paths are approimately equal in cost then all it would take for your GPU code to run slow would be one thread going down the 'else' path and doubling your run time.

CPUs, on the other hand, are very good at low latency branchy code where you have a few diverging paths you can take. While OpenCL can deal with this it isn't going to always be the best way of dealing with the problem which is where libraries such as TBB and MS's TPL come to play. Expressing a parallel 'for' loop is trivial in TBB/TPL; not so in OpenCL.

As for The Future, right now AMD have the right plan; a mixed approach where a CPU has both conventional and ALU-arrays (GPU in other words) which can do work loads they both do well at. The conventional core race has hit a wall, notice how we haven't increased core count recently? (I bought a 4C/8T i7 back in 2008, just recently I brought a 4C/8T Ivy Bridge i7). The future is mixed cores and even with OpenCL around you need to place your workload and pick your API accordingly.

So once again; they do not do the same thing. The GPU isn't best. Don't depend on increasing core counts to fix performance issues. There is no 'one API to rule them' when it comes to this kind of work.

#20 Hodgman   Moderators   -  Reputation: 31021

Like
0Likes
Like

Posted 07 September 2012 - 09:54 PM

N.B. GPU 'cores' aren't the same as CPU cores. GPU manufacturers greatly exaggerate their numbers by multiplying by hardware threads and SIMD width.
Using the same definition, some quad core CPUs would actually have 32 cores...

Also the behavior is completely different - CPU cores that support 2 hardware threads is called 'hyperthreading' and gives a small performance boost by switching threads during CPU stalls. On the other hand GPU cores are designed to run the same kernel many times with known stall-points by saving the state of execution to a huge register bank, which allows for hundreds of hardware threads per core (hence the inflated numbers). Also the number of "hardware threads" per physical core isn't fixed like on a CPU, it can vary depending on how many registers the current kernel requires to store it's state.
It's a very different design that makes it hard to do an apples and apples comparison.

If I wrote software to cleverly round robin a batch of kernel executions on a CPU core by switching them in/out of L1 in the same way that a GPU does, then I'd again get to multiply that '32' by another constant. Using their definitions, I could make a quad core CPU actually be a 512 core GPU....

Edited by Hodgman, 08 September 2012 - 03:45 AM.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS