Jump to content

  • Log In with Google      Sign In   
  • Create Account

Looking into how to parallelize my game engine


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
20 replies to this topic

#1 Narf the Mouse   Members   -  Reputation: 318

Like
0Likes
Like

Posted 20 June 2012 - 09:45 AM

Making my game engine parallel is next on the list, since (this version) is in an early state and it seems a good idea to do that before I go any further. Anyway, I know of the following methods (C#), using examples:

1) Build command queues for drawing objects in parallel; execute the command queues singlethreaded (as (IMU) drawing to the GPU multi-threaded doesn't work too well). Downsides: Command object overhead; by my test, not actually faster for untextured, unlit cubes (not conclusive, but worrisome).
2) One thread per "process"; that is, a DrawThread, a PhysicsThread, an AIThread, yada. Downsides: Not sure if possible to always have as many processes as threads; synchronization between threads

Looking for:
Information on those methods.
Information on other methods.
Tutorials.

Thanks.

Sponsor:

#2 frob   Moderators   -  Reputation: 22800

Like
5Likes
Like

Posted 20 June 2012 - 10:43 AM

There are many types of parallelization.

You can make work asynchronous. Examples are I/O where you give a buffer for read, make an asynchronous disk call, and get a callback when the buffer has been read. Libraries that already work asynchronously include rendering libraries like OpenGL and Direct3D, and audio libraries.

You can parallelize algorithms across processors. Examples of this would be searching where you give each processor a block of memory to process. Graphics cards do this internally on a massive scale. Some physics engines do this.

You can use parallel operations on a single processor. Generally this means SIMD calls (MMX, XMM, etc.) Used in conjunction with careful cache usage this is common to get incredible throughput on particle systems.



Putting each of render/physics/ai into their own thread can help a little, but they also introduce more complexity than they are worth, generally.

Your graphics library is already doing asynchronous rendering for you, so there is no need to add another threading layer.

Asynchronous rendering calls are the default. Asynchronous audio calls are the default. You can make asynchronous I/O calls your default.



When it comes to the case of breaking work across multiple processors, there are many excellent books on it. The best free online resource I've found is "Designing and Building Parallel Programs", although it is focused on large-scale computational tasks. The best method I've found among books is called the "PCAM" model. P = Partition your work into the smallest possible chunks. C = Communication, find communication patterns between the chunks of work. A=Agglomeration, turn those chunks of work into logical groups. M=Mapping, map those logical chunks of work into actual per-thread code. This method is covered in the DBPP online book.

As an example, take a trivially-parallel task of searching:

Partition = the smallest chunk of work is to compare a single item
Communication = each chunk must communicate with memory to get the result, and return a result
Agglomeration = To minimize communication, allocate a continuous block of search space based on memory access speed and compute time
Mapping = Each processor gets a list of 1/n items to search.

Another example is consumer/producer or master/worker pools. Again, you must partition it into chunks of work for the worker threads to do, establish what needs to be communicated, figure out how to minimize communication; mapping is automatic by the master coming up with blobs of work and the worker threads taking the next one as soon as they finish their current work.



It is mostly engine work that is amenable to parallelization. In order to be good candidates for parallel processing you need to have a lot of the same thing. Game code tends to be one-off work such as reaction to a player bumping a consumable object, or a single mouse click happened, or a single network packet was received, or otherwise NOT lots of the same thing, and therefore NOT good candidates for parallel processing.

Edited by frob, 20 June 2012 - 10:45 AM.

Check out my book, Game Development with Unity, aimed at beginners who want to build fun games fast.

Also check out my personal website at bryanwagstaff.com, where I write about assorted stuff.


#3 Narf the Mouse   Members   -  Reputation: 318

Like
0Likes
Like

Posted 20 June 2012 - 11:11 AM

Some of that I knew already, but the way you put it is interesting.

I'm trying to support from DirectX9 (SlimDX) up; I have a factory pattern that should be able to do that, with minimal overhead. DirectX9, at least, doesn't seem very parallel in its draw calls - IMU, it glitches if you make a draw call from a different thread than the one the device was created on. That seems to be one of the main problems I'll have to overcome if I want to make my engine parallel.

#4 Ravyne   GDNet+   -  Reputation: 8187

Like
3Likes
Like

Posted 20 June 2012 - 11:11 AM

Synchronization is a hard problem, especially in games where so many interactions happen all the time -- for example, a bullet which is a particle run by the physics engine collides with a wall that's owned by the physics system, and the collision fires events that spawn sparks (particles), places a bullet-hole decal (graphics), a ricochet sound effect (audio), and might also cause physics or entity updates if the wall is actually damaged.

There definitely are some broader systems that can be coarsely decoupled if you do it carefully, but I think that most games today are exploiting parallelism using task-based job-queues -- in other words, certain work that can be accomplished in a largely independent fashion is made into a nice little package, and there's a queue that drops these tasks onto the available cores based on priority and dependencies. Streaming resources from disk is a good example of a task that can be handled this way.

You can also segment and duplicate data that represents dependencies to decrease coupling, for example, if you store your collision world in a spacial structure (Octree or what-have-you) and your particle system needs that data, it doesn't necessarily need the whole world, just the nodes (or even just the primitives) that the particles might intersect. The more you can eliminate, or at least decouple, communication (especially 2-way or circular communication) the better chances you have of exploiting persistent parallelism

#5 Narf the Mouse   Members   -  Reputation: 318

Like
0Likes
Like

Posted 20 June 2012 - 12:24 PM

That's two votes for executing tasks in parallel - That's also the way I'd prefer to do it, too.

Interesting stuff. Keep it coming. :)

#6 M6dEEp   Members   -  Reputation: 898

Like
2Likes
Like

Posted 20 June 2012 - 03:55 PM

http://www.gameenginegems.net/geg1.php

Chapter 21 talks about multi threaded object models

http://www.gameenginegems.net/geg2.php

Chapter 29 talks about thread communication techniques

Also, the books are just awesome in general and are worth picking up if you are working on your own custom engine. Just thought I'd point them out :D

#7 Narf the Mouse   Members   -  Reputation: 318

Like
0Likes
Like

Posted 20 June 2012 - 04:39 PM

Downside: Completely wrong on the DirectX 9 Device not being able to receive commands from multiple threads.
Upside: Completely wrong on the DirectX 9 Device not being able to receive commands from multiple threads.
Downside: In my testbed, no matter how lightly or heavily the CPU is loaded, single-threading draws faster.

Thanks; unfinances are one reason I haven't picked up the Game Engine Gems books.

#8 phantom   Moderators   -  Reputation: 7596

Like
3Likes
Like

Posted 20 June 2012 - 05:04 PM

If you are just throwing commands at the D3D9 device from multiple threads then yes, single threaded is going to end up faster as the driver/runtime internally has to do a lot of state maintaince and locking - you effectively go single threaded but with more overhead.

The correct way to do this is to use multiple threads/tasks to assemble a sorted, ordered command queue using your own structures - this command queue is then processed by a main thread/task while everything else gets on with setting up for the next frame. In this case 'sorted and ordered' means doing state sorting and any other work to ensure your rendering does as few state changes as it can.

The engine used for Civ5 (LORE) uses this method on its DX9 path and they saw a speed up due to being able to take advantage of data being in cache as they were quickly re-hitting DX code rather than doing 'a bit of their code' then 'a bit of DX code' which is going to involve jumping around memory and losing data from the cache. You might not see the same speed up in .Net as they did in C++ but you should still see some improvement if you do it right.

In short;
- multiple threads/tasks to setup a command queue
- one thread to process this queue

#9 Narf the Mouse   Members   -  Reputation: 318

Like
0Likes
Like

Posted 20 June 2012 - 05:09 PM

How would you suggest setting up the command-queuing system?

Edit: To clarify:

1) How would you implement getting commands for the queue? For example, should every function which interacts with the device return a Command object?
2) How would you processs the commands? For example, drop them off in a central point?

Edited by Narf the Mouse, 20 June 2012 - 05:19 PM.


#10 Wyrframe   Members   -  Reputation: 733

Like
0Likes
Like

Posted 21 June 2012 - 09:06 AM

How would you suggest setting up the command-queuing system?

Edit: To clarify:

1) How would you implement getting commands for the queue? For example, should every function which interacts with the device return a Command object?
2) How would you processs the commands? For example, drop them off in a central point?

Assuming you're talking about drawing commands; http://realtimecollisiondetection.net/blog/?p=86

(( This email has been quad-ROT13 encrypted. Reading it violates the DMCA. ))
(( 我说很少的汉语,还我的语法平庸, but at least I'm UNICODE-compliant. ))


#11 Narf the Mouse   Members   -  Reputation: 318

Like
0Likes
Like

Posted 21 June 2012 - 09:42 AM


How would you suggest setting up the command-queuing system?

Edit: To clarify:

1) How would you implement getting commands for the queue? For example, should every function which interacts with the device return a Command object?
2) How would you processs the commands? For example, drop them off in a central point?

Assuming you're talking about drawing commands; http://realtimecolli....net/blog/?p=86

Thanks; that does tell me about sorting commands. But that's not quite what I asked.

Although, my question will probably be answered when I get to that section of Design Patterns. :)

#12 dougbinks   Members   -  Reputation: 492

Like
0Likes
Like

Posted 21 June 2012 - 01:31 PM

I would take a look at this sample from Intel for some help, along with the supporting articles: http://software.intel.com/en-us/articles/vcsource-samples-tasking-update/

Full disclosure: although I'm a games industry veteran I also worked for Intel for a few years helping developers make their code more parallel amongst other things, so I'm likely a bit biased towards their solutions. Having said that, these techniques are in several shipping games. The tasking (also often called Jobs) methodology is pretty much universally used in games these days.

The linked article has a list of useful references. Although the code uses a directx11 interface, the overall architecture is suitable for OpenGL and DirectX9 / 10 if you batch rendering up for execution in the main thread. See http://beautifulpixels.blogspot.ch/2008/07/parallel-rendering-with-directx-command.html for some insight (and code!) about one way to do that.

#13 Narf the Mouse   Members   -  Reputation: 318

Like
0Likes
Like

Posted 21 June 2012 - 02:29 PM

I would take a look at this sample from Intel for some help, along with the supporting articles: http://software.inte...tasking-update/

Full disclosure: although I'm a games industry veteran I also worked for Intel for a few years helping developers make their code more parallel amongst other things, so I'm likely a bit biased towards their solutions. Having said that, these techniques are in several shipping games. The tasking (also often called Jobs) methodology is pretty much universally used in games these days.

The linked article has a list of useful references. Although the code uses a directx11 interface, the overall architecture is suitable for OpenGL and DirectX9 / 10 if you batch rendering up for execution in the main thread. See http://beautifulpixe...tx-command.html for some insight (and code!) about one way to do that.

"ContactDialog.rc(10): fatal error RC1015: cannot open include file 'afxres.h'."

"Don't have permission to modify solution file." - Permissions clearly show full control.

#14 dougbinks   Members   -  Reputation: 492

Like
0Likes
Like

Posted 22 June 2012 - 03:14 AM

It works for me when I compile the solution:

TaskingUpdate_source\TaskingUpdate\DX11MultiThreadedAnimation\TaskingGameEngine\DX11MultiThreadedAnimation\DX11MultiThreadedAnimation_2010.sln

Which solution did you try, and what version of the compiler (VS full, or express) are you using?

#15 Shannon Barber   Moderators   -  Reputation: 1390

Like
0Likes
Like

Posted 22 June 2012 - 06:43 AM

I would start by looking into OpenCL.
At a minimum get an idea of how they designed things.

If you survey the field most of the effort in parallelization is centered around "big problems" where they can tolerate significant overhead because the gains from parallel processing are massive. When you start trying to parallel ~1ms tasks the overhead probably cannot be ignored.

The first thing I did was separate the windows-pump thread from the game-engine.
The primary thread runs the message-pump and the game runs on a second thread. If I am in window'd mode I use a 'monitor' synchronization mechanism to lock the game-thread while the window is resized. (I use OpenGL and switching the OpenGL context between threads, just unlocking & locking it, is expensive so you can't do it every frame.)
This has several nice effects, such as it allows me to continue rendering while the window is being dragged.
There are things you can do to lock this down and get decent behavior without a second thread for the game but for tools (which need a GUI) this was a nice touch.

Anytime you are chasing down a tree, two+ tail recursion, it can be parallelized. Scene culling, phsyics/collisions, & boid affects.

The physics part is where I think you get the most gain from parallelization. The collision detection and resultant forces/movement pass can be parallelized. You need a sync-point after each pass then you can submit a new batch of tasks for the next pass. I submit a job for the root of the tree and let it recurse and spawn more jobs. This is wasteful at the top of the tree but pays gains at the bottom. Figuring out the optimal place to spawn new jobs is... difficult.

Each time the scene culler determines it needs to chase down two (or more) branches each branch can be submitted as a culling job to the thread-pool.


The thread-pool is owned by my "core", the core also owns each sub-system so you can submit a scene-culling task and a collision task and they will process in parallel but I don't do this. I want the scene-culling to finish and I want to start submitting geometry to OpenGL to get the GPUs working, then start calculating physics. Kick-off scene-culling, sync-point waiting for culling to finish, reduce the thread-pool size by 1 (OGL needs to run on the core thread, this might be less awkward with D3D), kick-off physics, run opaque OGL shaders on culled geometry and once that is complete bump the thread-pool size back up and then sync that physics pass (physics passes repeat for a while).

Now I can run the opaque shader of the dynamic objects, then all mirrors, then all translucents.
I have my graphics engine setup to accept 'shader fragments'. It hash sorts them by shader priority (and then by shader) and the priority determines what order things are shaded in (order submitted for rendering). The priorities are set to minimize state-changes in the OGL pipeline.

If I did it again I think I would try adding an affinity mask to Jobs so I could force a job to execute on a particular thread. Then OGL & D3D rendering would work the same way through the thread-pool and I wouldn't have 'special steps' in the core process.

Edited by Shannon Barber, 22 June 2012 - 07:18 AM.

- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara

#16 Narf the Mouse   Members   -  Reputation: 318

Like
0Likes
Like

Posted 22 June 2012 - 02:56 PM

It works for me when I compile the solution:

TaskingUpdate_source\TaskingUpdate\DX11MultiThreadedAnimation\TaskingGameEngine\DX11MultiThreadedAnimation\DX11MultiThreadedAnimation_2010.sln

Which solution did you try, and what version of the compiler (VS full, or express) are you using?

I used Visual C++ Express 2010 and Visual C++ Express 10 (from the options) - Probably the same thing.

@Shannon Barber: Thanks; that gives me some more ideas for my engine.


This thread details a specific problem I'm having with parallel code:

http://www.gamedev.n...-parallel-code/

Edited by Narf the Mouse, 22 June 2012 - 02:56 PM.


#17 M6dEEp   Members   -  Reputation: 898

Like
1Likes
Like

Posted 22 June 2012 - 03:48 PM

Last night this thread piqued my interest in the topic so I went digging in my game engine gems books for some info and found some really relevant stuff. This is a presentation that was cited at the end of the Camera Centric Engine Design for Multithreaded Rendering chapter. It is a very good intro into the command queuing stuff. Also I didn't know this but Civ 5's engine LORE stands for Low Overhead Rendering Engine, and there is a presentation floating around that details how they implemented this technique for DX 9 and DX 11. When I get on my desktop I can post the link to it for your viewing pleasure.

SIGGRAPH presentation: http://developer.amd...ingSiggraph.pdf and Source
Firaxis LORE presentation

#18 dougbinks   Members   -  Reputation: 492

Like
0Likes
Like

Posted 25 June 2012 - 10:12 AM

@Narf the Mouse - tried to reproduce your problem with Visual C++ Express, and couldn't, so I'm not sure how to help.

FYI Civ 5 was multithreaded with the help of the original author of the Multithreaded Animation sample I mentioned, you can see him and Dan Baker chat about their work here:

There's also an article on this here: http://software.intel.com/en-us/articles/sid-meiers-civilization-v-finds-the-graphics-sweet-spot/

#19 Narf the Mouse   Members   -  Reputation: 318

Like
0Likes
Like

Posted 26 June 2012 - 12:30 PM

Thanks, everyone. I've looked through/bookmarked the resources; unfortunately, Intel vTune is a bit out of my price range right now, especially if I'm going to get the ANTS Profiler.

However, your help has already allowed my some significant parallelization.

#20 dougbinks   Members   -  Reputation: 492

Like
0Likes
Like

Posted 26 June 2012 - 12:58 PM

You might want to look into Intel GPA and it's CPU profiling capabilities (which require some code mark-up but that's usually fairly straight forwards), it's free of charge. AMD Codeanalyst is also free, and has some great capabilities these days http://developer.amd...es/default.aspx. Although some of it's abilities only work on AMD systems, I've used the timer profiling on Intel.

If you have Games Programming Gems 3 or Best of Games Programming Gems the Real-Time Hierarchical Profiling article is very decent, though needs some changes to track multiple threads. It can be combined with the markups you need for Intel GPA so you can get offline and real-time viewing of the data.

XPerf is also free and great for a wide range of profiling functions, though somewhat complex. See http://www.altdevblogaday.com/2012/06/20/wpaxperf-trace-analysis-reimagined/ and linked articles.

Edited by dougbinks, 26 June 2012 - 12:59 PM.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS