Every game I've worked on for current-gen and previous-gen consoles (and PC in the past 5-10 years) has used a job system at it's core. All of these games wouldn't have been able to ship without it -- they would've been over the CPU budget and not been able to hit 30Hz / 60Hz.
This model doesn't associate particular systems with particular threads. Instead you break the processing of every system down into small tasks (jobs), and let every thread process every job. Physics can then use N threads, Rendering can use N threads, AI can use N threads -- so you make full use of a Cell CPU with a single dual-threaded PowerPC plus 6 SPU's (8 threads, two instruction sets!), a tri-core dual-threaded PowerPC (6 threads), a dual quad-core AMD CPU-pair (8 threads), or a hyper-threaded hexa-core PC (12 threads)...
Yep, this has been much more important for console devs, as consoles have shitty CPU's! But, there's also a shitload more performance in a modern PC just going to waste if you're still writing single-threaded games.
See:
http://fabiensanglard.net/doom3_bfg/threading.php
https://blog.molecular-matters.com/2015/08/24/job-system-2-0-lock-free-work-stealing-part-1-basics/
Some recent presentations:
The Last of Us: Remastered:
http://www.gdcvault.com/play/1022186/Parallelizing-the-Naughty-Dog-Engine
http://www.benicourt.com/blender/wp-content/uploads/2015/03/parallelizing_the_naughty_dog_engine_using_fibers.pdf
Destiny:
http://advances.realtimerendering.com/destiny/gdc_2015/
it seems that very little of the basic tasks in a game are parallel in nature.
That's because you haven't practiced it yet. Pretty much everything in your game is capable of using multiple cores!
If you want an example, try a functional language like Erlang for a while, and see how data dependencies can be broken down at quite a fine-grained level to expose parallelism without even trying.
ideally, each basic task would have one or more processors dedicated to it (how about one processor per entity? <g>) , update would update some drawing data for render from time to time and set a flag that it had posted new data for exchange (to be passed to render). same idea with input passing info to update, or perhaps handling the input directly. render and audio would just hum along, checking for new data to display or play.
This is one of the early models that people used when the Xbox360/PS3 were thrown at us, with their shittily-performing-yet-numerous-cores.
In my experience, it's largely died off in favor of job systems. The dedicated thread-per-system with message passing model does still see some use in specialized areas, such as high-frequency input handling (e.g. a 500Hz thread that polls input devices), audio output, filesystem interactions, or network interactions... but for gameplay / simulation / rendering it's not popular any more.
in render, only one CPU can talk to the GPU at a time
Actually calling D3D/GL functions is only one small part of the renderer -- scene traversal, sorting, etc can all be done across multiple cores. Moreover, in D3D11, you can perform resource management tasks on any thread, and create multiple command buffers, to record draw/state commands on many threads (before a single "main" thread submits those command buffers to the GPU). D3D12/Vulkan head further in this direction.
but it seems that in both render and movement you must at some point merge back to a single chokepoint thread, to marshal all your parallel results together and apply them.
With a job system, you're "going wide" and then having dependent tasks wait on those results constantly.
Note that to do this you don't use mutexes/locks/semaphores/etc very much (or god forbid: volatile) -- the idea that you should be using to schedule access to data is a directed-acyclic-graph of jobs and something akin to dataflow programming / stream processing / flow based programming.
Some middleware has started supporting this model too. PhysX allows you to plug your engine's job system into it's work dispatcher, so that it will automatically spread all of it's physics calculations across all of your threads.
In the ISPC language, they support a basic job model with the launch and sync language keywords, which are likewise capable of being hooked into your engine's job system:
e.g. instead of a single-threaded loop:
void DoStuff( uniform uint i );
...
for( uniform uint i=0; i!=numTasks; ++i )
DoStuff(i);
You can easily write a many-core parallel version as:
void DoStuff( uniform uint i );
inline task void DoStuff_Task() { DoStuff(taskIndex); }//wrapper, pass magic taskIndex as i
...
launch[numTasks] DoStuff_Task();//go wide, running across up to numTasks threads
sync;//wait for the tasks to finish
Note that the above doesn't actually create any new threads. In my engine, one thread is created per core at startup and are always running, waiting for jobs to enter their queues. The above launch statement will push jobs into those queues and wake up any idle cores to get to work. That sync statement will block the ispc code until the launched jobs have all completed -- the thread that was running the ispc code will also be hijacked by the job system, and will execute job code until that point in time!
There's also libraries such as TBB or compiler-extensions such as OpenMP that add these kinds of constructs to C++, or you can write them yourself easily enough :wink: