what good are cores?

Started by
22 comments, last by SergioJdelos 7 years, 10 months ago

can anyone give me an example of a real world game that uses multiple cores for mission critical code, where multi-threading was obviously the best way to do it?

multi-threading just for multi-threading's sake doesn't count, and neither does using additional threads for non-mission critical code like smoke particle systems.

i'm talking about true parallel processing of the basic tasks of render, input, and update.

ideally, each basic task would have one or more processors dedicated to it (how about one processor per entity? <g>) , update would update some drawing data for render from time to time and set a flag that it had posted new data for exchange (to be passed to render). same idea with input passing info to update, or perhaps handling the input directly. render and audio would just hum along, checking for new data to display or play.

but are we at that point yet where it really would be better from a performance standpoint? forget about the additional implementation costs of multi-threading for a moment - which are non-trivial.

more importantly, do we even need the gains it could bring? right now the typical bottleneck is one processor feeding the GPU (and memory access of course). we'd still have one core feeding the gpu - perhaps with data from multiple cores.

so, what good are cores (at this point in time)? can they do anything truly useful? or just more BS bling bling chrome on high end machines?

parallelism only lends itself to processes that are inherently parallel in nature. it seems that very little of the basic tasks in a game are parallel in nature. update in parallel and dealing with collisions wouldn't be easy. in render, only one CPU can talk to the GPU at a time, so its not really parallel either. aspects of scene composition maybe, but it seems that in both render and movement you must at some point merge back to a single chokepoint thread, to marshal all your parallel results together and apply them.

.

Norm Barrows

Rockland Software Productions

"Building PC games since 1989"

rocklandsoftware.net

PLAY CAVEMAN NOW!

http://rocklandsoftware.net/beta.php

Advertisement

Every game I've worked on for current-gen and previous-gen consoles (and PC in the past 5-10 years) has used a job system at it's core. All of these games wouldn't have been able to ship without it -- they would've been over the CPU budget and not been able to hit 30Hz / 60Hz.

This model doesn't associate particular systems with particular threads. Instead you break the processing of every system down into small tasks (jobs), and let every thread process every job. Physics can then use N threads, Rendering can use N threads, AI can use N threads -- so you make full use of a Cell CPU with a single dual-threaded PowerPC plus 6 SPU's (8 threads, two instruction sets!), a tri-core dual-threaded PowerPC (6 threads), a dual quad-core AMD CPU-pair (8 threads), or a hyper-threaded hexa-core PC (12 threads)...

Yep, this has been much more important for console devs, as consoles have shitty CPU's! But, there's also a shitload more performance in a modern PC just going to waste if you're still writing single-threaded games.

See:
http://fabiensanglard.net/doom3_bfg/threading.php
https://blog.molecular-matters.com/2015/08/24/job-system-2-0-lock-free-work-stealing-part-1-basics/

Some recent presentations:

The Last of Us: Remastered:
http://www.gdcvault.com/play/1022186/Parallelizing-the-Naughty-Dog-Engine
http://www.benicourt.com/blender/wp-content/uploads/2015/03/parallelizing_the_naughty_dog_engine_using_fibers.pdf

Destiny:


http://advances.realtimerendering.com/destiny/gdc_2015/

it seems that very little of the basic tasks in a game are parallel in nature.

That's because you haven't practiced it yet. Pretty much everything in your game is capable of using multiple cores!
If you want an example, try a functional language like Erlang for a while, and see how data dependencies can be broken down at quite a fine-grained level to expose parallelism without even trying.

ideally, each basic task would have one or more processors dedicated to it (how about one processor per entity? <g>) , update would update some drawing data for render from time to time and set a flag that it had posted new data for exchange (to be passed to render). same idea with input passing info to update, or perhaps handling the input directly. render and audio would just hum along, checking for new data to display or play.

This is one of the early models that people used when the Xbox360/PS3 were thrown at us, with their shittily-performing-yet-numerous-cores.
In my experience, it's largely died off in favor of job systems. The dedicated thread-per-system with message passing model does still see some use in specialized areas, such as high-frequency input handling (e.g. a 500Hz thread that polls input devices), audio output, filesystem interactions, or network interactions... but for gameplay / simulation / rendering it's not popular any more.

in render, only one CPU can talk to the GPU at a time

Actually calling D3D/GL functions is only one small part of the renderer -- scene traversal, sorting, etc can all be done across multiple cores. Moreover, in D3D11, you can perform resource management tasks on any thread, and create multiple command buffers, to record draw/state commands on many threads (before a single "main" thread submits those command buffers to the GPU). D3D12/Vulkan head further in this direction.

but it seems that in both render and movement you must at some point merge back to a single chokepoint thread, to marshal all your parallel results together and apply them.

With a job system, you're "going wide" and then having dependent tasks wait on those results constantly.
Note that to do this you don't use mutexes/locks/semaphores/etc very much (or god forbid: volatile) -- the idea that you should be using to schedule access to data is a directed-acyclic-graph of jobs and something akin to dataflow programming / stream processing / flow based programming.

Some middleware has started supporting this model too. PhysX allows you to plug your engine's job system into it's work dispatcher, so that it will automatically spread all of it's physics calculations across all of your threads.

In the ISPC language, they support a basic job model with the launch and sync language keywords, which are likewise capable of being hooked into your engine's job system:
e.g. instead of a single-threaded loop:


void DoStuff( uniform uint i );
...
for( uniform uint i=0; i!=numTasks; ++i )
  DoStuff(i);

You can easily write a many-core parallel version as:


void DoStuff( uniform uint i );
inline task void DoStuff_Task() { DoStuff(taskIndex); }//wrapper, pass magic taskIndex as i
...
launch[numTasks]  DoStuff_Task();//go wide, running across up to numTasks threads
sync;//wait for the tasks to finish

Note that the above doesn't actually create any new threads. In my engine, one thread is created per core at startup and are always running, waiting for jobs to enter their queues. The above launch statement will push jobs into those queues and wake up any idle cores to get to work. That sync statement will block the ispc code until the launched jobs have all completed -- the thread that was running the ispc code will also be hijacked by the job system, and will execute job code until that point in time!

There's also libraries such as TBB or compiler-extensions such as OpenMP that add these kinds of constructs to C++, or you can write them yourself easily enough :wink:

Hodgman: Moreover, in D3D11, you can perform resource management tasks on any thread, and create multiple command buffers, to record draw/state commands on many threads (before a single "main" thread submits those command buffers to the GPU). D3D12/Vulkan head further in this direction.

Can you give a link so I can read more? I would love to be able to manage texture resources in a separate thread.....

Hodgman: Moreover, in D3D11, you can perform resource management tasks on any thread, and create multiple command buffers, to record draw/state commands on many threads (before a single "main" thread submits those command buffers to the GPU). D3D12/Vulkan head further in this direction.

Can you give a link so I can read more? I would love to be able to manage texture resources in a separate thread.....

The ID3D11Device interface is thread-safe -- it's got all the routines for creating/destroying resources. If you need Map or UpdateSubresource, you can use them on a secondary ID3D11DeviceContext created from ID3D11Device::CreateDeferredContext and then have the main thread execute the command buffer generated from that context. Alternatively, you can call Map on the main context, pass the pointer to a secondary thread, have it write data into the resource, then have the main thread call Unmap.

so, what good are cores (at this point in time)? can they do anything truly useful? or just more BS bling bling chrome on high end machines?


Honestly, the fact you think this means you are a good 15 years behind the curve right now - threads have been a pretty big deal for some time and far from 'bling bling'.

update in parallel and dealing with collisions wouldn't be easy


Just because something isn't easy, doesn't mean it isn't worth doing.
Honestly, the fact you think this means you are a good 15 years behind the curve right now - threads have been a pretty big deal for some time and far from 'bling bling'.

Threads and cores are two different things imho; having hundreds of threads doesn't imply you need many cores.

I believe it's actually pretty hard to usefully use more than 1-2 cores full time.

Hodgman:

I'm currently creating the texture and map(ing) it in the main thread, then executing another thread that has a pointer to the mapped data and manipulating it there. Then when the thread tells my program it is done, the main thread unmaps the texture and I can use it in my render.

This method works fine, but there is still a slight dip in fps. If I could do everything in another thread and then tell the main thread when it can use the texture, then there should be no noticeable dip in fps (theoretically). I've spent the last two hours (nearly) looking up how to do this. The methods I came up with all failed except one, and that one is ugly and produces even more of a dip in fps than my original one. It also causes my screen to blink during the update!

I would love to see an example of how it is done. Even pseudo-code would be great.

In addition to what Hodgman said... any game that renders anything or plays sounds, any such game, uses several cores. Plus, any game that uses asynchronous file I/O. If you just open a device context, a thread is created. You don't see it, but it's there. It will work off the render commands that you submit or it will mix the sounds that you play. Especially in processing sound, running stuff asynchronously (preferrably on another core) is very much real life, and very mission critical. Let the sound card's buffer get empty once, and the user will immediately and inevitably hear it. This must never happen. Process draw calls on the same core, the same thread even? Sure, it's possible. Welcome to 1995. But you're not getting anywhere close to 201x performance.

Honestly, the fact you think this means you are a good 15 years behind the curve right now - threads have been a pretty big deal for some time and far from 'bling bling'.

Threads and cores are two different things imho; having hundreds of threads doesn't imply you need many cores.

I believe it's actually pretty hard to usefully use more than 1-2 cores full time.

I have observed this situation quite often in a preprocessing tool using OpenMP:

Do a very simple thing like building a mip-map level in prallel: Speed up is 1.5... very disapointing

Do a complex thing like ray tracing: Speed up is 4... yep - that's the number of cores

My conclusion is that memory bandwidth limits hurt the mip-map generation.

I assume it would be faster to do mips and tracing at the same time, so memory limit is hidden behind tracing calculations.

Are there any known approaches where a job system tries to do this automatically with some info like job.m_bandWidthCost?

I've never heared of something like that.

This topic is closed to new replies.

Advertisement