Designing an efficient multithreading architecture

Started by
14 comments, last by Jason Z 10 years, 4 months ago

I'm designing an engine for my big 3D game project, and I want to make sure everything is very scalable to processors with many cores, without having more threads than necessary spawned at a time.

This is the architecture I'm considering right now in terms of different threads:

  • 1 scheduling thread
    • Runs the main loop
    • Spawns work and I/O jobs
    • Sends draw/compute calls to the GPU
  • 1 I/O thread
    • Blocks on calls to fread and fwrite
    • Spawns work jobs for decoding
  • 1 sound thread
    • Runs from within OpenAL or satisfies SDL audio callbacks
  • n worker threads, where n = ncores - 3
    • Run serial work jobs (embarassingly parallel jobs will be run on the GPU)

While this makes sense to me for processors like Intel i7's or AMD FX's which generally have more than 4 (logical) cores, for a 4-core processor like an i5, there is only one worker thread.

Should the scheduling thread also be able to run work jobs? If so, is it safe enough to have any thread be able to send draw/compute calls to the GPU (using OpenGL 4)?

"So there you have it, ladies and gentlemen: the only API I’ve ever used that requires both elevated privileges and a dedicated user thread just to copy a block of structures from the kernel to the user." - Casey Muratori

boreal.aggydaggy.com

Advertisement

You probably know a lot more than me, but it may be worth pointing out that every time I see someone ask about multithreading the overwhelming reply is "Don't!". As unless you really know about all the pitfalls it can bring it can just be more trouble than it's worth.

however maybe you do know your stuff, in which case good luck :)

You probably know a lot more than me, but it may be worth pointing out that every time I see someone ask about multithreading the overwhelming reply is "Don't!". As unless you really know about all the pitfalls it can bring it can just be more trouble than it's worth.

however maybe you do know your stuff, in which case good luck smile.png

Oh, I know what I'm getting myself into. I know all the issues about memory racing and whatnot across threads, so I'm going to minimize the number of times that threads need to synchronize, and use locks properly when I need to.

"So there you have it, ladies and gentlemen: the only API I’ve ever used that requires both elevated privileges and a dedicated user thread just to copy a block of structures from the kernel to the user." - Casey Muratori

boreal.aggydaggy.com

If you have multiple OpenGL contexts (one for each thread, with sharing of objects setup in between them) you will be able to make OpenGL calls from several threads simultaneously, but you must yourself ensure you're not eg. updating the same OpenGL object in one thread while rendering with it in the other. I also wouldn't be suprised if you discover more driver bugs that way, compared to singlethreaded use, but I don't have personal experience of that.

If your jobs are the kind of "update culling", "animate n entities", definitely run them also in the main thread, especially if your frame processing cannot proceed further without completing them first.

If you haven't already take a look at how Doom 3 BFG handles threads: http://fabiensanglard.net/doom3_bfg/index.php

If your jobs are the kind of "update culling", "animate n entities", definitely run them also in the main thread, especially if your frame processing cannot proceed further without completing them first.

Do you think it's better to run jobs on the main thread or instead spawn an extra worker thread and let the main thread sleep while there are jobs still pending?

"So there you have it, ladies and gentlemen: the only API I’ve ever used that requires both elevated privileges and a dedicated user thread just to copy a block of structures from the kernel to the user." - Casey Muratori

boreal.aggydaggy.com

I'd believe that is best answered by profiling, but each time a thread goes to sleep, it may not wake up as timely as you'd want due to the OS scheduling. That goes for both the main thread & workers. Therefore my gut feeling is against the extra worker thread.

If your jobs are the kind of "update culling", "animate n entities", definitely run them also in the main thread, especially if your frame processing cannot proceed further without completing them first.

Do you think it's better to run jobs on the main thread or instead spawn an extra worker thread and let the main thread sleep while there are jobs still pending?

Creating threads has a cost. Why pay it when you don't have to?

I'm designing an engine for my big 3D game project, and I want to make sure everything is very scalable to processors with many cores, without having more threads than necessary spawned at a time.

This is the architecture I'm considering right now in terms of different threads:

  • 1 scheduling thread
    • Runs the main loop
    • Spawns work and I/O jobs
    • Sends draw/compute calls to the GPU
  • 1 I/O thread
    • Blocks on calls to fread and fwrite
    • Spawns work jobs for decoding
  • 1 sound thread
    • Runs from within OpenAL or satisfies SDL audio callbacks
  • n worker threads, where n = ncores - 3
    • Run serial work jobs (embarassingly parallel jobs will be run on the GPU)
While this makes sense to me for processors like Intel i7's or AMD FX's which generally have more than 4 (logical) cores, for a 4-core processor like an i5, there is only one worker thread.
Should the scheduling thread also be able to run work jobs? If so, is it safe enough to have any thread be able to send draw/compute calls to the GPU (using OpenGL 4)?

I'd have suggest having the number of worker threads be equal to the number of cores or maybe even double that. (Forum member Frob has said in his experience 2 times the number of virtual cores is usually a good way to go).

The reason you want as many worker threads as cores even though you have other threads, is the worker threads probably have a different workload than the task specific threads (or whatever the word for that is). Those threads may occasionally sleep, block on an IO signal, and under use a single core. You want your worker threads to saturate the remaining resources, so it's better to have too many worker threads some of the time, than too few at any time. You can, however, give task specific threads higher priority.

fread/fwrite are blocking wrappers around the OS's internal non-blocking file-system API.

Instead of using a non-blocking-API, wrapped in a blocking API, wrapped in a thread to make it non-blocking, you should just use the native OS APIs wink.png

You still might want an "IO" thread for running decompression, or alternatively just treat decompression as a regular job for your worker threads.

The threads spawned by your audio middleware probably spend most of their time sleeping, so I wouldn't allocate them an entire core. I'd probably ignore the sound threads when figuring out how many worker threads to spawn.

I personally use my "main thread" as a worker thread too. Whenever any of my threads has nothing to do (e.g. it has to wait for the results of another thread before it can continue), then they make themselves useful by popping jobs from the job queue and doing some work. I basically have a "WaitFor..." busy loop, that continually checks if the condition has been met to exit, else tries to run a job, else after enough tries with no jobs in the queue it yields or sleeps.

Regarding the GPU, on PC it's probably still the fastest choice to just have one thread as the dedicated GPU thread. Other threads can perform rendering work -- e.g. doing frustrum culling, building render queues, performing state sorting or redundant state-change removal, etc... but only one thread actually draws things.

Multi-threaded drawing is possible, but AFAIK, the drivers do not perform very well at the moment.

Regarding embarrassingly parallel jobs -- in order to move these off to the GPU, you also need the consumers of those jobs to be ok with extremely long latencies. It's not possible to get short CPU->GPU->CPU latencies on PC without destroying overall performance.

I would distribute them as follows:

  • Core 1—Several resource-light threads.
    • Sound—Ticks only a few times per second to keep sound buffers filled, requiring very few resources.
    • Input (keyboard/mouse/etc.)—High-priority but mostly sleeping until a button is pressed, requiring very few resources.
    • Network thread—Medium priority but still mostly just waiting for events, which generally amount to at-most 20-per-second in heavy times.
    • 1 low-priority worker thread for background loading etc.—With sound, input, and networking all mostly in a sleep or wait state (and only waking to do very quick tasks before going back to waiting), there is still enough core left for a low-priority worker for any kind of task that is not very time-sensitive.
  • Core 2—Logic.
    • Game thread—Reads queued inputs, performs game logic, performs frustum culling, sorting, and submits render commands.
    • 1 worker thread—The game thread will have the heaviest load when it needs to update game logic, which is typically bound to be as infrequent as possible (and in racing games in which it can be more than 100 times per second, the load is balanced anyway so that not much logic actually takes place) and on the down-time the game thread simply interpolates object positions for re-submission to the render thread. There is typically enough CPU left over for a worker thread. It can either be constantly running at a medium-low priority or it can be the same priority and forced awake when the game thread is waiting and forced to wait when the game thread is awake.
  • Core 3—Rendering and worker scheduling.
    • Render thread—Sends render commands to the GPU.
    • Worker scheduler—Runs when the render thread is waiting, waits when the render thread is awakened. It takes very little time to read over a list of requested tasks and awaken worker threads.
  • Cores 4 to Total-3—Anything else that needs to be done.
    • 1 worker thread—Extra cores used for extra work. File loading, decoding, decompressing, whatever. Can change depending on the game. Each thread is high-priority, but waits until a job is there for it to do.

With this plan you still have multiple workers on 4-core systems, and the main 2 components (rendering and game logic) each basically have their own cores—they share it with a worker thread but that thread leaves them alone while they are active (though this should be handled with care on the game thread, since it doesn’t necessarily have to have down-time).

Also, don’t “spawn” threads, awaken them. They should already exist and just be idling in a waiting state, waiting for an event to set them in motion.

And a “wait” state is not a “sleep” state.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

This topic is closed to new replies.

Advertisement