Multithreading my game engine -- slower than expected performance

Started by
5 comments, last by venzon 15 years, 9 months ago
My game engine, vastly simplified, does two things: simulate the game world, then render the game world. Note that I'm including getting input and some other subsystems as part of the "simulate" step. Based on this forum discussion, I decided to try moving the rendering to a second thread. Unfortunately, the performance is worse running in multithreaded mode than in single threaded mode on my dual core CPU... by a fair margin. Let me explain in a little more detail what's going on. Here's what the execution looks like in single threaded mode:
|---sim---|---sync---|-------render-------|
and repeat. Note: the "sync" represents the time spent sending the simulation updates to the rendering system. Here's what the execution looks like in multi-threaded mode:
main thread:   |---sim---|---wait---|---sync---|
render thread: |-------render-------|---wait---|
and repeat. Note that the main thread's simulation step finishes much faster than the rendering, and then has to wait for the rendering to finish before the main thread starts a sync (sending data to the render system). When the main thread is doing a sync, the render thread has to wait. Performance data indicates that the single threaded mode is using 100% of one core, as expected -- and I get about 65 FPS. In multithreaded mode, each thread uses a little over 50% of its core, spending the rest of the time waiting as per the diagram above -- and I get about 45 FPS. I was expecting the multithreaded version to do a fair amount of waiting due to the fact that right now my simulation of the game world isn't very CPU intensive, and also that my sync step is poorly optimized and takes longer than it should. I was expecting that the multithreaded version would be maybe marginally faster than the single threaded version, and then as I added complexity to the sim and sped up the sync, the multithreaded version would start to really out-pace the single threaded version. However, at this point the multithreaded version is so much slower than the single threaded version that I'm wondering if I've done something terribly, terribly wrong in the way I architected my multithreaded version. So, I thought I'd ask the gamedev folks: Am I doing something fundamentally wrong with my multithreaded architecture? Here it is again:
main thread:   |---sim---|---wait---|---sync---|
render thread: |-------render-------|---wait---|
The way I'm implementing the waiting is with SDL (libsdl.org) semaphores. I'm on a dual core linux 32-bit system. Also, just another data point, if I make the simulation much simpler and reduce the scene to something very simple (which reduces sync and render times), I can get upwards of 400 FPS out of the multithreaded mode and maybe 600 FPS from single threaded mode. Thanks.
Advertisement
Do any of your steps do anything extra in multithreaded mode that they don't need to do in single-threaded mode (other than the semaphores)?

It doesn't seem like semaphores alone would cause that much penalty, unless you're accidentally setting up a situation where the sim step holds onto a semaphore that the render step wants (or vice versa):

|---sim----|-------wait--------|---sync---||---wait---|------render-------|---wait---|or maybe|---sync---|------wait---------|---sim----||---wait---|-----render--------|---wait---|
I'll try to quickly sketch out the "right" way to do a multithreaded game engine:

Game logic runs continuously, pushing deltas to a message queue after each step.
Renderer runs continuously, grabbing and applying deltas before each frame.

This requires the game logic to use an entirely independent data set, and be completely decoupled from the renderer -- which is as it should be. The game logic will determine how objects in the game world move around, then send any changes to the renderer, which rearranges the scene as necessary.

No waiting! Unless you want to cap ticks per second for whatever reason. The data structure you use for IPC obviously needs to be threadsafe, which can be accomplished by any number of techniques. The lockless queue (see Google) is probably your friend in this case.

For details, take a look at this thread on the OGRE forums, and pay careful attention to what xavier says:
http://www.ogre3d.org/phpBB2/viewtopic.php?t=26496

It eventually gets pretty deep into the implementation details, including preallocation and reuse of message objects.
Quote:Do any of your steps do anything extra in multithreaded mode that they don't need to do in single-threaded mode (other than the semaphores)?


No, even the syncing uses the same code.

Quote:it doesn't seem like semaphores alone would cause that much penalty


I agree. The high performance that I can get with no sim and very little data to sync or render indicates to me that the semaphores themselves probably add little overhead.

I'll look at my code a little more closely this afternoon to make sure I don't have a simple error somewhere that's causing excessive waiting beyond what I showed in the ASCII diagram.
Quote:Original post by venzon

main thread:   |---sim---|---wait---|---sync---|render thread: |-------render-------|---wait---|


The way I'm implementing the waiting is with SDL (libsdl.org) semaphores. I'm on a dual core linux 32-bit system. Also, just another data point, if I make the simulation much simpler and reduce the scene to something very simple (which reduces sync and render times), I can get upwards of 400 FPS out of the multithreaded mode and maybe 600 FPS from single threaded mode.


Why are you waiting. The code should look something like this:
main thread:   |-sim1-|-sim2-|-sim3-|-sim4-|-sim5-|-sim6|render thread: |-------render0-------|-------render3-------|
Simply put, renderer takes latest complete simulation step, and renders that.

You may need to duplicate the state, one that's being simulated, and another which is being rendered.

Whether you pass the data between threads, or use read-only shared state is a matter of choice.

Of course, it's perfectly possible you have trivial coding error.
A quick update. After restarting my system I realized that my earlier multithreaded performance data was invalid because I had a task running in the background on one of the cores that wasn't as idle as I thought. Woops. Now I see 90% utilization on one core and 30% on the other in multithreading mode, with 70 FPS. Single threading still shows high 60s. This is more in line with my expectations, since the rendering at the moment takes much longer than the simulation or the sync. Put pseudo-mathmatically, I expect the time per rendered frame (with my current architecture) to be (assuming Tsim < Trender):

Tsingle = Tsim + Tsync + Trender
Tmulti = Tsync + Trender

so:

Tmulti - Tsingle = Tsync - Tsim

Antheus and drakostar, a note on my simulation: I use a fixed timestep of 10 ms (game time). Each sim step will do multiple 10 ms updates until the game time matches wall clock time (which doesn't take very long because the sim is quick). After that it sits and waits until the render finishes so it can send it the latest data, then it repeats and does more updates until the game time matches wall clock time again. So, the fixed timestep of the sim means it will be doing waiting in one form or another. But, I think I understand the essence of your points, which is that I can get better performance by eliminating that sync portion that ties up both threads and running things continuously. I'll definitely look into that.
Alright, I added a second buffer to the render thread and eliminated the lock-step sync, so now it looks like this:

main thread:   |---sim---|---sync---|---wait---|render thread: |-------------render------------|

and repeat.

My processor usage is up to:
100% core usage from the render thread
34% core usage from the main thread

This is nifty because now I should be able to add considerable complexity to the simulation side without affecting the performance in multithreaded mode at all. Thanks for the help guys!

As a side note, my game is a racing simulation (vdrift.net) so I have a handy way to scale simulation complexity that won't tick off single-threaded users: allow racing against more AI cars if you have more cores!

This topic is closed to new replies.

Advertisement