Sign in to follow this  

Vsync messing with my job system

Recommended Posts

Hey guys,

 

I have been investigating a weird slowdown in my project that turned out to be Vsync causing my job system to perform poorly. I am not sure why this is happening though, so I ask for ideas.

 

I have a simulation running where particle systems are updated on the job system. The job system is fairly standard, it uses a pool of threads and gives those threads jobs to do. In addition, you can add dependencies to a job and have the main thread wait until the job is complete. In the case with the particle systems causing problems, each particle system's update function is thrown on the job system and the main thread waits for all of the update jobs to complete before continuing. In practice this is implemented with Win32 events and the main thread is using WaitForSingleObject() to wait for the depencency linked to all the update jobs.

 

This works fine when the frame rate is unlocked, and it even works fine when the main thread is limiting the framerate through code (by spinning in the beginning of the frame until the desired delta time has been reached). However, when turning on VScync to limit the framerate (in this case to 60FPS) the WaitForSingleObject() call starts taking anywhere between 1ms-500ms. With VSync off I can happily throw hundreds of particle systems on the job system and it performs fine. With VSync on the multithreaded update of particle systems causes the framerate to vary wildly (as previously mentioned). However, if I update the particle systems serially on the main thread the system works farily ok (although a bit slower since it does not benefit from the job system) even with VSync on.

 

So in short, having Vsync on makes the job system dependency waiting perform very poorly. Any idea what might cause this? Literally the only different in my engine between having VSync on/off is passing 1 or 0 as the first parameter to IDXGISwapChain::Present(). I am using DX11 for rendering and the swap effect has been set to DXGI_SWAP_EFFECT_DISCARD. The tests have been done in window mode. With VSync on I presume the call to Present() sleeps until it is time to present the content on screen. Is it possible that the machine does some weird context switching because of this, causing my threads to jump around and cause this behavior?

 

Thanks!

Share this post


Link to post
Share on other sites

You can throttle your frames or you can enable v-sync. Don't do both or you'll get drawing issues. Also, if it's causing lag then you're throttling incorrectly.

 

You really don't need to throttle the drawing in any case. Throttle your logic updates (search: "fix your timestep") but just draw as often as the machine can and just interpolate forward based on the time elapsed since the last logic update. You can enable v-sync in that case and it should work perfectly.

Share this post


Link to post
Share on other sites

Thanks for the tips, however, they don't help me in this case.

 

First of all I obviously don't have VSync and the manual frame rate limiter on at the same time, it's an either-or. In my tech there are good reasons for limiting the actual drawing rate, eg. in the level editor which is a WinForms app I am using the native engine to draw straight into a .net panel control using the raw handle and on many slightly older machines the WinForms app message pump gets flooded unless I limit the frame rate to 30 or 60.

 

I am using a fixed time step for the updates (two actually, a physics update rate and a tick rate for server gameplay and network logic). This is all fine and dandy, and besides the point I think.

 

Any idea why enabling VSync would mess up the WaitForSingleObject() duration?

Share this post


Link to post
Share on other sites
Are the jobs themselves doing anything that would have to wait for the next presentation interval?

I would try to find a thread profiler to see if you can spot anything obvious. AMD CodeXL might work, but I haven't actually used it for thread profiling yet.

Share this post


Link to post
Share on other sites

No they are running a method full of math functions, basically only updating the particle systems and not touching any other engine subsystems such as the renderer.

 

I am currently rewriting some dodgy parts of the job system (something I have put off for years already) to be able to debug it more easily. I don't particularly count on it fixing the problem though, so keep the suggestions coming!

Share this post


Link to post
Share on other sites
Posted (edited)
Have you tried to make the minimum program that behaves like that?
I mean, create only one thread, have vsync active and check if WaitForSingleObject does that weird thing.

Have you tried in another window version? Edited by nesdavid

Share this post


Link to post
Share on other sites

How do you propose having only one thread but still using WaitForSingleObject()? I am using the function explicitly for thread synchronization.

 

As for the windows version, the behavior is the same on Win7 and Win10.

Share this post


Link to post
Share on other sites
I see, sorry. Yes, I have tried with only one worker thread and it did seem like the wait duration did not vary as much as with the normal six worker thread setup.

I should have the job system refactor done later today or tomorrow and I should be able to debug it better. Cheers!

Share this post


Link to post
Share on other sites

six worker thread
So your game uses 7 threads? Plus probably at least 1 internal D3D thread, GPU driver thread, etc...

What kind of CPU do you have does it have 8 physical cores? Or 4 physical cores with hyper-threading, so pretending to be 8?

Do the job-threads do any kind of synchronization? What's their behavior when the job queue is empty?

Share this post


Link to post
Share on other sites

No dedicated D3D thread. There is a dedicated job system scheduler thread though. If you have Game Programming Gems 7, the job system follows the general idea found in there, except I have rewritten the whole system twice now. The synchronization is largely the same as in GPG7 too. The worker threads wait for work using events. The scheduler waits for things to schedule using an event. When a job is given to the job system the scheduler wakes up and signals the worker threads that there is work. The first worker thread to fire its "I am ready" signal gets the work. After finishing the worker goes back to waiting for more work etc. In other words there is no idle spinning or anything like that.


Oh, and regarding the thread count. I can lower the worker thread count to two but there is still significant (5-8 ms) extra waiting when VSync is on. This is running on an i7 with four cores and HT.

Share this post


Link to post
Share on other sites

Ok so let me ask you this, what happens when I turn on VSync?

 

I assume the call to Present() blocks until the contents have been presented on screen? Does it wait for some OS signal until that happens (similar to how I am waiting for events to sync my threads) or does it spin, or something else completely?

 

Cheers!

Share this post


Link to post
Share on other sites
Posted (edited)

To me it sounds like you're first waiting for jobs to finish using WaitForSingleObject(), then waiting again for vertical sync.

 

Just a thought I got:

WaitForSingleObject() causes a delay, which makes the scanline reach the end of the frame and start a new frame, and if you don't wait using WaitFor..() you would have Present()'ed on the current frame, but now you've skipped a frame, thus giving you more delay when reaching your Present() because it will have to wait almost an entire frame?

 

If I would do anything, it would be moving the particle system to the GPU, freeing up CPU for other stuff :)

Edited by vinterberg

Share this post


Link to post
Share on other sites

To me it sounds like you're first waiting for jobs to finish using WaitForSingleObject(), then waiting again for vertical sync.

 

Just a thought I got:

WaitForSingleObject() causes a delay, which makes the scanline reach the end of the frame and start a new frame, and if you don't wait using WaitFor..() you would have Present()'ed on the current frame, but now you've skipped a frame, thus giving you more delay when reaching your Present() because it will have to wait almost an entire frame?

 

If I would do anything, it would be moving the particle system to the GPU, freeing up CPU for other stuff :)

 

That's an interesting idea. However, I doubt it is that because, in the current scene I am using to test this, without VSync or the frame rate limiter I get an average of 400 FPS, so it hardly seems like the WaitFor call takes too long in that case. Turning Vsync on somehow makes either the PS update jobs themselves or the wait call that waits for all of them to complete to take significantly longer than without VSync. I still suspect that whatever Present() does when VSync is on is somehow messing with my threads causing the jobs to take longer or perhaps I have a subtle bug in the system which makes the thread synchronization code misbehave.

Share this post


Link to post
Share on other sites

Can you capture the value reported by QueryPerformanceCounter()

  • when each thread completes
  • when all threads are completed
  • when the WaitForSingleObject() wakes

and the value returned by QueryPerformanceFrequency()

Share this post


Link to post
Share on other sites

Here are a few samples, from four subsequent frames:

 

At end of update job, counter: 3398397919089, frequency: 3903985
At end of update job, counter: 3398397919109, frequency: 3903985
At end of update job, counter: 3398397919398, frequency: 3903985
At end of update job, counter: 3398397920061, frequency: 3903985
At end of update job, counter: 3398397920084, frequency: 3903985
At end of update job, counter: 3398397920152, frequency: 3903985
At end of update job, counter: 3398397924222, frequency: 3903985
At end of update job, counter: 3398397927228, frequency: 3903985
At end of update job, counter: 3398397930381, frequency: 3903985
When all jobs complete, counter: 3398397949064, frequency: 3903985
After WaitForSingleObject returns, counter: 3398397951995, frequency: 3903985
 
At end of update job, counter: 3398397972122, frequency: 3903985
At end of update job, counter: 3398397972658, frequency: 3903985
At end of update job, counter: 3398397972829, frequency: 3903985
At end of update job, counter: 3398397972892, frequency: 3903985
At end of update job, counter: 3398397972966, frequency: 3903985
At end of update job, counter: 3398397973016, frequency: 3903985
At end of update job, counter: 3398397975788, frequency: 3903985
At end of update job, counter: 3398397978940, frequency: 3903985
At end of update job, counter: 3398397982192, frequency: 3903985
When all jobs complete, counter: 3398398004404, frequency: 3903985
After WaitForSingleObject returns, counter: 3398398007546, frequency: 3903985
 
At end of update job, counter: 3398398022567, frequency: 3903985
At end of update job, counter: 3398398022585, frequency: 3903985
At end of update job, counter: 3398398022904, frequency: 3903985
At end of update job, counter: 3398398023087, frequency: 3903985
At end of update job, counter: 3398398023174, frequency: 3903985
At end of update job, counter: 3398398023177, frequency: 3903985
At end of update job, counter: 3398398028116, frequency: 3903985
At end of update job, counter: 3398398031368, frequency: 3903985
At end of update job, counter: 3398398035454, frequency: 3903985
When all jobs complete, counter: 3398398055518, frequency: 3903985
After WaitForSingleObject returns, counter: 3398398058745, frequency: 3903985
 
At end of update job, counter: 3398398078325, frequency: 3903985
At end of update job, counter: 3398398078329, frequency: 3903985
At end of update job, counter: 3398398078349, frequency: 3903985
At end of update job, counter: 3398398078382, frequency: 3903985
At end of update job, counter: 3398398078616, frequency: 3903985
At end of update job, counter: 3398398078634, frequency: 3903985
At end of update job, counter: 3398398081917, frequency: 3903985
At end of update job, counter: 3398398084962, frequency: 3903985
At end of update job, counter: 3398398088323, frequency: 3903985
When all jobs complete, counter: 3398398110501, frequency: 3903985
After WaitForSingleObject returns, counter: 3398398113692, frequency: 3903985
 
The "At end of update job" rows are added as the last thing in the job itself, ie. called by the job thread. The "When all jobs complete" row is added by the last job in the group, ie. also called by the job thread. The "After WaitForSingleObject returns" is added by the main thread.
 
Does this tell you anything?

Share this post


Link to post
Share on other sites

Okay, I am done with the job system refactor now, but it did not yet provide me with any new insights into what the problem might be. My next order of business is to improve my profiler so that I can get a better picture of where the time goes with/without VSync.

 

I also noticed that the update() method of the video playback library I am using (libTheoraPlayer) starts taking longer with VSync on as well. The library creates a single worker thread that it uses to decode the video playback, so I wonder if this is somehow related to what I am seeing with my own worker threads. Also, by completely disabling the video playback library (and thus creating one less worker thread) the extra waiting time with VSync goes down a little (just like it does if I create fewer own my own worker threads), so it definitely seems like there is some weird thread scheduling issue going on.

 

Do you generally set thread affinity or preferred cores for your threads on PC? I am talking about using SetThreadAffinityMask() and/or SetThreadIdealProcessor() to make sure the threads go on different cores. I am currently not doing any of this since I used to restrict my main thread to core 0 in the past and that obviously did not go well with many instances of the game. Should I assign cores for my threads on PC or let the OS decide where things should go?

Share this post


Link to post
Share on other sites

sounds like further profiling is necessary to determine where the time is going. get elapsed ticks should be sufficient for that, or you can use profilers.

 

me personally, i'd suspect a bunch of jobs that all have to finish, then render, then wait for vsync to present.

 

you'll need to time things and turn things on and off until you find it.  it may be tempting to guess, or "throw code at the problem", but you really need to know for sure whats wrong, and that its fixed correctly.

Share this post


Link to post
Share on other sites

Does this tell you anything?

After WaitForSingleObject returns, counter: 3398397951995(870494.6233131018), frequency: 3903985
After WaitForSingleObject returns, counter: 3398398007546(870494.6375424085), frequency: 3903985 (14.2293067 ms since prev.)
After WaitForSingleObject returns, counter: 3398398058745(870494.650656957),  frequency: 3903985 (13.1145485 ms since prev.)
After WaitForSingleObject returns, counter: 3398398113692(870494.6647315499), frequency: 3903985 (14.0745929 ms since prev.)

Either your monitor is running at 70Hz or vsync isn't on.

Share this post


Link to post
Share on other sites

You are absolutely right. I did so many things yesterday at the same time I must have messed up and ran the tests without VSync. My apologies. Here are new samples with VSync:

 

After WaitForSingleObject returns, counter: 3648415846943, frequency: 3903985
After WaitForSingleObject returns, counter: 3648415911650, frequency: 3903985
After WaitForSingleObject returns, counter: 3648415976977, frequency: 3903985
After WaitForSingleObject returns, counter: 3648416041226, frequency: 3903985
After WaitForSingleObject returns, counter: 3648416106776, frequency: 3903985
 
However, if it is just the frame time you want I already have ways of getting that (using QPC and QPF indirectly). I wasn't sure what you were after. Was there something in particular you were looking for in the timing samples? Something I should look out for?

Share this post


Link to post
Share on other sites

Yes, it is 60 fps. And with VSync on most of that time goes to waiting for WaitForSingleObject(), waiting for the particle system jobs to complete. With my manual framerate limiter (limiting to the same 60fps), the call to WaitForSingleObject() takes less than a ms, and so most of the time goes into idle spinning in my main thread to limit the fps.

 

What I still don't understand is why the exact same number of jobs for the exact same number of particle systems finish in under half a ms when VSync is off but it can take up to 15ms when VSync is on (usually less though, which is why you saw 60fps in my timing samples). I don't see how the main thread sleeping (ie Vsync) can affect a number of worker threads in such as way, unless it is something with the OS scheduling the threads on different cores since the main thread is sleeping. Also, I don't know if VSync causes the actual jobs on the background threads to take longer, or if the WaitForSingleObject() call simply does not return when all jobs have finished.

 

Anyway, I will continue with my next task which is to improve my profiler.

Share this post


Link to post
Share on other sites
Posted (edited)

I feel like I am explaining the same things over and over again, so let me just make it perfectly clear what the problem is and where I am in my investigations:

 

With VSync off and manual framerate limiter on (all fine):

  • The framerate is a consistent 60FPS
  • The particle system update jobs finish quickly
  • Thus, the main thread WaitForSingleObject() call that wait for the threads also finishes quickly.
  • Because of this most of the time of the frame is spent idle spinning in the main thread to manually limit (or throttle) the framerate.

Without any framerate limiting (also fine):

  • Mostly the same as above, the jobs also finish quickly and the frame only takes around 2-3 ms.

With VSync on (not fine):

  • In most of the game the framerate is fine, locked to 60FPS because of my 60Hz monitor
  • (The remainder of the time of the frame is spent in Present() waiting for the vertical sync)
  • Whenever there are particle systems updated on the job system, in separate worker threads, the framerate starts varying wildly.
  • My profiler shows that the time is spent in the main thread call to WaitForSingleObject() which waits for a job group. The group signal is fired when all pending jobs (ie. the particle system jobs) have completed. <- THIS IS THE PROBLEM!!!
  • If I turn off background updating of the particle systems and instead update them serially on the main thread the performance is hampered because of the extra main thread load but I still hit a solid 60FPS.

So what I would expect with VSync is that it would behave much like with the manual framerate limit, that the PS update jobs would finish quickly and that we would hit a solid 60FPS, spending most of the frame inside the call to Present(). I don't see how changing how the framerate is limited (going from having the main thread idle spinning in my own code to sleeping(?) inside Present()) makes my background jobs take a hundred times longer. The only thing that comes to mind is that sleeping in Present() (which I assume is what happens when you turn VSync on) somehow affects the thread scheduling so that jobs are put on different cores than they otherwise would. And that is why I asked about thread affinities etc.

 

Does this make sense? And thank you for all the help thus far!

Edited by GuyWithBeard

Share this post


Link to post
Share on other sites

No dedicated D3D thread.

I meant that D3D itself probably creates a thread. So you've likely got 10+ threads running on your 4/8 core PC. That should be fine as long as they're all idling properly, but would act like you describe if they're busy waiting instead of idling.
 
Perhaps the D3D thread busy-waits during vsync, and your threads are also busy-waiting, causing CPU oversubscription? I know you said that your threads aren't so that's a long shot, but oversubscription is the only explanation I can come up with.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this