The GPU comes too late for the party...

Started by
7 comments, last by Matias Goldberg 8 years, 1 month ago
Well, I've an issue with rendering OGL under Win10 on a Nvidia Laptop (Intel HD4000 + Nvidia 650m). It seems as if the GPU starts to process the command queue of the frame after I call SwapBuffers on the next frame. Example (milliseconds number do not represent the real data):

Frame A starts at 100
..
Fill command queue for A until 150
..
Frame A ends at 200

Frame B starts at 200
call SwapBuffers
-> GPU starts processing Frame A now until 250
-> CPU-thread is blocked(!) until 251
Fill command queue for B until 300
Frame B ends at 350

Frame C starts at 350
SwapBuffers
-> GPU starts processing Frame A now until 400
-> CPU-thread is blocked(!) until 401
Fill command queue for C until 450
Frame C end at 500

The GPU is not active for my game for around 50ms per frame enforcing the CPU-mainthread to idle for 50ms too. Why is OGL waiting so long until it submits its command queue to the GPU ? Any tips, ideas ?

Some additional informations:
- Vsync is turned off (by game, tested it even by en/disabling it in the drivers).
- No problem with NVidia GPU on desktop PC under Windows Vista.
- Not this problem with Intel HD4000 on the same laptop.
- I'm not using glFinish or glFlush.
- Hi-performance counter for CPU, double buffered glQueryCounter for GPU.
- Profiling based on getting the CPU time while submitting a timestamp query to the GPU.
Advertisement

- Profiling based on getting the CPU time while submitting a timestamp query to the GPU.

How does this work? If the query is a command in the command buffer, how do you match the CPU time of submission to the time that the GPU executes it?

Or... how do you match up the GPU timestamps to the CPU's timeline?

At the start of each frame I take the CPU and the GPU timestamp. For the GPU I use glGetInteger64v(GL_TIMESTAMP, ...). Then during the frame I submit multiple glQueryCounter(id,GL_TIMESTAMP) and save the CPU time too.

With one frame delay I get the timestamp from the GPU by using glGetQueryObjectui64v(..., GL_QUERY_RESULT, ...) ensuring that it is really available with glGetQueryObjectiv(...,GL_QUERY_RESULT_AVAILABLE, ...).
I'm using a double buffering approach and do not wait for the query to be available, but I output an error if it is not ready yet.

To compare them I scale them down to microseconds, thought the clocks are not 100% in sync, so I scale the GPU time since framestart by (CPU frame duration)/(GPU frame duration) [~2-5%]. That's it, I put all times in relation to the first frame I start the profiling, whereas I only profile for ~10 frames to limit inaccuracy.

Even with some tolerance, it seems that the main thread continue processing after roughly ~1 ms after the GPU has finished the previous frame. This over several frames.

This is an NVIDIA Optimus system and you're not the only one to have observed this degree of extra latency with this technology; see e.g https://www.reddit.com/r/oculus/comments/30lcpo/my_testing_shows_nvidia_optimus_on_880m_adds/

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Yes, it is an optimus system. Here are the profiling data of my desktop and laptop, the large pink part is really only the part for taking the timestamp before and after calling SwapBuffers(hDC):


On Optimus systems, the Intel card is always the one hooked to the monitor.

If you call get() query family for getting timestamp information, you're forcing the NV driver to ask Intel drivers to get information (i.e. have we presented yet?).

Does this problem go away if you never call glQueryCounter, glGetQueryObjectiv & glGetInteger64v(GL_TIMESTAMP)?

Calling glFlush could help you force the NV drivers to start processing sooner, but it's likely the driver will just ignore that call (since it's often abused by dumb programmers).

I also recall NV driver config in their control panel having a Triple Buffer option that only affected OpenGL. Try disabling it.

Edit: Also if it's Windows 7, try disabling Aero. Also check with GPUView (also check this out).

> On Optimus systems, the Intel card is always the one hooked to the monitor.
> If you call get() query family for getting timestamp information, you're forcing the NV driver to ask Intel drivers to get information (i.e. have we presented yet?).

Even if it askes the Intel driver, it doesnt seem to slow it down, considering that I submit hundreds of queries per frame. And why should it delay the rendering ?


> Does this problem go away if you never call glQueryCounter, glGetQueryObjectiv & glGetInteger64v(GL_TIMESTAMP)?

I will test it out.


> Calling glFlush could help you force the NV drivers to start processing sooner, but it's likely the driver will just ignore that call (since it's often abused by dumb programmers).

I will test this one out too.


> I also recall NV driver config in their control panel having a Triple Buffer option that only affected OpenGL. Try disabling it.

Played around with it to no effect.


> Edit: Also if it's Windows 7, try disabling Aero. Also check with GPUView (also check this out).

It is not a matter of performance, but a matter of delay and synchronisation. All rendring on the GPU falls more or less exactly in the SwapBuffers call, as if the driver waits too long before submitting the command queue and get eventually forced by calling SwapBuffers.

I have Windows 10 btw. I looked into GPUView, but it seems to track only DirectX events.

The point we're making is, that on an Optimus system, you run in one of the following two configurations:

1) Using the Intel card exclusively for everything, or,

2) Using the NVIDIA card for the main rendering, following which the framebuffer is transferred to the Intel and presented to the screen.

Page 14 of the Optimus Whitepaper explains this (my emphasis):

When using non-taxing applications to accomplish basic tasks, like checking email or creating a document, Optimus recognizes that the workload does not require the power of the GPU. As a result, Optimus completely shuts off the GPU (and associated PCIe lanes) to provide the highest possible efficiency and battery life. In this case, illustrated in Figure5, the IGP will be used for all processing duties and will also act as a display controller to output the frames to the display.

As soon as applications that can benefit from the power of the GPU are invoked, like watching Flash video, gaming, or converting video from one format to another using CUDA, Optimus instantly enables the GPU. As shown in the figure above, the GPU handles all processing duties and the IGP is only used as a display controller to render the GPU?s output to the display.

So in the second (NVIDIA/Intel) configuration there will always be some extra latency while the framebuffer is being transferred to the IGP for presenting, and this is normal for the way that Optimus operates. The first (Intel only) configuration on the other hand avoids the latency but at the cost of running on the weaker GPU. Sometimes - depending on what kind of rendering workload you're doing and/or where your bottlenecks are - that might even translate into higher overall performance. The nature of Optimus means that, unfortunately, there is no pure NVIDIA-only configuration.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

> On Optimus systems, the Intel card is always the one hooked to the monitor.
> If you call get() query family for getting timestamp information, you're forcing the NV driver to ask Intel drivers to get information (i.e. have we presented yet?).

Even if it askes the Intel driver, it doesnt seem to slow it down, considering that I submit hundreds of queries per frame. And why should it delay the rendering ?

Because the NV driver needs to wait on the Intel driver. In a perfect world the NV driver would be asynchronous if you're asynchronous as well. But in the real world, where Optimus devices can't get VSync straight, I wouldn't count on that. At all. There's several layers of interface communication (NV stack, DirectX, WDDM, and Intel stack) and a roundtrip is going to be a deep hell where likely a synchronous path is the only one being implemented.

> Edit: Also if it's Windows 7, try disabling Aero. Also check with GPUView (also check this out).

It is not a matter of performance, but a matter of delay and synchronisation. All rendring on the GPU falls more or less exactly in the SwapBuffers call, as if the driver waits too long before submitting the command queue and get eventually forced by calling SwapBuffers.

Aero adds an extra layer of latency. Anyway, not the problem since you're in Windows 10.

I have Windows 10 btw. I looked into GPUView, but it seems to track only DirectX events.

GPUView tracks GPU events such as DMA transfers, page commits, memory eviction, screen presentation. All of which is API agnostic and thus works with DirectX and OpenGL (I've successfully used GPUView with OpenGL apps). IIRC it also supports some DX-only events, but that's not really relevant for an OGL app.

This topic is closed to new replies.

Advertisement