Render, Update, then flip the pancake?

Started by
25 comments, last by Etnu 19 years, 8 months ago
Quote:Original post by Etnu
Any command can force the command queue to flush, once it's full.

Yeah, but that shouldn't normally happen. If it does, you're not using the API correctly. Normally, if you're overflowing the queue you should increase the batch sizes and decrease the number of commands.
Quote:Original post by Etnu
Microsoft's own documentation specifically points to the swap as being one of the most intensive tasks that can be done.

Can you give a link? This is the first time I hear about this being a problem.
Quote:Original post by Etnu
The setup you described guarantees nothing, unless you wrote the driver yourself, and write your code to be exactly sure of when the optimal time to do work will be.

Not according to this presentation. There's a lot more information scattered on NVidia and ATI sites. I'll try to find more links tonight or tomorrow.
Quote:Original post by Etnu
You can most certainly gain from a rendering thread in a seperate loop, as it's the only way to be 100% sure that the thread is not wasting clock cycles.

Sorry, I still don't see the benefit. The thread isn't waiting for which clock cycles? CPU or GPU? If your GPU pipeline stalls, it doesn't really matter if you use a separate thread to issue commands: you have to wait until the pipeline is rendered until you can continue filling it up. Can you clarify the benefit of a separate thread?
Quote:Original post by Etnu
Read the SDK documentation if you don't believe me on this one. It's clearly outlined that there is no way to be sure of when the card is busy and when it's not better than I could possibly explain.

There is no way to gurantee the GPU isn't waiting for the CPU. However, "gurantee" is a really strong word. You can be reasonably sure the GPU isn't being idle if you spend enough time profiling and ironing out the bottlenecks.

[Edited by - CoffeeMug on August 4, 2004 8:22:17 AM]
Advertisement
Quote:Original post by CoffeeMug
Yeah, but that shouldn't normally happen. If it does, you're not using the API correctly. Normally, if you're overflowing the queue you should increase the batch sizes and decrease the number of commands.


That's not necessarily true; it's quite possible on modern hardware for the command queue to flush 2-3 times in a single iteration of the render loop, even with optimal batching. Of course, that does depend on your data.

Quote:
Can you give a link? This is the first time I hear about this being a problem.


2k4 SDK Docs -> DirectX Graphics -> Advanced Topics -> Accurately Profiling Direct3D API Calls.

Quote:Sorry, I still don't see the benefit. The thread isn't waiting for which clock cycles? CPU or GPU? If your GPU pipeline stalls, it doesn't really matter if you use a separate thread to issue commands: you have to wait until the pipeline is rendered until you can continue filling it up. Can you clarify the benefit of a separate thread?


Simple; you never have to worry about what the GPU is doing, and can have as high a resolution as you'd like within your physics / input / sound code, completely independent of your rendering loop. The fact of the matter is that there will always be lost cycles when dealing with D3D calls, because, again, you may encounter things like flushes happening in the middle of the loop. A profiler will quickly show you that this happens.

Quote:
There is no way to gurantee the GPU isn't waiting for the CPU. However, "gurantee" is a really strong word. You can be reasonably sure the GPU isn't being idle if you spend enough time profiling and ironing out the bottlenecks.


Yes, if you spend a lot of time piddling around with your timing and writing code to handle it, you can be "reasonably" sure. Alternative, you can spend 2 hours and be 100% sure it's not waiting.

---------------------------Hello, and Welcome to some arbitrary temporal location in the space-time continuum.

Yes, modern hardware can flush the command queue multiple times a frame. This isn't a bad thing, because it allows the GPU to get a start actually doing some work. However, while the user-kernel mode swap is around 5000 cycles, is it really a performance bottleneck?

I read the documentation Etnu pointed at and found this:

Quote:
DrawPrimitive = kernel-transition + driver work + user-transition + runtime work
DrawPrimitive = 5000 + 935,000 + 2750 + 5000 + 900
DrawPrimitive = 947,950


So, while it's true that around 10,000 cycles are spent on those transitions, that's only about 1% of the total work being done.
I don't see how this could ever be a problem unless your transitions outweighed the work being done by the driver by some considerable amount, but in that case you're doing something drastically wrong anyway.

Is this why you're proposing multithreading, for cases where this flush occurs and you can regain performance by going to the other thread and doing some work there?

-Mezz
Quote:Original post by Mezz
Yes, modern hardware can flush the command queue multiple times a frame. This isn't a bad thing, because it allows the GPU to get a start actually doing some work. However, while the user-kernel mode swap is around 5000 cycles, is it really a performance bottleneck?

I read the documentation Etnu pointed at and found this:

Quote:
DrawPrimitive = kernel-transition + driver work + user-transition + runtime work
DrawPrimitive = 5000 + 935,000 + 2750 + 5000 + 900
DrawPrimitive = 947,950


So, while it's true that around 10,000 cycles are spent on those transitions, that's only about 1% of the total work being done.
I don't see how this could ever be a problem unless your transitions outweighed the work being done by the driver by some considerable amount, but in that case you're doing something drastically wrong anyway.

Is this why you're proposing multithreading, for cases where this flush occurs and you can regain performance by going to the other thread and doing some work there?

-Mezz


Right, but you're waiting for the driver to process all those commands. A seperate thread could easily be executing (especially on multi processors / hyperthreaded systems).

---------------------------Hello, and Welcome to some arbitrary temporal location in the space-time continuum.

Quote:Original post by Etnu
Right, but you're waiting for the driver to process all those commands. A seperate thread could easily be executing (especially on multi processors / hyperthreaded systems).

But even if the call to the driver is on a different thread, the driver is eating up CPU cycles so you're waiting no matter which thread the call is made on. The only place this will belp is on a multiple CPU system (and a hyperthreaded CPUs, although to a much lesser extent).

EDIT: Also, the point in that paper about the time taken by the mode transition is with regards to accurate CPU profiling and *not* parallisation with the GPU.

[Edited by - joanusdmentia on August 5, 2004 5:43:28 AM]
"Voilà! In view, a humble vaudevillian veteran, cast vicariously as both victim and villain by the vicissitudes of Fate. This visage, no mere veneer of vanity, is a vestige of the vox populi, now vacant, vanished. However, this valorous visitation of a bygone vexation stands vivified, and has vowed to vanquish these venal and virulent vermin vanguarding vice and vouchsafing the violently vicious and voracious violation of volition. The only verdict is vengeance; a vendetta held as a votive, not in vain, for the value and veracity of such shall one day vindicate the vigilant and the virtuous. Verily, this vichyssoise of verbiage veers most verbose, so let me simply add that it's my very good honor to meet you and you may call me V.".....V
Original post by joanusdmentia
Quote:Original post by Etnu
Right, but you're waiting for the driver to process all those commands. A seperate thread could easily be executing (especially on multi processors / hyperthreaded systems).

But even if the call to the driver is on a different thread, the driver is eating up CPU cycles so you're waiting no matter which thread the call is made on. The only place this will belp is on a multiple CPU system (and a hyperthreaded CPUs, although to a much lesser extent).


No, you've still got normal OS multitasking going on, which means that, while the worker threads may perform somewhat slower when a heavy load is on the main thread, they'll still get SOME timeslice, meaning that updates will continue to happen even though the thread is waiting for a call to return.

Quote:

EDIT: Also, the point in that paper about the time taken by the mode transition is with regards to accurate CPU profiling and *not* parallisation with the GPU.


Yes, but the cost of calls and how D3D works is still the same; they were talking about accurately measuring the amount of time a specific call takes, but they "accidentally" pointed out where code execution slows down.

---------------------------Hello, and Welcome to some arbitrary temporal location in the space-time continuum.

I'm still not sure I follow you Etnu - regardless of what thread the work is done in, the work still has to be done, so overall will it not just take the same amount of time?

I can only think of one example in which allowing another thread working time would be beneficial, and that's data streaming (music/large worlds).

Unless you want to be doing 'update' type work (AI/physics etc.) while in the middle of rendering? which I'm not sure is a good idea (unless maybe everything and absolutely everything is separated with no data shared whatsoever).

-Mezz
Quote:Original post by Etnu
No, you've still got normal OS multitasking going on, which means that, while the worker threads may perform somewhat slower when a heavy load is on the main thread, they'll still get SOME timeslice, meaning that updates will continue to happen even though the thread is waiting for a call to return.

I don't see how this would increase parallelism with the GPU though. When the driver flushes this is CPU work transferring the commands to the GPU, the CPU isn't sitting idlely waiting for the GPU (or am I wrong on this?). So the only thing you'd be achieving is artificially increasing your FPS while slightly reducing the number of times the scene is updated per seconded. Something along the lines of this:
        |      |        |      |   render frame Nupdate  |     \|/frame   |   -------  N+1     |      |        |      |   render frame N       \|/    \|/

Sure, your render thread isn't waiting for the next frame to be updated, but you're just redrawing the same frame again.
"Voilà! In view, a humble vaudevillian veteran, cast vicariously as both victim and villain by the vicissitudes of Fate. This visage, no mere veneer of vanity, is a vestige of the vox populi, now vacant, vanished. However, this valorous visitation of a bygone vexation stands vivified, and has vowed to vanquish these venal and virulent vermin vanguarding vice and vouchsafing the violently vicious and voracious violation of volition. The only verdict is vengeance; a vendetta held as a votive, not in vain, for the value and veracity of such shall one day vindicate the vigilant and the virtuous. Verily, this vichyssoise of verbiage veers most verbose, so let me simply add that it's my very good honor to meet you and you may call me V.".....V
Calculations() first, then Render():

Average render time in ms:
9.249394
Average calculations time in ms:
0.203390


Render() first, then Calculations():

Average render time in ms:
9.175182
Average calculations time in ms:
0.014599


Take note of the average calculations time.

This isn't just a one-off, I tried the test several times under the same conditions. The interesting thing is that Render() always has the flip at the end of it, so this may or may not be relevant to this topic. Can anyone explain why my calculations are 10x faster if I Render() first?
"Learn as though you would never be able to master it,
hold it as though you would be in fear of losing it" - Confucius
Quote:Original post by red_sodium
This isn't just a one-off, I tried the test several times under the same conditions. The interesting thing is that Render() always has the flip at the end of it, so this may or may not be relevant to this topic. Can anyone explain why my calculations are 10x faster if I Render() first?

You're doing something wrong when timing. :)
If your calculations are doing exactly the same work (and are presumably entirely on CPU) then they should take exactly the same time (give or take variance due to task switching, etc) no matter where you do your rendering, and even if there is *no* rendering. What we are talking about is the CPU time spent in the rendering code.

EDIT: Stupid me, that calculation timing is in milliseconds, not seconds :)
Ignore the difference, it's bugger all. Try doing some real calculations and then do your test again.
"Voilà! In view, a humble vaudevillian veteran, cast vicariously as both victim and villain by the vicissitudes of Fate. This visage, no mere veneer of vanity, is a vestige of the vox populi, now vacant, vanished. However, this valorous visitation of a bygone vexation stands vivified, and has vowed to vanquish these venal and virulent vermin vanguarding vice and vouchsafing the violently vicious and voracious violation of volition. The only verdict is vengeance; a vendetta held as a votive, not in vain, for the value and veracity of such shall one day vindicate the vigilant and the virtuous. Verily, this vichyssoise of verbiage veers most verbose, so let me simply add that it's my very good honor to meet you and you may call me V.".....V

This topic is closed to new replies.

Advertisement