Performance questions with post-processing render-to-texture

Started by
7 comments, last by JeffRS 8 years, 7 months ago

Hi all, I'm new to GameDev and new-ish to modern graphics programming so you'll have to forgive any extreme ignorance on my part.

I'm in the process of adding various post-processing effects to a rendering engine which I am working on. I am using OpenGL and have set up a couple of framebuffers with multiple texture attachments which I am using to render multiple passes of screenspace effects to (DoF, SSAO, bloom, etc.)

Currently I am trying to optimise the post processing effects as much as possible and after doing some rudimentary profiling I was surprised to see how much time is being taken to perform what I thought should be very simple and fast tasks. Primarily that is generating mipmaps and up/downscaling images.

Much of what I have read regarding post processing often talks about getting a significant speed increase by only processing a half or quater size screen image where suitable. This makes great sense although the cost of any downscaling or upscaling passes never seem to be mentioned, my testing is showing significant cost which seems to often almost cancel out any benefit.

I am aware that it is much faster for the gpu to run ALU instructions than to read and write from textures. Is this difference on modern gpu's significant enough to cause the performance I am seeing?

Some examples:

0.595ms to generate mipmaps for one fullscreen texture ( using glGenerateMipmap(GL_TEXTURE_2D) ).

0.268ms to read a fullscreen texture and render it to another texture at 1/4 size.

In contrast my SSAO takes almost exactly 1ms running at fullscreen, doing multiple reads per screen pixel of a 32bit depthbuffer then reading a full screen texture and applying the AO result to that and rendering it out to a new fullscreen texture. Why would it be almost 1/3 the cost of a fullscreen AO effect just to render a texture at 1/4 size where the shader is literally one texture fetch then an output? It is just not making sense to me.

I was hoping to convert my SSAO to run at 1/2 screen size, I also hoped to convert my bloom from a single 1/4 sized blur to a combination of multiple screen sizes as I have seen in many examples. At this point the cost of writing and reading many textures seems to very high, around 3ms total is being spent not actually processing anything other than rescaling.

I can't help but feel I must be doing something fundamentally wrong, many papers, books and tutorials talk about using many render passes and many variously scaled images as if it is commonplace and trivial. Many bloom techniques I have read talk about using 1/2, 1/4, 1/8 and 1/16 size images blurred and combined. I know for a fact many modern game engines are doing many post processing passes and rescaling for screenspace effects, I find it hard to believe they are spending several ms just on reading and writing to multiple textures.

Would using Compute shaders to resize textures be faster than fragment shaders?

What is the typical "budget" of processing time for various common effects? That is to say assuming many of the post effects are screenspace and have a relatively fixed cost per screen size regardless of the scene, how much of a % of a theoretical 16.6ms total rendering budget would be allocated to bloom, DoF, SSAO, etc...?

Advertisement
Those GPU times sound pretty high relative your SSAO cost. How are you measuring the timings? What GPU are you running on? What resolution are you running at?

The budget allocated to post-processing will vary depending on the game, what kind of effects it's using, what hardware it's running on, etc. For the PS4 game that I worked on, we budgeted around 3.0-3.25ms for post FX, which was close to 10% of a 33.3ms frame. I think around 0.85ms of that was DOF, 0.65-0.75ms or so for lens flares, 0.1-0.2ms for bloom, 0.3-0.5ms for motion blur, and maybe 0.85 ms for the final step which combined tone mapping, exposure, bloom/flare composite, chromatic aberration, film grain, lens distortion, vignette, and color correction. But of course this was after lots and lots of optimization for our target hardware.

The information on post processing times is great. I don't expect to come close to what a commercial studio can achieve in terms of quality and optimisation but it is just good to have something to aim for.

Yes the SSAO is cheap, it's only 8 samples in a small radius which is ok for what I am doing. I am testing on an nvidia GTX750 which is a modern but relatively modestly powered gpu. I am aiming for 60fps at 1280x720 and 30fps at 1920x1080.

So I have a lot more questions but your question about timings made me double check and it seems it is not accurate and there is no point going any further until I find a more accurate timing method.

I am using queries with GL_TIMESTAMP. After your question about timing I did some tests and found anywhere from 0.080-0.150ms time delay even just simply making 2 timestamps one after the other and comparing the difference. This is obviously skewing all my timing results by ~ +0.100ms whch of course explains why the very fast processing seems to be taking twice as long.

I am using glFinish() before making each time stamp, this is required to make sure all previous calls have been completed before checking the time. Without this the times are wildly all over the place however it seems that the glFinish() call itself is at least in part causing some problems as the actual fps drops signifcantly just by calling glFinish after each shader. I did some testing and rewrote the timing code in a number of ways and found I could easily skew the timing by +0.100ms or so depending on when the queries were read back. For example if I do a timestamp and read it back imediately then do another timestamp and read that back the difference is 0.080ms greater than if I do 2 timestamps in a row then read the 2 queries back. I can't imagine what is possibly causing this situation where 2 timestamps back to back with no other code or opengl calls inbetween are returning up to 0.150ms difference.

I realise I am trying to measure extremely precise timings but various documentation and annecdotal stories lead me to believe it is possible to get correct timings using this method which I believe is also similar to the way it is done in directX.

Perhaps some experienced people could suggest a reliable way of measuring gpu timings in openGL or point me to a source of information I could read on how to do it correctly.

I am using glFinish() before making each time stamp,

That will absolutely ruin performance and invalidate any real-worldness of the collected data.

The CPU and GPU are, in typical conditions, out of sync by about 1 whole frame's worth of commands.
By using glFinish to focibly synchronize them, you're introducing a lot of overhead an uncertainty.

You should be able to issue two timestamp events, and then at least one frame later, read back the two results.
This will give you two timestamps with a lot less overhead/uncertainty. There should be no reason to use glFinish at all.

[edit] Just to he clear, I'm talking about glQueryCounter/glGetQueryObject, which put a command into the GPU's command queue instructing it to record a timestamp, and then let the CPU read that timestamp later.

Thankyou for the helpful reply. Given your comments about glFinish I re-read where I found that information and it seems there was some confusion from me not understanding the documentation regarding opengl queries along with some false information posted on the opengl forums about having to sync the cpu and gpu.

I have again rewritten my timing code and have been trying to follow closely the explanations and examples from the books OpenGL Superbible and OpenGL Insights. I have tried using both gl_timestamp and glBeginQuery/glEndQuery and I am getting stable and repeatable results from both. I am now making queries at the start and end of when each shader is called and then retrieving those queries 3 frames (3 SwapBuffers) later. There is no longer anything causing the CPU to wait and the results are stable and repeatable. However I am still getting around 0.04-0.06ms timings where no opengl functions are being called. Both books I mentioned above make a point that the queries are inserted into the pipeline after the last previous opengl call is made but there is no guarantee that the last opengl function has 100% completed due to the parallel nature of the way the pipline is executed. The small incorrect times I am getting seem to be the result of this.

Again both books mention using glFenceSync/glWaitSync to try and perfectly align the timing queries. I tried an example of this from one of the books but it only gave me worse results. At this point the inaccuracies are known by me and seem stable so it would be possible for me to use the timings I am getting and make a small mental adjustment for error when I read the results. However it would be nice if someone could confirm whether or not there is a way to get more precise timing or if the errors I am getting are normal and to be expected.

However I am still getting around 0.04-0.06ms timings where no opengl functions are being called.

If it's any help, i didn't encounter that. On my GTX750Ti, executing two timestamps after each other give ~1us difference, which seems to be the precision of the NV driver (the raw difference values from glGetQueryObject fluctuate between 0, 992, 1024, 1057 in nanoseconds). For comparison, timing glClear(depth) on 1920x1080 framebuffer gives me ~10us (0.01ms). All my draw call timings correspond to what NV nSight is showing me without any noticable differences.

Yes that is very helpful. If you can achieve accurate timing with the same video card and opengl then there must still be a problem with my setup. I did actually spend some time with nsight to try and compare against my measured timings. Some things seems close and others were very different and made no sense at all.

Could you just confirm for me which method you are using for your queries? Are you inserting gl_timestamps or using glBeginQuery/glEndQuery? Are you doing anything to try and sync the time queries with the opengl calls you are trying to measure? Are you retrieving the queries every frame or waiting a frame or 2 before reading them back?

I will post some of my code later when I get the chance. Perhaps there is something wrong that other people can notice.

Could you just confirm for me which method you are using for your queries? Are you inserting gl_timestamps or using glBeginQuery/glEndQuery?

I'm using GL_TIMESTAMP before and after the section i want to query.

Are you doing anything to try and sync the time queries with the opengl calls you are trying to measure?

Nope, no syncing at all.

Are you retrieving the queries every frame or waiting a frame or 2 before reading them back?

Nothing fancy, for each query i keep ring buffer of query objects for 3 frames, on each frame i query for frame N, then check the availability of query result for frame N-2 with glGetQueryObject(...GL_QUERY_RESULT_AVAILABLE...) and then retrieve it with GL_GET_QUERY_RESULT.

Just to follow up on this, I actually fixed the problem. Thanks again to everyone who responded, each of you helped in different ways and I appreciate your patience.

This topic is closed to new replies.

Advertisement