Is it normal for 4x MSAA (with custom resolve) to cost ~3ms vs no MSAA?

Started by
8 comments, last by CDProp 10 years, 3 months ago

Greetings,

I have tone mapping working in my app, and I'd like to add multisampling. So that I may perform tone mapping before the MSAA resolve, I'm doing my own custom MSAA resolve by a) binding the multisampled texture and rendering a full-screen quad, b) using texelFetch to grab the individual samples, tone map them, and then blend them.

When I do this, it looks great, but it takes a big wet bite out of my render times (about 3ms). I don't think it's the tone mapping operator, because even if I only grab one sample instead of all 4, the performance seems about the same. This is for a 1920x1080 buffer on my GTX 680.

If 3ms is typical, then I'm okay with it, but since it's a good 20% of my frame time at 60Hz, I want to make sure it's not excessive. Obviously I can't afford to give everything 3ms. I'm using OpenSceneGraph, and so it can be difficult to tell what exactly is going on in terms of the underlying OpenGL, but I'm currently running it through gDEBugger to see what I can find. If you can think of anything I should be looking for, let me know.

Advertisement

My advice would be to do some profiling and measure which calls exactly cost the most time.

Then you might be able to improve bits without too much compromise on the visible end result.

I have no idea honestly if 3ms is normal, sounds like quite a bit though

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

In my experience doing a custom resolve can be significantly slower than than the driver-provided resolve. On certain hardware I'm familiar with there's various reasons for this, but I don't think I can share those reasons due to NDA. However in general it's safe to assume that the driver can make use of hardware-specific details to accelerate the process, and that it might have to do extra work in order to provide you with the raw subsample data so that you can read it in a shader.

Thanks, guys.

If it legitimately does take ~3ms, is that a price you would pay for doing (say) tone mapping before the resolve?

If your platform offers the ability to configure graphics settings, sure. I think a lot of folks don't mind the *option* to burn GPU horsepower should their hardware afford them the opportunity. For lower-spec systems, post-process-based antialiasing systems could offer a reasonable, low-cost(!) alternative.

clb: At the end of 2012, the positions of jupiter, saturn, mercury, and deimos are aligned so as to cause a denormalized flush-to-zero bug when computing earth's gravitational force, slinging it to the sun.

I have a little more data. Today, I set it up to run normal (non-explicit) MSAA that just blits the multisampled texture to a texture, and then renders that texture to the back buffer, so that I could do a performance comparison. With a scene of perhaps medium-low complexity, running simple shaders (forward-rendered), tone mapped, but no other post-processed effects, I have the following render times:

No MSAA: 1.89ms

4x MSAA (standard): 2.67ms

4x MSAA (explicit): 4.69ms

Again, this is at 1080p with a GTX 680.

So, about a 0.77ms difference with just plain MSAA and a 2.81ms difference for explicit MSAA, which means that the explicit MSAA is costing me an extra 2ms. Oddly, this number goes down slightly (1.79ms) if I reduce the scene complexity a bit. This is somewhat alarming, because I don't see how scene complexity can affect the MSAA resolve (I'm blending all four pixels regardless of whether they're on an edge or not; maybe the default resolve does something smarter).

So, I don't know. I don't have much of a choice, it seems. If I want to do tone mapping and gamma correction correctly, explicit multisampling seems to be the way to go. The best I can do is profile and make sure that I'm optimizing my app/shader code. *shrug*

I have a little more data. Today, I set it up to run normal (non-explicit) MSAA that just blits the multisampled texture to a texture, and then renders that texture to the back buffer, so that I could do a performance comparison. With a scene of perhaps medium-low complexity, running simple shaders (forward-rendered), tone mapped, but no other post-processed effects, I have the following render times:

No MSAA: 1.89ms

4x MSAA (standard): 2.67ms

4x MSAA (explicit): 4.69ms

Again, this is at 1080p with a GTX 680.

So, about a 0.77ms difference with just plain MSAA and a 2.81ms difference for explicit MSAA, which means that the explicit MSAA is costing me an extra 2ms. Oddly, this number goes down slightly (1.79ms) if I reduce the scene complexity a bit. This is somewhat alarming, because I don't see how scene complexity can affect the MSAA resolve (I'm blending all four pixels regardless of whether they're on an edge or not; maybe the default resolve does something smarter).

So, I don't know. I don't have much of a choice, it seems. If I want to do tone mapping and gamma correction correctly, explicit multisampling seems to be the way to go. The best I can do is profile and make sure that I'm optimizing my app/shader code. *shrug*

Can always just ditch MSAA. Give alternate AA techniques a look, personally I'm a fan of Crytek's implementation of SMAA: http://www.crytek.com/download/Sousa_Graphics_Gems_CryENGINE3.pdf

Shouldn't take much more than your standard MSAA results on a 680, and looks about as good I'd say. Plus if you're using deferred you don't have to worry about fiddling with transparencies, as it's essentially all post.

Just to be sure, depending on what you're going to add in the aimed scene/ your goal, you might be looking at optimization to early. If you turn on v-sync, anything up to 16.67ms will do fine. If this is the case, add more nice stuff, also gives more energy then looking for optimization :)

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

I have a little more data. Today, I set it up to run normal (non-explicit) MSAA that just blits the multisampled texture to a texture, and then renders that texture to the back buffer, so that I could do a performance comparison. With a scene of perhaps medium-low complexity, running simple shaders (forward-rendered), tone mapped, but no other post-processed effects, I have the following render times:

No MSAA: 1.89ms

4x MSAA (standard): 2.67ms

4x MSAA (explicit): 4.69ms

Again, this is at 1080p with a GTX 680.

So, about a 0.77ms difference with just plain MSAA and a 2.81ms difference for explicit MSAA, which means that the explicit MSAA is costing me an extra 2ms. Oddly, this number goes down slightly (1.79ms) if I reduce the scene complexity a bit. This is somewhat alarming, because I don't see how scene complexity can affect the MSAA resolve (I'm blending all four pixels regardless of whether they're on an edge or not; maybe the default resolve does something smarter).

So, I don't know. I don't have much of a choice, it seems. If I want to do tone mapping and gamma correction correctly, explicit multisampling seems to be the way to go. The best I can do is profile and make sure that I'm optimizing my app/shader code. *shrug*

Many GPU's make heavy use of compression in order to reduce the bandwidth needed to render to an MSAA render target, and also to resolve them. A naive 4xMSAA implementation would require 4x the bandwidth for writing to a render target, and then would require NumPixels * (4 + 1) bandwidth for the resolve. GPU's will instead take advantage of that fact that most pixels won't have 4 unique values in them (since they will only be covered by a single triangle), and in that case only "write" 1 pixel value along with some metadata indicating how many subsamples are needed. Then when it comes time to resolve, the metadata can be used to determine whether multiple subsamples need to be read and blended, which can then be used to provide a fast path for the majority of pixels. As your scene complexity goes up, you'll have more pixels where triangles only have partial coverage. This will lead to more bandwidth being used during both rendering and resolve. In your particular case when doing a custom resolve, there may also be a "decompression" step where the GPU has to ensure that all of the subsamples are actually written to memory so that you can read them in a shader. That decompression step may also utilize the compression to save bandwidth, and so it will be affected by your scene complexity

That makes a lot of sense, thank you. It might be worth playing with some edge detection, so that I'm only doing the full blend on edge pixels. I'm not sure that the performance savings will be worth the overhead, but it will be a fun experiment, anyway.

This topic is closed to new replies.

Advertisement