How do you resolve MSAA

Started by
9 comments, last by Vilem Otte 1 year, 4 months ago

So, I've deep dived into my MSAA items, and I'm still thinking about how to properly resolve it?

Previously (and for years) I've just looped over N samples and resolved, but naturally this doesn't need to be done everywhere, because it's just going to waste computation power on samples which don't need to be resolved. Here is a mask of which samples actually have different values (and which have the same):

Fig. 01 - Which pixels actually have more than 1 different sample in MSAA

While it is still quite some, in case of Sponza (Crytek version) we're talking about fraction of pixels requiring resolving. Now - as I'm correctly resolving after all processing in post-tonemap phase, doing all the phases over N samples is going to be computationally intensive (this may count double for calculating GI per-sample). Now - the user can define the number of samples (1, 2, 4, 8) for MSAA.

So, what are my options (throw in more if you can think off any):

  1. Do processing per-sample everywhere (the most simple, but wastes a LOT of resources)
  2. Store this mask - and then:
    1. For pixels in mask, resolve over ALL samples
    2. For pixels outside of mask, resolve only 1st sample
  3. Store this mask - and then:
    1. When processing per-tile (in tile-based approach - but what would be good tile size for this, 32x32 might be too big, 16x16 might be still too big)
      1. If there is any pixel with mask within tile, resolve ALL samples
      2. If there is no pixel with mask within tile, resolve only 1st sample
  4. Variant of 3., but apply 2. in case there is any pixel with mask within tile
  5. A variant of 2, but execute mask/non-mask pixels separately as 2 separate dispatches? Would require some additional data though (some packing/unpacking of the image) … that could be quite a nightmare to manage

The full pipeline is quite heavy on pixel-level processing (can and will cast cones or rays per-sample/per-pixel).

Now, the problem with 3 is, that some tiles are going to process N times more sample to process. Yet it could actually end up being better than in 2. As in 2, the whole warp (threadgroup) executing the current batch of pixels will wait for the slowest one. Now, this is just a speculation from ray tracers I've worked with in the past - but - could persistent-thread approach help here? That might be a bit too overkill though.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Advertisement

Did you consider a compaction step? After that, only indices to unique samples would remain. And their spatial order could still come from tiles (instead from scan lines over the whole screen), so your cones would remain coherent.
Unfortunately idk how long a prefix sum over the whole screen might take, but on modern GPU i guess it's surely less than a millisecond. I think it would be a win for you.

Though, everyone seems to agree that MSAA isn't practical with modern standards, and TAA is the only way. Why do you do this? VR support?

JoeJ said:
Though, everyone seems to agree that MSAA isn't practical with modern standards, and TAA is the only way. Why do you do this? VR support?

This question was something I wanted to avoid - and no it is not primarily VR support. Now, my personal opinion on visual quality (and it's going to be highly subjective) is that every single post-processing AA approach tends to blur and lower the overall quality of image. I do not like the introduced artifacts, problems and/or blurriness visually - and generally I just prefer No-AA at that point (of course speaking about this - with high-DPI 4K displays - No-AA isn't nearly as big problem as it was in the past).

So this being said - I like to give user options (but feasible ones!), allow them to use either MSAA and/or TAA (or other MLAA/FXAA/… variants, but they generally tend to be quite too blurry) or No-AA. I do know there is going to be performance hit for MSAA, the thing is to minimize that hit.

JoeJ said:
Did you consider a compaction step?

I did, I've used something similar for a ray tracer back then in GeForce 7xx generation. It could be quite viable, but I would have to implement and benchmark it. Speaking about this - I could simply store just all samples in large structured buffer and then in resolve map them back to the output pixels with weighting factors.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Vilem Otte said:
This question was something I wanted to avoid

We shouldn't, i think. I expected you just don't like TAA. Many people complain about related artifacts and blurring. So i want to know how many, to estimate the size of this problem.
Ideally, there should be statistics about these things. We're limited at personal opinions and assumptions, but have no real data. (Edit: Personally i don't notice ghosting, and i even like the blurring.)

Vilem Otte said:
I did, I've used something similar for a ray tracer back then in GeForce 7xx generation. It could be quite viable, but I would have to implement and benchmark it. Speaking about this - I could simply store just all samples in large structured buffer and then in resolve map them back to the output pixels with weighting factors.

It's the only real solution to the problem. Your other ideas sound more like reducing damage.
Unfortunately it's some work. But after that, you could tell me the cost of such big reductions, because i';m so curious about that for years. :D

Though, a full reduction wouldn't be needed. You could do it only per tile and workgroup, then append the result to global memory with a single atomic. It's pointless to do a global prefix sum over millions, so i did think about this wrongly initially.
Now i'm convinced the cost is pretty negligible. I do such local prefix sums all the time and everywhere, it's fast. And now with subgroup optimizations it should be even much faster.
You could also bin the pixel normals to improve tracing coherence as well while doing this, eventually.

JoeJ said:
It's the only real solution to the problem. Your other ideas sound more like reducing damage.

That's exactly what I thought. I've tried just looping from 0 to N in a full screen pixel shader - and while that may help in some scenarios, mostly it isn't (as even a single pixel having multiple samples in a warp will reduce the performance of whole warp).

JoeJ said:
Though, a full reduction wouldn't be needed.

That's a fair point - if I split the image into TileSize*TileSize tiles, I can determine whether tile has exactly 1 sample per pixel or more. In case of having just 1 sample per pixel I don't need to do any reduction, just process per-pixel without digging into samples at all. Therefore only tiles that have multiple samples per pixel needs to be processed in a complex way. This being said, I could simply run a reduction of TileSize*TileSize*N samples to build buffer of samples. The advantage of this could be that reduction on smaller scale might be faster… another advantage is, that when user disables MSAA - I just run tiled processing in a single way (no changes).

So… thinking where to begin (now I'm implementing this in toy-app before integrating into main editor and runtime - just to be more flexible). The next step is to switch pipeline to do this (this is more ‘thinking out loud’ part):

  1. Render multiple multisampled textures as it is doing now
  2. Clear 2 buffers containing tile structure
  3. Fill tile buffers in compute shader utilizing the outputs from 1 (first buffer being tiles that have 1 sample per pixel, second buffer being tiles that have at least >1 sample per pixel in tile), this will require atomics when pushing to tile buffers I guess - therefore counting number of tiles in first and second one
  4. For each tile in second buffer
    1. Create compact buffer that will hold samples, each with their pixel coordinate/index, at this point I will also know total number of samples in tile
    2. Run processing on samples (I could pre-multiply them here by factor, and therefore during resolve I could just sum the values)
  5. For each tile in first buffer
    1. Run processing on pixels (effectively using only 1st sample)
  6. Draw first buffer and just display
  7. Draw second buffer which will have to include resolve

If I read this correctly, I will need barrier after 2, 3, 4.1 and 5. Edit: Thinking about it, I could just draw indirect tiles for first and second - and that way I won't need to read out counters from 3 at all. I also won't need to read out counters in 4.1. So basically I can collect statistics (which I will probably want to be sure I didn't do any mistake) but I can just read that out at any point, ideally after rendering for frame is finished.

Now the performance hit might be quite high, so with no extensive processing in toy app the difference in timing might be scary. So I may need to add some heavy processing to see whether it paid off or not.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

I didn't get the section about the barriers, but sounds good to me otherwise. Just to confirm.

Btw, with such approach MSAA+deferred just works? That's interesting.

Not sure if having separate paths for all-1-sample tiles is worth it. Depends on geometry and scene, helps only with the best case, not with the worst.

Did you try to do a cone trace only for the sample with the highest weight, and using the result for all samples? Probably brings back aliasing, but maybe not that bad?

JoeJ said:
Did you try to do a cone trace only for the sample with the highest weight, and using the result for all samples? Probably brings back aliasing, but maybe not that bad?

No, while at the locations with little to no GI (under direct light) you do get antialiasing (although GI contribution is aliased, it is not that visible due to low contribution) - when you look at the ire with mostly indirect contribution (see image), the difference is visible:

Fig. 01 - 4x AA on direct light processing, no AA on GI processing
Fig. 02 - 4x AA on both - direct light processing and GI processing

I tried to take about the same spot - I still think the voxel resolution also hits the quality (I'm still not happy with it, I've been considering abandoning that for ages - I'm still thinking where to go with that one, I've got about half a dozen (or more) GI implementations in toy projects, and I'm not happy with any single one of those). If you look at the edges in the first image aliasing is clearly visible (which uses 4x for G-Buffer and direct lighting, 1x for VXGI). On the other hand in the second where the edges are nice and smooth it uses 4x for G-Buffer and direct lighting and 4x for VXGI. All resolving is done post-tonemapping step - in both cases. The fps will be lower, as it uses debug build of whole engine (for obvious reason - I just switched startup project to editor to be able to show it in an easy way).

I'm still messing with MSAA approach, got way too busy in past few days (Christmas incoming) to get through it, but I think I'm on a good way. Still no idea how performant it will be.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

Vilem Otte said:
I'm still thinking where to go with that one, I've got about half a dozen (or more) GI implementations in toy projects, and I'm not happy with any single one of those).

The industry wide struggle with GI should come to my advantage. That's the plan at least… : )

The problem i see with voxel or volume probe grid approaches is missing accuracy about the surface. For high quality we need probes at the surface, and a good approximation of surface.
If you lack one of these, the solution is only good enough for indirect bounces eventually.
Now that we have HW RT, you could do what A4 did: Trace the first bounce for accuracy, but replace the concept of ray paths with probe lookups. That's not bad for now i think, mainly because you can keep the RT part optional on low end.

But your images don't look bad. You get bounce light, which is the primary goal but that's where VCT ends. It misses the beauty of subtle gradients and some details, but there is no way to get this from coarse voxels alone.
Screenspace GI could surely help this, but isn't cheap either ofc. In above image i would criticize mainly a feel of ‘over darkening AO’. E.g. there is a dark soft shadow under the vase. Because the base of the vase is so bright, i would expect more bouncing of that to the floor. SSGI could address this, and because it models diffuse effects, typical SS artifacts are not as harsh as with SSAO or SSR.

Regarding the journey of trying many GI solutions, there were 2 of them which still stick with me:

Michael Bunnels anti radiosity idea (partially discussed in GPU Gems 2, but ther is some presentation around discussing GI application). This gives very high quality, beautiful results, and is still much faster than any recent solutions. The problems caused from the visibility approximation are still unsolved, though. His proposal here was to divide interior scenes into rooms, preventing leaking of one room to the other. But this sucks. I tried to use raytracing a low poly occluder scene representation to cull occluded surfels, but this did not mix nicely with the otherwise beautiful results, and this also was where i've learned that RT is really slow.
Still, the anti radiosity idea is remarkable. Maybe there is a way…

Cryteks idea to replace expensive gathering with cheap diffusion/propagation. This also avoids a need to calculate the expensive visibility term. But voxelized blockers are just no good scene representation, and SH2 can't represent light of different colors flowing at opositing directions. So that became a blurry mess for me. Seems limited to modeling volumetric lighting.

It feels there was promising progress back in around 2010, but no real solution was found. Then no progress at all until recently when HW RT came up. SDF became popular too.
But looking at Lumen or Portal RTX, i feel like they still grope in the dark. Both is way too expensive.

So, finally had a bit time to take a look at it today. I have a bug in at least one thing (see bottom-most row of tiles, they miss like half of tile for ⅔rd of image - but I can figure that out … Imgui docking/resizing does make a bit of mess in there).

This is how average Sponza will look like with 16x16 tiles - this one doesn't use the sample buffers yet, but do the hard resolve. Each red tile must be resolved, while standard doesn't need to be. With more complex geometry (like lion's head) - there will be no tiles without requirement for sample buffers.

Now it's time to start digging in actual sample buffers.

EDIT: Indeed my resize didn't take place when I docked/undocked window, overall for this purpose it doesn't make a difference. Still at anything with reasonable geometry it requires sample buffers. I'm quite mid-way with them, but might have results soon. Then I'll need to attach GPU profiler and count how much it really takes.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

The errors in this one are legendary.

This one is definitely correct. Except for non correctly initializing groupshared counters which I used for indexing.

Anyways, I think I've got it. The resolve is a pain though - I think it might be better optimized. The largest part of pain on it are the tiles that need to be resolved. I'll clean up the code and share it here - plus give it another bit of my brain power to analyze how this could be optimized. This is the first time I managed to do it successfully.

Antialiased Sponza… that was the goal.

Now, in detail - the next image shows:

  • Standard colored tiles are resolved by just pass through (no resolve)
  • Colored tiles have 2 different pixel statuses in
    • Green are not resolved, just first sample is always used
    • Red are resolved - they require atomic (weighted) addition of multiple samples to resulting value
Pretty much sums up, which tiles are the least expensive and which ones are the most expensive.

So, I decided to attach profiler, do a proper release build - and this is the result:

This one is with 8x MSAA, resolve and reduction tends to be sub-1 ms … on 4K display, with Radeon Rx 6800

This being said - it can be improved further I believe (many of the edges seem redundant - especially the ones inside the surfaces especially where UV is continuous - like inside the cloth parts, or from the bottom on arches). Reducing the amount of pixels to resolve improves this further.

It can also be further optimized (as ALL tiles are currently reduced - this is the subject on which I'm a bit skeptical - vegetation (I'm doing sample alpha to coverage - which gives me proper transparency “for free” in this case), and complex meshes are going against this.

EDIT: This being said - I believe it still is going to be worth it. The only downside I think about are some post-processing effects that require lookup into neighboring pixels … although I think I might have solution for those.

My current blog on programming, linux and stuff - http://gameprogrammerdiary.blogspot.com

This topic is closed to new replies.

Advertisement