Sign in to follow this  
dongsaturn

MSAA issues

Recommended Posts

dongsaturn    134

Hi guys,

 

I have read some articles [A Quick Overview of MSAA] [Rasterization Rules] about MSAA and rasterization, and tried to understand how MSAA works.

I articles are great and helpful. But I still have a little question, in the [Rasterization Rules] they said each pixel is only shading once at the pixel center with the interpolated attributes:

 

For a triangle, a coverage test is performed for each sample location (not for a pixel center). If more than one sample location is covered, a pixel shader runs once with attributes interpolated at the pixel center. The result is stored (replicated) for each covered sample location in the pixel that passes the depth/stencil test.

 

 

Also mentioned in the [A Quick Overview of MSAA]:

 

Instead, the pixel shader is executed only once for each pixel where the triangle covers at least one subsample. Or in other words, it is executed once for each pixel where the coverage mask is non-zero.

 

Please think about this situation: With 4xMSAA, In the rasterization stage we have 2 triangles both coverage 2 sub pixel samples(passes depth/stencil test) with the same pixel.

 

Does the pixel center attributes are interpolated from the 4 covered sub pixels or the same as none MSAA mode(interpolated from trangle position)?

Does only the interpolated pixel center attributes will be out put to the pixel shader in per pixel frequency shading mode?

Does the covered sub pixels will be out put to the pixel shader instead of pixel center in the per sample frequency shading mode (DX10.1)?

 

Update:

Another question: I found the MSAA implementation in the Deferred Shading example is a little different, instead of only shading once at the pixel center with the interpolated attributes, they use an edge detection algorithm to decided whether shading per sub pixel samples or per pixel.

Why not just use the conventional one in deferred shading? It's faster.

 

Thank you very much.

Edited by dongsaturn

Share this post


Link to post
Share on other sites
backstep    1313
Hey, since nobody seems to be touching this, and I also had these questions about the deferred shading sample, I'll try to explain my understanding (again). Apologies in advance for the quality of the explanation.

First of all, given your example of forward rendering with 4xMSAA, where a single pixel has two triangles each covering 2 of the 4 sub-samples...
 

Does the pixel center attributes are interpolated from the 4 covered sub pixels or the same as none MSAA mode(interpolated from trangle position)?


I believe by default with MSAA it's the pixel center attributes used in the pixel shader (same as non-MSAA), regardless of which sub samples were covered. So with your example pixel, the pixel was shaded twice, once for each triangle with partial coverage of the pixel. Both pixel shaders ran as if they covered the whole pixel the same as in non-MSAA mode. The result of each pixel shader was written to the pair of the 4 subsamples that each triangle covered.

In the Rasterization Rules link you posted, there is mention of centroid sampling, which will instead use attributes interpolated closer to the covered samples rather than always using the pixel center. I think that to use it you need to specify the interpolation method in the pixel shader input struct (for each member), as detailed here - http://msdn.microsoft.com/en-us/library/windows/desktop/bb509668(v=vs.85).aspx .
 

Does only the interpolated pixel center attributes will be out put to the pixel shader in per pixel frequency shading mode?


This seems like a similar question to the first one? I'm sorry if I misunderstand it. The pixel shader input is interpolated from the pixel center attributes by default, unless you specify a different method of interpolation in the pixel shader input struct.
 

Does the covered sub pixels will be out put to the pixel shader instead of pixel center in the per sample frequency shading mode (DX10.1)?


I assume you mean the "sample" interpolation modifier available to shader model 4.1? So far as I understand, this is equivalent to super sampling, in that the pixel shader will run for every sub-sample that is covered by a triangle, and yes each sub-sample is shaded using the interpolated sub-sample position rather than the pixel center, but only for those attributes (PS input struct members) you specifically declare with the sample modifier. Once you declare a single input struct member with sample modifier, all shaded sub-samples will invoke the pixel shader, but those input members without a sample modifier will be interpolated to the pixel center for all sub-samples of that pixel.
 

Another question: I found the MSAA implementation in the Deferred Shading example is a little different, instead of only shading once at the pixel center with the interpolated attributes, they use an edge detection algorithm to decided whether shading per sub pixel samples or per pixel.
Why not just use the conventional one in deferred shading? It's faster.


If you think about how conventional MSAA works during rasterization, it can share a single shading result amongst multiple sub-samples because it knows which triangle covers which sub-samples, and so, which sub-samples can share per-pixel attributes and share pixel shading cost. It has access to per-pixel attributes via interpolation of the vertex attributes.

Compare that to a deferred renderer, which only really knows about the post-rasterization attributes of each sub-sample, and has no access to per-pixel attributes. It doesn't know which sub-samples were covered by which triangle, and therefore doesn't know which sub-samples can share the same pixel attributes. Basically the deferred renderer has no knowledge of coverage or access to pixel center attributes.

It's been about a month since I looked at the tiled deferred example, but I think the edge detection looks for large discontinuities within each pixel's sub-sample attributes (normals and depth?) to try to guess whether they originate from the same triangle/surface. When no edge is detected it just splats shading for the first sample of the pixel in the gbuffer to all samples for the pixel in the output buffer.

Let's repeat your forward 4xMSAA example where two triangles each cover 2 of the 4 sub-samples for a pixel, but this time in the deferred renderer with a 4xMSAA gbuffer. It detects at least one of the sub-samples has sufficiently different attributes from another sub-sample, within the same pixel, and flags the pixel for per-sample shading. It immediately shades sub-sample 0 for the pixel and then later shades sub-samples 1-3 for the pixel that was flagged, right?

So for the same edge pixel, deferred 4xMSAA shades four times, once for each sub-sample, where forward 4xMSAA only had to shade twice to write to all four samples. But how would you have the deferred renderer detect when sub-samples belong to the same surface? And how would you interpolate the attributes for a similar effect to per-pixel shading shared between sub-samples? (since deferred cannot access the pixel center attributes, only the available sub-sample attributes in the gbuffer)

My naive guess is you'd first need to bucket sub-samples during edge detection, so in our example samples 0 and 1 belong to triangle one, while samples 2 and 3 look to belong to triangle two. For 4xMSAA your edge detection function would need to return the equivalent a 4x4 matrix of bits to show which triangle/surface each sub-sample belongs to (perhaps return a packed int). Then you'd have to interpolate the gbuffer attributes of sub-samples that share the same detected surface/triangle, then shade once for each detected surface/triangle, then write each shading result to the relevant pair of output sub-samples.

So compared to default, you'd be doing the same number of gbuffer reads and the same number of output sample writes, extra work bucketing the samples to surfaces and then interpolating gbuffer attributes per-surface, but less work shading each sample (shade once per-surface rather than once per-sample). I would suppose that how much faster it is depends upon whether the sample shading or the gbuffer reads are the limiting factor in deferred MSAA.

Unless you have a really expensive lighting shader (BRDF), I'm not certain you'd gain a whole lot from attempting to reproduce conventional MSAA's shared-shading-cost of sub-samples. I'm also not certain whether it would affect AA quality in a noticeable (bad) way, since the pixel resolve would be working with less information than before.

You could gain more if you skip interpolating the gbuffer attributes and just use the values belonging to the first sub-sample for each detected surface within a pixel. That would reduce your gbuffer reads and the computation cost. However, I expect the skipped gbuffer interpolation really would degrade the AA quality.

I should also point out I'm not that far from a beginner myself, and there might be other effects related to memory latency and access patterns, that're beyond my understanding. I also might have missed something important.

Also, I might just be wrong! It could turn out to be way faster and just as good quality as per-sample shading, perhaps one of us should try to implement it and find out for certain. smile.png

Share this post


Link to post
Share on other sites
dongsaturn    134

Thanks for your reply. I really appreciate it.
 

I believe by default with MSAA it's the pixel center attributes used in the pixel shader (same as non-MSAA), regardless of which sub samples were covered. So with your example pixel, the pixel was shaded twice, once for each triangle with partial coverage of the pixel. Both pixel shaders ran as if they covered the whole pixel the same as in non-MSAA mode. The result of each pixel shader was written to the pair of the 4 subsamples that each triangle covered.

 

It's seems reasonable.

 

Compare that to a deferred renderer, which only really knows about the post-rasterization attributes of each sub-sample, and has no access to per-pixel attributes. It doesn't know which sub-samples were covered by which triangle, and therefore doesn't know which sub-samples can share the same pixel attributes. Basically the deferred renderer has no knowledge of coverage or access to pixel center attributes.

 

SV_Coverage: Indicates to pixel shader which sub-samples covered during raster stage.
We can use SV_Coverage to output the coverage info to the GBuffer. But in the CryEngine3 Graphics Gems presentation they suggest: Avoid default SV_COVERAGE, since it results in redundant processing on regions not requiring MSAA! 
And It seems that they used a similar approach as the Deferred Shading example to create the coverage mask, but they did not output any extra info (like the positionZGrad) in the GBuffer!

 

My naive guess is you'd first need to bucket sub-samples during edge detection, so in our example samples 0 and 1 belong to triangle one, while samples 2 and 3 look to belong to triangle two. For 4xMSAA your edge detection function would need to return the equivalent a 4x4 matrix of bits to show which triangle/surface each sub-sample belongs to (perhaps return a packed int). Then you'd have to interpolate the gbuffer attributes of sub-samples that share the same detected surface/triangle, then shade once for each detected surface/triangle, then write each shading result to the relevant pair of output sub-samples.

 

I realize that in the Deferred Shading example, they defer the scheduling of per-sample-shaded pixels until after sample 0 has been shaded across the whole tile. This allows better SIMD packing and scheduling. And I did some test, it's much faster than rendering other sub samples after sample 0 directly. 

So If we have to use the interpolated attributes of sub-samples that covered by same surface/triangle, we need to out put them to an temp global buffer first to be able to defer the per-sample shading! That seems to be a big problems, and could be slower.
Edited by dongsaturn

Share this post


Link to post
Share on other sites
backstep    1313
Hey, thanks for the discussion too, I still have plenty to learn. smile.png

I wasn't aware you could use SV_Coverage reliably for opaque geometry. I've only come across it for alpha tested surfaces, and I've not done a whole lot with those yet. So thanks for the information.

If I remember correctly, I think there's a #define in one of the hlsl files for the tiled deferred example, that allows you to immediately shade the other sub-samples with sample 0, rather than deferring them to shade in a second pass. I think it was called DEFER_PER_SAMPLE and you can just set it 0. Like you said, I remember disabling it and finding performance dropped.

This weekend I'll try to implement your idea with the original intel example app, with deferring the sub-sample interpolation to the per-sample pass (after all sample 0's complete) and then shading once per detected surface. I'll post back about how it works out.

Also since you're working with compute shaders, I thought you might not know the new Visual Studio 2013 that released yesterday has excellent new support for compute shader debugging. You can select a threadgroup and a thread, and then step through the shader execution just as you would debugging a pixel shader. Previously I was using nSight and debugging on the GPU, and as useful as that was, the new VS2013 debugging is far more usable and responsive.

Share this post


Link to post
Share on other sites
dongsaturn    134


Also since you're working with compute shaders, I thought you might not know the new Visual Studio 2013 that released yesterday has excellent new support for compute shader debugging. You can select a threadgroup and a thread, and then step through the shader execution just as you would debugging a pixel shader. Previously I was using nSight and debugging on the GPU, and as useful as that was, the new VS2013 debugging is far more usable and responsive.

 

Thank for your reminding, i will take a try.biggrin.png

Share this post


Link to post
Share on other sites
backstep    1313
I spent this afternoon looking at that intel tiled deferred example again, and seeing if it was possible to implement your idea. The good news is it was workable, and it only took editing a handful of files. I'll attach them to this post in a zip file, you can just drop them into the project folder and rebuild the project to try it. I commented the code pretty heavily so it should make sense to read.

The bad news is I couldn't get it any faster than the default method of shading all sub-samples for edge pixels (it was only ~2-4% slower though). There's a couple of issues that I'm guessing are the reason sharing the shading output between samples isn't faster, but I'll explain how I implemented it before digging into that.

For reference, I always ran it at 1920x1080 with 4x MSAA, 1024 lights.

The first thing was associating sub-samples with each other, or rather associating sub-samples that shared the same surface/triangle. Usually sub-samples are only tested against sample 0 and as long as any one doesn't appear to be from the same surface as sample 0, then the pixel is added to an array of unsigned-ints in thread-group-shared-memory, and when that array is processed in a second pass by the threadgroup, all 3 remaining sub-samples in an edge pixel get shaded, each sample by a different thread. the TGSM uint array items just hold the viewport coordinates of the pixel packed as 2x 16bit uints.

It doesn't really need 16 bits per axis for pixel coords, so I cut that down to 12 bits per axis, still leaving a max resolution of 4k by 4k (computeshadertile.hlsl line:46). Anyhow that opens up 8 remaining bits to pack the sub-sample associations into. The sub-sample test against sample 0 (function is called RequiresPerSampleShading) usually returns a bool to signify if any sub-samples didn't match sample0's surface, instead I rewrote it (poorly) to pack the sub-sample association into the first 9 bits of a uint (see gbuffer.hlsl line:120). The packing is a bit branchy but I tried to keep it to a minimum. It's all commented so I won't explain it here. I will paste in how the bit packing is laid out:
 
//perSampleField's first 9 bits describe which samples share a surface and so can be shaded once but written to multiple output samples.
//the layout is as follows:
//
//   shaded sample:			sample0			sample1			sample2			sample3
//   samples to write:			1,2,3			1,2,3			2,3			3
//   bits:				0,1,2			3,4,5			6,7			8
//
//sample0 is always written from sample0's shading so no 4th bit is needed for sample0.  if Sample1 is a unique surface, it's shading will
//always be written for sample1, and could be written for samples 2 and 3.  Similarly if sample2 is a unique surface from samples 0 and 1,
//then sample2 will always use it's own shading, and only sample3 might also use it.  If sample3 is unique surface from all the other samples 
//then only sample3 can possibly use sample3's shading result(since for sample3 to be unique, all previous samples must also be unique).
//
//here is an example bitfield, where samples 0 and 2 are the same surface, but samples 1 and 3 are unique surfaces.
//
// bits:		0	1	2	3	4	5	6	7	8
// values:		0	1	0	1	0	0	0	0	1
//
//So when sample0 is shaded, it's result is written to samples 0 and 2.  When sample1 is shaded it's only written to sample1, and when sample3
//is shaded it's only written to sample3.

The actual cost of sorting the samples and packing them into the bitfield isn't that bad. It only pushed the frametime from 8.18ms to 8.26ms.

So that's 9 bits of data and 8 bits of free space. Luckily Sample0 is always shaded and written in the first pass by the threadgroup, so once perSampleField is returned then sample0 is shaded and written to output like normal, except any others samples associated with it's surface are now also written too. that is, any samples flagged in bits 0 to 2.

That means the three bits 0-2 are no longer needed, so the bitfield can be shifted once to the right. Now it's only 8 bits long, with two unused bits at the beginning. There is a reason to keep those first 2 bits available. Normally when subsamples need to be shaded in the threadgroup's second pass, you can just add the pixel to the uint array in TGSM that acts as a list, because all three remaining samples need to be shaded. However now we only want to shade the unique samples within a pixel listed for subsample shading. Each pixel that requires per-sample shading now, may only have one or two unique samples, so that poses a scheduling problem. You can't split each pixel between every 3 threads like before, since you'd end up with idle threads when only one or two samples are unique and need shading, out of every 3 from a pixel.

My solution was to use those empty first 2 bits of this sample mask bitfield to flag the sample to be shaded, and so add individual samples in the TGSM array, rather than entire pixels. So for each sample in a pixel that actually needs shading, the pixel coords and the sample mask are added to the TGSM array, along with a flag to say which sample to shade. For example to submit a pixel with the sample bits sets for sample1 (shared with sample2) and sample3, it's added to the TGSM array twice, once for each unique sample that needs shading. The first two bits say which sample to shade, the next 6 bits are the sample mask that describe which samples need to be written with which sample's shading result. In this example the pixel is added with sample1 flagged for shading and masked for writing out to samples 1 and 2, and then the pixel is added again with sample3 flagged for shading and masked for writing to sample3. The only difference between samples submitted from the same pixel are the first two bits, while the sample mask and the pixel coords are the same.

This solves a problem in that no threads will be given samples that don't require shading, since the list now only contains unique samples we want shaded. However, it also causes a problem since you end up writing and reading from TGSM a bit more. On a couple of occasions while I was testing, even 4xMSAA overran the 256 item array length on a tile/threadgroup. I added a check to avoid it in future, and there's no visible degradation from it, but it is technically a bit of a hack.

You can also just extend the TGSM array, I found doubling it to 512 samples added roughly 0.20ms to the frametime. Since the hacky fix had no noticeable drawback at 4xMSAA, I left that in the uploaded files.

So bits 0 and 1 are just a packed uint of the sample to be shaded (1, 2, or 3), while bits 2-7 describe the other samples' associations to the sample being shaded, so that once it's shaded they can also be written to output along with it. The details on adding samples instead of pixels to the TGSM array are in ComputeTileShader.hlsl around line 220 onward. Similarly the details on shading the sample once and writing that to multiple output samples are in the same file around line 324. Likewise it's all commented so I won't explain the details here, it's just a little bit of bit shifting and setting/clearing, and using intrinsic functions like firstbithigh and firstbitlow to parse the values out. You can toggle between shading unique-samples or the original all-samples-per-pixel using a new #define in ShaderDefines.h called UNIQUE_SAMPLES.

There's also a COMPLEX_BRDF toggle in ShaderDefines.h. When it's enabled the compute shader uses a more involved BRDF than the default simple phong shading (the new BRDF is oren-nayar diffuse with cook-torrance specular that I think I got from this very forum!). I thought it was worth testing to see if saving the shading time associated with a more expensive BRDF would offset the increased TGSM usage. It didn't. smile.png

Here were my actual frametime results, on a nvidia 560ti, and like I mentioned, 1920x1080, 1024 lights, 4xMSAA, oh and built with Release target:

8.18ms - original unmodified intel app (newly compiled though).
8.26ms - sample surface comparison and uint packed return, original pixel-only TGSM array.
8.41ms - sample surface comparison and uint packed return, unique-sample-only TGSM array.
9.81ms - sample surface comparison and uint packed return, original pixel-only TGSM array, complex BRDF.
9.98ms - sample surface comparison and uint packed return, unique-sample-only TGSM array, complex BRDF.

As you can see, whatever gains were made reducing the number of shaded samples, more was lost in the extra TGSM usage. I also suspect the export cost when writing samples is possibly higher now. With the original pixel-only TGSM array used in the second pass for edge pixel samples, each thread shades one sample and writes one sample. With the unique-samples style TGSM array, each thread shades one sample still, but may write two or three samples afterward. I think it just makes the thread scheduling a bit 'lumpy' and uneven perhaps.

It's still pretty close in speed to the original method though, and much much faster than disabling defer_per_sample, so it's definitely a usable technique. I'm just not sure what would make it worth using over the normal implementation.

Anyhow code is attached, extract to the original project directory. It's hard to see how to optimize the technique. Ideally it'd be good to reduce the TGSM usage to per-pixel once more, but I don't see how to efficiently schedule sample shading for that when each pixel could have an undetermined number of samples to shade. Maybe you'll see some optimizations I missed, or a bug!. smile.png

Oh, I never tried to interpolate the gbuffer attributes for samples that share surfaces. It's a bunch of extra reads and ops for what would appear to be a minor benefit, since the multisample resolve on the output buffer already seems indistinguishable from the original app's output. Also, if you enable SAMPLE_COLORS in shaderdefines.h you can use the #if blocks in computetileshader_color.hlsl to check the sample outputs. Otherwise it's identical to the main compute shader file (I moved the #if blocks out for readability).

[attachment=18462:deferred_shading_uniqueSamples.zip]

Share this post


Link to post
Share on other sites
dongsaturn    134

It's wonderful! I really appreciate it.

There is a small mistake in computeshadertile.hlsl line:33

should be: 

groupshared uint sPerSamplePixels[COMPUTE_SHADER_TILE_GROUP_SIZE * 3];

I guess you forgot to change it. smile.png

Sorry I missed the code for preventing overrun.

 


I also suspect the export cost when writing samples is possibly higher now. With the original pixel-only TGSM array used in the second pass for edge pixel samples, each thread shades one sample and writes one sample. With the unique-samples style TGSM array, each thread shades one sample still, but may write two or three samples afterward. I think it just makes the thread scheduling a bit 'lumpy' and uneven perhaps.

 

I think the divergent branching also result in writing more samples. 

Threads within a single warp take different paths, different execution paths are serialized.

Edited by dongsaturn

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this