I spent this afternoon looking at that intel tiled deferred example again, and seeing if it was possible to implement your idea. The good news is it was workable, and it only took editing a handful of files. I'll attach them to this post in a zip file, you can just drop them into the project folder and rebuild the project to try it. I commented the code pretty heavily so it should make sense to read.
The bad news is I couldn't get it any faster than the default method of shading all sub-samples for edge pixels (it was only ~2-4% slower though). There's a couple of issues that I'm guessing are the reason sharing the shading output between samples isn't faster, but I'll explain how I implemented it before digging into that.
For reference, I always ran it at 1920x1080 with 4x MSAA, 1024 lights.
The first thing was associating sub-samples with each other, or rather associating sub-samples that shared the same surface/triangle. Usually sub-samples are only tested against sample 0 and as long as any one doesn't appear to be from the same surface as sample 0, then the pixel is added to an array of unsigned-ints in thread-group-shared-memory, and when that array is processed in a second pass by the threadgroup, all 3 remaining sub-samples in an edge pixel get shaded, each sample by a different thread. the TGSM uint array items just hold the viewport coordinates of the pixel packed as 2x 16bit uints.
It doesn't really need 16 bits per axis for pixel coords, so I cut that down to 12 bits per axis, still leaving a max resolution of 4k by 4k (computeshadertile.hlsl line:46). Anyhow that opens up 8 remaining bits to pack the sub-sample associations into. The sub-sample test against sample 0 (function is called RequiresPerSampleShading) usually returns a bool to signify if any sub-samples didn't match sample0's surface, instead I rewrote it (poorly) to pack the sub-sample association into the first 9 bits of a uint (see gbuffer.hlsl line:120). The packing is a bit branchy but I tried to keep it to a minimum. It's all commented so I won't explain it here. I will paste in how the bit packing is laid out:
//perSampleField's first 9 bits describe which samples share a surface and so can be shaded once but written to multiple output samples.
//the layout is as follows:
// shaded sample: sample0 sample1 sample2 sample3
// samples to write: 1,2,3 1,2,3 2,3 3
// bits: 0,1,2 3,4,5 6,7 8
//sample0 is always written from sample0's shading so no 4th bit is needed for sample0. if Sample1 is a unique surface, it's shading will
//always be written for sample1, and could be written for samples 2 and 3. Similarly if sample2 is a unique surface from samples 0 and 1,
//then sample2 will always use it's own shading, and only sample3 might also use it. If sample3 is unique surface from all the other samples
//then only sample3 can possibly use sample3's shading result(since for sample3 to be unique, all previous samples must also be unique).
//here is an example bitfield, where samples 0 and 2 are the same surface, but samples 1 and 3 are unique surfaces.
// bits: 0 1 2 3 4 5 6 7 8
// values: 0 1 0 1 0 0 0 0 1
//So when sample0 is shaded, it's result is written to samples 0 and 2. When sample1 is shaded it's only written to sample1, and when sample3
//is shaded it's only written to sample3.
The actual cost of sorting the samples and packing them into the bitfield isn't that bad. It only pushed the frametime from 8.18ms to 8.26ms.
So that's 9 bits of data and 8 bits of free space. Luckily Sample0 is always shaded and written in the first pass by the threadgroup, so once perSampleField is returned then sample0 is shaded and written to output like normal, except any others samples associated with it's surface are now also written too. that is, any samples flagged in bits 0 to 2.
That means the three bits 0-2 are no longer needed, so the bitfield can be shifted once to the right. Now it's only 8 bits long, with two unused bits at the beginning. There is a reason to keep those first 2 bits available. Normally when subsamples need to be shaded in the threadgroup's second pass, you can just add the pixel to the uint array in TGSM that acts as a list, because all three remaining samples need to be shaded. However now we only want to shade the unique samples within a pixel listed for subsample shading. Each pixel that requires per-sample shading now, may only have one or two unique samples, so that poses a scheduling problem. You can't split each pixel between every 3 threads like before, since you'd end up with idle threads when only one or two samples are unique and need shading, out of every 3 from a pixel.
My solution was to use those empty first 2 bits of this sample mask bitfield to flag the sample to be shaded, and so add individual samples in the TGSM array, rather than entire pixels. So for each sample in a pixel that actually needs shading, the pixel coords and the sample mask are added to the TGSM array, along with a flag to say which sample to shade. For example to submit a pixel with the sample bits sets for sample1 (shared with sample2) and sample3, it's added to the TGSM array twice, once for each unique sample that needs shading. The first two bits say which sample to shade, the next 6 bits are the sample mask that describe which samples need to be written with which sample's shading result. In this example the pixel is added with sample1 flagged for shading and masked for writing out to samples 1 and 2, and then the pixel is added again with sample3 flagged for shading and masked for writing to sample3. The only difference between samples submitted from the same pixel are the first two bits, while the sample mask and the pixel coords are the same.
This solves a problem in that no threads will be given samples that don't require shading, since the list now only contains unique samples we want shaded. However, it also causes a problem since you end up writing and reading from TGSM a bit more. On a couple of occasions while I was testing, even 4xMSAA overran the 256 item array length on a tile/threadgroup. I added a check to avoid it in future, and there's no visible degradation from it, but it is technically a bit of a hack.
You can also just extend the TGSM array, I found doubling it to 512 samples added roughly 0.20ms to the frametime. Since the hacky fix had no noticeable drawback at 4xMSAA, I left that in the uploaded files.
So bits 0 and 1 are just a packed uint of the sample to be shaded (1, 2, or 3), while bits 2-7 describe the other samples' associations to the sample being shaded, so that once it's shaded they can also be written to output along with it. The details on adding samples instead of pixels to the TGSM array are in ComputeTileShader.hlsl around line 220 onward. Similarly the details on shading the sample once and writing that to multiple output samples are in the same file around line 324. Likewise it's all commented so I won't explain the details here, it's just a little bit of bit shifting and setting/clearing, and using intrinsic functions like firstbithigh and firstbitlow to parse the values out. You can toggle between shading unique-samples or the original all-samples-per-pixel using a new #define in ShaderDefines.h called UNIQUE_SAMPLES.
There's also a COMPLEX_BRDF toggle in ShaderDefines.h. When it's enabled the compute shader uses a more involved BRDF than the default simple phong shading (the new BRDF is oren-nayar diffuse with cook-torrance specular that I think I got from this very forum!). I thought it was worth testing to see if saving the shading time associated with a more expensive BRDF would offset the increased TGSM usage. It didn't.
Here were my actual frametime results, on a nvidia 560ti, and like I mentioned, 1920x1080, 1024 lights, 4xMSAA, oh and built with Release target:
8.18ms - original unmodified intel app (newly compiled though).
8.26ms - sample surface comparison and uint packed return, original pixel-only TGSM array.
8.41ms - sample surface comparison and uint packed return, unique-sample-only TGSM array.
9.81ms - sample surface comparison and uint packed return, original pixel-only TGSM array, complex BRDF.
9.98ms - sample surface comparison and uint packed return, unique-sample-only TGSM array, complex BRDF.
As you can see, whatever gains were made reducing the number of shaded samples, more was lost in the extra TGSM usage. I also suspect the export cost when writing samples is possibly higher now. With the original pixel-only TGSM array used in the second pass for edge pixel samples, each thread shades one sample and writes one sample. With the unique-samples style TGSM array, each thread shades one sample still, but may write two or three samples afterward. I think it just makes the thread scheduling a bit 'lumpy' and uneven perhaps.
It's still pretty close in speed to the original method though, and much much faster than disabling defer_per_sample, so it's definitely a usable technique. I'm just not sure what would make it worth using over the normal implementation.
Anyhow code is attached, extract to the original project directory. It's hard to see how to optimize the technique. Ideally it'd be good to reduce the TGSM usage to per-pixel once more, but I don't see how to efficiently schedule sample shading for that when each pixel could have an undetermined number of samples to shade. Maybe you'll see some optimizations I missed, or a bug!.
Oh, I never tried to interpolate the gbuffer attributes for samples that share surfaces. It's a bunch of extra reads and ops for what would appear to be a minor benefit, since the multisample resolve on the output buffer already seems indistinguishable from the original app's output. Also, if you enable SAMPLE_COLORS in shaderdefines.h you can use the #if blocks in computetileshader_color.hlsl to check the sample outputs. Otherwise it's identical to the main compute shader file (I moved the #if blocks out for readability).