SSAA and ResolveSubresource

Started by
15 comments, last by matt77hias 6 years, 5 months ago

Does there exist something similar as ID3D11DeviceContext::ResolveSubresource for a super-sampled instead of multi-sampled texture?

Or should one need to write a custom shader for this?

As a side note, is there actually a difference between a (super-sampled) and multi-sampled texture from the perspective of a compute shader? 

🧙

Advertisement

Conceptually, a SSAA surface ( resolution multiplier x 1 sample per pixel ) and a MSAA surface are identical ( resolution x sample count ). At the hardware level, MSAA surface may use different memory swizzling and tiling, drop fragments, etc.

At the drawing stage, obviously things are different, SSAA is not a thing, you are just using a larger resolution. For MSAA, you get only one fragment per pixel ( unless using the per fragment semantic ), and there locations are not on an aligned grid. ( So to do a better SSAA, you may want to try that also, i can' t comment on performance difference tho )

 

For the ResolveSubResource, does someone still use it ? Most of the time, you will need to do treatment per fragment anyway, a good example is tonemapping. I can' t certify it for all hardware, but this API is usually a compute pass nowaday on modern GPU, there is no fixed pass for it.

 

And because ResolveSubResource is dedicated to MSAA, if you want to do SSAA, you will have to write a custom downsampling for it.

1 hour ago, galop1n said:

So to do a better SSAA, you may want to try that also, i can' t comment on performance difference tho

What do you mean?

🧙

39 minutes ago, matt77hias said:

What do you mean?

The reason MSAA fragments are not organized on a regular grid is to improve the visual of an antialised edge and possibly antialised more of them where it matter ( perfect horizontals and perfect verticals don't suffers aliasing as much also ).

 

If you do your SSAA by using MSAA and a pixel shader using SV_SampleIndex, then you get that nice custom sample locations. You can even customize them with vendor extension, or with dx12 and the fall creator update API.

38 minutes ago, galop1n said:

The reason MSAA fragments are not organized on a regular grid is to improve the visual of an antialised edge and possibly antialised more of them where it matter ( perfect horizontals and perfect verticals don't suffers aliasing as much also ).

Thanks for stressing. Wasn't taking this into account.

39 minutes ago, galop1n said:

If you do your SSAA by using MSAA and a pixel shader using SV_SampleIndex, then you get that nice custom sample locations.

Is that an SV_Position you get from GetSamplePosition? MSDN is a bit vague:  https://msdn.microsoft.com/en-us/library/windows/desktop/bb944004(v=vs.85).aspx

 

The big downside of using this is this will double the number of pixel shaders :(

🧙

So resolving MSAA is basically something like this:


[numthreads(GROUP_SIZE, GROUP_SIZE, 1)]
void CS(uint3 thread_id : SV_DispatchThreadID) {

    const uint2 location = thread_id.xy;

    uint2 output_dim;
    uint nb_samples;
    g_input_image_texture.GetDimensions(output_dim.x, output_dim.y, nb_samples);
    if (any(location >= output_dim)) {
        return;
    }

    // Resolve the (multi-sampled) radiance, normal and depth.
    float4 ldr_sum    = 0.0f;
    float3 normal_sum = 0.0f;
    float  depth      = 0.0f;
    for (uint i = 0; i < nb_samples; ++i) {

        const float4 hdr = g_input_image_texture.sample[i][location];
        ldr_sum += saturate(TONE_MAP_COMPONENT(hdr));

        normal_sum += g_input_normal_texture.sample[i][location];

        // Non-inverted Z-buffer: 
        // output.p = min(depth, g_input_depth_texture.sample[location]);
        depth = max(depth, g_input_depth_texture.sample[i][location]);
    }

    const float inv_nb_samples = 1.0f / nb_samples;

    // Store the resolved radiance.
    g_output_image_texture[location]  = INVERSE_TONE_MAP_COMPONENT(ldr_sum * inv_nb_samples);
    // Store the resolved normal.
    g_output_normal_texture[location] = normalize(normal_sum);
    // Store the resolved depth.
    g_output_depth_texture[location]  = depth;
}

And resolving SSAA is something like this:


[numthreads(GROUP_SIZE, GROUP_SIZE, 1)]
void CS(uint3 thread_id : SV_DispatchThreadID) {

    const uint2 output_location = thread_id.xy;

    uint2 output_dim;
    g_output_image_texture.GetDimensions(output_dim.x, output_dim.y);
    if (any(output_location >= output_dim)) {
        return;
    }

    uint2 input_dim;
    g_input_image_texture.GetDimensions(input_dim.x, input_dim.y);

    const uint2 nb_samples     = input_dim / output_dim;
    const uint2 input_location = output_location * nb_samples;

    // Resolve the (super-sampled) radiance, normal and depth.
    float4 ldr_sum    = 0.0f;
    float3 normal_sum = 0.0f;
    float  depth      = 0.0f;
    for (uint i = 0; i < nb_samples.x; ++i) {
        for (uint j = 0; j < nb_samples.y; ++j) {

            const uint2 location = input_location + uint2(i,j);

            const float4 hdr = g_input_image_texture[location];
            ldr_sum += saturate(TONE_MAP_COMPONENT(hdr));

            normal_sum += g_input_normal_texture[location];

            // Non-inverted Z-buffer: 
            // output.p = min(depth, g_input_depth_texture[input_location]);
            depth = max(depth, g_input_depth_texture[location]);
        }
    }

    const float inv_nb_samples = 1.0f / (nb_samples.x * nb_samples.y);

    // Store the resolved radiance.
    g_output_image_texture[output_location]  = INVERSE_TONE_MAP_COMPONENT(ldr_sum * inv_nb_samples);
    // Store the resolved normal.
    g_output_normal_texture[output_location] = normalize(normal_sum);
    // Store the resolved depth.
    g_output_depth_texture[output_location]  = depth;
}

 

🧙

That's it yes. For performance, you may want to change the thread group to have one fragment processed per thread and use lane swizzling or group shared memory to gather the results.

12 hours ago, galop1n said:

For performance, you may want to change the thread group to have one fragment processed per thread and use lane swizzling or group shared memory to gather the results.

But currently no sharing or sync is needed within a group of threads? Every texel is processed independently by 1 thread?

🧙

8 hours ago, matt77hias said:

But currently no sharing or sync is needed within a group of threads? Every texel is processed independently by 1 thread?

Because you are using GetDimensions ( equivalent to a texture fetch ), your loop is not unrolled, so each fragments will lead to the shader waiting for a texture fetch result, possibly leading to stall. It is not optimal.

If the loop was unrolled because of statically known dimensions, it would be a trade off of register pressure that also could lead to sub optimal result.

In your case, you do want to have one thread per source pixel, not one thread per destination pixel. That way, you can parallelize the tfetch accros more thread group and the instruction of the tone mapping.

You can then use shader model 6 to do lane swizzling or use group shared memory with interlock operation ( drivers usually good at turning it into lane swizzling ) to gather results cross thread before writing. 

 

This should be more optimal if done properly.

6 minutes ago, galop1n said:

If the loop was unrolled because of statically known dimensions, it would be a trade off of register pressure that also could lead to sub optimal result.

I see, but that would mean a separate shader for every supported SSAA and MSAA combination.

8 minutes ago, galop1n said:

You can then use shader model 6 to do lane swizzling or use group shared memory with interlock operation ( drivers usually good at turning it into lane swizzling ) to gather results cross thread before writing. 

Windows 10 :(

10 minutes ago, galop1n said:

That way, you can parallelize the tfetch accros more thread group and the instruction of the tone mapping.

That is indeed a good but still general (i.e. one shader) optimization. Will add it.

🧙

This topic is closed to new replies.

Advertisement