Jump to content
  • Advertisement
Sign in to follow this  
matt77hias

SSAA and ResolveSubresource

This topic is 379 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Does there exist something similar as ID3D11DeviceContext::ResolveSubresource for a super-sampled instead of multi-sampled texture?

Or should one need to write a custom shader for this?

As a side note, is there actually a difference between a (super-sampled) and multi-sampled texture from the perspective of a compute shader? 

Edited by matt77hias

Share this post


Link to post
Share on other sites
Advertisement

Conceptually, a SSAA surface ( resolution multiplier x 1 sample per pixel ) and a MSAA surface are identical ( resolution x sample count ). At the hardware level, MSAA surface may use different memory swizzling and tiling, drop fragments, etc.

At the drawing stage, obviously things are different, SSAA is not a thing, you are just using a larger resolution. For MSAA, you get only one fragment per pixel ( unless using the per fragment semantic ), and there locations are not on an aligned grid. ( So to do a better SSAA, you may want to try that also, i can' t comment on performance difference tho )

 

For the ResolveSubResource, does someone still use it ? Most of the time, you will need to do treatment per fragment anyway, a good example is tonemapping. I can' t certify it for all hardware, but this API is usually a compute pass nowaday on modern GPU, there is no fixed pass for it.

 

And because ResolveSubResource is dedicated to MSAA, if you want to do SSAA, you will have to write a custom downsampling for it.

Share this post


Link to post
Share on other sites
1 hour ago, galop1n said:

So to do a better SSAA, you may want to try that also, i can' t comment on performance difference tho

What do you mean?

Share this post


Link to post
Share on other sites
39 minutes ago, matt77hias said:

What do you mean?

The reason MSAA fragments are not organized on a regular grid is to improve the visual of an antialised edge and possibly antialised more of them where it matter ( perfect horizontals and perfect verticals don't suffers aliasing as much also ).

 

If you do your SSAA by using MSAA and a pixel shader using SV_SampleIndex, then you get that nice custom sample locations. You can even customize them with vendor extension, or with dx12 and the fall creator update API.

Share this post


Link to post
Share on other sites
38 minutes ago, galop1n said:

The reason MSAA fragments are not organized on a regular grid is to improve the visual of an antialised edge and possibly antialised more of them where it matter ( perfect horizontals and perfect verticals don't suffers aliasing as much also ).

Thanks for stressing. Wasn't taking this into account.

39 minutes ago, galop1n said:

If you do your SSAA by using MSAA and a pixel shader using SV_SampleIndex, then you get that nice custom sample locations.

Is that an SV_Position you get from GetSamplePosition? MSDN is a bit vague:  https://msdn.microsoft.com/en-us/library/windows/desktop/bb944004(v=vs.85).aspx

 

The big downside of using this is this will double the number of pixel shaders :(

Share this post


Link to post
Share on other sites

So resolving MSAA is basically something like this:

[numthreads(GROUP_SIZE, GROUP_SIZE, 1)]
void CS(uint3 thread_id : SV_DispatchThreadID) {

    const uint2 location = thread_id.xy;

    uint2 output_dim;
    uint nb_samples;
    g_input_image_texture.GetDimensions(output_dim.x, output_dim.y, nb_samples);
    if (any(location >= output_dim)) {
        return;
    }

    // Resolve the (multi-sampled) radiance, normal and depth.
    float4 ldr_sum    = 0.0f;
    float3 normal_sum = 0.0f;
    float  depth      = 0.0f;
    for (uint i = 0; i < nb_samples; ++i) {

        const float4 hdr = g_input_image_texture.sample[i][location];
        ldr_sum += saturate(TONE_MAP_COMPONENT(hdr));

        normal_sum += g_input_normal_texture.sample[i][location];

        // Non-inverted Z-buffer: 
        // output.p = min(depth, g_input_depth_texture.sample[location]);
        depth = max(depth, g_input_depth_texture.sample[i][location]);
    }

    const float inv_nb_samples = 1.0f / nb_samples;

    // Store the resolved radiance.
    g_output_image_texture[location]  = INVERSE_TONE_MAP_COMPONENT(ldr_sum * inv_nb_samples);
    // Store the resolved normal.
    g_output_normal_texture[location] = normalize(normal_sum);
    // Store the resolved depth.
    g_output_depth_texture[location]  = depth;
}

And resolving SSAA is something like this:

[numthreads(GROUP_SIZE, GROUP_SIZE, 1)]
void CS(uint3 thread_id : SV_DispatchThreadID) {

    const uint2 output_location = thread_id.xy;

    uint2 output_dim;
    g_output_image_texture.GetDimensions(output_dim.x, output_dim.y);
    if (any(output_location >= output_dim)) {
        return;
    }

    uint2 input_dim;
    g_input_image_texture.GetDimensions(input_dim.x, input_dim.y);

    const uint2 nb_samples     = input_dim / output_dim;
    const uint2 input_location = output_location * nb_samples;

    // Resolve the (super-sampled) radiance, normal and depth.
    float4 ldr_sum    = 0.0f;
    float3 normal_sum = 0.0f;
    float  depth      = 0.0f;
    for (uint i = 0; i < nb_samples.x; ++i) {
        for (uint j = 0; j < nb_samples.y; ++j) {

            const uint2 location = input_location + uint2(i,j);

            const float4 hdr = g_input_image_texture[location];
            ldr_sum += saturate(TONE_MAP_COMPONENT(hdr));

            normal_sum += g_input_normal_texture[location];

            // Non-inverted Z-buffer: 
            // output.p = min(depth, g_input_depth_texture[input_location]);
            depth = max(depth, g_input_depth_texture[location]);
        }
    }

    const float inv_nb_samples = 1.0f / (nb_samples.x * nb_samples.y);

    // Store the resolved radiance.
    g_output_image_texture[output_location]  = INVERSE_TONE_MAP_COMPONENT(ldr_sum * inv_nb_samples);
    // Store the resolved normal.
    g_output_normal_texture[output_location] = normalize(normal_sum);
    // Store the resolved depth.
    g_output_depth_texture[output_location]  = depth;
}

 

Share this post


Link to post
Share on other sites

That's it yes. For performance, you may want to change the thread group to have one fragment processed per thread and use lane swizzling or group shared memory to gather the results.

Share this post


Link to post
Share on other sites
12 hours ago, galop1n said:

For performance, you may want to change the thread group to have one fragment processed per thread and use lane swizzling or group shared memory to gather the results.

But currently no sharing or sync is needed within a group of threads? Every texel is processed independently by 1 thread?

Share this post


Link to post
Share on other sites
8 hours ago, matt77hias said:

But currently no sharing or sync is needed within a group of threads? Every texel is processed independently by 1 thread?

Because you are using GetDimensions ( equivalent to a texture fetch ), your loop is not unrolled, so each fragments will lead to the shader waiting for a texture fetch result, possibly leading to stall. It is not optimal.

If the loop was unrolled because of statically known dimensions, it would be a trade off of register pressure that also could lead to sub optimal result.

In your case, you do want to have one thread per source pixel, not one thread per destination pixel. That way, you can parallelize the tfetch accros more thread group and the instruction of the tone mapping.

You can then use shader model 6 to do lane swizzling or use group shared memory with interlock operation ( drivers usually good at turning it into lane swizzling ) to gather results cross thread before writing. 

 

This should be more optimal if done properly.

Share this post


Link to post
Share on other sites
6 minutes ago, galop1n said:

If the loop was unrolled because of statically known dimensions, it would be a trade off of register pressure that also could lead to sub optimal result.

I see, but that would mean a separate shader for every supported SSAA and MSAA combination.

8 minutes ago, galop1n said:

You can then use shader model 6 to do lane swizzling or use group shared memory with interlock operation ( drivers usually good at turning it into lane swizzling ) to gather results cross thread before writing. 

Windows 10 :(

10 minutes ago, galop1n said:

That way, you can parallelize the tfetch accros more thread group and the instruction of the tone mapping.

That is indeed a good but still general (i.e. one shader) optimization. Will add it.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!