DX11 How to make downsampling with directx 11 ?

Hello !

I have a texture in 4K resolution and I need to downsample this texture to get a 1x1 resulting texture.

I know that there are intermediate downsamplings before getting to the 1x1 texture but how downsampling works and how do I have to code my pixel shader to downsample my texture ?

For each slice, you just use a simple copy-pixel pixel shader and adjust your UVs with 0.5 texel offset with linear filtering - just continue this down to the 1x1 texture. That way, you let the hardware filtering do the job of averaging the pixels with nearly no cost, since you're sampling in-between texels :)


11 hours ago, vinterberg said:

For each slice, you just use a simple copy-pixel pixel shader and adjust your UVs with 0.5 texel offset with linear filtering - just continue this down to the 1x1 texture. That way, you let the hardware filtering do the job of averaging the pixels with nearly no cost, since you're sampling in-between texels :)


So you mean the pixel shader would look like something like this :

pixelOut = pixelIN ; 


for (slice = (nslices - 1) to 1 step -1)
	setrendertarget(texture[slice - 1]);				// render to half-size texture
	texwidth = texture[slice].width;					// the double-size texture we want to downsample
	texheight = texture[slice].height;
	float u_offset = (1.0f / texwidth) / 2.0f;			// half texel size
	float v_offset = (1.0f / texheight) / 2.0f;
	pixelshader->setUVoffset(u_offset, v_offset);
	pixelshader->run();									// draw a fullscreen

	pixel_out = texture_sample(texture_input, vertexshader_UV + UVoffset);


13 hours ago, belfegor said:

I thought we didnt need pixel offset since dx11?

We don't, but here it's set so deliberately to use bilinear filtering for downsampling.

Aside: You can use the API to do this, though you won't have control of the filtering: ID3D11DeviceContext::GenerateMips (note the mandatory creation flags for the texture) .

I am confused as what to believe is true now.

For example, i am looking at MJP shadow sample project, where he downsample/scale texture, there is no "pixel offsets" applied, just bilinear filter:


quad verts

QuadVertex verts[4] =
        { XMFLOAT4(1, 1, 1, 1), XMFLOAT2(1, 0) },
        { XMFLOAT4(1, -1, 1, 1), XMFLOAT2(1, 1) },
        { XMFLOAT4(-1, -1, 1, 1), XMFLOAT2(0, 1) },
        { XMFLOAT4(-1, 1, 1, 1), XMFLOAT2(0, 0) }


quad vertex shader

VSOutput QuadVS(in VSInput input)
    VSOutput output;

    // Just pass it along
    output.PositionCS = input.PositionCS;
    output.TexCoord = input.TexCoord;

    return output;


// Uses hw bilinear filtering for upscaling or downscaling
float4 Scale(in PSInput input) : SV_Target
    return InputTexture0.Sample(LinearSampler, input.TexCoord);


Are you interested only in the 1x1 version ? or do you need all the chain ? To do short, Are you computing the average exposure for exposure adaptation or something else ?

If you are interested only in the 1x1 result as i understand your question, you should forget about pixel shader, running some compute looping over the image, keeping the averaging in groupshared memory or in register will outperform the bandwidth of writing and reading full surface plus you get rid of expensive pipeline flush between the different reduction pass ( because of going from rtv to srv ).

If you are interested in the full chain, running compute can also outperform, you can for example again save on reads by having a compute generating 3 mip in one run, doing the extra 2 by reusing what it read for the first reduction and working in groupshared memory.

Forget also about the legacy GenerateMips, it is not a hardware feature and usually does a sub optimal job compared to a hand crafted solution.


3 hours ago, theScore said:

I am interested by the 1x1 version only, do you a better method than downsampling for getting the 1x1 texture ?

Below is a possible implementation, i don't say it is the fastest, but it show the logic clearly and it is quite easy to understand. Only profiling and tweak of the group count and parsing of the texture will lead to the optimum, but it should already be quite blazing fast

This is just two dispatch with one small intermediate texture of w/8 by 1 pixel. The first pass is computing one average per column of 8 pixels width, write the value to the intermediate resource, then the second pass compute the average of the columns.

Each pass compute first a local average for his own thread, then average the value for the group with a groupshared storage and finaly write the value if it is the first thread in the group.

There is potential for errors in the code, i did not test it, but it should be quite close.

EDIT: On hold, the missing float atomics on PC make it a little harder to implement than on PS4/XboxOne, this need some adjustement, i will fix that later :(


// i assume the original image has dimensions that are multiple of 8 for clarity
// you will create a texture of dimension [w/8, 1] of type float with uav/srv binding, call it Columns
// you will create a texture of dimension [1,1] of type float with uav/srv binding, call it Result

// At runtime :
// SetCompute 1
// Set Rows to U0
// Set SourceImage to T0
// Dispatch( width / 8, 1, 1);
// SetCompute 2
// Set Rows to T0
// Set Result to U0
// Dispatch( 1, 1, 1 );
// Voilà

// Common.hlsli
float Lum( float3 rgb ) { return dot(rgb,float3(0.25,0.60,0.15)); }

// Pass1.hlsl
#include "Common.hlsli"
Texture2D<float3> sourceImage : register(t0);
RWTexture2D<float> columns : register(u0);

groupshared float intermediate;
[numthreads(8, 8, 1)]
void main(uint2 GTid : SV_GroupThreadID, uint gidx : SV_GroupIndex, uint2 Gid : SV_GroupID) {
	intermediate = 0;
	uint2 dim;

	uint rowCount = dim.y / 8; 
	float tmp = 0.f;
	for(uint row = 0; row < rowCount; ++row )
		tmp += Lum(sourceImage[ GTid + uint2(Gid.x,row) * 8 ]) / float(rowCount); // this use the operator[], you can try to use a sampler+Sample to hit half pixels uvs here.

	GroupMemoryBarrierWithGroupSync(); // for the initial intermediate = 0;
	InterlockAdd(intermediate,tmp / 64.f); 
	GroupMemoryBarrierWithGroupSync(); // for the interlock

	if (gidx == 0) 
		columns[Gid.x] = intermediate;

// Pass2.hlsl
#include "Common.hlsli"
Texture2D<float> columns : register(t0);
RWTexture2D<float> average : register(u0);

groupshared float intermediate;
[numthreads(64, 1, 1)]
void main(uint GTid : SV_GroupThreadID) {
	intermediate = 0;
	uint2 dim;

	float tmp = 0.f;
	for(uint col = 0; col < dim.x; col += 64)
		tmp += columns[col + GTid];

	GroupMemoryBarrierWithGroupSync(); // for the initial intermediate = 0;
	GroupMemoryBarrierWithGroupSync(); // for the interlock

	if (GTid == 0) 
		columnLums[Gid.x] = intermediate / dim.x;


