Summed Area Table for variance Shadow Mapping, again...

Started by
0 comments, last by Tsus 11 years, 6 months ago
Hello guys,

I'm trying do implement a summed area table for my shadowmap since a couple of weeks like it is in GPU Gems3. I found a lot of article an disscussions about it here and with google but I can't finde an answer for my Problem.
Im using DirectX11 with the DXUT and worked on a HLSL shader wich generate from the shadowmap a SAT-image.

c++:
[source lang="cpp"]RenderableTexture* renderSAT::createSAT(ID3D11DeviceContext* pd3dContext, RenderableTexture* Input)
{

shaderEffect->GetVariableByName("Samples")->AsScalar()->SetInt(SATRenderSamples);
pd3dContext->RSSetViewports(1, &m_Viewport);
m_original = Input;
//HorizontalPass
for(int i = 1; i < imgWidth; i *= SATRenderSamples)
{
int PassOffset[2] = { i, 0};

int done = i/SATRenderSamples;
D3D11_RECT region = { done, 0, imgWidth, imgHeight};
//only first pass
m_sourceEV->SetResource(i == 1 ? m_original->GetShaderResource() : m_src->GetShaderResource());
shaderEffect->GetVariableByName("PassOffset")->AsVector()->SetIntVector(PassOffset);
//avoids warning
pd3dContext->OMSetRenderTargets(0, NULL, NULL);
SATRenderTechnique->GetPassByIndex(0)->Apply(0, pd3dContext);

ID3D11RenderTargetView* pRTV = m_dest->GetRenderTarget();
pd3dContext->OMSetRenderTargets(1, &pRTV, NULL);
pd3dContext->RSSetScissorRects(1, &region);
pd3dContext->IASetInputLayout(NULL);
pd3dContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_POINTLIST);
SATRenderTechnique->GetPassByIndex(0)->Apply(0, pd3dContext);
pd3dContext->Draw(1,0);
// Swap pointers (ping pong)
std::swap(m_dest, m_src);
}
//vertical Pass
for(int i = 1; i < imgHeight; i *= SATRenderPasses)
{
int PassOffset[2] = { 0,i};

int done = i/SATRenderSamples;
D3D11_RECT region = { 0, done, imgWidth, imgHeight};
ID3D11ShaderResourceView* source = m_src->GetShaderResource();
m_sourceEV->SetResource(source);
shaderEffect->GetVariableByName("PassOffset")->AsVector()->SetIntVector(PassOffset);
//avoids warning
pd3dContext->OMSetRenderTargets(0, NULL, NULL);
SATRenderTechnique->GetPassByIndex(0)->Apply(0, pd3dContext);

ID3D11RenderTargetView* pRTV = m_dest->GetRenderTarget();
pd3dContext->OMSetRenderTargets(1, &pRTV, NULL);
pd3dContext->RSSetScissorRects(1, &region);
pd3dContext->IASetInputLayout(NULL);
pd3dContext->IASetPrimitiveTopology(D3D11_PRIMITIVE_TOPOLOGY_POINTLIST);
SATRenderTechnique->GetPassByIndex(0)->Apply(0, pd3dContext);
pd3dContext->Draw(1,0);
// Swap pointers (ping pong)
std::swap(m_dest, m_src);
}
return m_dest;
}
[/source]
and the pixelshader (works on a fullscreenquad)
[source lang="cpp"]float4 generateSAT(Vertex In) : SV_Target0
{
float2 dimension;
Source.GetDimensions(dimension.x, dimension.y);
float2 currentOffset = float2(PassOffset)/dimension;// (1, 0)/dimension or (0, 1)/dimension
float4 result;
for(int i = 0; i < Samples; i++)
{
result += Source.SampleLevel(samBorderPoint, In.Tex - i*currentOffset, 0);
}
return result;
}
[/source]

but if I sample the SAT-image with a 1x1 filter, I get this beautyful image, the image in use for the generation is displayed correctly.
[attachment=11469:Unbenannt.PNG]
left: SAT with 1x1 filter right: original ShadowMap
for the drawing on screen I use a bordercolor float4(0,0,0,1).

i think it looks like that the scissorRect don't fit, but i can't imagine why.
Did someone know what could I do to solve this Problem?
Advertisement
Hi,

The way you compute your scan is quite expensive. I don't see the bug in your code, but I would approach it differently anyway. You could compute the complete SAT with only two compute shaders by using one vertical and one horizontal scan pass, as described by Harris et al.

Perhaps, have a look into the GPU Computing SDK of Nvidia. It contains a sample for the parallel scan (from Harris) in Cuda and OpenCL. Translating to DirectCompute should be straight-forward.

Here is some naïve scan implementation I used in Cuda to implement a SAT filter. I'm sure you can translate it to DirectCompute. (It's also a good practice, since a lot of useful code is currently only written in Cuda.)

Naive parallel scan (no up and down sweep):

__global__ void scan_naive(float *g_odata, float *g_idata)
{
// Allocated shared memory for scan kernels
__shared__ float temp[2*N];
int thid = threadIdx.x;
int bid = blockIdx.x;
int pout = 0;
int pin = 1;
// Cache the computational window in shared memory
temp[pout*N + thid] = (thid > 0) ? g_idata[bid * N + thid-1] : 0;
for (int offset = 1; offset < N; offset *= 2)
{
pout = 1 - pout;
pin = 1 - pout;
__syncthreads();
temp[pout*N+thid] = temp[pin*N+thid];
if (thid >= offset)
temp[pout*N+thid] += temp[pin*N+thid - offset];
}
__syncthreads();
g_odata[bid * N + thid] = temp[pout*N+thid];
}

N is the image width.

And here a simple function that transposes an image, so that we can apply the same scan code for the vertical and horizontal pass (in case width == height).

__global__ void transpose(float *g_odata, float *g_idata)
{
int thid = threadIdx.x;
int bid = blockIdx.x;
g_odata[bid * N + thid] = g_idata[thid * N + bid];
}


Also, consider using the Load intrinsic if you read exactly from pixels. It by-passes the texture filtering and is therefore a bit faster.

Best regards!

This topic is closed to new replies.

Advertisement