Depth pre pass worth it ?

Started by
19 comments, last by lipsryme 11 years ago

I'm currently building my new deferred shading based renderer and been testing depth pre pass for opaque geometry.

I haven't implemented the GBuffer pass yet (rendering to backbuffer atm) so I guess the benefit will be greater later on but I was testing instancing with this and for 50k textured cubes (full input = Pos/UV/Normal/Tangent/BiTangent) and anisotropic filtering on. Here are my results:

Without depth pre pass:

actual rasterized geometry pass = ~ 2.7ms

With depth pre pass:

pre pass = ~ 2.0ms

actual rasterized geometry pass: around 2.4ms

So basically it's cutting my performance almost in half.

I'm already doing a very lightweight pre pass.

I've split the vertex information so I can transfer only the vertex position data.

This means that I'm setting multiple vertex buffers for the actual rasterization pass unfortunately though...could this be the culprit ?

I've set the pixel shader to NULL, set color write disabled.

And the shader itself only transforms vertex positions:


#pragma pack_matrix( row_major )



// Single big buffer to store instance transforms
Buffer<float4> InstanceTransformBuffer : register(t0);


// Constant buffers
cbuffer InstanceTransformsAccessBuffer : register(b0)
{
	float startIndex : packoffset(c0.x);
	float elementsPerInstance : packoffset(c0.y);

	float4x4 ViewProjection : packoffset(c1);
};


struct VSI
{
	float4 Position		: POSITION;
	uint InstanceID		: SV_InstanceID;
};


struct VSO
{
	float4 Position : SV_POSITION;
};




float4x4 GetInstanceTransform(uint instID, uint offset)
{
	uint BufferOffset = instID * elementsPerInstance + startIndex + offset;

	float4 c0 = InstanceTransformBuffer.Load(BufferOffset + 0);
	float4 c1 = InstanceTransformBuffer.Load(BufferOffset + 1);
	float4 c2 = InstanceTransformBuffer.Load(BufferOffset + 2);
	float4 c3 = float4(0.0f, 0.0f, 0.0f, 1.0f);

	float4x4 _World = { c0.x, c1.x, c2.x, c3.x,
						c0.y, c1.y, c2.y, c3.y,
						c0.z, c1.z, c2.z, c3.z,
						c0.w, c1.w, c2.w, c3.w };

	return _World;
}





VSO VS(VSI input)
{
	VSO output = (VSO)0;

	float4x4 World = GetInstanceTransform(input.InstanceID, 0);
	float4x4 WVP = mul(World, ViewProjection);
	output.Position = mul(input.Position, WVP);

	return output;
}

My render function looks like this:


void RendererD3D11::RenderGBuffer(const unsigned int drawcalls,
								  const unsigned int* culledSceneIDs)
{
	// Get instance description
	InstanceGroupDescription* instanceGroup = this->contentManager->GetPtrToOpaqueInstanceGroupDesc(drawcalls);

	unsigned int numInstances = 0; // Keeps track of how many instances we actually want to draw of this group
	if(instanceGroup->entityType == SceneList::Primitive)
	{
		D3D11_MAPPED_SUBRESOURCE instanceBufferProperties;

		// Lock the constant buffer so it can be written to
		this->deviceContext->Map(this->instanceTransformBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, &instanceBufferProperties);

		// Get a pointer to the data in the constant buffer.
		XMFLOAT4* pInstanceData = (XMFLOAT4*)instanceBufferProperties.pData;


		// Go through each sceneInstance inside this instance group
		for(size_t i = 0; i < instanceGroup->instanceSceneIDSize; ++i)
		{
			if(culledSceneIDs[instanceGroup->instanceSceneIDs] != 0)
			{
				// Get ScenePrimitiveDescription
				ScenePrimitiveDescription* scenePrimitive = &this->sceneManager->GetCurrentScene()->GetDesc()->primitives[instanceGroup->instanceSceneIDs];

				// Update buffer
				XMMATRIX worldTransform = XMMatrixTranspose(XMLoadFloat4x4(&scenePrimitive->worldTransform));			
				for(int u = 0; u < 3; u++)
				{	
					XMStoreFloat4(&pInstanceData[(numInstances * 3) + u], worldTransform.r);
				}

				// This instance should be drawn since it was not culled.
				numInstances++;	
			}
		}

		// Unlock the constant buffer
		this->deviceContext->Unmap(this->instanceTransformBuffer, 0);

		D3D11_MAPPED_SUBRESOURCE mappedResourceProperties;

		// Lock the constant buffer so it can be written to
		this->deviceContext->Map(this->cbInstanceTransformAccessBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, &mappedResourceProperties);

		// Get a pointer to the data in the constant buffer.
		cbsAccessInstanceTransforms* pData = (cbsAccessInstanceTransforms*)mappedResourceProperties.pData;

		// Copy the matrices into the constant buffer
		XMStoreFloat4x4(&pData->ViewProjection, mainCamera->ViewProjectionMatrix());

		pData->elementsPerInstance = 3;
		pData->startIndex = 0;
		pData->padding = XMFLOAT2(0.0f, 0.0f);


		// Unlock the constant buffer
		this->deviceContext->Unmap(this->cbInstanceTransformAccessBuffer, 0);

		// Set Constant buffer
		this->deviceContext->VSSetConstantBuffers(0, 1, &this->cbInstanceTransformAccessBuffer);

		// Set Shader Resource View for instance transforms
		ID3D11ShaderResourceView* resource = this->instanceTransformBuffer_SRV;
		this->deviceContext->VSSetShaderResources(0, 1, &resource);

		// Set BlendState
		this->renderStateContext.SetBlendState(RenderStateDesc::ColorWriteDisabled, &blendStates, deviceContext);

		// Set DepthStencilState
		this->renderStateContext.SetDepthStencilState(RenderStateDesc::DepthWriteEnabled, &this->depthStencilStates, this->deviceContext);

		// Depth Pre-Pass
		GenericShader* depthPrePass_shader = this->contentManager->GetShader(ShaderFile::DepthPrePass, this->device);
		if(depthPrePass_shader)
		{
			this->deviceContext->VSSetShader(depthPrePass_shader->VS, 0, 0);
			this->deviceContext->PSSetShader(NULL, 0, 0);
		}

		ID3D11DepthStencilView* main_DSV = this->mainDSV;
		deviceContext->OMSetRenderTargets(0, NULL, main_DSV);

		// Instanced Draw Call (DepthPrePass)
		if(numInstances > 0)
		{
			this->contentManager->GetPrimitiveFromPool(instanceGroup->groupID)->Draw(this->depthOnlyInputLayout, this->device,
																					 this->deviceContext, numInstances, true);
		}

		// Get ScenePrimitiveDescription
		ScenePrimitiveDescription* scenePrimitive = &this->sceneManager->GetCurrentScene()->GetDesc()->primitives[instanceGroup->instanceSceneIDs[0]];

		// Set DiffuseMap
		ID3D11ShaderResourceView* diffuseMap_SRV = this->contentManager->GetTextureFromPool(scenePrimitive->material.diffuseMap.ID)->GetResource();
		this->deviceContext->PSSetShaderResources(1, 1, &diffuseMap_SRV);

		// Set NormalMap
		ID3D11ShaderResourceView* normalMap_SRV = this->contentManager->GetTextureFromPool(scenePrimitive->material.normalMap.ID)->GetResource();
		this->deviceContext->PSSetShaderResources(2, 1, &normalMap_SRV);

		// Set SpecularMap
		ID3D11ShaderResourceView* specularMap_SRV = this->contentManager->GetTextureFromPool(scenePrimitive->material.specularMap.ID)->GetResource();
		this->deviceContext->PSSetShaderResources(3, 1, &specularMap_SRV);

		// Set SamplerStates
		this->renderStateContext.SetSamplerState(RenderStateDesc::Anisotropic, &this->samplerstates, this->deviceContext, 0);
		this->renderStateContext.SetSamplerState(RenderStateDesc::Linear, &this->samplerstates, this->deviceContext, 1);

		// Set BlendState
		this->renderStateContext.SetBlendState(RenderStateDesc::BlendDisabled, &blendStates, deviceContext);

		// Set DepthStencilState
		this->renderStateContext.SetDepthStencilState(RenderStateDesc::DepthEnabled, &this->depthStencilStates, this->deviceContext);

		// Set FillMode
		if(this->isWireframe)
		{
			this->renderStateContext.SetRasterizerState(RenderStateDesc::Wireframe, &rasterizerStates, deviceContext);
		}
		else
		{
			this->renderStateContext.SetRasterizerState(RenderStateDesc::BackFaceCull, &rasterizerStates, deviceContext);
		}

		// Set GBuffer shader
		GenericShader* gbuffer_shader = this->contentManager->GetShader(ShaderFile::GBuffer, this->device);
		if(gbuffer_shader)
		{
			this->deviceContext->VSSetShader(gbuffer_shader->VS, 0, 0);
			this->deviceContext->PSSetShader(gbuffer_shader->PS, 0, 0);
		}

		ID3D11RenderTargetView* backBuffer_RTV = this->backBufferRTV;
		deviceContext->OMSetRenderTargets(1, &backBuffer_RTV, main_DSV);

		// Instanced Draw Call (GBuffer)
		if(numInstances > 0)
		{
			this->contentManager->GetPrimitiveFromPool(instanceGroup->groupID)->Draw(this->defaultInputLayout, this->device,
																					 this->deviceContext, numInstances, false);
		}
	}
	else
	{
		// StaticMesh in here...
	}

}
Advertisement

As with anything graphics related this is usually going to be a hit and miss subject. If you have a lot of overdraw and complex shaders this is where depth z-pass tends to give you the best results. But if your scenes are fairly well culled, have low overdraw and such, it "can" be a loss of performance because you are eating up memory bandwidth. You are typically fairly safe to keep it in there as bandwidth is not normally a problem on modern cards, well excepting the mobile variations usually. I'd actually hook it up as a switch if at all possible and just test later.

As to the buffer sends, that can of course be a culprit. The specific Item I remember, from doing this a while back was the very notable performance gain we achieved by breaking the positional portion away from the other bits using multiple streams. Looks like you have that so can't really say much other than the above generalization.

Alright I'm gonna keep it and from time to time test if it's worth it.

By the way is it necessary to set the RenderTarget to NULL if you have no pixel shader set and color write disabled ?

Alright I'm gonna keep it and from time to time test if it's worth it.

That's always been my approach, don't throw things out till proven with "real" content if they should go away.

By the way is it necessary to set the RenderTarget to NULL if you have no pixel shader set and color write disabled ?

I wish I could tell you but this is getting into the specifics of which i just don't remember much of. Sorry, hopefully some DX guru will pass by and drop a dollop of knowledge in this area. :)

A great graphics programmer once said "a z-prepass is a day-to-day decision, not a lifestyle choice". I would recommend making it easy to turn on and off, and constantly profile to see if it's worth it for your current scene/shaders/renderer configuration/resolution/etc.

A depth pre-pass can make sense in cases where you've got potentially lots of overdraw and where you can't get reasonable front-to-back sorting for your opaque geometry; in other words where the overhead of not doing it is greater than the overhead of doing it. It's definitely not a general-case solution.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

If you have some memory to spare then its worth doing a position only VB and IB for each mesh. As well as this just storing the position component of the original mesh it can usually contain a lot less vertices.

Think of the case of cubes with hard faces. This requires 24 vertices total (6 faces * 4 vertices). However the position only mesh just requires 8 vertices. This cuts down on the bandwidth for vertex fetching, needs less transforms and makes better use of the post transform cache.

By using the position only VB and IB the original mesh can then be a standard interleaved format.

For the above don't forget to optimise for post-TC and pre-TC, use a position only decl and VS that only transforms position.

Secondly when doing you're frustum culling pass then use a different frustum for the prepass with a much closer far plane so the accepted number of meshes drawn is much lower. There is little gain to drawing meshes far away, firstly they are likely to cover little pixels on the screen, secondly they are less likely to occlude many pixels and thirdly HiZ buffers tend to really lose precision at far distances. It not uncommon to set the far plane to be as close as say 150m.

You can also cull meshes from the prepass using some heuristics. For example Don;t even bother considering meshes which are unlikely to cover many screen pixels.

I usually find it a gain to not draw alpha-tested objects in the prepass but to ensure they get drawn first in the base pass. That way they will benefit from opaque objects in the pre-pass and then update the depth buffer before the opaque objects in the base pass.

If you have some memory to spare then its worth doing a position only VB and IB for each mesh. As well as this just storing the position component of the original mesh it can usually contain a lot less vertices.

Think of the case of cubes with hard faces. This requires 24 vertices total (6 faces * 4 vertices). However the position only mesh just requires 8 vertices. This cuts down on the bandwidth for vertex fetching, needs less transforms and makes better use of the post transform cache.

By using the position only VB and IB the original mesh can then be a standard interleaved format.

For the above don't forget to optimise for post-TC and pre-TC, use a position only decl and VS that only transforms position.

Secondly when doing you're frustum culling pass then use a different frustum for the prepass with a much closer far plane so the accepted number of meshes drawn is much lower. There is little gain to drawing meshes far away, firstly they are likely to cover little pixels on the screen, secondly they are less likely to occlude many pixels and thirdly HiZ buffers tend to really lose precision at far distances. It not uncommon to set the far plane to be as close as say 150m.

You can also cull meshes from the prepass using some heuristics. For example Don;t even bother considering meshes which are unlikely to cover many screen pixels.

I usually find it a gain to not draw alpha-tested objects in the prepass but to ensure they get drawn first in the base pass. That way they will benefit from opaque objects in the pre-pass and then update the depth buffer before the opaque objects in the base pass.

I'll add to this - in your depth pre-pass, do not under any circumstances output depth from your fragment/pixel shader. Output any arbitrary colour, and use glColorMask/D3DRS_COLORWRITEENABLE/ID3D10BlendState/ID3D11BlendState to control whether or not you write to the color buffer instead. Beware of APIs where the color write mask affects whether or not the color buffer is cleared. If you want everything to start off black in order to accumulate light it may be more efficient to clear to black and disable color writes than it is to enable color writes and output black from your shader.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

For a depth prepass you shouldn't even be using a pixel/fragment shader at all unless you're rendering something that's alpha-tested.

I've split the vertex information so I can transfer only the vertex position data.
This means that I'm setting multiple vertex buffers for the actual rasterization pass unfortunately though...could this be the culprit ?

My tests with z-only pass are old, relating to D3D9 hardware.The main problem for me was the draw calls. On that hardware, using as many drawcalls for z as I used for standard rendering was nonsense. I'm surprised however this is still the case: it seemed like D3D10+ was going to be more efficient at dipatching drawcalls.

Are you doing some kind of batch merging?

Previously "Krohm"

This topic is closed to new replies.

Advertisement