What good is hundreds of lights with out hundreds of shadows?

Started by
18 comments, last by JoeJ 5 years ago

I thought people here would be interested in this, I just made a blog post about an algorithm I've been working on to efficiently render hundreds of shadow maps.

https://worldoffries.wordpress.com/2019/03/26/what-good-is-hundreds-of-lights-with-out-hundreds-of-shadows/

Let me know what you think :)

Advertisement

Already liked, but here's a bump for sheer coolness and so it doesn't just sink down without other, interested people seeing it.

Thanks!

I've added tri-axis cube map filtering which gives really nice soft shadows:

 

This is really cool! I recently went back and re-looked at the ISM technique, so this is inspirational for me too :D 

Seeing you've got the entire scene description accessible to you at once (so that you can generate these shadow maps in bulk)... you could also use that to directly ray-trace your shadows instead of using shadow-mapping at all. Performance would probably be terrible, but it would also be a neat tech demo. Instead of performing a shadow-map lookup during lighting, you could traverse the BVH for intersections with the surface-to-light ray.

I tried ray tracing first but the performance was not great, I then tried ray tracing the shadow maps, and the performance was worse. This method seems to outperform the ray tracing by quite a bit.

Both ray tracing attempts were much simpler to implement and only took a day or two... OTOH, this splatting algorithm was a really complicated beast to implement, took about 3 weeks of evenings, but worth it.

I'm glad you like it!

 

One other thought that I had (I'm sure I'm not the first) was that ray tracing could probably be improved by making a ray scheduler that is similar to the Node-View traversal. Something like a Node-Ray, and each iteration, each thread either intersects the ray with a triangle in leaf nodes, or pushes two more Node-Rays for the next iteration. This would considerably reduce thread divergence.

Great work! 

I work on something pretty similar for diffuse GI, but...

2 hours ago, fries said:

One other thought that I had (I'm sure I'm not the first) was that ray tracing could probably be improved by making a ray scheduler that is similar to the Node-View traversal. Something like a Node-Ray, and each iteration, each thread either intersects the ray with a triangle in leaf nodes, or pushes two more Node-Rays for the next iteration. This would considerably reduce thread divergence.

... now i want to add specular reflections, and i need to increase accuracy for direct lighting from small light sources, so i have the same thoughts than you actually. I already use raytracing for GI, but this algorithm is not efficient for hard shadows or sharp reflections. Also i do not use triangles but a sample hierarchy where the occlusion is defined by a disc per sample.

 

I have thought about those ideas: 

Generate imperfect shadow maps of direct lighting (or imperfect reflective SMs to support specular too): Likely too inaccurate.

The traversal idea you have mentioned above (if i get you right): But here you need to store ray and BVH information in global memory for each iteration step. I think that's too slow / bandwidth heavy, and i'm unsure how to bound memory requirements for a breadth first traversal stack for 2 million rays at all. What do you think about this problem?

Classical stackless traversal using a skip pointer: Ray keeps in register, no need to ever write to global memory, but each thread traverses a different path through the BVH. So horribly cache inefficient and the reason we have little RT in games (until recently). 

Binning ray batches to BVH branches: This is what i'll try soon. Idea is to divide BVH into branches that fit into LDS. The traversal logic is the same classic one as above, but each workgroup traverses all rays that potentially intersect, so you need to load scene data less often and it can be shared for multiple rays. If a ray skips out of the branch it gets binned to another branch. After one step sort and compact the rays to lists binned to the next BVH branches. After some iterations the workgroups no longer get enough rays and fallback to classical per thread traversal becomes faster.

Binning rays to coarse grid: Here i would again load BVH branch to LDS per workgroup and bruteforce all nearby rays collected from the grid. This would require to load the whole scene only once. But i think binning rays to a regular grid is too expensive. Prefix sum over 2 million rays alone will take a millisecond and likely we want to segment the rays so they can be associated with loose grid cells and to get some front to back order advantage. I still think about this but i guess it won't work well.

 

When working with triangles those 'BVH branch to LDS' ideas become less interesting because a triangle has more data than a disc. (i may even get away with treating them as spheres). But fp16 could help a lot with triangles eventually?

Btw, for me it was a big win to use MBVH instead binary tree. I support max 8 children, and at the surface this reduces to 4 (basically one disc per lightmap texel and hierarchy looks like mip maps). This way the tree requires much less levels which is a big win if you have barriers between tree level dispatches. Also the Many LoDs paper mentions a 'lazy update' optimization. I had implemented this but i saw it is a win only for binary trees. With 8 children just rebuilding the top levels from scratch was equally fast then using the lazy update.

 

 

For that, I would have a ray buffer and then the Node-Ray object would store a ray index, node index, and optionally an object index. If memory is a concern, you could do a subset of the rays at a time.

Would be just 64 bits, but for 2000000 rays, each one traversing maybe 1000 nodes would be 16 GB. A Radeon VII would spend the whole frame of a 60 fps target by reading this memory once, if i do the math right (not sure). And then all you have is indices but not the data itself or any processing / writing.

I'm not sure if it's possible to beat classical traversal at all, but i'll try... :|

 

Yeah, that might be starting to push it a little bit.

On another note, can anyone with some compute shader experience give me some hints to what might be causing this annoying flicker?

It seems to get worse with the group size, most of my testing was with groups of 64 threads doing the traversal, and the flicker is there, but only occasionally, with 256 threads in a group, it's absolutely insane. I'm guessing it's something to do with the way I push triangles or Node-Views into the buffer.


/////////////////////////////////////////////////////////////////////////////////////
// ManyPSMs

struct sGPUNodeView
{
//	float4x4 mViewObjectMatrix;
	uint mNodeIndex;
	uint mViewIndex;
	uint mObjectInstanceIndex;
	float mViewRadius;
};

struct sRingData
{
	uint mReadIndex;
	uint mWriteIndex;
	uint mPadding[2];
};

struct sGPUPackedMeshBVHNode
{
	float4 mSphere;
	uint4 mData;
	
	bool HasObjectInstance()
	{
		return mData.x & 1;
	}
	
	bool HasTriangle()
	{
		return mData.x & 2;
	}
	
	uint3 GetTriangleIndices()
	{
		return mData.yzw;
	}
	
	uint2 GetChildIndices()
	{
		return mData.yz;
	}
	
	uint GetObjectInstance()
	{
		return mData.w;
	}
	
	float4 GetSphere()
	{
		return mSphere;
	}
};

struct sPointSplatData
{
	float3 mPoisition;
	uint mViewIndex;
	uint mObjectInstanceIndex;
	uint mPadding[3];
};

struct sTriangleSplatData
{
	uint mIndices[3];
	uint mViewIndex;
	uint mObjectInstanceIndex;
	uint mPadding[3];
};


StructuredBuffer<sGPUNodeView> NodeViewBuffer;									// Input Node-View Buffer
RWStructuredBuffer<sGPUNodeView> OutNodeViewBuffer;			// Output Node-View Buffer
StructuredBuffer<sGPUPackedMeshBVHNode> BVHNodeBuffer;							// BVH Node Buffer
StructuredBuffer<float4x4> ViewBuffer;
Buffer<uint> ViewResolutionBuffer;
StructuredBuffer<float4x4> ObjectInstanceBuffer;
RWStructuredBuffer<sPointSplatData> PointSplatBuffer;			// Output points, not really used at the moment
RWStructuredBuffer<sTriangleSplatData> TriangleSplatBuffer;	// Output Triangles
RWBuffer<uint> IndirectDispatchArgs;							// Indirect args for the next dispatch
Buffer<uint> IterationDataRead;													// Stores the number of nodes to process for the current iteration
RWBuffer<uint> IterationDataWrite;												// Stores the number of nodes to process for the next iteration


struct cbProcessNodeViewsParams
{
	uint mIteration;
};

CBUFFER(ProcessNodeViewsParams);

static const int gLogTotalNumThreads = 8;
static const uint gTotalNumThreads = 1 << gLogTotalNumThreads;
groupshared uint gNumAppendNodeViews[gTotalNumThreads];							// Used in parallel reduction to figure out how many items were added to OutNodeViewBuffer

// Append item to OutNodeViewBuffer
uint AppendNodeView(uint NodeIndex, uint ViewIndex, float ViewRadius, uint ObjectInstanceIndex)
{
	uint index = 0;
	if (NodeIndex != 0xffffffff)
	{		
		index = OutNodeViewBuffer.IncrementCounter();

		sGPUNodeView nodeView = (sGPUNodeView)0;
		nodeView.mNodeIndex = NodeIndex;
		nodeView.mViewIndex = ViewIndex;
		nodeView.mViewRadius = ViewRadius;
		nodeView.mObjectInstanceIndex = ObjectInstanceIndex;
		OutNodeViewBuffer[index] = nodeView;
		
		index++; // We are interested in the max count, so that would be max index + 1
	}
	
	return index;
}


float TruncatedConeSphereTest(sConeData cone, float4 sphere, float minT, float maxT, out float a)
{
	// the sphere is partially included (return -1)
	// the sphere is totally included (return 1)
	// cull the sphere (return 0)
	/*
	float3 V = sphere.xyz - cone.apex_location;
	float dotProduct_V_V = dot(V, V);
	a = dot(V, cone.direction_normal);

	float t = maxT + sphere.w;
	if (dotProduct_V_V  > t*t || -a + minT > sphere.w)
	{
		return 0.0;
	}
	
	a = max(minT,a);

	const float p = a * cone.mSin;
	const float q = cone.mCos * cone.mCos * dotProduct_V_V - a * a;
	const float tmp = (q < 0.0f) ? 1.0f : 0.0f;

	float sr = sphere.w * sphere.w - q;
	float rhs = 2.f * sphere.w * p;
	float lhs = ((p < sphere.w) || tmp == 0.0f) ? -sr : sr;

	if (lhs < 0.0f)
	{
		return -1.0;
	}
	else
	{
		return tmp;
	}*/
	
	float3 V = sphere.xyz - cone.apex_location;
	a = dot(V, cone.direction_normal);
	float dotProduct_V_V = dot(V, V);
	
//	float t = ;
	if (a > maxT + sphere.w || dotProduct_V_V > (maxT + sphere.w)*(maxT + sphere.w) || -a + minT > sphere.w)
	{
		return 0.0;
	}
	
//	return 1;
	
	a = max(minT,a);
	
	float x = cone.mCos * sqrt( dotProduct_V_V - a*a ) - a * cone.mSin;
	if (abs(x) > sphere.w)
	{
		if (x < 0.0)
		{ 
			return 1;
		}
		else
		{
			return 0;
		}
	}
	else 
	{
		return -1;
	}
}	

bool ShouldCullNodeView(sGPUNodeView nodeView, sGPUPackedMeshBVHNode node, float4x4 view)
{
//	return false;
	float4x4 v = transpose(view);
	float3 position = float3(0, 0, 0);//-v[3].xyz;
	float3 direction = float3(0,0,1);//view[2].xyz;
	
	sConeData cone;
	cone.apex_location = position;
	cone.direction_normal = direction;
	cone.mSin = sin(3.1415 * 1.414213 * 0.5); // sin(90) probably needs to be slightly larger
	cone.mCos = cos(3.1415 * 1.414213 * 0.5); // cos(90)
	
	float4 sphere = node.GetSphere();
	sphere.xyz = mul(view, float4(sphere.xyz, 1.0)).xyz;
	
	float a;
	float result = TruncatedConeSphereTest(cone, sphere, 0.1, nodeView.mViewRadius, a);
	return result == 0.0;
}

// This appends the node to either the point splat output buffer or the triangle output buffer
// Point splats are not used yet
void DrawNodeView(sGPUNodeView nodeView, sGPUPackedMeshBVHNode node, float4x4 view)
{
	if (node.HasTriangle())
	{
		uint index = TriangleSplatBuffer.IncrementCounter();
		
		sTriangleSplatData splat = (sTriangleSplatData)0;
		uint3 indices = node.GetTriangleIndices();
		splat.mIndices[0] = indices.x;
		splat.mIndices[1] = indices.y;
		splat.mIndices[2] = indices.z;
		splat.mViewIndex = nodeView.mViewIndex;
		splat.mObjectInstanceIndex = nodeView.mObjectInstanceIndex;
		TriangleSplatBuffer[index] = splat;
	}
	else
	{
		uint index = PointSplatBuffer.IncrementCounter();

		sPointSplatData splat = (sPointSplatData)0;
		splat.mPoisition = node.GetSphere();
		splat.mViewIndex = nodeView.mViewIndex;
		splat.mObjectInstanceIndex = nodeView.mObjectInstanceIndex;
		PointSplatBuffer[index] = splat;
	}
}

// Approximate the node's screen space area
// https://iquilezles.org/www/articles/sphereproj/sphereproj.htm
// Sphere = Sphere position and radius
// View = Camera world to camera matrix
// fle = Camera Focal Length
float2 ApproximateProjectedArea(float4 Sphere, float4x4 View, float fle)
{
    // transform to camera space
    float3  o = mul(View, float4(Sphere.xyz,1.0)).xyz;

    float r2 = Sphere.w*Sphere.w;
    float z2 = o.z*o.z;
    float l2 = dot(o,o);

//    return -3.14159*fle*fle*r2*sqrt(abs((l2-r2)/(r2-z2)))/(r2-z2);
	return float2(
		fle*sqrt(-r2*(r2-l2)/((l2-z2)*(r2-z2)*(r2-z2))),
		fle*sqrt(-r2*(r2-l2)/((l2-z2)*(r2-z2)*(r2-l2)))
		);
}

// Should this Node-View be drawn or further processed?
bool ShouldDrawNodeView(sGPUNodeView nodeView, sGPUPackedMeshBVHNode node, float4x4 view, float Resolution)
{
	// Nodes with triangles must be drawn
	if (node.HasTriangle())
	{
		return true;
	}
	else
	{	
//		return false;
		
		// If the node is too small, draw it and stop processing it's children
		float FOV = 90.0 / 32.0; // 90 degree FOV because we're rendering cube shadow maps
		float focalPoint = 0.5 / tan(FOV * 3.1415 / 180.0);
		float2 approximateProjectedArea = ApproximateProjectedArea(node.GetSphere(), view, focalPoint) * Resolution * 2.0;

		return approximateProjectedArea.x * approximateProjectedArea.y <= 1.0;
	}
}

[numthreads(gTotalNumThreads, 1, 1)]
void ProcessNodeViews_main(uint3 dispatchThreadID : SV_DispatchThreadID, uint GroupIndex : SV_GroupIndex, uint GroupID : SV_GroupID)
{
	gNumAppendNodeViews[GroupIndex] = 0;
	
	// Dont process more nodes than what is in the node-view buffer (IterationDataWrite[0])
	if (dispatchThreadID.x < IterationDataWrite[gProcessNodeViewsParams.mIteration * 4])
	{		
		sGPUNodeView nodeView = NodeViewBuffer[dispatchThreadID.x];
		if (nodeView.mViewIndex != 0xffffffff) // this would be an error condition, and shouldnt really fail
		{
			// Retreive the node and view
			sGPUPackedMeshBVHNode node = BVHNodeBuffer[nodeView.mNodeIndex];
			float4x4 view = ViewBuffer[nodeView.mViewIndex];
			if (nodeView.mObjectInstanceIndex != 0xffffffff)
			{
				float4x4 object = ObjectInstanceBuffer[nodeView.mObjectInstanceIndex];
				view = mul(view, object);
			}
			
			if (!ShouldCullNodeView(nodeView, node, view))
			{	
				float resolution = asfloat(ViewResolutionBuffer[nodeView.mViewIndex]);
				if (ShouldDrawNodeView(nodeView, node, view, resolution))
				{
					DrawNodeView(nodeView, node, view);
				}
				else
				{		
					if (node.HasObjectInstance())
					{
						uint objectInstanceIndex = node.GetObjectInstance();
						nodeView.mObjectInstanceIndex = objectInstanceIndex;
					}
						
					// Append the child node-views to the OutNodeViewBuffer
					uint2 children = node.GetChildIndices();
					uint maxNodes1 = AppendNodeView(children.x, nodeView.mViewIndex, nodeView.mViewRadius, nodeView.mObjectInstanceIndex);
					uint maxNodes2 = AppendNodeView(children.y, nodeView.mViewIndex, nodeView.mViewRadius, nodeView.mObjectInstanceIndex);
					
					// Store the maximum item number that is stored in the OutNodeViewBuffer
					gNumAppendNodeViews[GroupIndex] = max(maxNodes1, maxNodes2);
				}
			}
		}
	}
			
	// Parallel reduction of maximum item number in OutNodeViewBuffer
	[unroll(gTotalNumThreads)]
	for(uint s = gTotalNumThreads / 2; s > 0; s >>= 1)
	{
		if(GroupIndex < s)
		{
			gNumAppendNodeViews[GroupIndex] = max(gNumAppendNodeViews[GroupIndex], gNumAppendNodeViews[GroupIndex + s]);
		}
		
		GroupMemoryBarrierWithGroupSync();
	}

	// Have the first thread write out the dispatch args, and number of nodes for the next iteration
	if(GroupIndex == 0)
	{
		uint numAppended = gNumAppendNodeViews[0];
		InterlockedMax(IndirectDispatchArgs[gProcessNodeViewsParams.mIteration * 4 + 4], (numAppended + gTotalNumThreads - 1) >> gLogTotalNumThreads);
		InterlockedMax(IterationDataWrite[gProcessNodeViewsParams.mIteration * 4 + 4], numAppended);
	}
}

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

There's also something not quite right with the screen space area approximation - it seems to return results that are way too large, and the cone-sphere test could probably be better, but these are quite insignificant compared to the flicker.

AppGameLoader_Win_2019-03-28_22-47-10-605.mp4

Note that the shadow maps are dynamically allocated in the atlas so they move around a lot, this is not the cause of the flicker, if you pause on certain frames, you can see triangles missing from certain meshes in the shadow map.

This topic is closed to new replies.

Advertisement