How does hardware instancing work on the pixel shader level?

Started by
11 comments, last by Dingleberry 7 years, 10 months ago

I am in the process of updating my shadow mapping routines and came across an issue with my point lights. Basically I'm supplying a ByteAddressBuffer to my pixel shader that contains a list of light sources (actually it's an index table into a separate light data buffer) that are close enough to affect each mesh instance being rendered.
Having a for-loop iterate over this list works well when using a comparison sampler to query shadow maps, however it yields divergent gradient compilation errors when used with mip mapped shadow maps. I get why this happens and have been thinking about alternative ways around it.
I had an idea to expand the light table to account for every light source, even those not affecting certain meshes (which seems wasteful but I suppose there won't be that many light sources at once anyway). In this way I could loop over all point lights for every pixel, and then do a dynamic branch based on whether the current light index is referenced by the currently drawn instance. However, I then got to thinking, how are hardware instances really drawn? Is each instance drawn by a different thread group or could I end up with multiple instances in the same group? If that can happen I don't see any way to guarantee non-divergent gradients within pixel blocks. Is there any, or should I rethink this problem from the ground up?

Thankful for any advice.

Advertisement

Don't worry about instance divergence in pixel shaders. They're computed in groups of 2x2 quads on nvidia and something similar on amd I believe. https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline

The divergence to worry about is within that small block of pixels. Things like small triangles or high frequency data will hurt performance. Other things like branching based on position probably won't, as the world space position of two pixels is very likely to be correlated. Of course, it might not be too, like looking through a chain link fence or something.

Hm yes... unfortunately it still doesn't seem to work however, which seems to suggest that the HLSL compiler can't guarantee there won't be any such divergence, and as such it refuses to compile my shader.

Here's my original light application loop:


/*
 * Process spot lights.
 * «spotLightData.x» contains an offset into the spot light table at which a set of uint indices
 *                   into the SpotLight buffer detailing what light sources affect the current mesh
 *                   instance begin.
 * «spotLightData.y» contains the number of spot light sources affecting the current mesh instance.
 */
for(n = 0; n < spotLightData.y; n++) {
	lightId = SpotLightTable.Load(spotLightData.x + (n * 4));
	sLightContrib contrib = ComputeSpotLightContribution(SpotLight[lightId], V, P.xyz, N);
	total.diffuse  += contrib.diffuse;
	total.specular += contrib.specular;
}

The above won't compile when ComputeSpotLightContribution uses SampleGrad / similar to sample a shadow map, raising error X4014:

Cannot have divergent gradient operations inside flow control.

The only possible source of such divergence here, as far as I can tell anyway, is that different instances may index into different light sources based on the light table lookup.

So I tried to change it to the following to ensure that all instances take the same flow path, and limit the actually processed lights with an if-branch instead:


/* Process spot lights */
uint tableIndex = 0;
lightId = SpotLightTable.Load(spotLightData.x + (tableIndex * 4));
[loop]
for(n = 0; n < NumSpotLights; n++) { // NumSpotLights is the total number of elements in the SpotLight buffer
	[branch]
	if(n == lightId) {
		sLightContrib contrib = ComputeSpotLightContribution(SpotLight[n], V, P.xyz, N);

		total.diffuse  += contrib.diffuse;
		total.specular += contrib.specular;

		// Look for next light table index if applicable
		if(++tableIndex < spotLightData.y)
			lightId = SpotLightTable.Load(spotLightData.x + (tableIndex * 4));
		else
			lightId = 0xffffffff;
	}
}

The light table is always organized in such a way that the light source indices will increase by the way.

Unfortunately this approach fails with the very same error message as above as well.

I still think there should be some way to accomplish this without having to break the instance draw calls apart and constantly change cbuffer data though or...?

You're going to continue to get the "Cannot have divergent gradient operations inside flow control" any time you're asking it to sample a texture using something like "Sample" where it needs adjacent threads' texture coordinates in order that it can calculate the derivatives and sample the correct mip level.

Using SampleLevel ("Here's the mip level I want...") or SampleGrad ("Here, I've calculated my own derivatives") should mean you don't get the divergent gradient operations compile error. SampleGrad should not cause this error.

Can you provide a fuller, compilable shader?

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Maybe he put the ddx and ddy calls inside the loop?

Well, I figured he was perhaps calculating the gradients without using ddx/ddy, but I suppose yes, calling ddx and ddy and then trying to call SampleGrad doesn't achieve very much beyond giving you access to those values. In fact, it's probably slower than just calling Sample if you don't actually need the gradients for anything other than providing them to SampleGrad. A whole shader would give us the answer.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Aye, there are indeed ddx/ddy calls inside the function being called in the branch; moving them into the outer loop have solved the issue :)

The thing that threw me off here was that a similar approach for directional lights was working just fine having those intrinsics called from inside the flow control. I can only assume that the compiler is fine with a loop that has a constant buffer member determining its upper bound, but not when that bound is retrieved from a shader resource. Furthermore both of my directional light paths have the same ddx/ddy calculations, but they're still inside both separate branches (yes I should move them out of there!) and that apparently works, so it would seem the compiler must detect and be fine with this too, which I wouldn't have suspected.

In fact, it's probably slower than just calling Sample if you don't actually need the gradients for anything other than providing them to SampleGrad.

The derivatives aren't calculated on the sampled texture coordinates so that will sadly not work.

What are you using the derivatives for? It seems strange for spot light contribution.

Chances are some of what you think are loops or branches as you've written them in HLSL may actually get flattened out by the compiler such that there isn't a loop or branch there at all. If the compiler can prove that a branch is 'coherent' among all threads in the wave/warp then it's not a problem to have gradient operations inside the control flow.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

They used the [branch] annotation so that should force it to prevent divergent gradient operations shouldn't it?

This topic is closed to new replies.

Advertisement