Need ideas about 'Baking' instances into vertex shader (no GS)

Started by
17 comments, last by Mr_Fox 7 years, 3 months ago

If all you need to do is to render cube then you don't need an IB or VB at all:

14 tristrip cube in the vertex shader.

b = 1 << i;
x = (0x287a & b) != 0
y = (0x02af & b) != 0
z = (0x31e3 & b) != 0

Where i is SV_VertexID (add a modulo operation to i to draw multiple cubes in the same draw).

Then set the position rotation and scale by pulling the transform from a Const buffer or SRV indexed by SV_VertexID / 14;

No index buffer, no vertex buffer, no GS, and no instancing. Only raw ALU and one SRV with each cube's transform.

hum.. however, my wireframe case won't work for pure vs linestrip instancing since it seems we cannot break tristrip/linestrip in vs (could we?)

Advertisement

If all you need to do is to render cube then you don't need an IB or VB at all:

14 tristrip cube in the vertex shader.

b = 1 << i;
x = (0x287a & b) != 0
y = (0x02af & b) != 0
z = (0x31e3 & b) != 0

Where i is SV_VertexID (add a modulo operation to i to draw multiple cubes in the same draw).

Then set the position rotation and scale by pulling the transform from a Const buffer or SRV indexed by SV_VertexID / 14;

No index buffer, no vertex buffer, no GS, and no instancing. Only raw ALU and one SRV with each cube's transform.

The magic number above will connect separate cubes, in case someone need it I figured out the number for attaching degenerate triangle at begin and end of each cube so the strip connect each cube is just a line. Here is the number


16 tristrip cube in the vertex shader.

b = 1 << i;
x = (0xd0f4 & b) != 0
y = (0x055f & b) != 0
z = (0xe3c7 & b) != 0

However, when I benchmark it, it's actually slightly slower than instancing!

I am so confused...

in case someone wondering, here is the shader code for both instancing/noInstance,


//=======================================================================
// This uses instance
//=======================================================================
#include "TSDFVolume.inl"
#include "TSDFVolume.hlsli"
#include "CalibData.inl"

Texture3D<int> tex_srvRenderBlockVol : register(t1);

void main(uint uInstanceID : SV_InstanceID, in float4 f4Pos : POSITION,
    out float4 f4ProjPos : SV_POSITION, out float2 f2Depths : NORMAL0)
{
    uint3 u3Idx = MakeU3Idx(uInstanceID,
        vParam.u3VoxelReso / vParam.uVoxelRenderBlockRatio);
    f4ProjPos = float4(0.f, 0.f, 0.f, 1.f);
    f2Depths = float2(0.f, 0.f);
    // check whether it is occupied 
    if (tex_srvRenderBlockVol[u3Idx] != 0) {
        float3 f3BrickOffset = 
            u3Idx * vParam.uVoxelRenderBlockRatio * vParam.fVoxelSize -
            (vParam.u3VoxelReso >> 1) * vParam.fVoxelSize;
        f4Pos.xyz = (f4Pos.xyz + 0.5f) * vParam.fVoxelSize *
            vParam.uVoxelRenderBlockRatio + f3BrickOffset;
#if FOR_VCAMERA
        f4ProjPos = mul(mProjView, f4Pos);
        float fVecLength = length(mul(mView, f4Pos).xyz);
#endif // FOR_VCAMERA
#if FOR_SENSOR
        float4 f4Temp = mul(mDepthView, f4Pos);
        float fz = -f4Temp.z;
        float2 f2HalfReso = i2DepthReso >> 1;
        float2 f2xy = (f4Temp.xy / fz * DEPTH_F + DEPTH_C
            - f2HalfReso) / f2HalfReso;
        f4ProjPos = float4(f2xy, 1.f, 1.f);
        float fVecLength = length(f4Temp.xyz);
#endif // FOR_SENSOR
        f2Depths = float2(fVecLength, -fVecLength);
    }
}

//=======================================================================
// This uses noinstance
//=======================================================================
#include "TSDFVolume.inl"
#include "TSDFVolume.hlsli"
#include "CalibData.inl"

#define TRISTRIPSIZE 16
#define MAGICFORX 0xd0f4
#define MAGICFORY 0x055f
#define MAGICFORZ 0xe3c7
//#define TRISTRIPSIZE 14
//#define MAGICFORX 0x287a
//#define MAGICFORY 0x02af
//#define MAGICFORZ 0x31e3
Texture3D<int> tex_srvRenderBlockVol : register(t1);

void main(uint uVertID : SV_VertexID,
    out float4 f4ProjPos : SV_POSITION, out float2 f2Depths : NORMAL0)
{
    uint3 u3Idx = MakeU3Idx(uVertID / TRISTRIPSIZE,
        vParam.u3VoxelReso / vParam.uVoxelRenderBlockRatio);
    f4ProjPos = float4(0.f, 0.f, 0.f, 1.f);
    f2Depths = float2(0.f, 0.f);
    // check whether it is occupied 
    if (tex_srvRenderBlockVol[u3Idx] != 0) {
        uint uMask = 1 << (uVertID % TRISTRIPSIZE);
        uint3 u3Pos = (uint3(MAGICFORX, MAGICFORY, MAGICFORZ) & uMask) != 0;
        float4 f4Pos = float4(u3Pos, 1.f);
        float3 f3BrickOffset = 
            u3Idx * vParam.uVoxelRenderBlockRatio * vParam.fVoxelSize -
            (vParam.u3VoxelReso >> 1) * vParam.fVoxelSize;
        f4Pos.xyz = f4Pos.xyz * vParam.fVoxelSize *
            vParam.uVoxelRenderBlockRatio + f3BrickOffset;
#if FOR_VCAMERA
        f4ProjPos = mul(mProjView, f4Pos);
        float fVecLength = length(mul(mView, f4Pos).xyz);
#endif // FOR_VCAMERA
#if FOR_SENSOR
        float4 f4Temp = mul(mDepthView, f4Pos);
        float fz = -f4Temp.z;
        float2 f2HalfReso = i2DepthReso >> 1;
        float2 f2xy = (f4Temp.xy / fz * DEPTH_F + DEPTH_C
            - f2HalfReso) / f2HalfReso;
        f4ProjPos = float4(f2xy, 1.f, 1.f);
        float fVecLength = length(f4Temp.xyz);
#endif // FOR_SENSOR
        f2Depths = float2(fVecLength, -fVecLength);
    }
}

Though I got D3D12 debug layer warning

D3D12 WARNING: ID3D12CommandList::DrawInstanced: Vertex Buffer at the input vertex slot 0 is not big enough for what the Draw*() call expects to traverse. This is OK, as reading off the end of the Buffer is defined to return 0. However the developer probably did not intend to make use of this behavior. [ EXECUTION WARNING #210: COMMAND_LIST_DRAW_VERTEX_BUFFER_TOO_SMALL]
But I think it should't affect performance... Any idea why this is slightly slower than instancing? (my test machine use GTX680m, and it is slower in any cube number case)
You should set sv_position to NAN when the voxel is off, it instruct the gpu to cull the triangle, probably faster than a million of zero area triangle at the center of the screen.

Do not measure perf with the debug layer (it is unclear you do it or not).

You should set sv_position to NAN when the voxel is off, it instruct the gpu to cull the triangle, probably faster than a million of zero area triangle at the center of the screen.

Do not measure perf with the debug layer (it is unclear you do it or not).

Thanks, but how to set sv_position to NAN in vs? I tried f4ProjPos = float4(NaN, NaN, NaN,NaN) etc, but it doesn't work, and I can't find related artical online...

I measured the perf with debug layer off, then with that strange result, I turn debug layer on to try to figure out anything interesting....

Question: Are you using one DrawPrimitive per cube? Because you're supposed to use just one DrawPrimitive call for all cubes (multiply the vertexCount by the number of cubes to render).

Edit: Also indexing out of bounds to tex_srvRenderBlockVol is likely having a performance penalty, not to mention it implies either your math is incorrect, your buffer is too small, or you're rendering more cubes than you intended.

Edit 2: If after fixing everything the instancing version still outperforms the one with vertex tricks, then it's likely you're ALU bound, in which case I recommend creating a vertex buffer with 36 vertices and index them via [SV_VertexID % 36];

Wow~ that's brilliant!! use bits to do these thing. hum... with a little bit modification, it seems we could use this method for lots of simple geometry, 16bit magic numbers for anything less than 16tristrip geometry(may need more magic number for cases where coordinates not just 0 or 1). How do you come up with that idea?
Thanks

The idea of using SV_VertexID is often called dynamic/manual vertex pulling and has been around as long as Vertex Shader tricks presentation.

As for the magic numbers itself, I'm afraid I can't take the credit. It was published by a dev on Twitter. Sadly I don't remember/didn't save the twitter handle who posted it.

Also indexing out of bounds to tex_srvRenderBlockVol is likely having a performance penalty, not to mention it implies either your math is incorrect, your buffer is too small, or you're rendering more cubes than you intended.

In my case vs will never index out of bounds to tex_srvRenderBlockVol, the reason debug layer complain my vertex buffer is too small is that I have option to do DrawIndexInstance in which I have 8 vertices in the VB, and in noInstance case I was too lazy to unbound VB, IB and IL. But later I fixed it, and it doesn't help with perf.

Also I was wondering even if my vs is ALU bound, how could it be slower than Instancing, the method you mentioned totally avoid VB, IB memory fetch, and only add several bit operations. And someone also told me that since I only have 8 vertices for each instance, my GPU is underutilized (it was said 32/64 vertices to avoid idle vs thread), so confused why it is slower...

For doing [SV_VertexID % 36] way, I was wondering it probably will be even slower since in my case I can't do pretransform vertices and for each cube you have to invoke vs 36 times instead of only 16 times, but I may be wrong though....

Any idea about 'kill' vertex in vs? In most case I do have 80% of those cubes are not active, and get degenerated into one point, but it will be good to totally get rid of them...

You can use this to generate the NaN. As for unwanted culled cubes all together, if you accept the price of complexity, you can run a compute to output the list of visible cube, plus a compute to fill an indirect draw argument buffer and trigger an ExecuteIndirect of it, the vertex shader will have an extra indirection to read the cube locations, but you cut all the wastes that way.


float nan() {
	#pragma warning(push)
	#pragma warning(disable:4118)
	return sqrt(-1);
	#pragma warning(pop)
}

If all you need to do is to render cube then you don't need an IB or VB at all:

14 tristrip cube in the vertex shader.

b = 1 << i;
x = (0x287a & b) != 0
y = (0x02af & b) != 0
z = (0x31e3 & b) != 0

Where i is SV_VertexID (add a modulo operation to i to draw multiple cubes in the same draw).

Then set the position rotation and scale by pulling the transform from a Const buffer or SRV indexed by SV_VertexID / 14;

No index buffer, no vertex buffer, no GS, and no instancing. Only raw ALU and one SRV with each cube's transform.

Hi Matias, Just curious, will doing this trick disable all vertex reuse? It seems GPU may not recognize how vertex are shared... so when compared to the instance method, this one is doing 14 vertex transform while the instance method (by using IB) only doing 8 vertex transform so probably here comes the perf diff? (which means I do get ALU bound though)...

Please correct me if I got something wrong

Thanks

This topic is closed to new replies.

Advertisement