# DX11 Single big shader buffer for hardware instancing

## Recommended Posts

I saw this part from the battlefield3 presentation (http://dice.se/wp-content/uploads/GDC11_DX11inBF3_Public.pdf) on page 30 where they show how they transfer transformations of each instance to the shader using a single big buffer and a const buffer with indices to index this buffer at the right position using the SV_InstanceID semantic. Can someone tell me what exactly this buffer is ?

This part here:

Buffer<float4> instanceVectorBuffer : register(t0);


I'm not too familiar with buffers in directx11 beyond constant buffers...

Isn't register(t0) for texture slots ?  Also they seem to use 3x4 floats for their transformations, so does that mean they construct their transformation matrix inside the shader using translation, rotation and scale vectors ?

And last how would I set such a buffer in the DX11 API ?

##### Share on other sites

The MSDN documentation is very lacking in explaining these concepts...

IIRC:

The t registers are similar to the c registers, but they're designed for random memory access patterns (like texture lookups) rather than constant memory access patterns (e.g. it's assumed that every pixel will read the same constants, but may read different texels). You can bind both textures and DX11 buffer objects to the t registers and then load values from them in the shader.

You bind values to the t registers using *SSetShaderResources, which takes "shader resource views". You can create a view of a texture, which is the common case, but you can also create a view of a buffer.

As mentioned above, you should prefer this method of binding buffers to t registers when you're going to be performing random lookups into the buffer. If every pixel/vertex/etc is going to read the same values from the buffer, then you should bind it to a c register using the usual method.

Also they seem to use 3x4 floats for their transformations, so does that mean they construct their transformation matrix inside the shader using translation, rotation and scale vectors ?

No, in a regular transformation matrix that's been constructed from a traslation + rotation + scale, the 4th row/column (depending on your conventions) will always be [0,0,0,1] so you can hard-code that value in the shader to save some space in the buffer.

I'm still not as experienced with DX11 as I am with DX9, so I hope that's correct

Edited by Hodgman

##### Share on other sites

Yes I meant 3 x float4 (sorry for the confusing syntax   )

Alright thanks for the info.

EDIT: By the way this buffer then should be dynamic, correct? Since I need to put in or remove elements during rendering.

Edited by lipsryme

##### Share on other sites

Yeah by the looks of it, they'd potentially be updating the entire buffer every frame.

##### Share on other sites

EDIT: Nevermind got it fixed...my shader wasn't up to date when I ran it

Edited by lipsryme

##### Share on other sites

The advantage of a buffer over cbuffer is the size (128mb vs 64kb, although in typical scenario 2-4mb should be enough) which allows you to store a frame worth of data to the buffer, also it allows you to fill the buffer once so you'll minimize buffer updates to minimum. The minor inconvenience is that you'll be able to store only float data (probably you can get around this with some bit manipulation method).

The buffer is pratically a texture and you'd bind it as vertex shader resource.

[source]
float4x4 GetInstanceMatrix(uint InstID,uint Offset)
{
uint BufferOffset = InstID * ElementsPerInstance + StartIndex + Offset;

// InstID = InstanceID
// ElementsPerInstance, how many float4's there are per drawn instance (typically 3 for a single mesh if 4x3 matrices are used)
// StartIndex = location of the first float4 inside the buffer for the first instance
// Offset (used to retrieve bone matrices)

float4 r0 = InstanceData.Load(BufferOffset + 0);
float4 r1 = InstanceData.Load(BufferOffset + 1);
float4 r2 = InstanceData.Load(BufferOffset + 2);
float4 r3 = float4(0.0f,0.0f,0.0f,1.0f);

return float4x4(r0,r1,r2,r3);
}

[/source]

Inside the vertex shader you may use the above code to retrieve the transform matrix for each vertex.

Cheers!

##### Share on other sites

The advantage of a buffer over cbuffer is the size (128mb vs 64kb, although in typical scenario 2-4mb should be enough)

Another important advantage from a tbuffer over a cbuffer is that cbuffer suffer from constant waterfalling, which make them a horrible fit for scalable hw accelerated vertex skinning (indexing the constant buffer with a different index in each vertex).

When you're not suffering constant waterfalling, cbuffers can be faster though, for well.... constant data.

##### Share on other sites

Ok this may sound stupid, but what about just using regular instancing?Has this method been benchmarked against regular instancing with a second vertex buffer?

##### Share on other sites

Ok this may sound stupid, but what about just using regular instancing?Has this method been benchmarked against regular instancing with a second vertex buffer?

If you mean that regular instancing means using a second vertex buffer, I remember reading some years ago that one of the big companies wrote that shaders using a second vertex stream will take a performance hit. It had something to do with the fact that the vertex declaration structure gets pretty big.

I think that the problem with classic instancing is also, that you'll need to define a vertex declaration for each different vertex shader type where the parameters are different. Normally of course you would just have a transform matrix as instance data for the vertex shader, but what if you need also some other per instance parameters, such as color, inverse world matrix or some other parameter which makes your instance different from the others. When using second vertex stream you'll need a different vertex declaration for each case.

In the case of the generic buffer, you'll need just one vertex declaration since the vertex data won't change. After it is the job of the vertex shader to extract the desired data from the generic buffer. Of course you'll need to make sure that the buffer contains the expected data. Also, since the buffer can be pretty big you'll be updating data much less often to the graphic card, which reduces API calls. Also, the program side logic get simpler when you don't need to worry so much about "will the data fit to the buffer or not, do I have to make several draw calls instead of one".

Of course, under D3D11 the instancing just another parameter that you can access in the vertex shader (InstanceID) and after that you may read or generate the instancing data in the best way you can imagine, though vertex streams can't be randomly accessed.

Cheers!

##### Share on other sites

So..I've created the buffer like this:

	D3D11_BUFFER_DESC instanceBufferDesc = {};

instanceBufferDesc.Usage = D3D11_USAGE_DYNAMIC;
instanceBufferDesc.ByteWidth = sizeof(XMFLOAT4);
instanceBufferDesc.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
instanceBufferDesc.MiscFlags = 0;
instanceBufferDesc.StructureByteStride = 0;

hr = this->device->CreateBuffer(&instanceBufferDesc, NULL, &this->instanceTransformBuffer);


is that correct so far ?

Now I'm a little clueless on how to create the shader resource view from this.
Any ideas ? Do I set the shader resource view description to NULL ?

This seems to work without errors:
	D3D11_SHADER_RESOURCE_VIEW_DESC shaderResourceDesc;



Now if the above is correct, how can I basically push float4's onto the buffer ?

Edited by lipsryme

##### Share on other sites

Use DeviceContext->Map(...) with D3D11_MAP_WRITE_DISCARD to get a pointer to an empty the buffer.

Fill the data (with memcpy for example) and the call Unmap. Don't write outside of the buffer! It will cause memory corruption and may even hang your computer.

It doesn't really make sense to use the buffer for one float, but I assume it to be a test. I used DXGI_FORMAT_R32G32B32A32_FLOAT as the buffer format since handling things as float4 makes maybe a bit more sense. You'll need at least 3x4 floats to store a typical matrix, but of course probably you'll want to allocate some megabytes of storage (within reason of course).

Cheers!

Edited by kauna

##### Share on other sites

Got it working perfectly thanks !

So the only way is to define a constant max size for this buffer inside the code...?

Worst case would be 3 x numInstances which could add up to quite a lot.

By the way I'm currently assigning each instance the 3 float4's inside a loop. I assume it would be more efficient to add them inside a structure which I then copy once using memcopy or is that neglectable?

I'm only doing map / unmap once, but I'm filling the data like this:

// Update buffer
for(int u = 0; u < 3; u++)
{
XMStoreFloat4(&pInstanceData[(numInstances * 3) + u], worldTransform.r[u]);
}


basically in the loop that goes through every instance and counts them to do the instanced draw call at the end.

Also am I correct in thinking that I am restricted to one specific material (texture and properties from const buffer) for every unique instance group ?

update: Just read about putting them inside a Texture2DArray...so for specific material properties would it make sense to also store them inside a buffer like I did with the transforms ? Might be overkill for e.g. booleans but still...

Edited by lipsryme

##### Share on other sites

Got it working perfectly thanks !

So the only way is to define a constant max size for this buffer inside the code...?

Worst case would be 3 x numInstances which could add up to quite a lot.

By the way I'm currently assigning each instance the 3 float4's inside a loop. I assume it would be more efficient to add them inside a structure which I then copy once using memcopy or is that neglectable?

I'm only doing map / unmap once, but I'm filling the data like this:

// Update buffer
for(int u = 0; u < 3; u++)
{
XMStoreFloat4(&pInstanceData[(numInstances * 3) + u], worldTransform.r[u]);
}


basically in the loop that goes through every instance and counts them to do the instanced draw call at the end.

Also am I correct in thinking that I am restricted to one specific material (texture and properties from const buffer) for every unique instance group ?

update: Just read about putting them inside a Texture2DArray...so for specific material properties would it make sense to also store them inside a buffer like I did with the transforms ? Might be overkill for e.g. booleans but still...

Wait, can I ask something - how does your draw call look like?I mean, do you call DrawIndexedInstanced with only 1 buffer?Or with 1 vertex buffer and 1 empty instance buffer/or having the ID inside the instance buffer data?

##### Share on other sites

I only use a vertex buffer yes, not using instance buffer at all but giving it per instance transformations via a float4 buffer that holds these.
They're then accessed inside the shader using the instanceID.

Edited by lipsryme

##### Share on other sites

I only use a vertex buffer yes, not using instance buffer at all but giving it per instance transformations via a float4 buffer that holds these.
They're then accessed inside the shader using the instanceID.

Hm I tried to do the same as you to see how it compares to normal instancing, however nothing leaves the vertex shader.The Shader Debugger detects that the sphere goes in:
However nothing gets rasterized.It can't be the view matrix, since it works on normal rendering, so it has to be something to do with the large buffer object.Did you ever experience such an issue?

##### Share on other sites

Hmm I do remember having some issues that no pixel shader was being executed....can't remember exactly what it was though...

Should look something like this:

	D3D11_INPUT_ELEMENT_DESC lo[] =
{
{"POSITION", 0, DXGI_FORMAT_R32G32B32_FLOAT, 0, 0, D3D11_INPUT_PER_VERTEX_DATA, 0 },
{"TEXCOORD", 0, DXGI_FORMAT_R32G32_FLOAT, 1, D3D11_APPEND_ALIGNED_ELEMENT, D3D11_INPUT_PER_VERTEX_DATA, 0},
{"NORMAL",   0, DXGI_FORMAT_R32G32B32_FLOAT, 2, D3D11_APPEND_ALIGNED_ELEMENT, D3D11_INPUT_PER_VERTEX_DATA, 0},
{"TANGENT",  0, DXGI_FORMAT_R32G32B32_FLOAT, 3, D3D11_APPEND_ALIGNED_ELEMENT, D3D11_INPUT_PER_VERTEX_DATA, 0},
};


fourth parameter is important here as this is the vertex buffer slot...

Also try to make sure the float4s are combined in the correct order.

I've had a lot of problems with this. In the end I had to transpose the matrix first and then pass them to the cbuffer because otherwise you'd loose the 4th row because

if you pass your 3 rows something like:

cbufferData.data[i] = worldMatrix.r[i];


you're passing only 3 vectors of this matrix and you will loose important data if you pass the rows instead of the columns.

That's why I changed it over to columns and it worked:

float4x4 GetInstanceTransform(uint instID, uint offset)
{
uint BufferOffset = instID * elementsPerInstance + startIndex + offset;

float4 c0 = InstanceDataBuffer.Load(BufferOffset + 0);
float4 c1 = InstanceDataBuffer.Load(BufferOffset + 1);
float4 c2 = InstanceDataBuffer.Load(BufferOffset + 2);
float4 c3 = float4(0.0f, 0.0f, 0.0f, 1.0f);

float4x4 _World = { c0.x, c1.x, c2.x, c3.x,
c0.y, c1.y, c2.y, c3.y,
c0.z, c1.z, c2.z, c3.z,
c0.w, c1.w, c2.w, c3.w };

return _World;
}

I don't see a way around that...but maybe someone does ?

Edited by lipsryme

##### Share on other sites

That's why I changed it over to columns and it worked

oh I see, cause in your other post you specify #pragma pack_matrix( row_major )

I'll try to get it to work tonight when i can to see what happens

##### Share on other sites

Yes but that only applies to float4x4 values coming from the cbuffer. It doesn't apply to what you use inside the vertex shader and also not for float4's.

## Create an account

Register a new account

• ### Forum Statistics

• Total Topics
628293
• Total Posts
2981868
• ### Similar Content

• I'm attempting to implement some basic post-processing in my "engine" and the HLSL part of the Compute Shader and such I think I've understood, however I'm at a loss at how to actually get/use it's output for rendering to the screen.
Assume I'm doing something to a UAV in my CS:
RWTexture2D<float4> InputOutputMap : register(u0); I want that texture to essentially "be" the backbuffer.

I'm pretty certain I'm doing something wrong when I create the views (what I think I'm doing is having the backbuffer be bound as render target aswell as UAV and then using it in my CS):

DXGI_SWAP_CHAIN_DESC scd; ZeroMemory(&scd, sizeof(DXGI_SWAP_CHAIN_DESC)); scd.BufferCount = 1; scd.BufferDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM; scd.BufferUsage = DXGI_USAGE_RENDER_TARGET_OUTPUT | DXGI_USAGE_SHADER_INPUT | DXGI_USAGE_UNORDERED_ACCESS; scd.OutputWindow = wndHandle; scd.SampleDesc.Count = 1; scd.Windowed = TRUE; HRESULT hr = D3D11CreateDeviceAndSwapChain(NULL, D3D_DRIVER_TYPE_HARDWARE, NULL, NULL, NULL, NULL, D3D11_SDK_VERSION, &scd, &gSwapChain, &gDevice, NULL, &gDeviceContext); // get the address of the back buffer ID3D11Texture2D* pBackBuffer = nullptr; gSwapChain->GetBuffer(0, __uuidof(ID3D11Texture2D), (LPVOID*)&pBackBuffer); // use the back buffer address to create the render target gDevice->CreateRenderTargetView(pBackBuffer, NULL, &gBackbufferRTV); // set the render target as the back buffer CreateDepthStencilBuffer(); gDeviceContext->OMSetRenderTargets(1, &gBackbufferRTV, depthStencilView); //UAV for compute shader D3D11_UNORDERED_ACCESS_VIEW_DESC uavd; ZeroMemory(&uavd, sizeof(uavd)); uavd.Format = DXGI_FORMAT_R8G8B8A8_UNORM; uavd.ViewDimension = D3D11_UAV_DIMENSION_TEXTURE2D; uavd.Texture2D.MipSlice = 1; gDevice->CreateUnorderedAccessView(pBackBuffer, &uavd, &gUAV); pBackBuffer->Release();
After I render the scene, I dispatch like this:
gDeviceContext->OMSetRenderTargets(0, NULL, NULL); m_vShaders["cs1"]->Bind(); gDeviceContext->CSSetUnorderedAccessViews(0, 1, &gUAV, 0); gDeviceContext->Dispatch(32, 24, 0); //hard coded ID3D11UnorderedAccessView* nullview = { nullptr }; gDeviceContext->CSSetUnorderedAccessViews(0, 1, &nullview, 0); gDeviceContext->OMSetRenderTargets(1, &gBackbufferRTV, depthStencilView); gSwapChain->Present(0, 0); Worth noting is the scene is rendered as usual, but I dont get any results from the CS (simple gaussian blur)
I'm sure it's something fairly basic I'm doing wrong, perhaps my understanding of render targets / views / what have you is just completely wrong and my approach just makes no sense.

If someone with more experience could point me in the right direction I would really appreciate it!

On a side note, I'd really like to learn more about this kind of stuff. I can really see the potential of the CS aswell as rendering to textures and using them for whatever in the engine so I would love it if you know some good resources I can read about this!

Thank you <3

P.S I excluded the .hlsl since I cant imagine that being the issue, but if you think you need it to help me just ask

P:P:S. As you can see this is my first post however I do have another account, but I can't log in with it because gamedev.net just keeps asking me to accept terms and then logs me out when I do over and over

• I was wondering if anyone could explain the depth buffer and the depth stencil state comparison function to me as I'm a little confused
So I have set up a depth stencil state where the DepthFunc is set to D3D11_COMPARISON_LESS, but what am I actually comparing here? What is actually written to the buffer, the pixel that should show up in the front?
I have these 2 quad faces, a Red Face and a Blue Face. The Blue Face is further away from the Viewer with a Z index value of -100.0f. Where the Red Face is close to the Viewer with a Z index value of 0.0f.
When DepthFunc is set to D3D11_COMPARISON_LESS the Red Face shows up in front of the Blue Face like it should based on the Z index values. BUT if I change the DepthFunc to D3D11_COMPARISON_LESS_EQUAL the Blue Face shows in front of the Red Face. Which does not make sense to me, I would think that when the function is set to D3D11_COMPARISON_LESS_EQUAL the Red Face would still show up in front of the Blue Face as the Z index for the Red Face is still closer to the viewer
Am I thinking of this comparison function all wrong?
Vertex data just in case
//Vertex date that make up the 2 faces Vertex verts[] = { //Red face Vertex(Vector4(0.0f, 0.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), Vertex(Vector4(100.0f, 100.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), Vertex(Vector4(100.0f, 0.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), Vertex(Vector4(0.0f, 0.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), Vertex(Vector4(0.0f, 100.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), Vertex(Vector4(100.0f, 100.0f, 0.0f), Color(1.0f, 0.0f, 0.0f)), //Blue face Vertex(Vector4(0.0f, 0.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), Vertex(Vector4(100.0f, 100.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), Vertex(Vector4(100.0f, 0.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), Vertex(Vector4(0.0f, 0.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), Vertex(Vector4(0.0f, 100.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), Vertex(Vector4(100.0f, 100.0f, -100.0f), Color(0.0f, 0.0f, 1.0f)), };
• By mellinoe
Hi all,
First time poster here, although I've been reading posts here for quite a while. This place has been invaluable for learning graphics programming -- thanks for a great resource!
Right now, I'm working on a graphics abstraction layer for .NET which supports D3D11, Vulkan, and OpenGL at the moment. I have implemented most of my planned features already, and things are working well. Some remaining features that I am planning are Compute Shaders, and some flavor of read-write shader resources. At the moment, my shaders can just get simple read-only access to a uniform (or constant) buffer, a texture, or a sampler. Unfortunately, I'm having a tough time grasping the distinctions between all of the different kinds of read-write resources that are available. In D3D alone, there seem to be 5 or 6 different kinds of resources with similar but different characteristics. On top of that, I get the impression that some of them are more or less "obsoleted" by the newer kinds, and don't have much of a place in modern code. There seem to be a few pivots:
The data source/destination (buffer or texture) Read-write or read-only Structured or unstructured (?) Ordered vs unordered (?) These are just my observations based on a lot of MSDN and OpenGL doc reading. For my library, I'm not interested in exposing every possibility to the user -- just trying to find a good "middle-ground" that can be represented cleanly across API's which is good enough for common scenarios.
Can anyone give a sort of "overview" of the different options, and perhaps compare/contrast the concepts between Direct3D, OpenGL, and Vulkan? I'd also be very interested in hearing how other folks have abstracted these concepts in their libraries.

• If I do a buffer update with MAP_NO_OVERWRITE or MAP_DISCARD, can I just write to the buffer after I called Unmap() on the buffer? It seems to work fine for me (Nvidia driver), but is it actually legal to do so? I have a graphics device wrapper and I don't want to expose Map/Unmap, but just have a function like void* AllocateFromRingBuffer(GPUBuffer* buffer, uint size, uint& offset); This function would just call Map on the buffer, then Unmap immediately and then return the address of the buffer. It usually does a MAP_NO_OVERWRITE, but sometimes it is a WRITE_DISCARD (when the buffer wraps around). Previously I have been using it so that the function expected the data upfront and would copy to the buffer between Map/Unmap, but now I want to extend functionality of it so that it would just return an address to write to.

• Trying to write a multitexturing shader in DirectX11 - 3 textures work fine, but adding 4th gets sampled as black!
Could you please look at the textureClass.cpp line 79? - I'm guess its D3D11_TEXTURE2D_DESC settings are wrong,
but no idea how to set it up right. I tried changing ArraySize from 1 to 4, but does nothing. If thats not the issue, please look
at the LightShader_ps - maybe doing something wrong there? Otherwise, no idea.
// Setup the description of the texture.
textureDesc.Height = height;
textureDesc.Width = width;
textureDesc.MipLevels = 0;
textureDesc.ArraySize = 1;
textureDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM;
textureDesc.SampleDesc.Count = 1;
textureDesc.SampleDesc.Quality = 0;
textureDesc.Usage = D3D11_USAGE_DEFAULT;