DrawCalls as a struct

Started by
7 comments, last by TheChubu 7 years, 5 months ago

For a while I've been trying to implement correctly s struct that HOLDS all the possible data that could be needed to execute a draw call in a single struct. I tough that I invented the hot water, but it turns out that other are also doing it and form what I *hear* they are doing better. To put you in context here is what ive got


struct DrawCall
{

// Buffer Texture SamplerState ARE JUST POINTERS - LITERALLY
typedef std::vector<Pair<unsigned, Buffer>> BoundCBuffersContainter;
typedef std::vector<Pair<unsigned, Texture>> BoundTexturesContainter;
typedef std::vector<Pair<unsigned, SamplerState>> BoundSamplersContainter;

std::vector<BoundUniform> m_uniforms; // a quick access to the 0th CBuffer in D3D or the uniforms in OpenGL, holds std::string uniform name(as hashed int), type and byte offset int m_uniformdata
std::vector<char> m_uniformData;

ShadingProgram  m_shadingProg; // POINTER
Buffer                             m_vertexBuffers[GraphicsCaps::NUM_VERTEX_BUFFER_SLOTS]; // AN ARRAY OF POINTERS, GraphicsCaps::NUM_VERTEX_BUFFER_SLOTS is ~8
uint32                             m_vbOffsets[GraphicsCaps::NUM_VERTEX_BUFFER_SLOTS];
uint32                             m_vbStrides[GraphicsCaps::NUM_VERTEX_BUFFER_SLOTS];
VertexDeclIndex                    m_vertDeclIndex;
PrimitiveTopology::Enum            m_primTopology;
Buffer                             m_indexBuffer; // POINTER
UniformType::Enum                  m_indexBufferFormat;
uint32                             m_indexBufferByteOffset;
BoundCBuffersContainter            m_boundCbuffers; // STD::VECTOR OF POINTERS
BoundTexturesContainter            m_boundTextures; // STD::VECTOR OF POINTERS
BoundSamplersContainter            m_boundSamplers; // STD::VECTOR OF POINTERS
FrameTarget                        m_frameTarget;  //POINTER (a set of render targets + depth buffer)
Viewport                           m_viewport;
RasterizerState                    m_rasterState; // POINTER
DepthStencilState                  m_depthStencilState; // POINTER
BlendState                         m_blendState; // POINTER
AABox2i                            m_scissorRect; // This will be used only if useScissors is true in the rasterizer state.
DrawExecDesc m_drawExec; // linear indexed ect. vertex bufffer num primitives ect.


DrawCall();
~DrawCall() = default;

// Just setter for the values above.
void setProgram(ShadingProgram pShadingProgram);
void setVB(const int slot, Buffer pBuffer, const uint32 byteOffset, const uint32 stride);
void SetVBDeclIndex(const VertexDeclIndex idx);
void setPrimitiveTopology(const PrimitiveTopology::Enum pt);
void setIB(Buffer pBuffer, const UniformType::Enum format, const uint32 byteOffset);
void setCBuffer(const unsigned nameStrIdx, Buffer cbuffer); // nameStrIdx is just a hash (actually retrieved by std::map<std::string, int>)
void setTexture(const unsigned nameStrIdx, Texture texture, const bool bindSampeler = true);// nameStrIdx is just a hash (actually retrieved by std::map<std::string, int>)
void setSampler(const unsigned nameStrIdx, SamplerState sampler);// nameStrIdx is just a hash (actually retrieved by std::map<std::string, int>)
void setFrameTarget(FrameTarget frameTarget);
void setRenderState(RasterizerState rasterState, DepthStencilState depthStencilState, BlendState blendState = BlendState());
void setScissorsRect(const AABox2i& rect);
void setViewport(const Viewport& viewport);void draw(const uint32 numVerts, const uint32 startVert, const uint32 numInstances = 1);
void drawIndexed(const uint32 numIndices, const uint32 startIndex, const uint32 startVert, const uint32 numInstances = 1);

// This functions checks the validity of the bound resources
// and returns true if the draw call is valid-ish...
// [CAUTION] this function is NOT complete.
bool validateDrawCall() const;
};

So far so good...
This struct struct could later passed to a "Context" that can execute it. My problems are:

- The struct under x64 i 400 bytes long.

- The struct allocates (via std::vector) in order to hold the bound data: A quick solution to fix this is
- use static arrays (will increase the overall size of the struct but no dynamic allocations)

- reuse a DrawCall struct when possible (which is pretty often), but this basically kill the whole idea of having DrawCall as structs

Technically that DrawCall reuse isn't that bad, but i have this gut feeling that it is not right. All suggestions are welcome.

My other issue is the way I bind resources (texture, constant buffers, sampler states, "regular uniforms"(I use that because they are WAY easier than constant buffer).
Also I like the idea of binding something by string-name/string-hash but that lookup is a bit costly, so I think something like slots as in d3d might be better, any suggestions here?

So any suggestions how i can improve that?

Advertisement
I did a talk on this last year, page 50 has a very coarse view of my draw call struct: http://tiny.cc/gpuinterface
DrawItem: 32-80 bytes (contains pointers to 0/1 IA configs, and 0-8 resource lists).
Input assembler config: 20-128 bytes
Resource list: 2-256 bytes

Use IDs over pointers. e.g. A pointer is 64 bits. A resource list ID is 16 bits. A blend state ID is 6 bits. Pack all your render state IDs together.

Don't store resource pointers/IDs paired with binding slots.
I store a compacted array of draw-call resource bindings, with no slot IDs and no gaps in the array. The shader program itself then contains a small structure that specifiea that compacted resource #x should be bound to binding slot #y.

Don't use regular uniforms. Use uniform buffers.

Don't bind individual resources. Instead of binding textures individually, I bind resource-lists, which are an array of texture bindings. This dramatically reduces the state inside a draw item (just like going from raw uniforms to uniform buffers).
Resource lists also mean that if a user wants to edit a material, then they can modify a resource list without being required to recreate their draw calls.

Don't use std::vector / multiple allocations per draw call. Measure the total requires space of your draw call and allocate that much memory up-front. Place all arrays inside this single allocation.

Break the input assembler state (vertex buffer bindings) out into its own object, as this data is very likely shared between multiple draw calls.

Make draw-calls immutable and reusable. If you're going to go to the effort of really compressing these things, then don't recreate them every frame. Rendering from a collection of pre-created draw calls is super fast!

Break the frame target / viewport data out into a seperate structure, which must be submitted alongside a draw call, e.g.
void Context::Execute( const Target&, const Draw* draws, uint numDraws )

Thanks for the great explanation Hodgman.

I still have some conceptual question that i cannot resolve.

Currently I do not have a game, but a small set of scenes that I render via my API. And usually I have only one constant buffer that is constantly bound and I update it before eveydraw call. Additionally
A while ago I did a measurement (if i did it right) and it turned to that indeed having a constantly bound cbuffer which I update() is wayyy faster compared to having multiple CBuffer just binding them(not including the updates). Because of that assumption I cannot understand your concept of CBuffers, to me they are not like the usual texture (which is immutable), to me cbuffers are constantly updated and that's why I currently store "bare" uniforms In my draw call.

Could you go a little bit in depth (or explain what i could be doing wrong) that CBuffer, thing and how do you use them?

The thing with shader constants is that they typically change at different frequencies based on what the data is and some might be reused across many draw calls within the scene. e.g.
- View / projection matrices typically only change once per frame.
- Material constants are the same across many draw calls that use the same value
- Model matrix changes per draw call.

The goal with constant buffers is to split your data up such that you only need to re-upload the constants to the GPU as needed. For example, if you have a seperate View & Projection matrix per draw call, thats 128 bytes per draw call you're uploading repeatedly. It doesn't sound like much but that number starts scaling very quickly as you have more data and more draw calls. The cost of switching CBuffers is typically much lower than the CPU cost of recomitting the data into a single constantly bound cbuffer - depending on what exactly your scene construction looks like.


The cost of switching CBuffers is typically much lower than the CPU cost of recomitting the data into a single constantly bound cbuffer - depending on what exactly your scene construction looks like.

This is exactly the opposite of what I've measured. At least with my constant buffer which is pretty small in bytes.
My measurements are kind of in sync with https://developer.nvidia.com/content/constant-buffers-without-constant-pain-0 if I understand this paper correctly?

OFFTOPIC:

How can I "ping" Hodgman, I'm not sure that he was able to see that this thread has been updated.


The cost of switching CBuffers is typically much lower than the CPU cost of recomitting the data into a single constantly bound cbuffer - depending on what exactly your scene construction looks like.

This is exactly the opposite of what I've measured. At least with my constant buffer which is pretty small in bytes.
My measurements are kind of in sync with https://developer.nvidia.com/content/constant-buffers-without-constant-pain-0 if I understand this paper correctly?

OFFTOPIC:

How can I "ping" Hodgman, I'm not sure that he was able to see that this thread has been updated.

Thats why I mentioned that it scales quickly based on the amount of data in the cbuffer and the number of drawcalls. Even if you have just split it into 2 constant buffers that are always bound, e.g. constants that only change once per frame, and then everything else in the other, you would be transferring less data to the GPU, while the GPU will be still be doing the same work. But this stuff only matters when upload bandwidth is your performance bottleneck.

Regarding your offtopic, if you quote their post they get a notification.

A while ago I did a measurement (if i did it right) and it turned to that indeed having a constantly bound cbuffer which I update() is wayyy faster compared to having multiple CBuffer just binding them(not including the updates). Because of that assumption I cannot understand your concept of CBuffers, to me they are not like the usual texture (which is immutable), to me cbuffers are constantly updated and that's why I currently store "bare" uniforms In my draw call.
Which API did you measure that on?

You can provide a cbuffer abstraction to the DrawItem system, but then use a completely different back-end implementation.

e.g. on D3D9, there are no cbuffers, so the DrawItem has a CBufferID, which is associated with some memory that came from malloc, where the user has placed their constants, and the back-end copies these constants into D3D9's constant registers.

Or on D3D12, I have a single massive "constant-ring" for streaming per-frame/dynamic constants to the GPU. The back-end copies the user's constants into this ring-buffer every frame, and they're overwritten again next frame.

In both cases, the DrawItem still just stores a CBufferID as if it's using cbuffers/UBO's.

If you've got OpenGL profiling data that shows that raw uniforms are indeed better than UBO's, then your OpenGL back-end can still use raw-uniforms, even though if the DrawItem system is using a UBO-like abstraction.

Also - what Digital said. You generally want to split up your constants by update-frequency / data-source. Constants from your camera system are updated rarely and shared between many draw calls. UBO's let you send this data to the GPU once and then simply bind a pointer to each draw-call. On the other hand, traditional GL uniforms require you to repeatedly send the data to the GPU for each draw call - so in the general case, they don't scale as well.

That said, I'm sure that on NVidia's GL drivers (the GL optimization kings), that traditional uniforms probably do perform amazingly well in certain situations.

My measurements are kind of in sync with https://developer.nvidia.com/content/constant-buffers-without-constant-pain-0 if I understand this paper correctly?
It doesn't sound like it. They show SetConstantBuffers to be cheaper than updating a constant buffer.

If you're using multiple constant buffers correctly, then you don't need to update them for each draw call.

e.g.

* your per-material buffers and your per-object buffers for static objects get updated once (on load).

* your camera buffer gets updated once per frame

* your per-object buffers for dynamic objects get updated once per draw

So per frame you call UpdateSubresource once for the camera, and once for each dynamic object (unless you're using instancing/etc -- then it's likely once or less per group of mesh types)...

So when rendering static objects, you only need to call SetConstantBuffers, as all your cbuffers already contain the right data.

And when rendering dynamic (non-instanced) objects, you call UpdateSubresource to copy the dynamic data into a cbuffer, and then call SetConstantBuffers.

They then also describe an advanced feature that's available in D3D11.1 and OpenGL4 that lets you reduce the number of calls to UpdateSubresource dramatically.


Which API did you measure that on?

Direct3D 11. Under OpenGL I've got implemented uniform buffers, the problem Is that I'm writing the shading language myself(it gets translated to HLSL GLSL) and I haven't to implemented that feature yet, so yeah I'm going to drop these "regular" uniforms in the future.

Maybe I should remeasure and think of a more complex scene.

Otherwise to back up my decision, as far as I know the constant buffers size could contain up to 4096 float4 variables, which is 64KB. My scenes currently have very little cbuffer data : (world, view, proj, few color, few colors for lights and few point light positions ... nothing fancy).

There is no running away from updating at least one constant buffer, because of the "world" matrix (and not to mention if there are animated material, which I currently do not have, but it is a possibility),
so I did a bit of measuring and it turns out that If I use only one cbuffer, the cost the updating one variable is the same as updating multiple variables(note again that my cbuffers aren't that big), and if I keep that cbuffer constantly bound I wouldn't pay for binding another buffer.
So yeah that's the logic behind that decision.

I'm on a halt with that cbuffer thing, I really want to hear what you've experienced guys first.
Otherwise I've adopted the "StateGroup" technique and it flows really nice with everything else so far. I'm able now to cache those StateGroups and reuse them between draw calls.


EDIT:

My concept will not scale well if and object has multiple materials, which isn't that rare(well it depends... as everything in this world).

Or on D3D12, I have a single massive "constant-ring" for streaming per-frame/dynamic constants to the GPU. The back-end copies the user's constants into this ring-buffer every frame, and they're overwritten again next frame.

I'll add this is really handy. Even if for some awkward reason it runs slower than plain glUniforms, its so much easier to manage. Instead of tying a bunch of different glUniformBleh calls for different types, juggling with the names, indices and the fact that they work per shader, with UBOs you got a fat buffer, write bytes on it, and just index into it inside the shader.

Quoting that nVidia article:


  1. Don’t update a subset of a larger constant buffer, as this increases the accumulated memory size more than necessary, and will cause the driver to reach the renaming limit more quickly. This piece of advice doesn’t apply when you are using the DX11.1 features that allow for partial constant buffer updates.

Partial updates have been possible in GL since UBOs exist afaik (3.1, although I'm currently using it with GL 3.3). So its a feature you can take advantage of in a lot of hardware (+/- driver bugs, as always :P ).

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and Making a Terrain Generator

This topic is closed to new replies.

Advertisement