Sign in to follow this  
Seabolt

Vulkan What are your opinions on DX12/Vulkan/Mantle?

Recommended Posts

Ameise    1148

 

use baseInstance parameter from glDraw*BaseInstanceBaseVertex. gl_InstanceID will still be zero based, but you can use an instanced vertex element to overcome this problem (or use an extension that exposes an extra glsl variable with the value of baseInstance)

And what if you're drawing two different meshes? ie, not instancing a single mesh.

 

 

 

How are you implementing your constant buffers? From what you've written as your #3b, it sounds like you're packing multiple materials'/objects' constants into a single large constant buffer, and perhaps indexing out of it in your draws? IIRC, that's supported only in D3D11.1+, as there is no *SSetConstantBuffer function that takes offsets until then.

I have no idea about D3D11, but prolly isn't even necessary. Just update the entire buffer in one call. Buffer is defined as an array of structs, index into that to fetch the one that corresponds to the current thing being drawn.

 

 

So, he's just saying 'for the next n draws, here are the constants', and then sets indices (somehow? Not sure how he'd track that without also updating a constant. Atomic integers?) to say 'access struct n in the huge constant buffer?

Honestly, I'd rather update smaller buffers with finer granularity as I wouldn't be stalling on one large copy.

Share this post


Link to post
Share on other sites
Matias Goldberg    9580

How are you implementing your constant buffers? From what you've written as your #3b, it sounds like you're packing multiple materials'/objects' constants into a single large constant buffer, and perhaps indexing out of it in your draws?

Yes
 

IIRC, that's supported only in D3D11.1+, as there is no *SSetConstantBuffer function that takes offsets until then.

That's one way of doing it, and doing it that way, then you're correct. We don't use D3D11.1 functionality, though since OpenGL does support setting constant buffers by offsets, we take advantage of that to further reduce splitting some batch of draw calls.
 

Otherwise, if you aren't using constant buffers with offsets, how are you avoiding having to set things like object transforms and the like? If you are, how are you handling targets below D3D11.1?

By treating all your draws as instanced draws (even if they're just one instance) and use StartInstanceLocation.
I attach a "drawId" R32_UINT vertex buffer (instanced buffer) which is filled with 0, 1, 2, 3, 4 ... 4095 (basically, we can't batch more than 4096 draws together in the same call; that limit is not arbitrary: 4096 * 4 floats per vector = 64kb; aka the const buffer limit).
Hence the "drawId" vertex attribute will always contain the value I want as long as it is in range [0; 4096) and thus index whatever I want correctly.

This is the most compatible way of doing it which works with both D3D11 and OpenGL. There is a GL4 extension that exposes the keywords gl_DrawIDARB & gl_BaseInstanceARB which allows me to do the same without having to use an instanced vertex buffer (thus gaining some performance bits in memory fetching; though I don't know if it's noticeable since the vertex buffer is really small and doesn't consume much bandwidth; also the 4096 draws per call limit can be lifted thanks to this extension). Edited by Matias Goldberg

Share this post


Link to post
Share on other sites
Matias Goldberg    9580

1 is a valid value for the instance count.

Of course but the idea is to batch up data inside the constant/uniform buffers and use the instance ID for indexing. No sense doing it if you can only index one thing (ie, you end up what I am doing, one glDraw and glUniform1i call per mesh drawn).

Doing this for just one instance is completely valid. If you do it the way you said, although valid; your API overhead will go through the roofs, specially if you have a lot of different meshes. Edited by Matias Goldberg

Share this post


Link to post
Share on other sites
agleed    1013

Otherwise, if you aren't using constant buffers with offsets, how are you avoiding having to set things like object transforms and the like? If you are, how are you handling targets below D3D11.1?

By treating all your draws as instanced draws (even if they're just one instance) and use StartInstanceLocation.

 

 

 And you have no noticeable problems with that? A year and half ago or so I did some quick tests where I just rendered (in OpenGL) all my objects using normal draw calls vs rendering all my objects using instancing with instance count =1, and it had some truly horrendous CPU overhead. Profiler showed that GPU fell asleep, but CPU for some reason took a lot longer for everything. If I remember right, for about 700 total draw calls (crytek sponza geometry + shadow pass), I saved something like 3 or 4ms by switching back to normal draw calls for everything (on an i7 3770k and GTX 770). Granted the setup was suboptimal at best, I sorted by shaders and textures used and nothing else, every mesh was in its own VB, etc. Maybe that was the reason and there's a much smaller instancing overhead otherwise?

Edited by agleed

Share this post


Link to post
Share on other sites
mhagain    13430

And you have no noticeable problems with that? A year and half ago or so I did some quick tests where I just rendered (in OpenGL) all my objects using normal draw calls vs rendering all my objects using instancing with instance count =1, and it had some truly horrendous CPU overhead. Profiler showed that GPU fell asleep, but CPU for some reason took a lot longer for everything. If I remember right, for about 700 total draw calls (crytek sponza geometry + shadow pass), I saved something like 3 or 4ms by switching back to normal draw calls for everything (on an i7 3770k and GTX 770). Granted the setup was suboptimal at best, I sorted by shaders and textures used and nothing else, every mesh was in its own VB, etc. Maybe that was the reason and there's a much smaller instancing overhead otherwise?

 

This would depend on how you update the per-instance buffer.

 

If you have a small buffer - with space for only one instance - and you do a separate buffer update for each instance, then OpenGL is going to perform horribly (D3D won't).  If you have a large buffer with space for all your instances, but you update them all together, then it should run well.

 

The overhead isn't instancing, it's OpenGL's buffer objects API.

Share this post


Link to post
Share on other sites
Matias Goldberg    9580

And you have no noticeable problems with that?

Nope, we're are not.

every mesh was in its own VB

There's your problem. Every time you had to switch to the next mesh, you had to respecify the VAO state.
You could be hitting the slow path by doing that per mesh + using instancing. The driver may have been able to detect the VAO only switched buffers with the non-instanced calls; but decided to respecify the whole vertex data when using instancing.
You should keep all your meshes in the same Buffer Object, or have very Buffer Objects at least.

Also, you obviously compared an instanced version without indexing into one single buffer vs normal draw calls.
You should compare instanced version + indexing into one single buffer vs normal draw calls.
If there is higher overhead from using instancing, it is more than negated by using indexes into a single buffer.

Share this post


Link to post
Share on other sites
TheChubu    9454

Doing this for just one instance is completely valid. If you do it the way you said, although valid; your API overhead will go through the roofs, specially if you have a lot of different meshes.
Then with the instanced method, how would you handle drawing different meshes?

 

ie, as I see it you'd have two ways of doing it:

  1. Update the mesh transform, then issue a glDraw*Instanced call with a single instance, always fetch transform in index 0. Repeat for every single mesh.
  2. Update transform UBO with all the transforms that can fit, then issue glDraw*Instanced call, repeat this draw call increasing the base instance ID by one for every single mesh until you run out of transforms in the UBO (doing the instanced index buffer trick you mentioned since instance ID is always 0).

So you always end up with one draw call per each different mesh. Thing that differs is UBO updating scheme (no scheme in first one, batching scheme in the second one).

Share this post


Link to post
Share on other sites
vlj    1070

I starting to be worried by rumors that Google may have its own low level api too. This would basically mean one API per OS which break the purpose of Vulkan in the first place...

Share this post


Link to post
Share on other sites
vlj    1070

 

Then with the instanced method, how would you handle drawing different meshes?

 

ie, as I see it you'd have two ways of doing it:

  1. Update the mesh transform, then issue a glDraw*Instanced call with a single instance, always fetch transform in index 0. Repeat for every single mesh.
  2. Update transform UBO with all the transforms that can fit, then issue glDraw*Instanced call, repeat this draw call increasing the base instance ID by one for every single mesh until you run out of transforms in the UBO (doing the instanced index buffer trick you mentioned since instance ID is always 0).

So you always end up with one draw call per each different mesh. Thing that differs is UBO updating scheme (no scheme in first one, batching scheme in the second one).

 

 

glMultiDrawIndirect basically iterates glDraw*Instanced call over all element of bound indirect draw command buffer.

Share this post


Link to post
Share on other sites
Hodgman    51324

ie, as I see it you'd have two ways of doing it:

  • Update the mesh transform, then issue a glDraw*Instanced call with a single instance, always fetch transform in index 0. Repeat for every single mesh.
  • Update transform UBO with all the transforms that can fit, then issue glDraw*Instanced call, repeat this draw call increasing the base instance ID by one for every single mesh until you run out of transforms in the UBO (doing the instanced index buffer trick you mentioned since instance ID is always 0).
So you always end up with one draw call per each different mesh. Thing that differs is UBO updating scheme (no scheme in first one, batching scheme in the second one).
The CPU cost of a draw call depends on the state changes that preceeded it.
Apparently setting the base instance ID state is much cheaper than binding a new UBO, which makes sense, as there's a tonne of resource management code that has to run behind the scenes whenever you bind any resource, especially if it's an orphaned resource.

Also, yes, updating one large UBO is going to be much cheaper than updating thousands of small ones. Especially if you use persistent unsynchronized updates.

On the GPU side, draw calls are free. What costs is context/segment switches. If two draw-calls use the same "context", the GPU bundles them together, avoiding stalls.
Certain state changes "roll the context"/"begin a segment"/etc, which means the next draw can't overlap with the previous one.
It would be interesting to find out where base-instance-id state and UBO bindings stand in regards to context rolls on different GPUs...

Share this post


Link to post
Share on other sites
mhagain    13430

 

Doing this for just one instance is completely valid. If you do it the way you said, although valid; your API overhead will go through the roofs, specially if you have a lot of different meshes.
Then with the instanced method, how would you handle drawing different meshes?

 

ie, as I see it you'd have two ways of doing it:

  1. Update the mesh transform, then issue a glDraw*Instanced call with a single instance, always fetch transform in index 0. Repeat for every single mesh.
  2. Update transform UBO with all the transforms that can fit, then issue glDraw*Instanced call, repeat this draw call increasing the base instance ID by one for every single mesh until you run out of transforms in the UBO (doing the instanced index buffer trick you mentioned since instance ID is always 0).

So you always end up with one draw call per each different mesh. Thing that differs is UBO updating scheme (no scheme in first one, batching scheme in the second one).

 

 

glDrawElementsInstancedBaseInstance - one big per-instance buffer containing all of your instances.  It doesn't have to be a UBO; if the per-instance data is small enough (which it will be if it's just a transform) it can be a VBO and specified in your VAO using glVertexAttribDivisor.  Use the baseinstance parameter of your draw call to specify which instance you're currently drawing.  That will index into your per-instance buffer effectively for free.  gl_InstanceID remains 0-based but in this case it doesn't matter because you're not using it.  You're not touching uniforms, you're not updating state between draws, the only thing that changes is the parameters to your draw calls.

Share this post


Link to post
Share on other sites
TheChubu    9454

I'm not arguing about the impact of the UBO updating/binding, we pretty much established we do are trying to do UBO update batching. Nor I am talking about OpenGL 4.x features as I've mentioned in a previous post (ie, no indirect draw calls). Strictly GL 3 here.

 

What I am asking is how much different is doing a glDraw*Instanced call changing the instance ID offset vs doing a normal glDraw* call while calling glUniform1i for the indices.

 

Mathias said the draw-glUniform1i combination it would be detrimental if you have many different meshes (we're still limited by one different mesh -> one draw call) but I'm trying to figure out if it would be that much worse than doing a glDraw*Instance call and changing the base instance ID, fetching the real index from an instance attribute buffer, since you still have one draw call per different mesh.

 

State isn't changing beyond the glUniform1i call, which requires no binding. If it matters, the same could be accomplished with glVertexAttrib*i and passing the index that way before each draw call (which is what nVidia guys used for their scene graph presentations in 2013 and 2014).

 

In short, for different meshes, we still have one draw call per mesh, we still can batch UBO updates, but there are several ways to get the needed index to the shader (glUniform*i, glVertexAttrib*i, or glDraw*Instanced with indices in instanced attribute).

 

EDIT: Reading mhagain's post:

 

t doesn't have to be a UBO
  Of course, but Mathias said it was a constant buffer, so I'm assuming he is using UBOs.

 

Then again, what you're saying here is basically, dont use UBOs, put all in instanced attributes? It sounds good but see this presentation for example, there is a small section on UBO updating and indexing inside the shader, maybe nVidia hardware doesn't works that well with instanced drawing.

 

EDIT2: Fucking quote blocks and fucking editor for fucks sake I fucking hate it.

Edited by TheChubu

Share this post


Link to post
Share on other sites
vlj    1070


This is the most compatible way of doing it which works with both D3D11 and OpenGL. There is a GL4 extension that exposes the keywords gl_DrawIDARB & gl_BaseInstanceARB which allows me to do the same without having to use an instanced vertex buffer (thus gaining some performance bits in memory fetching; though I don't know if it's noticeable since the vertex buffer is really small and doesn't consume much bandwidth; also the 4096 draws per call limit can be lifted thanks to this extension).

 

 

Not sure if it helps performance wise, azdo's slides say that gl_DrawID often cripples performance

http://fr.slideshare.net/CassEveritt/approaching-zero-driver-overhead (slide 33)

 

There is no word on gl_BaseInstance though.

Share this post


Link to post
Share on other sites
Agony    3452

I starting to be worried by rumors that Google may have its own low level api too. This would basically mean one API per OS which break the purpose of Vulkan in the first place...

Wouldn't that actually justify the purpose of Vulkan?  One cross-platform API that will work regardless of the OS, more or less the same purpose as with OpenGL.  This would be in contrast to it simply being intended to fill the platform gap, providing a high performance 3D graphics API on platforms that don't have a native one.  I'd be highly disappointed if the latter purpose were all that the Vulkan designers ever hoped to achieve.

Share this post


Link to post
Share on other sites
Ameise    1148

 

How are you implementing your constant buffers? From what you've written as your #3b, it sounds like you're packing multiple materials'/objects' constants into a single large constant buffer, and perhaps indexing out of it in your draws?

Yes
 

IIRC, that's supported only in D3D11.1+, as there is no *SSetConstantBuffer function that takes offsets until then.

That's one way of doing it, and doing it that way, then you're correct. We don't use D3D11.1 functionality, though since OpenGL does support setting constant buffers by offsets, we take advantage of that to further reduce splitting some batch of draw calls.
 

Otherwise, if you aren't using constant buffers with offsets, how are you avoiding having to set things like object transforms and the like? If you are, how are you handling targets below D3D11.1?

By treating all your draws as instanced draws (even if they're just one instance) and use StartInstanceLocation.
I attach a "drawId" R32_UINT vertex buffer (instanced buffer) which is filled with 0, 1, 2, 3, 4 ... 4095 (basically, we can't batch more than 4096 draws together in the same call; that limit is not arbitrary: 4096 * 4 floats per vector = 64kb; aka the const buffer limit).
Hence the "drawId" vertex attribute will always contain the value I want as long as it is in range [0; 4096) and thus index whatever I want correctly.

This is the most compatible way of doing it which works with both D3D11 and OpenGL. There is a GL4 extension that exposes the keywords gl_DrawIDARB & gl_BaseInstanceARB which allows me to do the same without having to use an instanced vertex buffer (thus gaining some performance bits in memory fetching; though I don't know if it's noticeable since the vertex buffer is really small and doesn't consume much bandwidth; also the 4096 draws per call limit can be lifted thanks to this extension).

 

So, let me reword it to make sure I understand what you're doing. You have a single per-instance "vertex buffer" that just has a sequence of integers from 0 to 4095. Your draws specify which instance offset they are, and because of that when you get this per-instance ID, it matches which overall ID it is of your draws. You then use that ID to access the constants from a 64K constant buffer for that draw.

Right?

Do you have any performance issues with it? I'd be wary of requiring a 64K copy to GPU memory before draws. Is it faster if you use smaller batches?

Also importantly, how scalable is this to next-gen draw bundles? Would you need to use indirect parameters to inject the right instance offset for your draws in the bundles?

I suppose the main benefit here is that you have reduced the number of actual API calls dramatically to actually perform draws, though with the next-gen APIs, isn't the benefit of that going to be somewhat mitigated by the fact that the actual draws themselves can be completely encapsulated in prebuilt bundles?

Edit: I suppose the core question here is: are the multidraws (indirect or not) actually pushed as such onto the GPU's command buffer (that is, is there a 'multidraw' command) or is the driver extracting the draws and inserting them individually (or batched, whatever) onto the command buffer? If the former, then it would certainly be faster. If the latter, I imagine a ton of draw bundles would be faster.

Edited by Ameise

Share this post


Link to post
Share on other sites
microlee    534


Edit: I suppose the core question here is: are the multidraws (indirect or not) actually pushed as such onto the GPU's command buffer (that is, is there a 'multidraw' command) or is the driver extracting the draws and inserting them individually (or batched, whatever) onto the command buffer? If the former, then it would certainly be faster. If the latter, I imagine a ton of draw bundles would be faster.

AFAIK, it's former that the application pushed the draws into the command buffer. 


 

Share this post


Link to post
Share on other sites
vlj    1070
On Radeon beginning from hd6950 and geforce multi draw indirect is a hardware feature. On Intel (up to Haswell at least, don't Know for future chip) it is emulated by a loop in the driver.

Share this post


Link to post
Share on other sites
Killeak    269

 

I starting to be worried by rumors that Google may have its own low level api too. This would basically mean one API per OS which break the purpose of Vulkan in the first place...

Wouldn't that actually justify the purpose of Vulkan?  One cross-platform API that will work regardless of the OS, more or less the same purpose as with OpenGL.  This would be in contrast to it simply being intended to fill the platform gap, providing a high performance 3D graphics API on platforms that don't have a native one.  I'd be highly disappointed if the latter purpose were all that the Vulkan designers ever hoped to achieve.

 

The problem is that Google is the one that controls Android, and is not an open environment like Windows for the user (at least for most users), in the sense that you can't install drivers like you do on Windows, the drivers comes with the device and its updates (which for worse in most cases are under control of the carriers).

 

Sure, if you install Cyanogen Mod or some other custom version you can do whatever you want, but for the normal user, they are stuck with whats comes with the device, which means that if Google decides to not implement Vulkan, you can't use Vulkan for Android and you are forced to use their API (so is worse than MS with DX vs OpenGL, since in Windows at least you can always install a driver that implements the latest OpenGL).

 

The thing is, these days Google is the new MS and Android the new Windows, they have the biggest portion of market and they are in a position where they can do whatever they want.

This is worse case scenario, I don't think Google will do this, but is a possibility that worries.
 

But to be honest, I don't care if I have to implement one or two more APIs, since we already support a bunch (DX11, OpenGL 3.x, 4.x, ES 2.x, ES3.x, and we have an early implementation for DX12 and we want to add support for PS4 as well). I prefer to have to support multiple strong and solid APIs than a bad one (OpenGL am looking at you). 
 
Vulkan seems great but I still have reserves, the good thing is that is just like D3D12, so porting should be very easy (the only thing that I need now is a HLSL compiler to SPIR-V tongue.png ).
Edited by Killeak

Share this post


Link to post
Share on other sites
mhagain    13430

On Radeon beginning from hd6950 and geforce multi draw indirect is a hardware feature. On Intel (up to Haswell at least, don't Know for future chip) it is emulated by a loop in the driver.

 

One would hope that Intel at least validates once only rather than for each draw call in the loop.

Share this post


Link to post
Share on other sites
Matias Goldberg    9580

**Sigh**

If anyone has lots of questions, you can just compile and try Ogre 2.1, then disect its source code to see how we're handling it. It's Open Source after all.
Doing what I'm saying is not impossible, otherwise we wouldn't be doing it.

To answer The Chubu's question, glDrawElementsInstancedBaseVertexBaseInstance has THREE key parameters:

  • baseInstance: With this I can send an arbitrary index as I explained, which I can use to index whatever I want from a constant (UBO) or texture (TBO) buffer. I can even perform multiple indirections (retrieve an index from an UBO using the index from baseInstance)
  • baseVertex?: With this I can store as many meshes as I want in the same buffer object; and select them individually by providing the offset location to the start of the mesh I want to render. With this, I don't need to alter state at all (unless vertex format changes). The meshes don't even need to be contiguous in memory; they just need to be in the same Buffer Object and aligned to the vertex size.
  • indices: With this I can store as many meshes' index data as I want in the same buffer object, and select them individually by providing the offset location to the start of the index data. Remember to keep alignment to 4 bytes. Bonus points: You can keep the vertex and index data in the same buffer object.

 

The DX11 equivalent of this DrawIndexedInstanced and the analogous parameters are StartInstanceLocation, BaseVertexLocation & StartIndexLocation respectively.

We treat all of our draws with these functions.

 

The DX11 function works on DX10 hardware just fine. glDrawElementsInstancedBaseVertexBaseInstance was introduced in GL 4.2; however it is available to GL3 hardware via extension. The most notable remark is that OS X doesn't support this extension, at the time of writing.

 

The end result is that we just map the buffer(s) once; write all the data in sequence; bind these buffers and then issue a lot of consecutive glDrawElementsInstancedBaseVertexBaseInstance / DrawIndexedInstanced calls without any other API calls in between.

We only need to perform additional API calls when:

  • We need to bind a different buffer / buffer section (i.e. we've exhausted the 64kb limit)
  • We need to change state (shaders, vertex format, blending modes, rasterizer states; we keep them sorted to reduce this)
  • We're using more than one mesh pool (pool = a buffer where we store all our meshes together), and the next mesh is stored in another pool (we sort by pools though, in order to reduce this switching).

Share this post


Link to post
Share on other sites
Ameise    1148

 

**Sigh**

If anyone has lots of questions, you can just compile and try Ogre 2.1, then disect its source code to see how we're handling it. It's Open Source after all.
Doing what I'm saying is not impossible, otherwise we wouldn't be doing it.

To answer The Chubu's question, glDrawElementsInstancedBaseVertexBaseInstance has THREE key parameters:

  • baseInstance: With this I can send an arbitrary index as I explained, which I can use to index whatever I want from a constant (UBO) or texture (TBO) buffer. I can even perform multiple indirections (retrieve an index from an UBO using the index from baseInstance)
  • baseVertex?: With this I can store as many meshes as I want in the same buffer object; and select them individually by providing the offset location to the start of the mesh I want to render. With this, I don't need to alter state at all (unless vertex format changes). The meshes don't even need to be contiguous in memory; they just need to be in the same Buffer Object and aligned to the vertex size.
  • indices: With this I can store as many meshes' index data as I want in the same buffer object, and select them individually by providing the offset location to the start of the index data. Remember to keep alignment to 4 bytes. Bonus points: You can keep the vertex and index data in the same buffer object.

 

The DX11 equivalent of this DrawIndexedInstanced and the analogous parameters are StartInstanceLocation, BaseVertexLocation & StartIndexLocation respectively.

We treat all of our draws with these functions.

 

The DX11 function works on DX10 hardware just fine. glDrawElementsInstancedBaseVertexBaseInstance was introduced in GL 4.2; however it is available to GL3 hardware via extension. The most notable remark is that OS X doesn't support this extension, at the time of writing.

 

The end result is that we just map the buffer(s) once; write all the data in sequence; bind these buffers and then issue a lot of consecutive glDrawElementsInstancedBaseVertexBaseInstance / DrawIndexedInstanced calls without any other API calls in between.

We only need to perform additional API calls when:

  • We need to bind a different buffer / buffer section (i.e. we've exhausted the 64kb limit)
  • We need to change state (shaders, vertex format, blending modes, rasterizer states; we keep them sorted to reduce this)
  • We're using more than one mesh pool (pool = a buffer where we store all our meshes together), and the next mesh is stored in another pool (we sort by pools though, in order to reduce this switching).

 

 

To be fair, I understood what you were doing after my first post - I just wanted to know if you were experiencing any stalling from filling your buffer prior to a large batch of draws, and how you were going to handle a Vulkan/D3D12 transition with bundles and the like.

 

Also, I've never really liked the 'just look at the source' thing. I have no familiarity with Ogre, and last time I tried to build it, it was relatively difficult and frustrating. It would probably take me quite some time not only to understand what you're doing, but why, whereas spending the 10 minutes in conversion would be more productive (and I get the benefit of talking to a fine chap such as yourself :) ).

Share this post


Link to post
Share on other sites
Matias Goldberg    9580

So, let me reword it to make sure I understand what you're doing. You have a single per-instance "vertex buffer" that just has a sequence of integers from 0 to 4095. Your draws specify which instance offset they are, and because of that when you get this per-instance ID, it matches which overall ID it is of your draws. You then use that ID to access the constants from a 64K constant buffer for that draw.

Right?

Yes.
 

Do you have any performance issues with it? I'd be wary of requiring a 64K copy to GPU memory before draws. Is it faster if you use smaller batches?

No. "Batch, batch, batch" is still relevant today because GPUs are brute force powers of engineering. They like processing things in batch.
I don't understand your "requiring a 64K copy to GPU memory before draws" part though.

 

Also importantly, how scalable is this to next-gen draw bundles? Would you need to use indirect parameters to inject the right instance offset for your draws in the bundles?

I suppose the main benefit here is that you have reduced the number of actual API calls dramatically to actually perform draws, though with the next-gen APIs, isn't the benefit of that going to be somewhat mitigated by the fact that the actual draws themselves can be completely encapsulated in prebuilt bundles?

Let me clear something: This is how Vulkan and D3D12 approach rendering. You will be doing this. These new APIs allow us to do things right now is somewhat hacky or difficult (in OpenGL4, the problem is in doing multithreading well, on D3D11 the problem is not having NO_OVERRIDE on constant buffers unless you're on D3D11.1; also needing a instanced vertex buffer to get arbitrary instance IDs is a hack...); and perform much less hazard tracking.
Bundles allow you to record calls to reduce validation overhead and maximize reutilization, but DX12/Vulkan can't hide GPU state changes as Hodgman explained.
 

Edit: I suppose the core question here is: are the multidraws (indirect or not) actually pushed as such onto the GPU's command buffer (that is, is there a 'multidraw' command) or is the driver extracting the draws and inserting them individually (or batched, whatever) onto the command buffer? If the former, then it would certainly be faster. If the latter, I imagine a ton of draw bundles would be faster.

You're missing the point. The point of indirect rendering is that the draw call data is stored in GPU memory and can be filled from within multiple cores in parallel or even a compute shader. Whether dedicated hardware runs through the Indirect buffer during an MDI call is secondary (which is true for AMD's GCN btw).

Share this post


Link to post
Share on other sites
Matias Goldberg    9580

To be fair, I understood what you were doing after my first post - I just wanted to know if you were experiencing any stalling from filling your buffer prior to a large batch of draws, and how you were going to handle a Vulkan/D3D12 transition with bundles and the like.

I don't understand why you think there would be a stall.

Let's assume we draw exactly 4096 meshes (doesn't matter if they're different or not).
This will consume the whole 64kb buffer.

If I split this into smaller batches, I will perform 4096 draw calls and use 16 bytes per draw. In the end I will still have used 64kb, but with much more binding in the middle.

If I draw less, say, 16 meshes; then I will write 256 bytes from CPU into the GPU. Then two things may happen depending on RenderSystem:

  1. GL/Vulkan/D3D12: We bind 256 bytes, the GPU loads exactly 256 bytes into their L1/L2 caches (or constant register file)
  2. D3D11: We bind 64kb (since D3D11 binds the whole buffer's size). We will be using 256 bytes. The remaining 65.280 bytes may be loaded by the GPU, but not be read by the shader and are filled with garbage.

The most "wasteful" here is D3D11. However, reading the extra 65.280 bytes is a joke for any GPU; and is certainly not going to be a bottleneck to worry about.

Share this post


Link to post
Share on other sites
Ameise    1148

 

To be fair, I understood what you were doing after my first post - I just wanted to know if you were experiencing any stalling from filling your buffer prior to a large batch of draws, and how you were going to handle a Vulkan/D3D12 transition with bundles and the like.

I don't understand why you think there would be a stall.

Let's assume we draw exactly 4096 meshes (doesn't matter if they're different or not).
This will consume the whole 64kb buffer.

If I split this into smaller batches, I will perform 4096 draw calls and use 16 bytes per draw. In the end I will still have used 64kb, but with much more binding in the middle.

If I draw less, say, 16 meshes; then I will write 256 bytes from CPU into the GPU. Then two things may happen depending on RenderSystem:

  1. GL/Vulkan/D3D12: We bind 256 bytes, the GPU loads exactly 256 bytes into their L1/L2 caches (or constant register file)
  2. D3D11: We bind 64kb (since D3D11 binds the whole buffer's size). We will be using 256 bytes. The remaining 65.280 bytes may be loaded by the GPU, but not be read by the shader and are filled with garbage.

The most "wasteful" here is D3D11. However, reading the extra 65.280 bytes is a joke for any GPU; and is certainly not going to be a bottleneck to worry about.

 

I was thinking in the sense that you batched 4,096 draw calls, before the call even got submitted to the GPU, you'd have to perform a copy to GPU memory for your 64KiB of data. Until that copy is complete, the GPU may not be doing anything at all. However, I am quite possibly vastly underestimating the speed of such a copy (which is probably on the order of microseconds).

Share this post


Link to post
Share on other sites
Hodgman    51324

I was thinking in the sense that you batched 4,096 draw calls, before the call even got submitted to the GPU, you'd have to perform a copy to GPU memory for your 64KiB of data. Until that copy is complete, the GPU may not be doing anything at all. However, I am quite possibly vastly underestimating the speed of such a copy (which is probably on the order of microseconds).

If you do it properly, copying data is completely asynchronous, in that it only involves CPU work - the GPU is unaware.
Also, the GPU is almost always 1 or more frames behind the CPU - i.e. The OS/driver are buffering all your draw calls for 1+ frames.
Stopping issuing draws to copy some data, or run gameplay code, etc will not starve the GPU, because there's this massive frame long buffer of work. As long as once per frame you submit a frame's worth of work, the GPU will never starve.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this  

  • Similar Content

    • By DavidTheFighter
      So I've been trying to implement a multi-threaded resource system w/ vulkan in my free time, where a thread can request a resource to be loaded, and it gets pushed into a queue. On another thread, the resource (as of right now, a mesh) gets loaded from a file, and I map the data to a staging buffer. The issue comes in where I record the command buffer to copy the data to a GPU buffer. I record a secondary command buffer w/ just the vkCmdCopyBuffer command, and push it to a queue to be executed from a primary command buffer on the main thread to a transfer-only queue. As far as I can tell, the staging works fine, and the mesh is drawn and looks perfectly fine, but my validation layers (VK_LAYER_LUNARG_standard_validation) spam tell me: "vkCmdBindIndexBuffer(): Cannot read invalid region of memory allocation 0x16 for bound Buffer object 0x15, please fill the memory before using," and the vertex buffer binding gives me an identical message. Both buffers were created with the proper bits, TRANSFER_SRC for the staging buffer, TRANSFER_DST for the gpu buffer (plus index and vertex buffer usage bits). I use Vulkan Memory Allocator from GPUOpen to handle buffer memory allocation, and I'm careful to make sure that the staging buffer is mapped properly and isn't deleted before the command finishes. The validation layers stop spamming telling me this error if I switch the copy commands to using primary buffers, even when recorded in the same way (i.e. just changing the level parameter), but everything I've seen recommends recording secondary command buffers simultaneously on worker threads, and submitting them on the main thread later. Any ideas on why my validation layers are freaking out, or did I just skip over something when reading the spec?
      Here's some relevant code:
       
    • By mark_braga
      If I have an array of storage buffers or constant buffers with descriptor type UNIFORM/STORAGE_BUFFER_DYNAMIC how would I specify the dynamic offsets in bind descriptor sets?
      Offsets:
      A[0] = 256
      A[1] = 1024
      A[2] = 4096
      A[3] = 8192
      Will the dynamic offsets array look like { 256, 1024, ... }? And what will be the dynamicOffsetCount? Will it be 1 or the array size?
    • By mark_braga
      Does anyone know what is Vulkan's version of the UAVBarrier in DX12?
      In my situation, I have two compute shaders. The first one clears the uav and second one writes to the uav.
      void ComputePass(Cmd* pCmd) { cmdDispatch(pCmd, pClearBufferPipeline); // Barrier to make sure clear buffer shader and fill buffer shader dont execute in parallel cmdUavBarrier(pCmd, pUavBuffer); cmdDispatch(pCmd, pFillBufferPipeline); } My best guess was the VkMemoryBarrier but I am not very familiar with vulkan barriers. So any info on this would really help.
      Thank you.
    • By nickyc95
      Hi,
      I posted on here a while back about rendering architecture and came away with some great information.
      I am planning on implementing a render queue which collects the visible objects in the scene and sorts them based on different criteria to minimise state change etc..
      The thing I am currently undecided about is: what is the best way to submit my draw calls?
      (I am wanting to support both OpenGL and Vulkan)
      At the moment I have two ideas for how I can handle it.
      The renderable handles the rendering (i.e. It calls renderContext->BindVertexBuffer(...) etc) and setup the renderer state Pro- Each renderable is full in control of how it renders Con - Have to manually manage state The renderable pushes RenderCommands (DrawMesh, DrawMeshIndexed etc) into a CommandBuffer that gets executed by the RenderBacked at the end of the frame Pro - Stateless Con - Seems more difficult to extend with new features Pro/Con - The front end only has a subset of rendering capabilities  
      There are more pros / cons for each, but I have listed a couple to help show my thinking..
       
      Any one have any comments on either of these two approaches or any other approaches that are typically used?
       
      Thanks
    • By mark_braga
      I have been reading about async compute in the new apis and it all sounds pretty interesting.
      Here is my basic understanding of the implementation of async compute in a simple application like computing the Mandelbrot fractal:
      In this case, the compute queue generates a texture of the fractal and the graphics queue presents it.
      Program structure:
      // Create 3 UAV textures for triple buffering // Create 3 fences for compute queue beginCmd(computeCmd); cmdDispatch(computeCmd); endCmd(computeCmd); queueSubmit(computeQueue, fence[frameIdx]); if (!getFenceReady(fence[frameIdx - 1]) waitForFences(fence[frameIdx - 1]); beginCmd(graphicsCmd); cmdDraw(uavTexture[frameIdx - 1]); endCmd(graphicsCmd); queueSubmit(graphicsQueue); I am not sure about one thing in this structure
      All the examples I have seen use vkWaitForFences but I thought fences are used for waiting from the CPU for the GPU to complete. Should I use semaphores instead, so the graphics queue waits on the GPU for the compute queue to finish if it's running faster than the compute queue? Any advice on this will really help to make efficient use of async compute.
  • Popular Now