What are your opinions on DX12/Vulkan/Mantle?

Started by
120 comments, last by Ubik 8 years, 10 months ago

**Sigh**

If anyone has lots of questions, you can just compile and try Ogre 2.1, then disect its source code to see how we're handling it. It's Open Source after all.
Doing what I'm saying is not impossible, otherwise we wouldn't be doing it.

To answer The Chubu's question, glDrawElementsInstancedBaseVertexBaseInstance has THREE key parameters:

  • baseInstance: With this I can send an arbitrary index as I explained, which I can use to index whatever I want from a constant (UBO) or texture (TBO) buffer. I can even perform multiple indirections (retrieve an index from an UBO using the index from baseInstance)
  • baseVertex?: With this I can store as many meshes as I want in the same buffer object; and select them individually by providing the offset location to the start of the mesh I want to render. With this, I don't need to alter state at all (unless vertex format changes). The meshes don't even need to be contiguous in memory; they just need to be in the same Buffer Object and aligned to the vertex size.
  • indices: With this I can store as many meshes' index data as I want in the same buffer object, and select them individually by providing the offset location to the start of the index data. Remember to keep alignment to 4 bytes. Bonus points: You can keep the vertex and index data in the same buffer object.

The DX11 equivalent of this DrawIndexedInstanced and the analogous parameters are StartInstanceLocation, BaseVertexLocation & StartIndexLocation respectively.

We treat all of our draws with these functions.

The DX11 function works on DX10 hardware just fine. glDrawElementsInstancedBaseVertexBaseInstance was introduced in GL 4.2; however it is available to GL3 hardware via extension. The most notable remark is that OS X doesn't support this extension, at the time of writing.

The end result is that we just map the buffer(s) once; write all the data in sequence; bind these buffers and then issue a lot of consecutive glDrawElementsInstancedBaseVertexBaseInstance / DrawIndexedInstanced calls without any other API calls in between.

We only need to perform additional API calls when:

  • We need to bind a different buffer / buffer section (i.e. we've exhausted the 64kb limit)
  • We need to change state (shaders, vertex format, blending modes, rasterizer states; we keep them sorted to reduce this)
  • We're using more than one mesh pool (pool = a buffer where we store all our meshes together), and the next mesh is stored in another pool (we sort by pools though, in order to reduce this switching).
Advertisement

**Sigh**

If anyone has lots of questions, you can just compile and try Ogre 2.1, then disect its source code to see how we're handling it. It's Open Source after all.
Doing what I'm saying is not impossible, otherwise we wouldn't be doing it.

To answer The Chubu's question, glDrawElementsInstancedBaseVertexBaseInstance has THREE key parameters:

  • baseInstance: With this I can send an arbitrary index as I explained, which I can use to index whatever I want from a constant (UBO) or texture (TBO) buffer. I can even perform multiple indirections (retrieve an index from an UBO using the index from baseInstance)
  • baseVertex?: With this I can store as many meshes as I want in the same buffer object; and select them individually by providing the offset location to the start of the mesh I want to render. With this, I don't need to alter state at all (unless vertex format changes). The meshes don't even need to be contiguous in memory; they just need to be in the same Buffer Object and aligned to the vertex size.
  • indices: With this I can store as many meshes' index data as I want in the same buffer object, and select them individually by providing the offset location to the start of the index data. Remember to keep alignment to 4 bytes. Bonus points: You can keep the vertex and index data in the same buffer object.

The DX11 equivalent of this DrawIndexedInstanced and the analogous parameters are StartInstanceLocation, BaseVertexLocation & StartIndexLocation respectively.

We treat all of our draws with these functions.

The DX11 function works on DX10 hardware just fine. glDrawElementsInstancedBaseVertexBaseInstance was introduced in GL 4.2; however it is available to GL3 hardware via extension. The most notable remark is that OS X doesn't support this extension, at the time of writing.

The end result is that we just map the buffer(s) once; write all the data in sequence; bind these buffers and then issue a lot of consecutive glDrawElementsInstancedBaseVertexBaseInstance / DrawIndexedInstanced calls without any other API calls in between.

We only need to perform additional API calls when:

  • We need to bind a different buffer / buffer section (i.e. we've exhausted the 64kb limit)
  • We need to change state (shaders, vertex format, blending modes, rasterizer states; we keep them sorted to reduce this)
  • We're using more than one mesh pool (pool = a buffer where we store all our meshes together), and the next mesh is stored in another pool (we sort by pools though, in order to reduce this switching).

To be fair, I understood what you were doing after my first post - I just wanted to know if you were experiencing any stalling from filling your buffer prior to a large batch of draws, and how you were going to handle a Vulkan/D3D12 transition with bundles and the like.

Also, I've never really liked the 'just look at the source' thing. I have no familiarity with Ogre, and last time I tried to build it, it was relatively difficult and frustrating. It would probably take me quite some time not only to understand what you're doing, but why, whereas spending the 10 minutes in conversion would be more productive (and I get the benefit of talking to a fine chap such as yourself :) ).

So, let me reword it to make sure I understand what you're doing. You have a single per-instance "vertex buffer" that just has a sequence of integers from 0 to 4095. Your draws specify which instance offset they are, and because of that when you get this per-instance ID, it matches which overall ID it is of your draws. You then use that ID to access the constants from a 64K constant buffer for that draw.

Right?

Yes.

Do you have any performance issues with it? I'd be wary of requiring a 64K copy to GPU memory before draws. Is it faster if you use smaller batches?

No. "Batch, batch, batch" is still relevant today because GPUs are brute force powers of engineering. They like processing things in batch.
I don't understand your "requiring a 64K copy to GPU memory before draws" part though.


Also importantly, how scalable is this to next-gen draw bundles? Would you need to use indirect parameters to inject the right instance offset for your draws in the bundles?

I suppose the main benefit here is that you have reduced the number of actual API calls dramatically to actually perform draws, though with the next-gen APIs, isn't the benefit of that going to be somewhat mitigated by the fact that the actual draws themselves can be completely encapsulated in prebuilt bundles?

Let me clear something: This is how Vulkan and D3D12 approach rendering. You will be doing this. These new APIs allow us to do things right now is somewhat hacky or difficult (in OpenGL4, the problem is in doing multithreading well, on D3D11 the problem is not having NO_OVERRIDE on constant buffers unless you're on D3D11.1; also needing a instanced vertex buffer to get arbitrary instance IDs is a hack...); and perform much less hazard tracking.
Bundles allow you to record calls to reduce validation overhead and maximize reutilization, but DX12/Vulkan can't hide GPU state changes as Hodgman explained.

Edit: I suppose the core question here is: are the multidraws (indirect or not) actually pushed as such onto the GPU's command buffer (that is, is there a 'multidraw' command) or is the driver extracting the draws and inserting them individually (or batched, whatever) onto the command buffer? If the former, then it would certainly be faster. If the latter, I imagine a ton of draw bundles would be faster.

You're missing the point. The point of indirect rendering is that the draw call data is stored in GPU memory and can be filled from within multiple cores in parallel or even a compute shader. Whether dedicated hardware runs through the Indirect buffer during an MDI call is secondary (which is true for AMD's GCN btw).

To be fair, I understood what you were doing after my first post - I just wanted to know if you were experiencing any stalling from filling your buffer prior to a large batch of draws, and how you were going to handle a Vulkan/D3D12 transition with bundles and the like.

I don't understand why you think there would be a stall.

Let's assume we draw exactly 4096 meshes (doesn't matter if they're different or not).
This will consume the whole 64kb buffer.

If I split this into smaller batches, I will perform 4096 draw calls and use 16 bytes per draw. In the end I will still have used 64kb, but with much more binding in the middle.

If I draw less, say, 16 meshes; then I will write 256 bytes from CPU into the GPU. Then two things may happen depending on RenderSystem:

  1. GL/Vulkan/D3D12: We bind 256 bytes, the GPU loads exactly 256 bytes into their L1/L2 caches (or constant register file)
  2. D3D11: We bind 64kb (since D3D11 binds the whole buffer's size). We will be using 256 bytes. The remaining 65.280 bytes may be loaded by the GPU, but not be read by the shader and are filled with garbage.

The most "wasteful" here is D3D11. However, reading the extra 65.280 bytes is a joke for any GPU; and is certainly not going to be a bottleneck to worry about.

To be fair, I understood what you were doing after my first post - I just wanted to know if you were experiencing any stalling from filling your buffer prior to a large batch of draws, and how you were going to handle a Vulkan/D3D12 transition with bundles and the like.

I don't understand why you think there would be a stall.

Let's assume we draw exactly 4096 meshes (doesn't matter if they're different or not).
This will consume the whole 64kb buffer.

If I split this into smaller batches, I will perform 4096 draw calls and use 16 bytes per draw. In the end I will still have used 64kb, but with much more binding in the middle.

If I draw less, say, 16 meshes; then I will write 256 bytes from CPU into the GPU. Then two things may happen depending on RenderSystem:

  1. GL/Vulkan/D3D12: We bind 256 bytes, the GPU loads exactly 256 bytes into their L1/L2 caches (or constant register file)
  2. D3D11: We bind 64kb (since D3D11 binds the whole buffer's size). We will be using 256 bytes. The remaining 65.280 bytes may be loaded by the GPU, but not be read by the shader and are filled with garbage.

The most "wasteful" here is D3D11. However, reading the extra 65.280 bytes is a joke for any GPU; and is certainly not going to be a bottleneck to worry about.

I was thinking in the sense that you batched 4,096 draw calls, before the call even got submitted to the GPU, you'd have to perform a copy to GPU memory for your 64KiB of data. Until that copy is complete, the GPU may not be doing anything at all. However, I am quite possibly vastly underestimating the speed of such a copy (which is probably on the order of microseconds).

I was thinking in the sense that you batched 4,096 draw calls, before the call even got submitted to the GPU, you'd have to perform a copy to GPU memory for your 64KiB of data. Until that copy is complete, the GPU may not be doing anything at all. However, I am quite possibly vastly underestimating the speed of such a copy (which is probably on the order of microseconds).

If you do it properly, copying data is completely asynchronous, in that it only involves CPU work - the GPU is unaware.
Also, the GPU is almost always 1 or more frames behind the CPU - i.e. The OS/driver are buffering all your draw calls for 1+ frames.
Stopping issuing draws to copy some data, or run gameplay code, etc will not starve the GPU, because there's this massive frame long buffer of work. As long as once per frame you submit a frame's worth of work, the GPU will never starve.

I was thinking in the sense that you batched 4,096 draw calls, before the call even got submitted to the GPU, you'd have to perform a copy to GPU memory for your 64KiB of data. Until that copy is complete, the GPU may not be doing anything at all. However, I am quite possibly vastly underestimating the speed of such a copy (which is probably on the order of microseconds).

If you do it properly, copying data is completely asynchronous, in that it only involves CPU work - the GPU is unaware.
Also, the GPU is almost always 1 or more frames behind the CPU - i.e. The OS/driver are buffering all your draw calls for 1+ frames.
Stopping issuing draws to copy some data, or run gameplay code, etc will not starve the GPU, because there's this massive frame long buffer of work. As long as once per frame you submit a frame's worth of work, the GPU will never starve.

I think I'm too used to working on CPU-bound applications to ever actually experience this :)

Well in a CPU bound situation, the GPU will starve every frame until you can manage to get your CPU frametimes below your GPU frametimes.

As for the copy, say we're lucky enough to have a 20Gbps bus -that's ~2.33 GiBps, or ~2.38 MiB per millisecond, or ~2.44 KiB per microsecond!
So, 64KiB could be transferred in ~26 microseconds.

On the other hand, if you have to to a GL/D3D map/unmap operation, that's probably 300 microseconds of driver overhead!

Well in a CPU bound situation, the GPU will starve every frame until you can manage to get your CPU frametimes below your GPU frametimes.

As for the copy, say we're lucky enough to have a 20Gbps bus -that's ~2.33 GiBps, or ~2.38 MiB per millisecond, or ~2.44 KiB per microsecond!
So, 64KiB could be transferred in ~26 microseconds.

On the other hand, if you have to to a GL/D3D map/unmap operation, that's probably 300 microseconds of driver overhead!

Yup, on certain projects I've certainly seen map/unmap operations build up.

This is a different (and personal) codebase from what I usually work on (which are clients'), so I'm trying to "do things right" - I suspect I'm a bit 'polluted' by other people's codebases that didn't necessarily work well. Forgive my questions if they seem ignorant - I haven't worked on an actual modern, well-performing codebase :(.

I suspect I'm a bit 'polluted' by other people's codebases that didn't necessarily work well. Forgive my questions if they seem ignorant - I haven't worked on an actual modern, well-performing codebase

I'm pretty sure its very hard to saturate a PCIe x16 bus unless you're doing some very serious graphics (ie, think Crysis 14 or something) or plain stupid things (ie, reupload all textures every frame or something). If there is a stall from the application POV, it will probably be by a driver synchronization point (in which case you'd have to rework how you are doing things) and/or just pure API overhead (in which case you'd need to minimize API calls).

Thanks Mathias for answering my questions biggrin.png

EDIT: For fucks sake this editor and fucking quote blocks. Its fucking broken. BROKEN YOU HEAR ME!? BROKEN! NOT FROZEN! BROKEN!

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and Making a Terrain Generator

This topic is closed to new replies.

Advertisement