Is MultiDrawIndirect/ExecuteIndirect really worth it ?

Started by
3 comments, last by Hodgman 6 years, 10 months ago

Hi Folks,

I am currently designing a new rendering pipeline and I have a serious question about the mechanism behind MultiDrawIndirect/ExecuteIndirect...

Say that, for the purpose of efficient culling, I split all my geometries into batches and store these batches into a large buffer containing N batches.

A batch simply contains a pointer to the triangle StartIndex and an integer defining the triangle count (there can be from 64 to 512 triangles per batch).

Let's assume that I want to use MultiDrawIndirect to draw these N batches in a single draw call...

The draw arguments buffer would thus look somewhat like that:

----- batch 0 -----

vertex count: 89

start index: 0

----- batch 1 -----

vertex count: 394

start index: 89

----- batch 2 -----

vertex count: 145

start index: 483

...

Will the graphics hardware execute these draw calls in serial, e.g. draw batch 0 first then batch 1, etc...

In case the batches are executed in serial (batch 0 must be finalised before batch 1) then I reckon that given the size of the batches, some of my SM (streaming multiprocessor) will remain idle the whole time...

That's my assumption ATM so an alternative I came up with, is to use a single DrawIndirect (one vertex shader per batch) and using the tessellation shader to dynamically inject the geometries of the batches, that way all the batches will be execute simultaneously and my SM occupancy will always be 100% all the time.

I know that the tessellation shader has an certain overhead but maybe that would still be worth it if that could mean full SM usage...

Is there any GPU gurus out there that could tell me if I am too far off from reality ?

Advertisement

Direct3D 11 doesn't have multidraw.

That said, the purpose of multidraw is not to reduce or parallelize GPU loads; it's to reduce driver overhead on the CPU side. See for example: https://docs.nvidia.com/gameworks/content/gameworkslibrary/graphicssamples/opengl_samples/multidrawindirectsample.htm

In general, even if the frame rate rises only slightly when MultiDrawIndirect is selected (e.g. if the scene is fill-limited), the CPU time should be seen to fall significantly when using MultiDrawIndirect. The main goal of MultiDrawIndirect and other AZDO features is to reduce driver overhead; while they may not always increase frame rate directly, they can generally reduce the CPU load, providing more CPU "headroom" for the application itself.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Yeah multi-draw is just a CPU-side optimization so that instead of calling the draw API many times, you can call it less times.
Draw-indirect is then a further optimization that lets you move the draw-call setup tasks (e.g. culling) from the CPU side to the GPU side.
i.e. you'd use these features to deliberately harm your GPU frame-time in order to save CPU frame-time :wink:

Direct3D 11 doesn't have multidraw.

AMD and NVidia both support it via extensions:
NvAPI_D3D11_MultiDrawInstancedIndirect / NvAPI_D3D11_MultiDrawIndexedInstancedIndirect
agsDriverExtensions_MultiDrawInstancedIndirect / agsDriverExtensions_MultiDrawIndexedInstancedIndirect

In case the batches are executed in serial (batch 0 must be finalised before batch 1) then I reckon that given the size of the batches, some of my SM (streaming multiprocessor) will remain idle the whole time... That's my assumption ATM so an alternative I came up with, is to use a single DrawIndirect (one vertex shader per batch) and using the tessellation shader to dynamically inject the geometries of the batches, that way all the batches will be execute simultaneously and my SM occupancy will always be 100% all the time.

D3D and OpenGL specify that a GPU must behave as if each draw is processed in serial, each waiting for the previous. They also state that the GPU must behave as if each triangle is processed in serial, each waiting for the previous.
GPU's are very good are processing draws/triangles in parallel, while still behaving as if the work was done in serial.

Ok, it seems that you guys did not get the general gist of my question, so I'm going to make it simpler.

Let's take one million triangle for instance, in both cases the scene is drawn in a single draw call...

Case 1: MultiDrawInstanced with 100.000 draws of triangle size 10

Case 2: DrawInstanced with 100.000 instances of triangle size 10

Is case 1 as fast as case 2 ?

My assumption is that because the draws are performed in serial in case 1 there must be a bit of synchronisation done between each draws, whereas in case 2 everything can be batched together so no need to synchronise the batches...

My assumption is that because the draws are performed in serial in case 1 there must be a bit of synchronisation done between each draws, whereas in case 2 everything can be batched together so no need to synchronise the batches...

I mentioned above that, according to the specifications, individual triangles must be drawn in serial, so you should also be assuming that there is a bit of synchronization between every triangle, in which case, the synchronization between draws would be irrelevant!

GPUs in practice implement this not as serial synchronization, but as arbitration of memory writes, which allows triangles to be computed in parallel, yet still interact with memory in the same way that they would if they had been processed in serial.

GPU designs differ widely so it's hard to make generalisations.
10 years ago, GPUs would execute draws in serial, which made small draws especially painful. These days, a modern AMD GPU can process somewhere around 8 draws in parallel (while still managing to behave as if it were doing them in serial). I assume NVidia has similar capabilities. Large draws are still ideal, but small ones aren't as deadly as they used to be.

FWIW though, you shouldn't even use instancing in your example - it's slower than a non-instanced draw using pseudo-instancing until you reach around 500 vertices per instance. Every GPU design has strange quirks and performance characteristics like this, so the only way to know is to measure.

Also, multi-draw is way more powerful than instancing -- instancing requires every instance to use the same set of indices and the same number of triangles, multi-draw allows every draw to use a completely different set of indices/triangles and a different number of instances. On some platforms, it even allows every draw in the group to have different resources (texture) bindings!

If you don't need that kind of power and instancing works as a solution, then you should probably just stick with the less powerful tool. You can still use multi-draw-indirect to kick off a single instanced draw call :)

For your problem, you may want to use multi-draw-indirect to launch a single non-instanced draw-call with the triangle count equal to the total triangles for all of your batches, and then use manual vertex fetching to get the right vertex data for each batch.

This topic is closed to new replies.

Advertisement