What are your opinions on DX12/Vulkan/Mantle?

Started by
120 comments, last by Ubik 8 years, 10 months ago

ie, as I see it you'd have two ways of doing it:

  • Update the mesh transform, then issue a glDraw*Instanced call with a single instance, always fetch transform in index 0. Repeat for every single mesh.
  • Update transform UBO with all the transforms that can fit, then issue glDraw*Instanced call, repeat this draw call increasing the base instance ID by one for every single mesh until you run out of transforms in the UBO (doing the instanced index buffer trick you mentioned since instance ID is always 0).
So you always end up with one draw call per each different mesh. Thing that differs is UBO updating scheme (no scheme in first one, batching scheme in the second one).
The CPU cost of a draw call depends on the state changes that preceeded it.
Apparently setting the base instance ID state is much cheaper than binding a new UBO, which makes sense, as there's a tonne of resource management code that has to run behind the scenes whenever you bind any resource, especially if it's an orphaned resource.

Also, yes, updating one large UBO is going to be much cheaper than updating thousands of small ones. Especially if you use persistent unsynchronized updates.

On the GPU side, draw calls are free. What costs is context/segment switches. If two draw-calls use the same "context", the GPU bundles them together, avoiding stalls.
Certain state changes "roll the context"/"begin a segment"/etc, which means the next draw can't overlap with the previous one.
It would be interesting to find out where base-instance-id state and UBO bindings stand in regards to context rolls on different GPUs...
Advertisement

Doing this for just one instance is completely valid. If you do it the way you said, although valid; your API overhead will go through the roofs, specially if you have a lot of different meshes.
Then with the instanced method, how would you handle drawing different meshes?

ie, as I see it you'd have two ways of doing it:

  1. Update the mesh transform, then issue a glDraw*Instanced call with a single instance, always fetch transform in index 0. Repeat for every single mesh.
  2. Update transform UBO with all the transforms that can fit, then issue glDraw*Instanced call, repeat this draw call increasing the base instance ID by one for every single mesh until you run out of transforms in the UBO (doing the instanced index buffer trick you mentioned since instance ID is always 0).

So you always end up with one draw call per each different mesh. Thing that differs is UBO updating scheme (no scheme in first one, batching scheme in the second one).

glDrawElementsInstancedBaseInstance - one big per-instance buffer containing all of your instances. It doesn't have to be a UBO; if the per-instance data is small enough (which it will be if it's just a transform) it can be a VBO and specified in your VAO using glVertexAttribDivisor. Use the baseinstance parameter of your draw call to specify which instance you're currently drawing. That will index into your per-instance buffer effectively for free. gl_InstanceID remains 0-based but in this case it doesn't matter because you're not using it. You're not touching uniforms, you're not updating state between draws, the only thing that changes is the parameters to your draw calls.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

I'm not arguing about the impact of the UBO updating/binding, we pretty much established we do are trying to do UBO update batching. Nor I am talking about OpenGL 4.x features as I've mentioned in a previous post (ie, no indirect draw calls). Strictly GL 3 here.

What I am asking is how much different is doing a glDraw*Instanced call changing the instance ID offset vs doing a normal glDraw* call while calling glUniform1i for the indices.

Mathias said the draw-glUniform1i combination it would be detrimental if you have many different meshes (we're still limited by one different mesh -> one draw call) but I'm trying to figure out if it would be that much worse than doing a glDraw*Instance call and changing the base instance ID, fetching the real index from an instance attribute buffer, since you still have one draw call per different mesh.

State isn't changing beyond the glUniform1i call, which requires no binding. If it matters, the same could be accomplished with glVertexAttrib*i and passing the index that way before each draw call (which is what nVidia guys used for their scene graph presentations in 2013 and 2014).

In short, for different meshes, we still have one draw call per mesh, we still can batch UBO updates, but there are several ways to get the needed index to the shader (glUniform*i, glVertexAttrib*i, or glDraw*Instanced with indices in instanced attribute).

EDIT: Reading mhagain's post:

t doesn't have to be a UBO
Of course, but Mathias said it was a constant buffer, so I'm assuming he is using UBOs.

Then again, what you're saying here is basically, dont use UBOs, put all in instanced attributes? It sounds good but see this presentation for example, there is a small section on UBO updating and indexing inside the shader, maybe nVidia hardware doesn't works that well with instanced drawing.

EDIT2: Fucking quote blocks and fucking editor for fucks sake I fucking hate it.

"I AM ZE EMPRAH OPENGL 3.3 THE CORE, I DEMAND FROM THEE ZE SHADERZ AND MATRIXEZ"

My journals: dustArtemis ECS framework and Making a Terrain Generator



This is the most compatible way of doing it which works with both D3D11 and OpenGL. There is a GL4 extension that exposes the keywords gl_DrawIDARB & gl_BaseInstanceARB which allows me to do the same without having to use an instanced vertex buffer (thus gaining some performance bits in memory fetching; though I don't know if it's noticeable since the vertex buffer is really small and doesn't consume much bandwidth; also the 4096 draws per call limit can be lifted thanks to this extension).

Not sure if it helps performance wise, azdo's slides say that gl_DrawID often cripples performance

http://fr.slideshare.net/CassEveritt/approaching-zero-driver-overhead (slide 33)

There is no word on gl_BaseInstance though.

I starting to be worried by rumors that Google may have its own low level api too. This would basically mean one API per OS which break the purpose of Vulkan in the first place...

Wouldn't that actually justify the purpose of Vulkan? One cross-platform API that will work regardless of the OS, more or less the same purpose as with OpenGL. This would be in contrast to it simply being intended to fill the platform gap, providing a high performance 3D graphics API on platforms that don't have a native one. I'd be highly disappointed if the latter purpose were all that the Vulkan designers ever hoped to achieve.

"We should have a great fewer disputes in the world if words were taken for what they are, the signs of our ideas only, and not for things themselves." - John Locke

How are you implementing your constant buffers? From what you've written as your #3b, it sounds like you're packing multiple materials'/objects' constants into a single large constant buffer, and perhaps indexing out of it in your draws?

Yes

IIRC, that's supported only in D3D11.1+, as there is no *SSetConstantBuffer function that takes offsets until then.

That's one way of doing it, and doing it that way, then you're correct. We don't use D3D11.1 functionality, though since OpenGL does support setting constant buffers by offsets, we take advantage of that to further reduce splitting some batch of draw calls.

Otherwise, if you aren't using constant buffers with offsets, how are you avoiding having to set things like object transforms and the like? If you are, how are you handling targets below D3D11.1?

By treating all your draws as instanced draws (even if they're just one instance) and use StartInstanceLocation.
I attach a "drawId" R32_UINT vertex buffer (instanced buffer) which is filled with 0, 1, 2, 3, 4 ... 4095 (basically, we can't batch more than 4096 draws together in the same call; that limit is not arbitrary: 4096 * 4 floats per vector = 64kb; aka the const buffer limit).
Hence the "drawId" vertex attribute will always contain the value I want as long as it is in range [0; 4096) and thus index whatever I want correctly.

This is the most compatible way of doing it which works with both D3D11 and OpenGL. There is a GL4 extension that exposes the keywords gl_DrawIDARB & gl_BaseInstanceARB which allows me to do the same without having to use an instanced vertex buffer (thus gaining some performance bits in memory fetching; though I don't know if it's noticeable since the vertex buffer is really small and doesn't consume much bandwidth; also the 4096 draws per call limit can be lifted thanks to this extension).

So, let me reword it to make sure I understand what you're doing. You have a single per-instance "vertex buffer" that just has a sequence of integers from 0 to 4095. Your draws specify which instance offset they are, and because of that when you get this per-instance ID, it matches which overall ID it is of your draws. You then use that ID to access the constants from a 64K constant buffer for that draw.

Right?

Do you have any performance issues with it? I'd be wary of requiring a 64K copy to GPU memory before draws. Is it faster if you use smaller batches?

Also importantly, how scalable is this to next-gen draw bundles? Would you need to use indirect parameters to inject the right instance offset for your draws in the bundles?

I suppose the main benefit here is that you have reduced the number of actual API calls dramatically to actually perform draws, though with the next-gen APIs, isn't the benefit of that going to be somewhat mitigated by the fact that the actual draws themselves can be completely encapsulated in prebuilt bundles?

Edit: I suppose the core question here is: are the multidraws (indirect or not) actually pushed as such onto the GPU's command buffer (that is, is there a 'multidraw' command) or is the driver extracting the draws and inserting them individually (or batched, whatever) onto the command buffer? If the former, then it would certainly be faster. If the latter, I imagine a ton of draw bundles would be faster.


Edit: I suppose the core question here is: are the multidraws (indirect or not) actually pushed as such onto the GPU's command buffer (that is, is there a 'multidraw' command) or is the driver extracting the draws and inserting them individually (or batched, whatever) onto the command buffer? If the former, then it would certainly be faster. If the latter, I imagine a ton of draw bundles would be faster.

AFAIK, it's former that the application pushed the draws into the command buffer.


On Radeon beginning from hd6950 and geforce multi draw indirect is a hardware feature. On Intel (up to Haswell at least, don't Know for future chip) it is emulated by a loop in the driver.

I starting to be worried by rumors that Google may have its own low level api too. This would basically mean one API per OS which break the purpose of Vulkan in the first place...

Wouldn't that actually justify the purpose of Vulkan? One cross-platform API that will work regardless of the OS, more or less the same purpose as with OpenGL. This would be in contrast to it simply being intended to fill the platform gap, providing a high performance 3D graphics API on platforms that don't have a native one. I'd be highly disappointed if the latter purpose were all that the Vulkan designers ever hoped to achieve.

The problem is that Google is the one that controls Android, and is not an open environment like Windows for the user (at least for most users), in the sense that you can't install drivers like you do on Windows, the drivers comes with the device and its updates (which for worse in most cases are under control of the carriers).

Sure, if you install Cyanogen Mod or some other custom version you can do whatever you want, but for the normal user, they are stuck with whats comes with the device, which means that if Google decides to not implement Vulkan, you can't use Vulkan for Android and you are forced to use their API (so is worse than MS with DX vs OpenGL, since in Windows at least you can always install a driver that implements the latest OpenGL).

The thing is, these days Google is the new MS and Android the new Windows, they have the biggest portion of market and they are in a position where they can do whatever they want.

This is worse case scenario, I don't think Google will do this, but is a possibility that worries.

But to be honest, I don't care if I have to implement one or two more APIs, since we already support a bunch (DX11, OpenGL 3.x, 4.x, ES 2.x, ES3.x, and we have an early implementation for DX12 and we want to add support for PS4 as well). I prefer to have to support multiple strong and solid APIs than a bad one (OpenGL am looking at you).
Vulkan seems great but I still have reserves, the good thing is that is just like D3D12, so porting should be very easy (the only thing that I need now is a HLSL compiler to SPIR-V tongue.png ).

On Radeon beginning from hd6950 and geforce multi draw indirect is a hardware feature. On Intel (up to Haswell at least, don't Know for future chip) it is emulated by a loop in the driver.

One would hope that Intel at least validates once only rather than for each draw call in the loop.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

This topic is closed to new replies.

Advertisement