Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 15 Nov 2010
Offline Last Active Jul 20 2014 07:09 PM

Posts I've Made

In Topic: SOLVED: Compute shader atomic shared variable problem

26 June 2014 - 06:44 PM

Yes, I missed the part in the specification that said that the variable you pass into the atomic functions (in my case minDepth and maxDepth) are inout, not just in. It all made sense once I got that. >_>

In Topic: Using the ARB_multi_draw_indirect command

23 May 2014 - 08:15 AM


then an additional ~8 ms is spent blocking on buffer swap (= waiting for the driver to complete the queued commands, e.g. C code (or something) in the driver)
As a rule of thumb, if you're blocking on swap/flip/present, you're either waiting for a vblank if you're vsyncing, or you're waiting for the GPU to catch up.

Measurements of "GPU load" are very misleading. You can be bottlenecked by the GPU without seeing it report 100% load...



Threaded optimization off:

    51 FPS

    Render time: 18.806 ms
    Swap time: 0.277 ms
    Frame time: 19.678 ms (also includes some UI rendering)
    GPU load: ~60%

Threaded optimization on: 

    61 FPS

    Render time: 9.185 ms
    Swap time: 5.739 ms
    Frame time: 15.727 ms

    GPU load: ~71%


I did not change a single line in my program. Threaded optimization on simply moves the cost of the render calls to the driver server thread (see the slides I posted above), and if the server thread lags behind it causes it to block on buffer swaps.

In Topic: Using the ARB_multi_draw_indirect command

22 May 2014 - 08:18 PM

Are you sure that you're not simply prematurely optimizing? How exactly is your situation looking? Have you identified the bottleneck?


Slightly off-topic: I was inspired by your post and decided to try out glMultiDrawElementsIndirect() since I identified a part in my engine where I simply called glDrawElementsInstancedBaseVertex() in a loop. This was for shadow rendering, so no texture switches were required. Depending on how many types of tiles that were visible, around 20 draw calls were issued in a row, which I replaced with a single glMultiDrawElementsIndirect() call instead. That left my code with 3 different modes, depending on OpenGL support.



OGL3: Although all the instance data for all draw calls is packed into the same VBO, the vertex attribute pointer needs to be updated before each draw call so that it reads the correct subset of instances from that buffer.

			glVertexAttribPointer(instancePositionLocation, 3, GL_FLOAT, false, 0, baseInstance * 12);
			glDrawElementsInstancedBaseVertex(GL_TRIANGLES, numIndices, GL_UNSIGNED_SHORT, baseIndex*2, numInstances, baseVertex);

ARB_base_instance: If ARB_base_instance is supported, I can instead simply pass in a base instance instead of modifying the instance data pointer, removing the last set of state change from the mesh rendering loop:

glDrawElementsInstancedBaseVertexBaseInstance(GL_TRIANGLES, numIndices, GL_UNSIGNED_SHORT, baseIndex*2, numInstances, baseVertex, baseInstance);

ARB_multi_draw_indirect: If ARB_multi_draw_indirect is supported, I can pack together the above data into an array (an IntBuffer in my case since I'm using Java, hence the weird code), and draw them all with a single draw call:

//In the mesh "rendering" loop

//After the loop:
ARBMultiDrawIndirect.glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT, multiDrawBuffer, multiDrawCount, 0);
multiDrawCount = 0;



OGL3: 56 FPS

ARB_base_instance: 56 FPS (seems like the overhead of glVertexAttribPointer() is extremely low)

ARB_multi_draw_indirect: 62 FPS


The scene used was a purposely CPU intensive scene with 1944 shadow maps being rendered (extremely low resolution and most simply had no shadow casters that passed frustum culling). The resolution was intentionally kept very low and the GPU load was at around 69-71%. My Java code was NOT the bottleneck; my OpenGL commands take approximately 8.5 ms to execute, and then an additional ~8 ms is spent blocking on buffer swap (= waiting for the driver to complete the queued commands, e.g. C code (or something) in the driver). My conclusion is that glMultiDrawElementsIndirect() effectively reduced the load on the driver thread significantly, even when batching together just 10-20 draw calls into each glMultiDrawElementsIndirect() commands.

In Topic: Using the ARB_multi_draw_indirect command

22 May 2014 - 04:59 PM

You won't gain anything by simply replacing each glDrawElements() call with a glMultiDrawElementsIndirect(). The whole point of glMultiDrawElementsIndirect() is to allow you to upload everything you need for all your draw calls to the GPU (using uniform buffers, texture buffers, bindless textures, sparse textures, etc) and then replace ALL your glDrawElements() calls with a single glMultiDrawElementsIndirect() call. As far as I know, glMultiDrawElementsIndirect() is not faster than glDrawElements() when simply used as a replacement for the latter.


I strongly recommend you take a look at this presentation http://www.slideshare.net/CassEveritt/beyond-porting which explains really well both the problems and how to solve them.

In Topic: Geometry Shader for a quad or not ?

18 May 2014 - 07:41 AM

I also have experience where instancing is slower than geometry shaders.