• Advertisement
Sign in to follow this  

Speeding up BufferData and Batching draws

This topic is 771 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm working on a voxel engine and have some of the basics done with everything drawing decently fast. However there are two issues I'm having trouble coming up with a solution for:

 

First is stutters during chunk generation, I do chunk generation on a separate thread and once the vertices are ready, I tell the main thread to pull them into VBOs. This sounds like it would work fine however the pulling stuff into VBOs part (glBufferData) causes a slight stutter even though I'm uploading on average about 2000 vertices per frame (each vertex is also packed into a single int using the Int2101010Rev format, this yielded a massive performance increase). I tried using persistent mapped buffers but they cause the framerate to drop to 60 which is unacceptable considering that I have nothing but dot product lighting going on. I tried it (separately) with the coherent bit and with the unsynchronized bit but neither had much of an effect. So, my first question is, How can I reduce those stutters to the point that they are no longer noticeable?

 

Second is increasing the size of vertex batches, while profiling in CodeXL, I see that my batches are far too small compared to the recommended size of ~40k vertices. My batches are more around the 1k vertices mark. (about 20% of vertices with about 50% of batches). Increasing that should help with performance and it would also reduce draw calls but I'm unsure of how to do that considering the way I generate my data:

1. Fill an array with the voxel ids

2. Loop over the array and build a mesh using 'greedy' meshing, generating faces relative to the chunk

3. Calculate the normal of the face and place it in a list appropriately

4. Finally, combine the vertex information for all the faces into one list, keeping track of which normal is at which offset

5. Put this data into a VBO

6. In the vertex shader, use the normal grouping information passed in as uniforms along with gl_VertexID to determine which normal to pass along, a world transform positions the chunk properly.

 

As you can see, because of this, I can't just reduce my draw calls since at least two things depend on the current call. The normal however can be worked around by just batching in groups of normals, however the World transform remains an issue. I think I might be able to do something using MultiDrawIndirect but my problem with that is that I have never understood how I might determine which draw call I'm at if on hardware that doesn't support gl_DrawID. I have read about the trick with using a base instance however I don't see how that's correct since a vertex attribute isn't considered dynamically uniform and thus won't be usable as an index into a UBO.

So my second question is, how do I increase the size of my batches and reduce my draw calls?

 

The relevant code is at (mainly Chunk.cs and BlockManager.cs) https://github.com/himanshugoel2797/Messier-Game/tree/BruteForce/Messier/Engine

The master branch has the persistent mapping based approach implemented.

Share this post


Link to post
Share on other sites
Advertisement

 

First is stutters during chunk generation, I do chunk generation on a separate thread and once the vertices are ready, I tell the main thread to pull them into VBOs. This sounds like it would work fine however the pulling stuff into VBOs part (glBufferData) 

 

maybe you could allocate some big array then call glBufferSubData for faster upload?

Share this post


Link to post
Share on other sites
First off, you should check out gDEBugger(http://www.gremedy.com/) , a free OpenGL profiler, which could help you to detect your bottleneck.

The next problem you should check is the upload of your data. Are you uploading it and accessing itimmediately in the same frame ? This could stall your rendering pipeline. One approach to decouple this is to use multiple upload buffers (use map/unmap), upload the data and copy them to your final buffer afterwards:


Frame i:
update buffer A (CPU, worker thread, work on mapped buffer)
upload buffer B (DMA)
copy buffer C to render buffer (GPU, really fast)
Frame i+1:
update buffer C
upload buffer A
copy buffer B
...


You will need to have more buffers and use fences for synchronisation, but it will prevent your rendering pipeline from stalling.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement