Instancebasevertex Perf Hit

Started by
1 comment, last by Matias Goldberg 7 years, 8 months ago

I have a model I loaded into a VBO with an index buffer using a VAO. The shader Im testing with uses a uniform buffer with an array of 4x4 matrices and an array of vec4 colors, 256 of each. These arrays are indexed by gl_InstanceID.
When I create the buffer with the STATIC hint, with glBufferData to size it once, and glBufferSubData to upload to it once, everything is fine, runs as expected, 256 randomly rotated models in a grid layout.
Switching the VBO to use glBufferStorage with 3x storage (~6mb for ~2mb model), creating with persistent/write/coherent flags, and mapping once with persistent/write/coherent/dynamic_storage causes the performance to plummet.
If I limit the instance count of the bad perf case to 32 (of the possible 256) it runs alright.
Whether I write to the 3x buffer every frame or not (using fenceSyncs, or not) simply using buffer storage with instancing destroys perf, where Im not seeing any problems with a single STATIC/BufferData/BufferSubData.
Ive checked with NSight and all I can see is that with the buffer storage case, the single glDrawElementsInstancedBaseVertex call takes approx. 4x longer then the STATIC bufferData case.
Nothing changes with the uniform buffer, same data, same updates, and the funny thing is I use this bufferStorage/fenceSync/3xSize setup all over the place, but only with this single glDrawElementsInstancedBaseVertex does it cause perf problems.

Other info:
- im not exceeding the uniform buffer size limit
- can safely assume the fence syncs on the 3x buffer are not blocking
- on an NV 670 2gb
- Nsight reports no errors in graphics debugging, except the call is taking 4x longer

Am I missing something, or have I just hit a slow path in the driver? any ideas? my last idea is that the 3x buffer with base vertex is taking me to some unaligned address, could that be it?

Advertisement

creating with persistent/write/coherent flags

That's asking the driver to allocate the buffer in system RAM (e.g. malloc) rather than to allocate it in GPU RAM. Your shader, when reading from the buffer, will be reading system memory via the PCIe bus.
That's fine for data the the GPU will read once (e.g. copying into a GPU-resident buffer), but is not good for data that's randomly accessed by shader code.

You probably don't want to be using the coherent bit.

  1. Avoid coherent bit. When you modify a buffer, use glFlushBufferRange to notify the driver which regions are dirty. Make sure you merge your flushes (i.e. don't call glFlushBufferRange 7 times for 7 contiguous chunks; just call it once at the end before submitting your drawcalls with one huge chunk)
  2. Persistent bit will cause the driver to keep the data in host-visible memory (either System RAM or slower VRAM). This is bad.
  3. Don't use the Write bit. This will prevent the driver from keeping the buffer in device only memory.
  4. The correct way is to create two buffers: 1 in device only memory; another with persistent+write bits. You write to the latter from the CPU. Then you copy the data to the former using glCopyBufferSubData (it's like a GPU->GPU memcpy). The second buffer is commonly referred to as "the staging buffer" because it acts like an intermediary stash to talk between CPU and GPU. Once you're done you can destroy the staging buffer or keep it around to reuse it for another transfer for something else.
  5. Ignore points 2, 3 & 4 for dynamic buffers (i.e. data that is re-generated every frame in CPU and sent to GPU). In this case just write to a persistently mapped buffer directly.

This topic is closed to new replies.

Advertisement