Custom view matrices reduces FPS phenomenally

Started by
16 comments, last by JohnnyCode 9 years, 12 months ago

whereas the GPU will need to do it per vertex.


If your graphics drivers are any good, it will only do the multiplication once. If your driver can't perform this optimization, switch gpu vendor.

Who's going to tell that to your users after you release Turbo Wombat IV and it sells 20 million copies, but runs slow for 10 million of them? You?

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Advertisement

If you modify a model matrix, don't update your projection matrix just for shits and giggles, that's inefficient and a huge performance drop. Imagine how many times per second you are doing that!

When we are speaking of optimization, NV does not transfer values to uniforms if they are not changed. It is probably not the case for buffers. Of course, buffers are transfered to graphics card memory only before they are actually used. So, frequent change before drawing should not affect performance significantly. Especially because it is a small amount of data in case of uniform blocks and calls communicate only with drivers' memory space in main memory.

Also, instead of working out the model view proj matrix on the CPU, do the multiplication in your shader. GPUs are far better at matrix multiplication in almost any situation.

I have strongly to disagree with this statement. Model/view/projection matrix calculation is far better to be done on the CPU side. In case of scientific visualization, when precision is important, CPU (when say this I mean Intel, because I'm not familiar with AMD architecture) can generate 10 orders of magnitude more precise matrices than GPU. I don't even know how such huge number is called. :) Transformations cumulatively generates errors. If double precision is not used the transformation cannot be accurate enough. Further more, transcendental functions are calculated only using single precision on the GPU. CUDA and similar APIs emulate double precision for such functions, but in OpenGL there is no transcendental functions emulations. I agree that hardware implemented transcendental functions are enormously fast. No CPU can compete with GPUs in that field. Just a single clock interval for a function call! Besides the fact that the number of SFU (as they are called) are not equal to SP units, pipeline usually hides the latency imposed by waiting for the SFU. But, as I already said, the high-level accuracy cannot be achieved.

As far as I have understood it, Uniform Buffer Objects were created exactly for the need of bulk-updating multiple uniforms in a single call. Submitting a single UBO that contains mere five matrices should be a trivial workload. Refactoring that to multiple UBOs e.g. where one would contain model matrices and the other would contain projection matrices like was suggested above sounds like a heavy antioptimization - don't do that! (unless profiling suggests that two UBO uploads are faster than one in this case :o)

Or perhaps the discussion has confused the use of uniforms with a call to glUniformMatrix4fv without UBOs, and UBOs themselves. If you are not using UBOs and are manually updating uniform matrices with glUniformMatrix4fv, the there is benefit in optimizing to not redundantly change matrices that haven't changed.

Hodgman's suggestion is the sanest here:

- Stop measuring FPS, but instead start measuring milliseconds. This will give better sense of the actual difference in workload.

- Use a CPU profiler with the old code and the new code to compare where the extra added time is being spent. E.g. AMD CodeAnalyst is good (works on non-AMD CPUs as well). If it turns out to not be a CPU-side slowdown (the profiles are identical), then use e.g. nVidia Parallel Studio or AMD CodeXL to debug and profile the GPU side operation.

I can pretty much guarantee where the slowdown is. It's not in the matrix multiplication, it's not in binding UBOs to the pipeline. The OP is doing a separate UBO update for each object drawn. That's potentially tens, hundreds or thousands of UBO updates per frame.

The slowdown is in GL's buffer object API, because you just can't make this kind of high-frequency update and still maintain performance when using it. Any profiling is just going to show a huge amount of time in the driver waiting for buffer object API calls to finish, waiting on CPU/GPU synchronization, and waiting on GL client/server synchronization.

The solution is to not use small UBOs and to not update per object. Instead you create a single UBO large enough to hold all objects, figure out the data that needs updating ahead of time, do one single big UBO update per frame (preferably via glBufferSubData), then a bunch of glBindBufferRange calls per-object. That runs fast, and in the absence of persistent mapping it's the only way to get performance out of UBOs.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Also, instead of working out the model view proj matrix on the CPU, do the multiplication in your shader.

Never perform matrix multiplication in a shader. All matrices that will be used in the shader should already be precomputed on the CPU.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid


Also, instead of working out the model view proj matrix on the CPU, do the multiplication in your shader.

Never perform matrix multiplication in a shader. All matrices that will be used in the shader should already be precomputed on the CPU.


L. Spiro

Except for skinning, but I agree with L. Spiro because you have to have in mind this mul will be done on each vertex or each pixel.

There are cases where it can be a good idea to keep view-projection and world matrices separate. Say you've got 10k static objects, if merging these transforms, the CPU has to perform 10k world*viewProj operations, and upload the 10k resultant matrices every frame. If kept separate, the CPU only has to upload the new viewProj matrix, and doesn't have to change any per-object data at all (but of course the GPU now has to do the 10k*numVerts matrix concatenations instead).
The "right" decision depends entirely on the game (and target hardware).

upgrading my engine to use custom view matrices instead of the OpenGL gl_ModelView and gl_Projection which are

were you setting any other uniforms in the old deprecated scenario?

How many uniform writes do you do per frame? roughly (batch complexity)

This topic is closed to new replies.

Advertisement