Matrix Calculation Efficiency

Started by
9 comments, last by Hodgman 7 years, 8 months ago

Hi Guys,

At present, I send the W, V, & P matrices to the shader where they are multiplied within the shader to position vertices.

Would it be more efficient to pre-multiply these on the CPU and then pass the result to the shader?

Thanks in advance :)

Advertisement

Do not prematurely optimize things, you might end up having to switch to the other method later. Profile and test things, that is what will make the best determination. There are very, very few steadfast rules about this stuff, it is highly dependent upon what you're doing code wise, and the data you're pumping through the CPU/GPU, etc.

"Those who would give up essential liberty to purchase a little temporary safety deserve neither liberty nor safety." --Benjamin Franklin

It's my premature optimisation that is allowing me to be able to render so much in the first place.

I was just wondering what the normal practice was.

Simple answer: yes - doing multiplication once ahead of time, in order to avoid doing it hundreds of thousands of times (once per vertex) is obviously a good idea.

However, there may be cases where uploading a single WVP matrix introduces its own problems too!

For example, lets say we have a scene with 1000 static objects in it and a moving camera.

Each frame, we have to calculate VP = V*P, and then perform 1000 WVP = W * VP calculations, and upload the 1000 resulting WVP matrices to the GPU.

If instead, we sent W and VP to the GPU separetely, then we could pre-upload 1000 W matrices one time in advance, and then upload a single VP matrix per frame.... which means that the CPU will be doing 1000x less matrix/upload work in the second situation... but the GPU will be doing Nx more matrix multiplications, where N is the number of vertices drawn.

The right choice there would depend on the exact size of the CPU/GPU costs incurred/saved, and how close to your GPU/CPU processing budgets you are.

Yes. Multiply once outside is the way to go. If it's doing something static like rendering landscape then yes. A bit more tricky if its your game entities. In that case you need to weigh up instancing for translation and orientation of objects vs updating the matrix on the fly each draw call.

For static yes. For dynamic in low numbers yes. More murky when you start dealing with alot of objects.

Indie game developer - Game WIP

Strafe (Working Title) - Currently in need of another developer and modeler/graphic artist (professional & amateur's artists welcome)

Insane Software Facebook

Thanks guys!

In my case just about all of the geometry will be pre-transformed in my 3D package. So, there won't be any additional rotations, scaling, etc to do either.

Thanks for the advice.

Yes.

And no, no, no, no, no: this is not premature optimization, it's engineering for efficiency, they're not the same thing and don't listen to anyone who tells you different.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

I got a similar question about fine performance measurment:

Imagine I have in Geometry Shader two loops with known compile-time consts:


for (x = 0; x < 4; ++x) {
    for (y = 0; y < 3; ++y{
       ... DoStuff();
    }
}

This code in release mode gives me "Approximately 22 instruction slots used" (VS compiler will output this info)

If I would place [unroll] before each loop, I would have "Approximately 89 instruction slots used".

Right now I can measure time in NSight's "Events" window with nonosec-precision and can’t see performance gain between the shaders.

Is there a way to measure the difference in a finer way?

The question is similar, because measurement perf. diff in such optimizations (2 matrices vs 1, unroll/not unroll) requires some tool to measure the difference.

If you can't see any perf difference it might just be because you're bottlenecked elsewhere; e.g. you might be CPU-bound.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

If you can't see any perf difference it might just be because you're bottlenecked elsewhere; e.g. you might be CPU-bound.

No, I am not CPU bound at all.

This code calculates 4 Shadow Maps in one pass, which is faster, that 4 separate calls (I can see difference in NSight, because it is significant like 50-200% win dependent on quality settings).

This is a macro-optimization.

But passing unroll or 1/2 matrices is a micro optimization, which might give me something.

And with current tools I am aware of I can't detect it =(

One option - is to calculate instruction count.

But as I understand:

1. Each instruction has it's own cost and just summing them up is not a good idea.

2. NSight's measurement on same scene, with same shader, gives me error about 0.2% between passes.

So I am keep searching for a tool that will give me ability to measure micro-optimization perf.

The main reason for that: find (and measure) a good practice once, and after that apply it elsewhere without unnecessary code bloating because of some unmeasured speculations.

This topic is closed to new replies.

Advertisement