On the CPU side, are you using SSE2 (at minimum) for matrix multiplication?
I have 0 SSE code in my math code actually, I should add that a day.
On the CPU side, are you using SSE2 (at minimum) for matrix multiplication?
I have 0 SSE code in my math code actually, I should add that a day.
Can you clarify if you're CPU-bound or GPU-bound?
It sounds like you are CPU-bound (calculating the bone matrices?) - but some folks are offering shader optimizations, which won't make a difference if you're being limited on the CPU.
Final transform matrix is computed on the CPU, my GPU part is only the vertex shader I have showed.
I have profile using very sleepy during 1min and we can see that operator* of the matrix is the heaviest function called on the list :
CMatrix4::operator* = 2.43s (exclusive)
the second on the list is QuaternionSlerp :
QuaternionSlerp = 1.57s (exclusive)
QuaternionSlerp could be replaced by QuaternionNLerp only I think, I do a check inside to do a NLerp :
const float CosPhi = QuaternionDot( q1, NewQ2 );
if( CosPhi > ( 1.0f - 0.001f ) )
but since the most of time angle is low, this check could be removed and go just for QuaternionNLerp.
My actual performance is on a map of 250 000 triangles with 10 characters animated of 50 bones with textures and directional lighting : 530 FPS.
How many calls were there to CMatrix4::operator*?
It's taking up 4.05% of your CPU time, so it almost sounds like... *drumroll, please* you should be offloading a lot of that work to the GPU through the use of vertex shaders.
Can you walk us step-by-step through the process you use to render a single entity once?
I think it's not here that the heaviest code is, my send of data for a skinned to the GPU has nothing fancy :
1) Bind VertexBuffer/IndexBuffer
2) Bind VertexShader
3) Update constant buffer, on this part I do InverseBindPose*FinalTransform[ i ].
4) for loop of material subset
5) Check if we have a material
6) Bind PixelShader
7) Update constant buffer/textures
8) Draw the subset
My actual performance is on a map of 250 000 triangles with 10 characters animated of 50 bones with textures and directional lighting : 530 FPS.
530FPS! What makes you think you have a performance problem? What are your performance goals?
The matrix operations you listed are taking up 4% of your CPU time. Assuming that's all in the bone calculations, and you were able to successfully divide the work onto 4 cores, it would now be taking 1% of your CPU time. So the CPU time for a frame is now 97% of what it used to be (i.e. of questionable benefit considering the added complexity).
If you're really worried your CPU bone matrix calculations are a performance issue, make a build where you can turn them off at will (i.e. just not update them each frame). Does it affect performance?
I just would tried to make some threading to see how that works, so I gave an idea of threading but I don't know where the best place is for threading an animation system.
About threading, using SSE2 and doesn't use operator* and use a function with a pointer to a matrix can win performance too.
I have to say too that operator* is used in Actor::Update so a boolean to avoid update of transform when not needed need to be added.