1.) Situation A: We have sub-meshes. This is usually the case if the model has more than a single material and the rendering system cannot blend materials. So the model is divided into said sub-meshes where each sub-mesh has its own material. Further, each sub-mesh has its local placement **S** relative to the model. If we still name the model's placement in the world **M**, then the local-to-world matrix **W** of the sub-mesh is given by

**W** := **M** * **S**

2.) Situation B: We have sub-meshes as above, but the sub-meshes are statically transformed, so that **S** is already applied to the vertices during pre-processing. The computation at runtime is then just

**W** := **M**

3.) Situation C: We have models composed by parenting sub-models, e.g. rigid models of arms and legs structured in a skeleton, or rigid models of turrets on a tank. So a sub-model has a parent model, and the parent model may itself be a sub-model and has another parent model, up until the main model is reached. So each sub-mesh has its own local placement **S**_{i} relative to its parent. Corresponding to the depth of parenting, we get a chain of transformations like so:

**W** := **M** * **S**_{n} * **S**_{n-1} * … * **S**_{0}

where **S**_{0} is the sub-model that is not itself a parent, and **S**_{n} is the sub-model that has not itself a parent.

4.) View transformation: Now having our models given relative to the global "world" space due to **W**, we have placed a camera with transform **C** relative to the world. Because we see in view space and not in world space, we need the inverse of **C** and hence yield in the view transform **V**:

**V** := **C**^{-1}

5.) The projection **P** is applied in view space and yields in the normalized device co-ordinates.

6.) All together, the transform so far looks like

**P** * **V** * **W**

(still using column vectors) to come from a model local space into device co-ordinates.

Herein **P** changes perhaps never during the entire runtime of the game, **V** changes usually from frame to frame (letting things like portals and mirrors aside), and **W** changes from model to model during a frame. This can be realized by computing **P** * **V** once per frame and using it again and again during that frame. So the question is what to do with **W**.

7.) Comparing situation C with situation A shows that A looks like C for just a single level. Situation A would simply allow for setting the constant **M** once per model, and applying the varying **S** per sub-mesh (i.e. render call). Using parenthesis to express what I mean:

( ( **P** * **V** ) * **M** ) * **S**

But thinking in performance of render calls where switching textures, shaders, and whatever have their costs, especially often switching materials may be a no-go, so that render calls with the same material are to be batched. Obviously this contradicts the simple re-use of **M** as shown above. Instead, we get a situation where for each sub-mesh the own **W** is computed on the CPU, and the render call gets

( **P** * **V** ) * **W**

BTW, this is also the typical way to deal with parenting, partly due to the same reasons.

Another drawback of using both **M** and **S **is that either you supply both matrices also for stand-alone models (for which **M** alone would be sufficient), or else you have to double your set of shaders.

So in the end it is usual to deal with all 3 situations in a common way: Computing the MODEL matrix on the CPU and send it to the shader. (Is just a suggestion) ;)