1.) Situation A: We have sub-meshes. This is usually the case if the model has more than a single material and the rendering system cannot blend materials. So the model is divided into said sub-meshes where each sub-mesh has its own material. Further, each sub-mesh has its local placement S relative to the model. If we still name the model's placement in the world M, then the local-to-world matrix W of the sub-mesh is given by
W := M * S
2.) Situation B: We have sub-meshes as above, but the sub-meshes are statically transformed, so that S is already applied to the vertices during pre-processing. The computation at runtime is then just
W := M
3.) Situation C: We have models composed by parenting sub-models, e.g. rigid models of arms and legs structured in a skeleton, or rigid models of turrets on a tank. So a sub-model has a parent model, and the parent model may itself be a sub-model and has another parent model, up until the main model is reached. So each sub-mesh has its own local placement Si relative to its parent. Corresponding to the depth of parenting, we get a chain of transformations like so:
W := M * Sn * Sn-1 * … * S0
where S0 is the sub-model that is not itself a parent, and Sn is the sub-model that has not itself a parent.
4.) View transformation: Now having our models given relative to the global "world" space due to W, we have placed a camera with transform C relative to the world. Because we see in view space and not in world space, we need the inverse of C and hence yield in the view transform V:
V := C-1
5.) The projection P is applied in view space and yields in the normalized device co-ordinates.
6.) All together, the transform so far looks like
P * V * W
(still using column vectors) to come from a model local space into device co-ordinates.
Herein P changes perhaps never during the entire runtime of the game, V changes usually from frame to frame (letting things like portals and mirrors aside), and W changes from model to model during a frame. This can be realized by computing P * V once per frame and using it again and again during that frame. So the question is what to do with W.
7.) Comparing situation C with situation A shows that A looks like C for just a single level. Situation A would simply allow for setting the constant M once per model, and applying the varying S per sub-mesh (i.e. render call). Using parenthesis to express what I mean:
( ( P * V ) * M ) * S
But thinking in performance of render calls where switching textures, shaders, and whatever have their costs, especially often switching materials may be a no-go, so that render calls with the same material are to be batched. Obviously this contradicts the simple re-use of M as shown above. Instead, we get a situation where for each sub-mesh the own W is computed on the CPU, and the render call gets
( P * V ) * W
BTW, this is also the typical way to deal with parenting, partly due to the same reasons.
Another drawback of using both M and S is that either you supply both matrices also for stand-alone models (for which M alone would be sufficient), or else you have to double your set of shaders.
So in the end it is usual to deal with all 3 situations in a common way: Computing the MODEL matrix on the CPU and send it to the shader. (Is just a suggestion) ;)