So as an example, i choose the animation "running melee" and then i go to the blend tree which defines "running melee" as the two animations "running" and "melee" and mixes them (or in my case overwrites the needed nodes with the melee animation)? I might be getting that wrong, but how would the blend tree know then which frame the running is on and which frame the melee is on? Maybe that's a bad example on my part...
Let us say there are the animations idleStand, moveWalk, moveRun, leanLeft, leanRight, meleeAttack.
The first thing to do is to use a variable forwardSpeed to mix the animations idleStand, moveWalk, moveRun by blending so that e.g. for a speed greater or equal 0 and lesser than 3 the animations idleStand and moveWalk, and for a speed greater or equal 3 and lesser then 6 (the maximum speed) the animations moveWalk and moveRun are blended. The blend weights are then computed e.g. like so:
w_idleStand = ( forwardSpeed < 3 ) ? ( 1 - forwardSpeed / 3 ) : ( 0 )
w_moveWalk = ( forwardSpeed < 3 ) ? ( forwardSpeed / 3 ) : ( 1 - ( forwardSpeed - 3 ) / 3 )
w_moveRun = ( forwardSpeed < 3 ) ? ( 0 ) : ( ( forwardSpeed - 3 ) / 3 )
This would be the job of the "move forward" branch of the animation tree.
The next would be to use a variable lean to mix the result from above with either leanLeft or leanRight. It may be implemented as an additive animation, or perhaps also by linear blending with appropriate weights.
With the above set-up the entire skeleton is animated (e.g. also the upper body and the arms are animated accordingly). Now when the meleeAttack animation comes into play, it should override what the above animation defines for upper body and arms. This is done by defining a new layer for meleeAttack. This layer works on the upper body and the arms only, not on the legs. One possibility to implement the layering is to give each part (or perhaps each bone) and contingent of 100%. If a layer write animation data to a part / bone, then the contingent is reduced by a defined amount. Any layer below is allowed to mix in as much as the remaining contingent says.
An example: At the beginning of computing the pose, the contingent is initialized with 100%. The top layer with the meleeAttack is not playing, so the contingent is not reduced. The lower layer where the forward movement is handled has 100% control over the entire body. Later on the meleeAttack is triggered. The upper layer takes control of the parts upper body and arms, sets them into an appropriate pose, and reduced the contingent by, say, 95%. So the lower layer with the forward movement is able to influence the upper body and arms with just 5%. For a total override a layer would reduce the contingent by 100%, of course.
Seen as an animation tree, we have 3 leaf nodes with the animations idleStand, moveWalk, and moveRun. These 3 nodes are the leafs of a one axis linear blend node. The output of that blend node is forward movement. Further we have 3 leaf nodes for the animations leanLeft, leanRight, and Identity. These nodes are the leafs of a blend node which effectively selects one if its children. Both the forward movement and leaning nodes are children of an additive blend node. Further we have a leaf node with meleeAttack. Both the output of the additive blend node and the meleeAttack node are children of a layering node parametrized as described above.
As can be seen from this description, animations are not directly chosen. Instead, the animation parameters are set accordingly to the situation. E.g. the forwardSpeed variable is set to 1.5, the leaning variable to -0.25, and the attack variable to true. These settings are driven from the player controller in the case of the PC or from the animation controller in case of an NPC.
EDIT: BTW, all this does not mean that a state machine cannot be useful. Just notice that a state machine is good for discrete distinctions (the states), while the mechanisms described above easily allow for continuos mixing.