Mitigate the cost of glUseProgram() by sorting objects with the same shaders such that they can be drawn without calling glUseProgram() (render queues).
Branches inside shaders are almost free as long as all pixels in a block take the same branch. Still there is a small cost for branching, and it never makes sense to keep a branch if you have to swap shaders anyway (such that the branch is always taken in one and never in the other). For example, opaque and translucent objects should always be permutated and without a branch indicating "alpha or not".
A branch is costly if in a block some pixels will take one path and others the other. In that case each branch waits for the other, meaning all pixels ran both branches.
When to permutate is decided based on the balance between these costs.
BTW, if one block of the branch (or even branch chain) requires high register usage, it will reduce the number of warps in flight, which can have an overall negative impact on performance (your super simple "sub-shader" may be running with the same register allocation as your super complex one).
So as a rule of thumb, it is often better to group your complex shaders together in one uber shader and your simpler shaders in another - so slightly less uber ;)