Somebody correct me if I'm wrong, but if every path in a shader takes the same branch it is nearly (not entirely) the same cost as if the changes were compiled in with defines. If you are rendering an object with a specific set of branches that every thread will take it may not be a big deal. If threads take different branches you will eat the cost of all branches.
Branching on constants is nice, because there's no chance of divergence. Divergence is when some threads within a warp/wavefront take one side of the branch, and some take another. It's bad because you end up paying the cost of executing both sides of the branch. When branching on a constant everybody takes the same path, so you only pay some (typically small) cost for checking the constant and executing the branch instructions.
As for whether it's the same cost as compiling an entirely different shader permutation, it may depend on what's in the branch as well as the particulars of the hardware. One potential issue with branching on constants is register pressure. Many GPUs have static register allocation, which means that they compute the maximum number of registers needed by a particular shader program at compile time and then make sure that the maximum is always available when the shader is actually executed. Typically the register file has a fixed size and is shared among multiple warps/wavefronts, which means that if a shader needs lots of registers then fewer warps/wavefronts can be in flight simultaneously. GPUs like to hide latency by swapping out warps/wavefronts, so having fewer in flight limits their ability to hide latency from memory access. So let's say that you have something like this:
// EnableLights comes from a constant buffer
if(EnableLights)
{
DoFancyLightingCodeThatUsesLotsOfRegisters();
}
By branching you can avoid the direct cost of computing the lighting, but you won't be able to avoid the
indirect cost that may occur from increased register pressure. However if you were to use a preprocessor macro instead of branch and compile 2 permutations, then the permutation with lighting disabled can potentially use less registers and have greater occupancy. But again, this depends quite a bit on the specifics of the hardware as well as your shader code, so you don't want to generalize about this too much. In many cases branching on a constant might have the exact same performance as creating a second shader permutation, or might even be faster due to CPU overhead from switching shaders.