Doing 12 samples software-style, does that mean 12 texture lookups?
You can use branching to your benefit. First make a quick test with 4 lookups, if all are either inside or outside of the shadow, you can mark the pixel as shadowed/unshadowed. If you got a mixed result, just look up X more shadow texels. The benefit comes from how GPUs work (atleast some, depends on hardware). GPU often group the processing units into wavefronts, cells whatever, which process multiple input data as single process (they do the same work at the same time). This is a reason, that branching hurts sometimes, because if some units of this group needs to do something else, the rest of the group need to wait and the total amount of processing time of the whole group increases.
Nevertheless, in our case, we want to optimize from, lets say 20 texel to just 4 texel access (sometimes). That is, if just one unit of a wavefront hits a mixed shadowed result, all pixels need the same (worst case) time. But if all unit just need 4 texel accesses, you suddently save a lot of processing time for this wavefront. A pixel wavefront eg has a 8x8 block size, thought this really depends on the hardware architecture. And the probabilty, that a 8x8 block is completly inside or outside the shadow , is quite high. Instead of lets say 20 texel access for the worst case scenario, you suddenly have an average case of ~12 texel access (assumption 50% hit rate => 4 + 16*50%).
If you work on a console, you should have access to detailed GPU archtecture information and can utilize it accordingly. If you work on the PC with unknown hardware, just build in an option which let you choose your shadow smoothness (from 1 sample up to 48 samples) and the user is able to decide himself which is best (performance vs quality).