Carmack's: 6.16 clocks, precision: 1.75e-3
sqrt_ss(x): 5.92 clocks, precision: 8.4e-8
mulss(rsqrtss(x)): 2.08 clocks, precision: 3e-4
0x1FBD1DF5: 2.32 clocks, precision: 4.3e-2
Though I think we all agree that the sqrt itself is rather irrelevant when the number of actual CPU cycles is this close to 1, it is certainly interesting, and I did some tests as well.
For one I think you're measuring loads and stores.. as I managed to reproduce almost exactly those numbers (also Haswell) and the ~2 clocks for rsqrt were identical to mem0 = mem1 (unless your test compensates for the load/store time?)
I glanced at the code and it looked as if it was included in the timing.
I guess in order to make anything like this meaningful at all we have to assume that we're at a point in a program where we have our float values in an XMM register and want to measure only the time it takes to get the sqrt of that value into another XMM register, which means we measure latency.
However if we have the sqrt in a loop and the compiler is smart, it might order operations so that the throughput matters. The loop in your test does independent sqrts, so the second sqrt doesn't wait for the first, and therefore measures throughput, but hides latency.
I tried it in ASM and for thoughput got something like sqrtss ~= 7, mulss rsqrtss ~= 2.5, psrld paddd ~= 1.5.
When measuring latency however, sqrtss ~= 10, mulss rsqrtss ~= 10, psrld paddd ~= 2. Looking at Intels specifications this matches pretty closely, as mulss for example has a throughput of 1 cycle, but a latency of 4 cycles. sqrtss is actually listed as 13 cycles latency (if I managed to look at the right column), which probably means my test wasn't well enough constructed to hide such a long latency.
psrld have 1 latency and 1 throughput, whereas paddd has 1 latency but 0.5 throughput, so it can actually do 2 adds per cycle in throughput.
If you have a long sequence of floating point calculations and at one point want to at minimal latency turn once of your floats into the sqrt of itself and don't care for very much precision, psrld paddd is probably the way to go.
Also note that the doc linked by Chris_F actually talked about GPUs and doing sqrts in a pixel shader for example, and we're discussing using the same methods on completely different architectures. I would guess these things matter a lot more on GPUs nowadays, as we're much more often at a time where a few instructions can actually make a difference in performance, and latency is already hidden as good as possible by the GPU so once it starts to matter parallelism has already been maximized and is no longer improvable.
On CPUs I would guess this case is more rare and much harder to predict in general code... and depending on the code around the sqrt different methods can give best results.
I think a script that brute-force tested every sqrt in a program individually with all these methods would find that performance would benefit from one of them in some cases and from another in another case.