Are square roots still really that evil?

Started by
12 comments, last by alvaro 10 years ago
Was just wondering, a lot of topics, replies and articles state that you should prevent using square roots in your equations 'with all cost'. Of course 'all cost' in my opinion depends on what you have to do to prevent using the sqrt.

With today's hardware, is it still reasonable to believe that 3 to 6 multiplications and value assignments are still cheaper then 1 sqrt? (of course profiling would tell, but I'm just curious about experience and opinions)

Examples I'm talking about are mainly distance comparisions, i.e. on the CPU (point to point distance, point in sphere check etc.) but also on the GPU side (i.e for light attenuation).

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

Advertisement

Well, distance comparisons don't "require" square roots as the square root part of the Euclidean metric does not change the order of comparisons (since the square root function is strictly increasing). Similarly, for light attenuation, you typically don't need the distance but the distance squared, so what is the point of calculating the square root just to square the result immediately after? Unless you need the actual distance/radius at some point, I don't see what you gain by doing the computation.

So, where do the "3 to 6 multiplications and value assignments" come in when doing distance comparisons or distance squared computations? To me it just seems like taking the square root is straight up a waste of energy here. If you have a specific situation where avoiding doing a square root requires some extra work, please mention it, because the examples you give don't really seem relevant to your question.

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

Point in sphere and distance checks by themselves can be done by comparing the squared distance to the squared radius, thereby avoiding sqrt.

When you actually need sqrt... it's not very evil on newer desktop processors, but at the same time the other instructions have also gotten faster, so they can still be relatively faster.

There are also special instructions on many newer processors for calculating them. One reference I found put sqrt for a single float in SSE at 19 clockcycles, while an instruction for 1 / sqrt which is only an approximation with some number of bits accuracy only takes 3 cycles so if that would work then it would probably be the fastest way.

Its not that evil but the fastest code is the code that you never have to run. So if you don't need to do a sqrt then don't do it.

Normalization involves a sqrt, so since normalization is used so much, you can safely assume that GPUs are optimized for it.

Of course, using the built-in normalize instruction rather than writing your own normalization would be advised to take advantage of where it may be implemented directly in the hardware.

As the others said, for distance comparisons/etc, where you can get away without doing a sqrt then do so.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

Latencies in cycles, straight from the Intel Intrinsics Guide:

_mm256_add_ps (add 8 pairs of floats): 3
_mm256_mul_ps (multiply 8 pairs of floats): 5
_mm256_rcp_ps (compute approx. reciprocals of 8 floats): 7
_mm256_rsqrt_ps (compute approx. reciprocals of square roots of 8 floats): 7
_mm256_div_ps (divide 8 pairs of floats): 29
_mm256_sqrt_ps (compute square roots of 8 floats): 29

For reference (from the intel vTune performance analysis guide), a L1 load is about 4 cycles, a L2 load about 10 cycles, a L3 load ranges from tens to hundreds of cycles, and a RAM access takes forever. This is assuming no TLB miss.

Today, precise square root has the same latency as precise division.

So essentially what others already said: Don't do work you don't have to do, but other then that, sqrt has gotten pretty fast and if it is used sparsely, most of the latency will probably get pipelined away.

GPUs have a sqrt instruction which is a single instruction (edit: it's not just 1 cycle I think), so taking the x1*x1 + y1*y1 ...> x2*x2 + y2*y2 ... comparison can actually end up being slower than just doing sqrt(vecn(...),vecn(...0).

CPUs have a similar instruction as well, but it's (afaik) SIMD only. However, it wouldn't surprise me if compilers implemented std::sqrt simply by calling that simd sqrt.

edit: yes, seems I was wrong. Thanks for correcting me.

Thanks, this gives a good view of what (not to) do.

I'll keep in mind that every sqrt (or anything else :)) that isn't really necessary, shouldn't be done at all.

Two examples:

1. My CoordToCoord distance function:


float CoordToCoordDist(const D3DXVECTOR3 pv1, const D3DXVECTOR3 pv2)
{
	return sqrt(pow(pv1.x - pv2.x, 2) + pow(pv1.y - pv2.y, 2) + pow(pv1.z - pv2.z, 2));
}

How would you do that without the sqrt?

2. Point in sphere.

I currently save the radius of my bounding spheres as 'normal' radius. I take the CoordToCoord distance from world center of the sphere to the point I'm checking. This distance I compare to the radius. That would basically be solved if the above CoordToCoord distance function returns the squared distance. In that I could initially take the squared radius of the sphere and save that (and when updating also keep the squared radius).

Note: in my shaders I don't use any sqrt at the moment, I'll ook into how I do my attenuation at the moment.

Of course there are some normalizations in my VS/PS, which I think are needed (and cannot be done without a square root).

Crealysm game & engine development: http://www.crealysm.com

Looking for a passionate, disciplined and structured producer? PM me

GPUs have a sqrt instruction which is a single instruction (edit: it's not just 1 cycle I think), so taking the x1*x1 + y1*y1 ...> x2*x2 + y2*y2 ... comparison can actually end up being slower than just doing sqrt(vecn(...),vecn(...0).

CPUs have a similar instruction as well, but it's (afaik) SIMD only. However, it wouldn't surprise me if compilers implemented std::sqrt simply by calling that simd sqrt.



GPUs have an approximate rsqrt and an approximate rcp function similar to the SIMD counterparts in my last post. They do not have a vector->length function which performs the squaring and adding of the components in addition to the sqrt in one instruction. So everything that was said for the CPU pretty much also holds for the GPU.


Thanks, this gives a good view of what (not to) do.
I'll keep in mind that every sqrt (or anything else smile.png) that isn't really necessary, shouldn't be done at all.

Two examples:

1. My CoordToCoord distance function:

float CoordToCoordDist(const D3DXVECTOR3 pv1, const D3DXVECTOR3 pv2)
{
	return sqrt(pow(pv1.x - pv2.x, 2) + pow(pv1.y - pv2.y, 2) + pow(pv1.z - pv2.z, 2));
}

How would you do that without the sqrt?

2. Point in sphere.
I currently save the radius of my bounding spheres as 'normal' radius. I take the CoordToCoord distance from world center of the sphere to the point I'm checking. This distance I compare to the radius. That would basically be solved if the above CoordToCoord distance function returns the squared distance. In that I could initially take the squared radius of the sphere and save that (and when updating also keep the squared radius).

Note: in my shaders I don't use any sqrt at the moment, I'll ook into how I do my attenuation at the moment.
Of course there are some normalizations in my VS/PS, which I think are needed (and cannot be done without a square root).


Are you sure that pow(a, 2) is reduced to a*a and not exp(log(a)+2) which is significantly more expensive?

Also sqrt and pow are the double precision functions. The float functions are sqrtf and powf and if you want the compiler to decide based on the parameters then it is std::sqrt and std::pow.

Even if you don't store the squared radius, computing it is faster then computing the square root. You should have a sqrLength or a dot function so you get the squared distance as sqrDistance = (vec1-vec2).SqrLength(); Then you check (sqrDistance < radius*radius). No need for sqrt.

Edit: And you should pass vectors per reference, not per value!

GPUs have a sqrt instruction which is a single instruction (edit: it's not just 1 cycle I think), so taking the x1*x1 + y1*y1 ...> x2*x2 + y2*y2 ... comparison can actually end up being slower than just doing sqrt(vecn(...),vecn(...0).

...good job they've also got a dot product instruction then...!

(dot (v1, v1) > dot (v2, v2))

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

This topic is closed to new replies.

Advertisement