# will mingw optymize ...

This topic is 1806 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

will mingw optymize this..:

struct float3 { float x,y,z; };

float3 cross(float3 a, float3 b)
{
float3 out;

out.x = a.y * b.z - a.z * b.y;
out.y = a.z * b.x - a.x * b.z;
out.z = a.x * b.y - a.y * b.x;

return out;
}

so this is the same that i would put it through pointers
or no? (I would like very much to write it this way by value
if mingw optymizes this but im not sure)

##### Share on other sites

It is not the same (although it will work the same, viewed from outside). With some luck the compiler is smart enough to elide the extra copies, but formally you are explicitly asking for a and b being copied. out can probably be NRVOed away.

What is the hindrance to passing constant references? Both arguments are read-only, why do you need to explicitly copy them?

(note that I didn't bother to check whether the cross product formula is correct at all)

Edited by samoth

##### Share on other sites

It is not the same (although it will work the same, viewed from outside). With some luck the compiler is smart enough to elide the extra copies, but formally you are explicitly asking for a and b being copied. out can probably be NRVOed away.

What is the hindrance to passing constant references? Both arguments are read-only, why do you need to explicitly copy them?

(note that I didn't bother to check whether the cross product formula is correct at all)

formula should be correct

for me it is just much easier to work (with such 3-floats) when passing and getting back them in in this by value notation

so i wonder if this is noticably slower or not (and how much) probably i will be using this form for convenience except the ultraspinning loop

passing with & * is dirtier to read test etc

PS was you maybe looking in asm source with this? If someone would do this could you maybe post it here (the best for two version one

for cross(float3*,float3*,float3*) to compare) I could test myself probably but now cant do this and it is always nicer to discuss it)

Edited by fir

##### Share on other sites

so i wonder if this is noticably slower or not (and how much)
You can test it very easily by compiling a minimum test program with -save-temps -fverbose-asm, which will give you the generated code with annotations that make it easy to understand what's going on.

Just be sure that you do not initialize the input variables (to avoid the compiler optimizing out constants) and be sure to consume the computed value (by printing it or by returning it from main) or the compiler may realize that your statement does nothing and optimizes it out alltogether.

Difference with MinGW-gcc-4.8.1 with my default set of build options (which includes -O2) is 3 extra  movl instructions to registers before doing a movss from register. The const reference version uses movss directly. So that will be something like 3-4 cycles extra per cross product (neglegible, but on the other hand totally unnecessary).

##### Share on other sites

passing with & * is dirtier to read test etc
I don't see how using references would be dirtier.
struct float3 { float x,y,z; };

float3 cross(const float3& a, const float3& b)
{
float3 out;

out.x = a.y * b.z - a.z * b.y;
out.y = a.z * b.x - a.x * b.z;
out.z = a.x * b.y - a.y * b.x;

return out;
}


This will eliminate two ( unnecessary ) copies, and the function definition also tells you that the two input parameters are read-only, so you can depend on their values staying the same after the function call. Notice how the function body didn't change at all. You'd use the function the same way you did before, too.

##### Share on other sites

No, all of this has to do with calling conventions.

Passing them as floats...

If you compiled as 32 bit with the default cdecl convention

The values will be passed on the stack. The return value will be passed through fp0.

if you compiled as 64 bit:

the floating point parameters (up to four) are passed in XMM0 through XMM3, with additional values stored on the stack. The return value will be passed through XMM0.

If you passed them as pointers (both 32 and 64 bit):

All pointers are passed on the stack, the pointers must be followed and the values loaded into the appropriate floating point locations. The return values will be passed through fp0 or XMM0 depending on 32/64 bit options, as above.

So if your values were already available in appropriate registers, passing them directly is probably best. If you believe they were probably not loaded into memory then passing as pointers may be best.

If this were anything less than a low-level matrix operation I would say this is an unnecessary microoptimization. As it is, I will recommend that it probably (not certainly) is such. If the cost of accessing the CPU cache is a concern then you had better have eliminated every major issue that shows up in your profiler.

##### Share on other sites

so i wonder if this is noticably slower or not (and how much) probably i will be using this form for convenience except the ultraspinning loop

Depends on the hardware and if the values are already in cache memory.

On 3rd generation i7 the timing would average about 4 cycles for an L1 cache hit, but since the OOO core can reschedule things, the cost of the other loads will become zero as one load operation can be scheduled per cycle.

Depending on how far out the data is, L2 cache is ~10 cycles, L3 on another core is ~65 cycles, and paged to disk can reach into trillions of cycles; hopefully the data isn't that far out.

So assuming the data is in L1, the cost difference is about 1.5 nanoseconds per function call. Note that the overhead of the function call itself for setup and teardown is about 7 nanoseconds.

Of course, if that number of nanoseconds was relevant you would have already rewritten your cross product with SSE which only takes 8 operations and executes much faster than the code in your first post.

##### Share on other sites

What is the hindrance to passing constant references? Both arguments are read-only, why do you need to explicitly copy them?

Note that certain _other_ compilers have issues with their optimizers when using const references in some cases. You can't reason about how a compiler will optimize anything. The moral of the story is to never, ever take performance advice from the Internet and to always just _check yourself_ and _measure_ any optimization.

##### Share on other sites

What's a bit disappointing is that the auto-vectorizer doesn't convert this code to two 2x-shuffle-multiply followed by a subtraction, instead MinGW uses scalar SSE math.

Not even if you add a fourth pass-through component to make it "clearer" that it's 4-vector stuff... alas, maybe in a few years from now.

Then again, the few cross products that you usually do wouldn't even matter if they were 50 times slower...

##### Share on other sites
Passing by value is likely to be faster here. As frob explained above, values are passed via registers, so accessing them can be faster. In addition, passing by value prevents the possibility that the two parameters alias each other, which allows more flexibility for the compiler to reorder the body.

But I would expect that in most cases this simple function is likely to be inlined, which makes the point moot; once the function is inlined, the analyzer can trace the data directy, ignore the pointers, and reorder the body with other code in the function.

• ### What is your GameDev Story?

In 2019 we are celebrating 20 years of GameDev.net! Share your GameDev Story with us.

• 15
• 14
• 10
• 9
• 11
• ### Forum Statistics

• Total Topics
634096
• Total Posts
3015493
×