• Advertisement
Sign in to follow this  

Yes I'm an optimization freak

This topic is 4434 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I just wanted to know, is the operation 3 / 2 or the operation 3 * .5 faster on an Intel-based processor. Thanx for any suggestions! (Oh and please dont post anything about "it doesnt matter" unless it really doesnt matter) Edit: To expand on that, is the operation 2 * (4 / 5) faster than 2 * .8 Edit: Assume all values are stored as floats

Share this post


Link to post
Share on other sites
Advertisement
well, considering that todays technology you could presume that takes 1/1207833498789 of the processor to do maybe smaller lol... but considering the circumstances it would be faster doing 2 * .8 because it only has to do ONE mathmatic equation where at 2 * (4/5) it has to do two

Share this post


Link to post
Share on other sites
Not sure but I think this is faster then both

3 >> 1;

of course that wont work with floats but if your focased with optomization that much there has to be sacrifices.

Share this post


Link to post
Share on other sites
If you have something like this:

3 * (4/5)

And everything is constants, not variables, then the compiler "SHOULD" figure it out at compile time.

At run time, doing 4/5 would be longer than .8 because of the extra instruction, I think.

Share this post


Link to post
Share on other sites
It doesn't matter. Every single change you suggest can be trivially performed by the compiler if needed. So, it doesn't matter. Write code that is easy to understand, and let people with more information than you worry about the details.

CM

Share this post


Link to post
Share on other sites
Quote:
Original post by geekalert
(Oh and please dont post anything about "it doesnt matter" unless it really doesnt matter)


It really doesn't matter.

Share this post


Link to post
Share on other sites
yes as stated above the compiler will do LOW LEVEL optimzation, the thing that you as a programmer must be concerned with is high level opimization. For example

while(I < 300)
I++; ///<-- the compiler wont optomize this, obivious as the ineffeiciency is

as opposed to

I = 300;

this is a really simple example but hopefully you get what I mean

Share this post


Link to post
Share on other sites
Quote:
Original post by raptorstrike
yes as stated above the compiler will do LOW LEVEL optimzation, the thing that you as a programmer must be concerned with is high level opimization. For example

while(I < 300)
I++; ///<-- the compiler wont optomize this, obivious as the ineffeiciency is

as opposed to

I = 300;

this is a really simple example but hopefully you get what I mean

This even is a bad example - the compiler will reduce that to i = 300.
(tested under VC toolkit 2k3).

Cheers,
Pat.

PS: Micro-optimisation is pointless unless profiler outcome suggests to do it.

Share this post


Link to post
Share on other sites
Back to the topic. If you have really floats and not constants, there is a HUGE difference in performance! Obviously the calculation with 3 numbers is the slowest and multiplication should be faster than divisions.

I wrote a little test app doing each of your calculations 1 billion times, here is the result:

3/2=1,5 took 3662791ns
3*0,5=1,5 took 1371114ns
2*(4/5)=1,6 took 7788465ns
2*0,8=1,6 took 1366573ns


And the sourcecode (c#, but test it in c++ or assembler if you want to ...):

// Project: TestAddMultPerformance, File: Program.cs
// Namespace: TestAddMultPerformance, Class: Program
// Path: C:\code\TestAddMultPerformance, Author: Abi
// Code lines: 107, Size of file: 2,96 KB
// Creation date: 30.12.2005 07:16
// Last modified: 30.12.2005 07:25
// Generated with Commenter by abi.exDream.com

#region Using directives
using System;
using System.Collections.Generic;
using System.Text;
using System.Runtime.InteropServices;
#endregion

namespace TestAddMultPerformance
{
/// <summary>
/// Program
/// </summary>
class Program
{
#region Performance counters and getting ns time
/// <summary>
/// Query performance (high resolution) timer frequency
/// </summary>
/// <param name="lpFrequency">current frequency</param>
[System.Security.SuppressUnmanagedCodeSecurity]
[DllImport("Kernel32.dll")]
[return: MarshalAs(UnmanagedType.Bool)]
internal static extern bool QueryPerformanceFrequency(
out long lpFrequency);

/// <summary>
/// Query performance (high resolution) timer counter
/// </summary>
/// <param name="lpCounter">current counter value</param>
[System.Security.SuppressUnmanagedCodeSecurity]
[DllImport("Kernel32.dll")]
[return: MarshalAs(UnmanagedType.Bool)]
internal static extern bool QueryPerformanceCounter(
out long lpCounter);

/// <summary>
/// Get current performance timer frequency
/// (using QueryPerformanceFrequency)
/// </summary>
public static long GetPerformanceFrequency()
{
long l;
QueryPerformanceFrequency(out l);
return l;
} // GetPerformanceFrequency()

/// <summary>
/// Get current performance timer counter value
/// (using QueryPerformanceCounter)
/// </summary>
public static long GetPerformanceCounter()
{
long l;
QueryPerformanceCounter(out l);
return l;
} // GetPerformanceCounter()

/// <summary>
/// Remember the frequency
/// </summary>
public static long performanceFrequency = GetPerformanceFrequency();

/// <summary>
/// Convert performance counter value to ns.
/// </summary>
/// <param name="perfCounter">Counter difference from 2 values</param>
static public int ConvertToNs(long perfCounter)
{
return (int)(perfCounter * 1000000 / performanceFrequency);
} // ConvertToNs(perfCounter)

/// <summary>
/// Convert performance counter value difference
/// (perfCounter2-perfCounter1) to ns.
/// </summary>
static public int ConvertToNs(long perfCounter1, long perfCounter2)
{
return (int)((perfCounter2 - perfCounter1) *
1000000 / performanceFrequency);
} // ConvertToNs(perfCounter1, perfCounter2)
#endregion

static void Main(string[] args)
{
// Declare everything here in case this eats up cycles
float ret = 0.0f;
float value1 = 3;
float value2 = 2;
float value3 = 1 / value2;
float value4 = 4;
float value5 = 5;
float value6 = value4 / value5;
long perfCounterBefore, perfCounterAfter;

// Do 1 bio iterations.
int numberOfIterations = 1000 * 1000 * 1000;

// Just call every method once to make sure we don't count any JIT time.
perfCounterBefore = GetPerformanceCounter();
for (int i = 0; i < numberOfIterations; i++)
{
ret = value1 / value2;
} // for
perfCounterAfter = GetPerformanceCounter();

Console.WriteLine("Dummy test to init JIT: "+
value1 + "/" + value2 + "=" + ret + " took " +
ConvertToNs(perfCounterBefore, perfCounterAfter) + "ns");

// Test1
perfCounterBefore = GetPerformanceCounter();
for (int i = 0; i < numberOfIterations; i++)
{
ret = value1 / value2;
} // for
perfCounterAfter = GetPerformanceCounter();

Console.WriteLine(value1 + "/" + value2 + "=" + ret + " took " +
ConvertToNs(perfCounterBefore, perfCounterAfter) + "ns");

// Test2
perfCounterBefore = GetPerformanceCounter();
for (int i = 0; i < numberOfIterations; i++)
{
ret = value1 * value3;
} // for
perfCounterAfter = GetPerformanceCounter();

Console.WriteLine(value1 + "*" + value3 + "=" + ret + " took " +
ConvertToNs(perfCounterBefore, perfCounterAfter) + "ns");

// Test3
perfCounterBefore = GetPerformanceCounter();
for (int i = 0; i < numberOfIterations; i++)
{
ret = value2 * (value4 / value5);
} // for
perfCounterAfter = GetPerformanceCounter();

Console.WriteLine(value2 + "*(" + value4 + "/" + value5 + ")=" + ret + " took " +
ConvertToNs(perfCounterBefore, perfCounterAfter) + "ns");

// Test4
perfCounterBefore = GetPerformanceCounter();
for (int i = 0; i < numberOfIterations; i++)
{
ret = value2 * value6;
} // for
perfCounterAfter = GetPerformanceCounter();

Console.WriteLine(value2 + "*" + value6 + "=" + ret + " took " +
ConvertToNs(perfCounterBefore, perfCounterAfter) + "ns");

Console.ReadLine();
} // Main(args)
} // class Program
} // namespace TestAddMultPerformance


And obviously this is nothing to consider when coding normal algorithms, but it is never bad to know these things :)

Share this post


Link to post
Share on other sites
Quote:
Original post by raptorstrike
yes as stated above the compiler will do LOW LEVEL optimzation, the thing that you as a programmer must be concerned with is high level opimization. For example

while(I < 300)
I++; ///<-- the compiler wont optomize this, obivious as the ineffeiciency is

as opposed to

I = 300;

this is a really simple example but hopefully you get what I mean

While your point is both valid and important, I'll bet that if you check, a good compiler does indeed optimize that.

CM

Share this post


Link to post
Share on other sites
Quote:
Original post by abnormal
...

The fact that those tests took any time at all indicates that things weren't being optimized correctly...you were performing the same constant calculation over and over again, and immediately throwing away the result. In a release build under VC++, each of those loops would be thrown out entirely.

CM

Share this post


Link to post
Share on other sites
Quote:
Original post by Conner McCloud
Quote:
Original post by abnormal
...

The fact that those tests took any time at all indicates that things weren't being optimized correctly...you were performing the same constant calculation over and over again, and immediately throwing away the result. In a release build under VC++, each of those loops would be thrown out entirely.

CM

To verify this, here's the C++ program and its output (compiled using VC++ 2k3 toolkit with /O2:

#include <iostream>
#include <windows.h>

namespace {

class Freq {
LARGE_INTEGER Value;
public:
Freq() {
QueryPerformanceFrequency( &Value );
}
operator unsigned long long () const { return Value.QuadPart; }
};

unsigned long long ConvertToNs( LARGE_INTEGER const & before, LARGE_INTEGER const & after) {
static Freq freq;
unsigned long long result = ((after.QuadPart - before.QuadPart) * 1000000) / freq;
return result;
}
}

int main()
{

// Declare everything here in case this eats up cycles
float ret = 0.0f;
float value1 = 3.0f;
float value2 = 2.0f;
float value3 = 1.0f / value2;
float value4 = 4.0f;
float value5 = 5.0f;
float value6 = value4 / value5;
LARGE_INTEGER perfCounterBefore, perfCounterAfter;

// Do 1 bio iterations.
unsigned long const numberOfIterations = 1000 * 1000 * 1000;

// Just call every method once to make sure we don't count any JIT time.
QueryPerformanceCounter(&perfCounterBefore);
for (unsigned long i = 0; i < numberOfIterations; ++i)
{
ret = value1 / value2;
} // for
QueryPerformanceCounter(&perfCounterAfter);

std::cout << "Dummy test to init JIT: " <<
value1 << "/" << value2 << "=" << ret << " took " <<
ConvertToNs(perfCounterBefore, perfCounterAfter) << "ns\n";

// Test1
QueryPerformanceCounter(&perfCounterBefore);
for (int i = 0; i < numberOfIterations; ++i)
{
ret = value1 / value2;
} // for
QueryPerformanceCounter(&perfCounterAfter);

std::cout << value1 << "/" << value2 << "=" << ret << " took " <<
ConvertToNs(perfCounterBefore, perfCounterAfter) << "ns\n";

// Test2
QueryPerformanceCounter(&perfCounterBefore);
for (int i = 0; i < numberOfIterations; ++i)
{
ret = value1 * value3;
} // for
QueryPerformanceCounter(&perfCounterAfter);

std::cout << value1 << "*" << value3 << "=" << ret << " took " <<
ConvertToNs(perfCounterBefore, perfCounterAfter) << "ns\n";

// Test3
QueryPerformanceCounter(&perfCounterBefore);
for (int i = 0; i < numberOfIterations; ++i)
{
ret = value2 * (value4 / value5);
} // for
QueryPerformanceCounter(&perfCounterAfter);

std::cout << value2 << "*(" << value4 << "/" << value5 << ")=" << ret << " took " <<
ConvertToNs(perfCounterBefore, perfCounterAfter) << "ns\n";

// Test4
QueryPerformanceCounter(&perfCounterBefore);
for (int i = 0; i < numberOfIterations; ++i)
{
ret = value2 * value6;
} // for
QueryPerformanceCounter(&perfCounterAfter);

std::cout << value2 << "*" << value6 << "=" << ret << " took " <<
ConvertToNs(perfCounterBefore, perfCounterAfter) << "ns\n";

char c;
std::cin >> c;

return 0;
}


Output:

C:\Temp>cl /O2 /EHsc t.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
Copyright (C) Microsoft Corporation 1984-2002. All rights reserved.

t.cpp
Microsoft (R) Incremental Linker Version 7.10.3077
Copyright (C) Microsoft Corporation. All rights reserved.

/out:t.exe
t.obj

C:\Temp>t
Dummy test to init JIT: 3/2=1.5 took 0ns
3/2=1.5 took 0ns
3*0.5=1.5 took 1ns
2*(4/5)=1.6 took 0ns
2*0.8=1.6 took 0ns

So much for artifical benchmarks. This is almost 2006 people, don't think compilers are as stupid as they were in 1989...

Cheers,
Pat

Share this post


Link to post
Share on other sites
The answer you are probably looking for is this: on a Pentium 4 a single-precision floating-point divide takes approximately 10 - 15 times as many cycles as a floating-point multiply.

Keep in mind that without any context, the answer is pretty useless. The difference between a multiply instruction and a divide instruction is only about 10 billionths of a second, so unless your code does nothing but divide numbers by a constant over and over billions of times, there is no way to know if it makes a difference which way you do it. There could be other factors that have a much bigger impact.

So now that you know the answer, you can forget it because it will probably never make a difference.

As a side note, your examples are poor because generally the compiler will precompute operations on constants. Some compilers will convert division by a constant to multiplication by its reciprocal. For the ones that don't, you should convert it yourself -- if you care.

Share this post


Link to post
Share on other sites
I think the lesson here is, don't use C# if you're concerned with the micro-optimization of constant floating point operations

Share this post


Link to post
Share on other sites
No, I designed them not to be optimized out (which would happen in c# too if they were constants).

In c++ the same stuff works too (in debug mode similar results, little bit slower though), but you are right. In release mode all the equations get elimininated to one and I can't get the performance values anymore. However, this wasn't a contest to find the fastest way to calc 4 numbers, but to see which of the cases with RANDOM floats would be the fastest.

This are the c++ results (debug mode, release mode doesn't work):

3.000000/2.000000=1.500000 took 3799649
3.000000*0.500000=1.500000 took 2811231
2.000000*(4.000000/5.000000)=1.600000 took 7944695
2.000000*0.800000=1.600000 took 3315934


Here is the c++ version

// Project: TestAddMultPerformanceCpp, File: TestAddMultPerformanceCpp.cpp
// Path: c:\code\TestAddMultPerformanceCpp, Author: Abi
// Code lines: 78, Size of file: 2,23 KB
// Creation date: 30.12.2005 07:54
// Last modified: 30.12.2005 08:06
// Generated with Commenter by abi.exDream.com

// TestAddMultPerformanceCpp.cpp : Same stuff as in c#, just in c++.
// Note: When using Release mode the compiler will optimize out all equations
// and unnescessary code. For this reason it does not make sense to use the
// release mode here (its just a performance test for gods sake).
//

#include "stdafx.h"
#include "windows.h"

int _tmain(int argc, _TCHAR* argv[])
{
// Declare everything here in case this eats up cycles
float ret = 0.0f;
float value1 = 3;
float value2 = 2;
float value3 = 1 / value2;
float value4 = 4;
float value5 = 5;
float value6 = value4 / value5;
LARGE_INTEGER perfCounterBefore, perfCounterAfter;
LARGE_INTEGER performanceFrequency;
QueryPerformanceFrequency(&performanceFrequency);

// Do 1 bio iterations.
int numberOfIterations = 1000 * 1000 * 1000;

// Test1
QueryPerformanceCounter(&perfCounterBefore);
for (int i = 0; i < numberOfIterations; i++)
{
ret = value1 / value2;
} // for
QueryPerformanceCounter(&perfCounterAfter);

LONGLONG perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;
printf("%f/%f=%f took %d\n", value1, value2, ret,
perfDifference*1000000/performanceFrequency.QuadPart);

// Test2
QueryPerformanceCounter(&perfCounterBefore);
for (int i = 0; i < numberOfIterations; i++)
{
ret = value1 * value3;
} // for
QueryPerformanceCounter(&perfCounterAfter);

perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;
printf("%f*%f=%f took %d\n", value1, value3, ret,
perfDifference*1000000/performanceFrequency.QuadPart);

// Test3
QueryPerformanceCounter(&perfCounterBefore);
for (int i = 0; i < numberOfIterations; i++)
{
ret = value2 * (value4 / value5);
} // for
QueryPerformanceCounter(&perfCounterAfter);

perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;
printf("%f*(%f/%f)=%f took %d\n", value2, value4, value5, ret,
perfDifference*1000000/performanceFrequency.QuadPart);

// Test4
QueryPerformanceCounter(&perfCounterBefore);
for (int i = 0; i < numberOfIterations; i++)
{
ret = value2 * value6;
} // for
QueryPerformanceCounter(&perfCounterAfter);

perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;
printf("%f*%f=%f took %d\n", value2, value6, ret,
perfDifference*1000000/performanceFrequency.QuadPart);

char* temp = new char[100];
scanf(temp);

return 0;
} // _tmain(int argc, _TCHAR* argv[])

Share this post


Link to post
Share on other sites
Quote:
Original post by abnormal

// Note: When using Release mode the compiler will optimize out all equations
// and unnescessary code. For this reason it does not make sense to use the
// release mode here (its just a performance test for gods sake).


I'm sorry to say that, but are you serious?
What the heck is release mode and an optimising compiler good for if actually don't want it to optimise your code just to proof that you could
do said optimisation manually...

This "logic" just escapes me.

Cheers,
Pat.


Share this post


Link to post
Share on other sites
I agree with you darookie. But abnormal meant if we want to examine exactly what each statement does in high level code, we should do it in debug mode (of course all checkings must be turned off). That will bring us the correct result about how fast/slow an expression is in low level. Turning optimizations on helps us determine the average performance of a release build, which is unsuitable when we're examining at low level viewpoint.

Share this post


Link to post
Share on other sites
Can you read? I just wanted to test if 3/2 or 3*0.5f is faster, doing 1 bio times. Again: This is not about making sense. Performance tests are always flawed and make no sense at all. I just want to compare floating point multiplications and divisions (as the thread creator asked). I'm well aware of the fact that every test will produce different results and in normal life situations many other factors come also into play (as some ppl here already pointed out).

When it was very hard to do with c++ optimizations turned on in release mode, I tried to let it alone. I guess the only way to do it anyway is with assembler (and hey, ppl always say assembler is fast, hehe).

Anyways, the same test with c++ and assembler (release mode this time), the results:

3.000000/2.000000=1.500000 took 3663257ns
3.000000*0.500000=1.500000 took 1366358ns
2.000000*(4.000000/5.000000)=1.600000 took 7738814ns
2.000000*0.800000=1.600000 took 1366403ns


And the code:

// Project: TestAddMultPerformanceCpp, File: TestAddMultPerformanceCpp.cpp
// Path: c:\code\TestAddMultPerformanceCpp, Author: Abi
// Code lines: 78, Size of file: 2,23 KB
// Creation date: 30.12.2005 07:54
// Last modified: 30.12.2005 08:42
// Generated with Commenter by abi.exDream.com

// TestAddMultPerformanceCpp.cpp : Same stuff as in c#, just in c++.
// Update: This version uses assembler inside the loops to force the
// compiler not to cut everything out.
//

#include "stdafx.h"
#include "windows.h"

int _tmain(int argc, _TCHAR* argv[])
{
// Declare everything here in case this eats up cycles
float returnValue = 0.0f;
float value1 = 3;
float value2 = 2;
float value3 = 1 / value2;
float value4 = 4;
float value5 = 5;
float value6 = value4 / value5;
LARGE_INTEGER perfCounterBefore, perfCounterAfter;
LARGE_INTEGER performanceFrequency;
QueryPerformanceFrequency(&performanceFrequency);

// Do 1 bio iterations.
int numberOfIterations = 1000 * 1000 * 1000;

// Test1
QueryPerformanceCounter(&perfCounterBefore);
for (int i = 0; i < numberOfIterations; i++)
{
//ret = value1 / value2;
// Do it the assembler way (maybe that doesn't get optimized out)
__asm
{
fld dword ptr [value1]
fdiv dword ptr [value2]
fstp dword ptr [returnValue]
} // __asm
} // for
QueryPerformanceCounter(&perfCounterAfter);

LONGLONG perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;
printf("%f/%f=%f took %dns\n", value1, value2, returnValue,
perfDifference*1000000/performanceFrequency.QuadPart);

// Test2
QueryPerformanceCounter(&perfCounterBefore);
for (int i = 0; i < numberOfIterations; i++)
{
//ret = value1 * value3;
__asm
{
fld dword ptr [value1]
fmul dword ptr [value3]
fstp dword ptr [returnValue]
} // __asm
} // for
QueryPerformanceCounter(&perfCounterAfter);

perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;
printf("%f*%f=%f took %dns\n", value1, value3, returnValue,
perfDifference*1000000/performanceFrequency.QuadPart);

// Test3
QueryPerformanceCounter(&perfCounterBefore);
for (int i = 0; i < numberOfIterations; i++)
{
//ret = value2 * (value4 / value5);
__asm
{
fld dword ptr [value4]
fdiv dword ptr [value5]
fmul dword ptr [value2]
fstp dword ptr [returnValue]
} // __asm
} // for
QueryPerformanceCounter(&perfCounterAfter);

perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;
printf("%f*(%f/%f)=%f took %dns\n", value2, value4, value5, returnValue,
perfDifference*1000000/performanceFrequency.QuadPart);

// Test4
QueryPerformanceCounter(&perfCounterBefore);
for (int i = 0; i < numberOfIterations; i++)
{
//ret = value2 * value6;
__asm
{
fld dword ptr [value2]
fmul dword ptr [value6]
fstp dword ptr [returnValue]
} // __asm
} // for
QueryPerformanceCounter(&perfCounterAfter);

perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;
printf("%f*%f=%f took %dns\n", value2, value6, returnValue,
perfDifference*1000000/performanceFrequency.QuadPart);

char* temp = new char[100];
scanf(temp);

return 0;
} // _tmain(int argc, _TCHAR* argv[])

Share this post


Link to post
Share on other sites
Quote:
Original post by Skeleton_V@T
if we want to examine exactly what each statement does in high level code

High level code doesn't do anything. It cannot be executed. The CPU can't execute C++ directly. That might sound like sophistry, but it isn't. Even in debug mode, you're profiling low-level code which was generated from the high level code. In debug mode, statements are not reordered, functions are not inlined..... basically, things are done to make step-by-step debugging easier. That does NOT include doing everything in the most obvious way possible. The expressions "a << 1" and "a*2", for instance, will produce the same object code whether you're in Release or Debug mode. There is simply NO WAY to force the compiler to not mess with your arithmetic, because the compiler knows that no one in his right mind wouldn't want the faster version. If you want to profile different opcodes, profile in assembler. If you want useful results, profile in Release mode. There is no reason to profile in Debug mode.

Share this post


Link to post
Share on other sites
Quote:
Original post by abnormal
Can you read? I just wanted to test if 3/2 or 3*0.5f is faster, doing 1 bio times.
[snip]

Thank you. I actually can read. And nobody in this thread said that a division is faster than multiplication on x86 hardware.

Some (including me) pointed out that artificial benchmarks just to prove this given fact are pointless and that the compiler will optimise these things anyway (which it actually did) and thus forced you to go into great lengths to create a pointless test case just to prove something that nobody was actually arguing about.

Now what I was trying to say is that without any profiling (of release build!) code, such micro-optimisations are just a waste of time. The 20-80 rule applies and chances are that it's not a bunch of "wasted" divisions or multiplications in some inner loop that represent the 20% of code your CPU spends 80% of its processing time in.

Most probably it's the loop itself that can be optimised by changing the algorithm, the data flow and whatnot.

I just hope you enjoyed typing all that code and words, I sure did [smile].

Best regards,
Pat.


Share this post


Link to post
Share on other sites
The idiocy of all this aside, you can get your above "performance test" to work in release mode by assign randomly generated values to the variables, (for example float value1 = (float)rand();), and display the calculated values.

Share this post


Link to post
Share on other sites
Quote:
Original post by Sneftel
Quote:
Original post by Skeleton_V@T
if we want to examine exactly what each statement does in high level code

High level code doesn't do anything. It cannot be executed. The CPU can't execute C++ directly. That might sound like sophistry, but it isn't. Even in debug mode, you're profiling low-level code which was generated from the high level code. In debug mode, statements are not reordered, functions are not inlined..... basically, things are done to make step-by-step debugging easier. That does NOT include doing everything in the most obvious way possible. The expressions "a << 1" and "a*2", for instance, will produce the same object code whether you're in Release or Debug mode. There is simply NO WAY to force the compiler to not mess with your arithmetic, because the compiler knows that no one in his right mind wouldn't want the faster version. If you want to profile different opcodes, profile in assembler. If you want useful results, profile in Release mode. There is no reason to profile in Debug mode.


In that case I'll be pretty sure what code will procedure higher performance, so no profiling is required. ^_^

To geekalert: If you're a real optimization hunger, the processor manufacturers documentations are best suited for you. Take a look at your target CPU documentation for details, you'll see there're some other issues that need more concerns. Intel P4 for example.

Share this post


Link to post
Share on other sites
Pipeline stalls and cache misses have many, many times the effect of slowdown as complex instructions (such as divide) on modern processors.

On both AMD and Intel processors the first step when optimizing down at the instruction level (besides vectorizing using SSE/SSE2) is making sure you don't stall the pipeline or miss cache. A good profiler can help you figure out where you are causing these things, which can be nearly impossible to do by hand because of superscalar/out-of-order instruction and micro-ops fusion.

So I would not worry about which instruction takes longer, because if you don't do it properly your 5 clock savings (on modern P4 a divide is about 14, mult about 9 or something in that range) will be meaningless if you have to wait 50 clocks for cache misses.

These are not intel 8088, computer architecture has advanced so much since then!

Share this post


Link to post
Share on other sites
A few guidelines:
1: Most instructions simpler than division can be done in a single cycle, although with some cycles' latency. So in many cases, it doesn't matter, as long as there's no dependency on it in the subsequent instructions.

2: The above is not true for division. Division is slow, and is typically *not* pipelined. If you performa division, you will (at least on Atlon 64, and I assume a Pentium would do the same) stall the multiplication unit until it's done, preventing you from doing multiplies or divisions. So yes, transforming a division into a multiply might be worth it if it's performed sufficiently often.

3: As long as you're dealing with constants, it doesn't make a scrap of difference. The compiler will optimize it away.

4: None of this will make a measurable difference if it isn't done, say, a million times per second or so. At the very least, 100,000 times/second. Any less than that, and you won't be able to measure the difference.

5: Floating point math is typically not optimized very much (For example, the compiler won't transform a division into a multiplication because that could alter the result due to precision loss (Again, if we're dealing with constants, this doesn't matter, as the compiler will figure it out at compile-time))

6: If you really want to know all this, head to Intel/AMD's website, and download their optimization manuals. They specify *exactly* how slow every operation is, and gives you a ton of advice on top of it.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement