• Advertisement

# Yes I'm an optimization freak

This topic is 4434 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

## Recommended Posts

I just wanted to know, is the operation 3 / 2 or the operation 3 * .5 faster on an Intel-based processor. Thanx for any suggestions! (Oh and please dont post anything about "it doesnt matter" unless it really doesnt matter) Edit: To expand on that, is the operation 2 * (4 / 5) faster than 2 * .8 Edit: Assume all values are stored as floats

#### Share this post

##### Share on other sites
Advertisement
well, considering that todays technology you could presume that takes 1/1207833498789 of the processor to do maybe smaller lol... but considering the circumstances it would be faster doing 2 * .8 because it only has to do ONE mathmatic equation where at 2 * (4/5) it has to do two

#### Share this post

##### Share on other sites
Not sure but I think this is faster then both

3 >> 1;

of course that wont work with floats but if your focased with optomization that much there has to be sacrifices.

#### Share this post

##### Share on other sites
If you have something like this:

3 * (4/5)

And everything is constants, not variables, then the compiler "SHOULD" figure it out at compile time.

At run time, doing 4/5 would be longer than .8 because of the extra instruction, I think.

#### Share this post

##### Share on other sites
It doesn't matter. Every single change you suggest can be trivially performed by the compiler if needed. So, it doesn't matter. Write code that is easy to understand, and let people with more information than you worry about the details.

CM

#### Share this post

##### Share on other sites
Quote:
 Original post by geekalert(Oh and please dont post anything about "it doesnt matter" unless it really doesnt matter)

It really doesn't matter.

#### Share this post

##### Share on other sites
yes as stated above the compiler will do LOW LEVEL optimzation, the thing that you as a programmer must be concerned with is high level opimization. For example

while(I < 300)
I++; ///<-- the compiler wont optomize this, obivious as the ineffeiciency is

as opposed to

I = 300;

this is a really simple example but hopefully you get what I mean

#### Share this post

##### Share on other sites
Quote:
 Original post by raptorstrikeyes as stated above the compiler will do LOW LEVEL optimzation, the thing that you as a programmer must be concerned with is high level opimization. For example while(I < 300) I++; ///<-- the compiler wont optomize this, obivious as the ineffeiciency isas opposed toI = 300;this is a really simple example but hopefully you get what I mean

This even is a bad example - the compiler will reduce that to i = 300.
(tested under VC toolkit 2k3).

Cheers,
Pat.

PS: Micro-optimisation is pointless unless profiler outcome suggests to do it.

#### Share this post

##### Share on other sites
Back to the topic. If you have really floats and not constants, there is a HUGE difference in performance! Obviously the calculation with 3 numbers is the slowest and multiplication should be faster than divisions.

I wrote a little test app doing each of your calculations 1 billion times, here is the result:
3/2=1,5     took 3662791ns3*0,5=1,5   took 1371114ns2*(4/5)=1,6 took 7788465ns2*0,8=1,6   took 1366573ns

And the sourcecode (c#, but test it in c++ or assembler if you want to ...):
// Project: TestAddMultPerformance, File: Program.cs// Namespace: TestAddMultPerformance, Class: Program// Path: C:\code\TestAddMultPerformance, Author: Abi// Code lines: 107, Size of file: 2,96 KB// Creation date: 30.12.2005 07:16// Last modified: 30.12.2005 07:25// Generated with Commenter by abi.exDream.com#region Using directivesusing System;using System.Collections.Generic;using System.Text;using System.Runtime.InteropServices;#endregionnamespace TestAddMultPerformance{	/// <summary>	/// Program	/// </summary>	class Program	{		#region Performance counters and getting ns time		/// <summary>		/// Query performance (high resolution) timer frequency		/// </summary>		/// <param name="lpFrequency">current frequency</param>		[System.Security.SuppressUnmanagedCodeSecurity]		[DllImport("Kernel32.dll")]		[return: MarshalAs(UnmanagedType.Bool)]		internal static extern bool QueryPerformanceFrequency(			out long lpFrequency);		/// <summary>		/// Query performance (high resolution) timer counter		/// </summary>		/// <param name="lpCounter">current counter value</param>		[System.Security.SuppressUnmanagedCodeSecurity]		[DllImport("Kernel32.dll")]		[return: MarshalAs(UnmanagedType.Bool)]		internal static extern bool QueryPerformanceCounter(			out long lpCounter);		/// <summary>		/// Get current performance timer frequency		/// (using QueryPerformanceFrequency)		/// </summary>		public static long GetPerformanceFrequency()		{			long l;			QueryPerformanceFrequency(out l);			return l;		} // GetPerformanceFrequency()		/// <summary>		/// Get current performance timer counter value		/// (using QueryPerformanceCounter)		/// </summary>		public static long GetPerformanceCounter()		{			long l;			QueryPerformanceCounter(out l);			return l;		} // GetPerformanceCounter()		/// <summary>		/// Remember the frequency		/// </summary>		public static long performanceFrequency = GetPerformanceFrequency();		/// <summary>		/// Convert performance counter value to ns.		/// </summary>		/// <param name="perfCounter">Counter difference from 2 values</param>		static public int ConvertToNs(long perfCounter)		{			return (int)(perfCounter * 1000000 / performanceFrequency);		} // ConvertToNs(perfCounter)		/// <summary>		/// Convert performance counter value difference		/// (perfCounter2-perfCounter1) to ns.		/// </summary>		static public int ConvertToNs(long perfCounter1, long perfCounter2)		{			return (int)((perfCounter2 - perfCounter1) *				1000000 / performanceFrequency);		} // ConvertToNs(perfCounter1, perfCounter2)		#endregion		static void Main(string[] args)		{			// Declare everything here in case this eats up cycles			float ret = 0.0f;			float value1 = 3;			float value2 = 2;			float value3 = 1 / value2;			float value4 = 4;			float value5 = 5;			float value6 = value4 / value5;			long perfCounterBefore, perfCounterAfter;			// Do 1 bio iterations.			int numberOfIterations = 1000 * 1000 * 1000;			// Just call every method once to make sure we don't count any JIT time.			perfCounterBefore = GetPerformanceCounter();			for (int i = 0; i < numberOfIterations; i++)			{				ret = value1 / value2;			} // for			perfCounterAfter = GetPerformanceCounter();			Console.WriteLine("Dummy test to init JIT: "+				value1 + "/" + value2 + "=" + ret + " took " +				ConvertToNs(perfCounterBefore, perfCounterAfter) + "ns");			// Test1			perfCounterBefore = GetPerformanceCounter();			for (int i = 0; i < numberOfIterations; i++)			{				ret = value1 / value2;			} // for			perfCounterAfter = GetPerformanceCounter();			Console.WriteLine(value1 + "/" + value2 + "=" + ret + " took " +				ConvertToNs(perfCounterBefore, perfCounterAfter) + "ns");			// Test2			perfCounterBefore = GetPerformanceCounter();			for (int i = 0; i < numberOfIterations; i++)			{				ret = value1 * value3;			} // for			perfCounterAfter = GetPerformanceCounter();			Console.WriteLine(value1 + "*" + value3 + "=" + ret + " took " +				ConvertToNs(perfCounterBefore, perfCounterAfter) + "ns");			// Test3			perfCounterBefore = GetPerformanceCounter();			for (int i = 0; i < numberOfIterations; i++)			{				ret = value2 * (value4 / value5);			} // for			perfCounterAfter = GetPerformanceCounter();			Console.WriteLine(value2 + "*(" + value4 + "/" + value5 + ")=" + ret + " took " +				ConvertToNs(perfCounterBefore, perfCounterAfter) + "ns");			// Test4			perfCounterBefore = GetPerformanceCounter();			for (int i = 0; i < numberOfIterations; i++)			{				ret = value2 * value6;			} // for			perfCounterAfter = GetPerformanceCounter();			Console.WriteLine(value2 + "*" + value6 + "=" + ret + " took " +				ConvertToNs(perfCounterBefore, perfCounterAfter) + "ns");			Console.ReadLine();		} // Main(args)	} // class Program} // namespace TestAddMultPerformance

And obviously this is nothing to consider when coding normal algorithms, but it is never bad to know these things :)

#### Share this post

##### Share on other sites
Quote:
 Original post by raptorstrikeyes as stated above the compiler will do LOW LEVEL optimzation, the thing that you as a programmer must be concerned with is high level opimization. For example while(I < 300) I++; ///<-- the compiler wont optomize this, obivious as the ineffeiciency isas opposed toI = 300;this is a really simple example but hopefully you get what I mean

While your point is both valid and important, I'll bet that if you check, a good compiler does indeed optimize that.

CM

#### Share this post

##### Share on other sites
Wow, you learn somthing new everyday. Ok well I'm glad my point is still valid... sorta

#### Share this post

##### Share on other sites
Quote:
 Original post by abnormal...

The fact that those tests took any time at all indicates that things weren't being optimized correctly...you were performing the same constant calculation over and over again, and immediately throwing away the result. In a release build under VC++, each of those loops would be thrown out entirely.

CM

#### Share this post

##### Share on other sites
Quote:
Original post by Conner McCloud
Quote:
 Original post by abnormal...

The fact that those tests took any time at all indicates that things weren't being optimized correctly...you were performing the same constant calculation over and over again, and immediately throwing away the result. In a release build under VC++, each of those loops would be thrown out entirely.

CM

To verify this, here's the C++ program and its output (compiled using VC++ 2k3 toolkit with /O2:
#include <iostream>#include <windows.h>namespace {	class Freq {		LARGE_INTEGER Value;	public:		Freq() {			QueryPerformanceFrequency( &Value );		}		operator unsigned long long () const { return Value.QuadPart; }	};	unsigned long long ConvertToNs( LARGE_INTEGER const & before, LARGE_INTEGER const & after) {		static Freq freq;		unsigned long long result = ((after.QuadPart - before.QuadPart) * 1000000) / freq;		return result;	}}int main(){	// Declare everything here in case this eats up cycles	float ret = 0.0f;	float value1 = 3.0f;	float value2 = 2.0f;	float value3 = 1.0f / value2;	float value4 = 4.0f;	float value5 = 5.0f;	float value6 = value4 / value5;	LARGE_INTEGER perfCounterBefore, perfCounterAfter;	// Do 1 bio iterations.	unsigned long const numberOfIterations = 1000 * 1000 * 1000;	// Just call every method once to make sure we don't count any JIT time.	QueryPerformanceCounter(&perfCounterBefore);	for (unsigned long i = 0; i < numberOfIterations; ++i)	{		ret = value1 / value2;	} // for	QueryPerformanceCounter(&perfCounterAfter);	std::cout << "Dummy test to init JIT: " <<		value1 << "/" << value2 << "=" << ret << " took " <<		ConvertToNs(perfCounterBefore, perfCounterAfter) << "ns\n";	// Test1	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; ++i)	{		ret = value1 / value2;	} // for	QueryPerformanceCounter(&perfCounterAfter);	std::cout << value1 << "/" << value2 << "=" << ret << " took " <<		ConvertToNs(perfCounterBefore, perfCounterAfter) << "ns\n";	// Test2	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; ++i)	{		ret = value1 * value3;	} // for	QueryPerformanceCounter(&perfCounterAfter);	std::cout << value1 << "*" << value3 << "=" << ret << " took " <<		ConvertToNs(perfCounterBefore, perfCounterAfter) << "ns\n";	// Test3	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; ++i)	{		ret = value2 * (value4 / value5);	} // for	QueryPerformanceCounter(&perfCounterAfter);	std::cout << value2 << "*(" << value4 << "/" << value5 << ")=" << ret << " took " <<		ConvertToNs(perfCounterBefore, perfCounterAfter) << "ns\n";	// Test4	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; ++i)	{		ret = value2 * value6;	} // for	QueryPerformanceCounter(&perfCounterAfter);	std::cout << value2 << "*" << value6 << "=" << ret << " took " <<		ConvertToNs(perfCounterBefore, perfCounterAfter) << "ns\n";	char c;	std::cin >> c;	return 0;}

Output:
C:\Temp>cl /O2 /EHsc t.cppMicrosoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86Copyright (C) Microsoft Corporation 1984-2002. All rights reserved.t.cppMicrosoft (R) Incremental Linker Version 7.10.3077Copyright (C) Microsoft Corporation.  All rights reserved./out:t.exet.objC:\Temp>tDummy test to init JIT: 3/2=1.5 took 0ns3/2=1.5 took 0ns3*0.5=1.5 took 1ns2*(4/5)=1.6 took 0ns2*0.8=1.6 took 0ns

So much for artifical benchmarks. This is almost 2006 people, don't think compilers are as stupid as they were in 1989...

Cheers,
Pat

#### Share this post

##### Share on other sites
The answer you are probably looking for is this: on a Pentium 4 a single-precision floating-point divide takes approximately 10 - 15 times as many cycles as a floating-point multiply.

Keep in mind that without any context, the answer is pretty useless. The difference between a multiply instruction and a divide instruction is only about 10 billionths of a second, so unless your code does nothing but divide numbers by a constant over and over billions of times, there is no way to know if it makes a difference which way you do it. There could be other factors that have a much bigger impact.

So now that you know the answer, you can forget it because it will probably never make a difference.

As a side note, your examples are poor because generally the compiler will precompute operations on constants. Some compilers will convert division by a constant to multiplication by its reciprocal. For the ones that don't, you should convert it yourself -- if you care.

#### Share this post

##### Share on other sites
I think the lesson here is, don't use C# if you're concerned with the micro-optimization of constant floating point operations

#### Share this post

##### Share on other sites
No, I designed them not to be optimized out (which would happen in c# too if they were constants).

In c++ the same stuff works too (in debug mode similar results, little bit slower though), but you are right. In release mode all the equations get elimininated to one and I can't get the performance values anymore. However, this wasn't a contest to find the fastest way to calc 4 numbers, but to see which of the cases with RANDOM floats would be the fastest.

This are the c++ results (debug mode, release mode doesn't work):
3.000000/2.000000=1.500000 took 37996493.000000*0.500000=1.500000 took 28112312.000000*(4.000000/5.000000)=1.600000 took 79446952.000000*0.800000=1.600000 took 3315934

Here is the c++ version
// Project: TestAddMultPerformanceCpp, File: TestAddMultPerformanceCpp.cpp// Path: c:\code\TestAddMultPerformanceCpp, Author: Abi// Code lines: 78, Size of file: 2,23 KB// Creation date: 30.12.2005 07:54// Last modified: 30.12.2005 08:06// Generated with Commenter by abi.exDream.com// TestAddMultPerformanceCpp.cpp : Same stuff as in c#, just in c++.// Note: When using Release mode the compiler will optimize out all equations// and unnescessary code. For this reason it does not make sense to use the// release mode here (its just a performance test for gods sake).//#include "stdafx.h"#include "windows.h"int _tmain(int argc, _TCHAR* argv[]){	// Declare everything here in case this eats up cycles	float ret = 0.0f;	float value1 = 3;	float value2 = 2;	float value3 = 1 / value2;	float value4 = 4;	float value5 = 5;	float value6 = value4 / value5;	LARGE_INTEGER perfCounterBefore, perfCounterAfter;	LARGE_INTEGER performanceFrequency;	QueryPerformanceFrequency(&performanceFrequency);	// Do 1 bio iterations.	int numberOfIterations = 1000 * 1000 * 1000;	// Test1	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		ret = value1 / value2;	} // for	QueryPerformanceCounter(&perfCounterAfter);	LONGLONG perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f/%f=%f took %d\n", value1, value2, ret,		perfDifference*1000000/performanceFrequency.QuadPart);	// Test2	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		ret = value1 * value3;	} // for	QueryPerformanceCounter(&perfCounterAfter);	perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f*%f=%f took %d\n", value1, value3, ret,		perfDifference*1000000/performanceFrequency.QuadPart);	// Test3	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		ret = value2 * (value4 / value5);	} // for	QueryPerformanceCounter(&perfCounterAfter);	perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f*(%f/%f)=%f took %d\n", value2, value4, value5, ret,		perfDifference*1000000/performanceFrequency.QuadPart);	// Test4	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		ret = value2 * value6;	} // for	QueryPerformanceCounter(&perfCounterAfter);	perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f*%f=%f took %d\n", value2, value6, ret,		perfDifference*1000000/performanceFrequency.QuadPart);	char* temp = new char[100];	scanf(temp);	return 0;} // _tmain(int argc, _TCHAR* argv[])

#### Share this post

##### Share on other sites
Quote:
 Original post by abnormal// Note: When using Release mode the compiler will optimize out all equations// and unnescessary code. For this reason it does not make sense to use the// release mode here (its just a performance test for gods sake).

I'm sorry to say that, but are you serious?
What the heck is release mode and an optimising compiler good for if actually don't want it to optimise your code just to proof that you could
do said optimisation manually...

This "logic" just escapes me.

Cheers,
Pat.

#### Share this post

##### Share on other sites
I agree with you darookie. But abnormal meant if we want to examine exactly what each statement does in high level code, we should do it in debug mode (of course all checkings must be turned off). That will bring us the correct result about how fast/slow an expression is in low level. Turning optimizations on helps us determine the average performance of a release build, which is unsuitable when we're examining at low level viewpoint.

#### Share this post

##### Share on other sites
Can you read? I just wanted to test if 3/2 or 3*0.5f is faster, doing 1 bio times. Again: This is not about making sense. Performance tests are always flawed and make no sense at all. I just want to compare floating point multiplications and divisions (as the thread creator asked). I'm well aware of the fact that every test will produce different results and in normal life situations many other factors come also into play (as some ppl here already pointed out).

When it was very hard to do with c++ optimizations turned on in release mode, I tried to let it alone. I guess the only way to do it anyway is with assembler (and hey, ppl always say assembler is fast, hehe).

Anyways, the same test with c++ and assembler (release mode this time), the results:
3.000000/2.000000=1.500000            took 3663257ns3.000000*0.500000=1.500000            took 1366358ns2.000000*(4.000000/5.000000)=1.600000 took 7738814ns2.000000*0.800000=1.600000            took 1366403ns

And the code:
// Project: TestAddMultPerformanceCpp, File: TestAddMultPerformanceCpp.cpp// Path: c:\code\TestAddMultPerformanceCpp, Author: Abi// Code lines: 78, Size of file: 2,23 KB// Creation date: 30.12.2005 07:54// Last modified: 30.12.2005 08:42// Generated with Commenter by abi.exDream.com// TestAddMultPerformanceCpp.cpp : Same stuff as in c#, just in c++.// Update: This version uses assembler inside the loops to force the// compiler not to cut everything out.//#include "stdafx.h"#include "windows.h"int _tmain(int argc, _TCHAR* argv[]){	// Declare everything here in case this eats up cycles	float returnValue = 0.0f;	float value1 = 3;	float value2 = 2;	float value3 = 1 / value2;	float value4 = 4;	float value5 = 5;	float value6 = value4 / value5;	LARGE_INTEGER perfCounterBefore, perfCounterAfter;	LARGE_INTEGER performanceFrequency;	QueryPerformanceFrequency(&performanceFrequency);	// Do 1 bio iterations.	int numberOfIterations = 1000 * 1000 * 1000;	// Test1	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		//ret = value1 / value2;		// Do it the assembler way (maybe that doesn't get optimized out)		__asm		{			fld         dword ptr [value1] 			fdiv        dword ptr [value2] 			fstp        dword ptr [returnValue]		} // __asm			} // for	QueryPerformanceCounter(&perfCounterAfter);	LONGLONG perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f/%f=%f took %dns\n", value1, value2, returnValue,		perfDifference*1000000/performanceFrequency.QuadPart);	// Test2	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		//ret = value1 * value3;		__asm		{			fld         dword ptr [value1] 			fmul        dword ptr [value3] 			fstp        dword ptr [returnValue]		} // __asm	} // for	QueryPerformanceCounter(&perfCounterAfter);	perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f*%f=%f took %dns\n", value1, value3, returnValue,		perfDifference*1000000/performanceFrequency.QuadPart);	// Test3	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		//ret = value2 * (value4 / value5);		__asm		{			fld         dword ptr [value4] 			fdiv        dword ptr [value5] 			fmul        dword ptr [value2] 			fstp        dword ptr [returnValue] 		} // __asm	} // for	QueryPerformanceCounter(&perfCounterAfter);	perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f*(%f/%f)=%f took %dns\n", value2, value4, value5, returnValue,		perfDifference*1000000/performanceFrequency.QuadPart);	// Test4	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		//ret = value2 * value6;		__asm		{			fld         dword ptr [value2] 			fmul        dword ptr [value6] 			fstp        dword ptr [returnValue] 		} // __asm	} // for	QueryPerformanceCounter(&perfCounterAfter);	perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f*%f=%f took %dns\n", value2, value6, returnValue,		perfDifference*1000000/performanceFrequency.QuadPart);	char* temp = new char[100];	scanf(temp);	return 0;} // _tmain(int argc, _TCHAR* argv[])

#### Share this post

##### Share on other sites
Quote:
 Original post by Skeleton_V@Tif we want to examine exactly what each statement does in high level code

High level code doesn't do anything. It cannot be executed. The CPU can't execute C++ directly. That might sound like sophistry, but it isn't. Even in debug mode, you're profiling low-level code which was generated from the high level code. In debug mode, statements are not reordered, functions are not inlined..... basically, things are done to make step-by-step debugging easier. That does NOT include doing everything in the most obvious way possible. The expressions "a << 1" and "a*2", for instance, will produce the same object code whether you're in Release or Debug mode. There is simply NO WAY to force the compiler to not mess with your arithmetic, because the compiler knows that no one in his right mind wouldn't want the faster version. If you want to profile different opcodes, profile in assembler. If you want useful results, profile in Release mode. There is no reason to profile in Debug mode.

#### Share this post

##### Share on other sites
Quote:
 Original post by abnormalCan you read? I just wanted to test if 3/2 or 3*0.5f is faster, doing 1 bio times.[snip]

Thank you. I actually can read. And nobody in this thread said that a division is faster than multiplication on x86 hardware.

Some (including me) pointed out that artificial benchmarks just to prove this given fact are pointless and that the compiler will optimise these things anyway (which it actually did) and thus forced you to go into great lengths to create a pointless test case just to prove something that nobody was actually arguing about.

Now what I was trying to say is that without any profiling (of release build!) code, such micro-optimisations are just a waste of time. The 20-80 rule applies and chances are that it's not a bunch of "wasted" divisions or multiplications in some inner loop that represent the 20% of code your CPU spends 80% of its processing time in.

Most probably it's the loop itself that can be optimised by changing the algorithm, the data flow and whatnot.

I just hope you enjoyed typing all that code and words, I sure did [smile].

Best regards,
Pat.

#### Share this post

##### Share on other sites
The idiocy of all this aside, you can get your above "performance test" to work in release mode by assign randomly generated values to the variables, (for example float value1 = (float)rand();), and display the calculated values.

#### Share this post

##### Share on other sites
Quote:
Original post by Sneftel
Quote:
 Original post by Skeleton_V@Tif we want to examine exactly what each statement does in high level code

High level code doesn't do anything. It cannot be executed. The CPU can't execute C++ directly. That might sound like sophistry, but it isn't. Even in debug mode, you're profiling low-level code which was generated from the high level code. In debug mode, statements are not reordered, functions are not inlined..... basically, things are done to make step-by-step debugging easier. That does NOT include doing everything in the most obvious way possible. The expressions "a << 1" and "a*2", for instance, will produce the same object code whether you're in Release or Debug mode. There is simply NO WAY to force the compiler to not mess with your arithmetic, because the compiler knows that no one in his right mind wouldn't want the faster version. If you want to profile different opcodes, profile in assembler. If you want useful results, profile in Release mode. There is no reason to profile in Debug mode.

In that case I'll be pretty sure what code will procedure higher performance, so no profiling is required. ^_^

To geekalert: If you're a real optimization hunger, the processor manufacturers documentations are best suited for you. Take a look at your target CPU documentation for details, you'll see there're some other issues that need more concerns. Intel P4 for example.

#### Share this post

##### Share on other sites
Pipeline stalls and cache misses have many, many times the effect of slowdown as complex instructions (such as divide) on modern processors.

On both AMD and Intel processors the first step when optimizing down at the instruction level (besides vectorizing using SSE/SSE2) is making sure you don't stall the pipeline or miss cache. A good profiler can help you figure out where you are causing these things, which can be nearly impossible to do by hand because of superscalar/out-of-order instruction and micro-ops fusion.

So I would not worry about which instruction takes longer, because if you don't do it properly your 5 clock savings (on modern P4 a divide is about 14, mult about 9 or something in that range) will be meaningless if you have to wait 50 clocks for cache misses.

These are not intel 8088, computer architecture has advanced so much since then!

#### Share this post

##### Share on other sites
A few guidelines:
1: Most instructions simpler than division can be done in a single cycle, although with some cycles' latency. So in many cases, it doesn't matter, as long as there's no dependency on it in the subsequent instructions.

2: The above is not true for division. Division is slow, and is typically *not* pipelined. If you performa division, you will (at least on Atlon 64, and I assume a Pentium would do the same) stall the multiplication unit until it's done, preventing you from doing multiplies or divisions. So yes, transforming a division into a multiply might be worth it if it's performed sufficiently often.

3: As long as you're dealing with constants, it doesn't make a scrap of difference. The compiler will optimize it away.

4: None of this will make a measurable difference if it isn't done, say, a million times per second or so. At the very least, 100,000 times/second. Any less than that, and you won't be able to measure the difference.

5: Floating point math is typically not optimized very much (For example, the compiler won't transform a division into a multiplication because that could alter the result due to precision loss (Again, if we're dealing with constants, this doesn't matter, as the compiler will figure it out at compile-time))

6: If you really want to know all this, head to Intel/AMD's website, and download their optimization manuals. They specify *exactly* how slow every operation is, and gives you a ton of advice on top of it.

#### Share this post

##### Share on other sites

• Advertisement
• Advertisement
• ### Popular Tags

• Advertisement
• ### Popular Now

• 9
• 10
• 12
• 10
• 10
• Advertisement