Yes I'm an optimization freak

181

December 30, 2005 12:33 AM

Wow, you learn somthing new everyday. Ok well I'm glad my point is still valid... sorta

____________________________"This just in, 9 out of 10 americans agree that 1 out of 10 americans will disagree with the other 9"- Colin Mochrie

Conner McCloud

1,135

December 30, 2005 12:35 AM

Quote:Original post by abnormal
...

The fact that those tests took any time at all indicates that things weren't being optimized correctly...you were performing the same constant calculation over and over again, and immediately throwing away the result. In a release build under VC++, each of those loops would be thrown out entirely.

CM

darookie

1,441

December 30, 2005 12:56 AM

Quote:Original post by Conner McCloud
Quote:Original post by abnormal
...

The fact that those tests took any time at all indicates that things weren't being optimized correctly...you were performing the same constant calculation over and over again, and immediately throwing away the result. In a release build under VC++, each of those loops would be thrown out entirely.

CM

To verify this, here's the C++ program and its output (compiled using VC++ 2k3 toolkit with /O2:

#include <iostream>#include <windows.h>namespace {	class Freq {		LARGE_INTEGER Value;	public:		Freq() {			QueryPerformanceFrequency( &Value );		}		operator unsigned long long () const { return Value.QuadPart; }	};	unsigned long long ConvertToNs( LARGE_INTEGER const & before, LARGE_INTEGER const & after) {		static Freq freq;		unsigned long long result = ((after.QuadPart - before.QuadPart) * 1000000) / freq;		return result;	}}int main(){	// Declare everything here in case this eats up cycles	float ret = 0.0f;	float value1 = 3.0f;	float value2 = 2.0f;	float value3 = 1.0f / value2;	float value4 = 4.0f;	float value5 = 5.0f;	float value6 = value4 / value5;	LARGE_INTEGER perfCounterBefore, perfCounterAfter;	// Do 1 bio iterations.	unsigned long const numberOfIterations = 1000 * 1000 * 1000;	// Just call every method once to make sure we don't count any JIT time.	QueryPerformanceCounter(&perfCounterBefore);	for (unsigned long i = 0; i < numberOfIterations; ++i)	{		ret = value1 / value2;	} // for	QueryPerformanceCounter(&perfCounterAfter);	std::cout << "Dummy test to init JIT: " <<		value1 << "/" << value2 << "=" << ret << " took " <<		ConvertToNs(perfCounterBefore, perfCounterAfter) << "ns\n";	// Test1	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; ++i)	{		ret = value1 / value2;	} // for	QueryPerformanceCounter(&perfCounterAfter);	std::cout << value1 << "/" << value2 << "=" << ret << " took " <<		ConvertToNs(perfCounterBefore, perfCounterAfter) << "ns\n";	// Test2	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; ++i)	{		ret = value1 * value3;	} // for	QueryPerformanceCounter(&perfCounterAfter);	std::cout << value1 << "*" << value3 << "=" << ret << " took " <<		ConvertToNs(perfCounterBefore, perfCounterAfter) << "ns\n";	// Test3	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; ++i)	{		ret = value2 * (value4 / value5);	} // for	QueryPerformanceCounter(&perfCounterAfter);	std::cout << value2 << "*(" << value4 << "/" << value5 << ")=" << ret << " took " <<		ConvertToNs(perfCounterBefore, perfCounterAfter) << "ns\n";	// Test4	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; ++i)	{		ret = value2 * value6;	} // for	QueryPerformanceCounter(&perfCounterAfter);	std::cout << value2 << "*" << value6 << "=" << ret << " took " <<		ConvertToNs(perfCounterBefore, perfCounterAfter) << "ns\n";	char c;	std::cin >> c;	return 0;}

Output:

C:\Temp>cl /O2 /EHsc t.cppMicrosoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86Copyright (C) Microsoft Corporation 1984-2002. All rights reserved.t.cppMicrosoft (R) Incremental Linker Version 7.10.3077Copyright (C) Microsoft Corporation.  All rights reserved./out:t.exet.objC:\Temp>tDummy test to init JIT: 3/2=1.5 took 0ns3/2=1.5 took 0ns3*0.5=1.5 took 1ns2*(4/5)=1.6 took 0ns2*0.8=1.6 took 0ns

So much for artifical benchmarks. This is almost 2006 people, don't think compilers are as stupid as they were in 1989...

Cheers,
Pat

JohnBolton

1,373

December 30, 2005 01:03 AM

The answer you are probably looking for is this: on a Pentium 4 a single-precision floating-point divide takes approximately 10 - 15 times as many cycles as a floating-point multiply.

Keep in mind that without any context, the answer is pretty useless. The difference between a multiply instruction and a divide instruction is only about 10 billionths of a second, so unless your code does nothing but divide numbers by a constant over and over billions of times, there is no way to know if it makes a difference which way you do it. There could be other factors that have a much bigger impact.

So now that you know the answer, you can forget it because it will probably never make a difference.

As a side note, your examples are poor because generally the compiler will precompute operations on constants. Some compilers will convert division by a constant to multiplication by its reciprocal. For the ones that don't, you should convert it yourself -- if you care.

John BoltonLocomotive Games (THQ)Current Project: Destroy All Humans (Wii). IN STORES NOW!

NickW

321

December 30, 2005 01:04 AM

I think the lesson here is, don't use C# if you're concerned with the micro-optimization of constant floating point operations

abnormal

223

December 30, 2005 01:13 AM

No, I designed them not to be optimized out (which would happen in c# too if they were constants).

In c++ the same stuff works too (in debug mode similar results, little bit slower though), but you are right. In release mode all the equations get elimininated to one and I can't get the performance values anymore. However, this wasn't a contest to find the fastest way to calc 4 numbers, but to see which of the cases with RANDOM floats would be the fastest.

This are the c++ results (debug mode, release mode doesn't work):

3.000000/2.000000=1.500000 took 37996493.000000*0.500000=1.500000 took 28112312.000000*(4.000000/5.000000)=1.600000 took 79446952.000000*0.800000=1.600000 took 3315934

Here is the c++ version

// Project: TestAddMultPerformanceCpp, File: TestAddMultPerformanceCpp.cpp// Path: c:\code\TestAddMultPerformanceCpp, Author: Abi// Code lines: 78, Size of file: 2,23 KB// Creation date: 30.12.2005 07:54// Last modified: 30.12.2005 08:06// Generated with Commenter by abi.exDream.com// TestAddMultPerformanceCpp.cpp : Same stuff as in c#, just in c++.// Note: When using Release mode the compiler will optimize out all equations// and unnescessary code. For this reason it does not make sense to use the// release mode here (its just a performance test for gods sake).//#include "stdafx.h"#include "windows.h"int _tmain(int argc, _TCHAR* argv[]){	// Declare everything here in case this eats up cycles	float ret = 0.0f;	float value1 = 3;	float value2 = 2;	float value3 = 1 / value2;	float value4 = 4;	float value5 = 5;	float value6 = value4 / value5;	LARGE_INTEGER perfCounterBefore, perfCounterAfter;	LARGE_INTEGER performanceFrequency;	QueryPerformanceFrequency(&performanceFrequency);	// Do 1 bio iterations.	int numberOfIterations = 1000 * 1000 * 1000;	// Test1	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		ret = value1 / value2;	} // for	QueryPerformanceCounter(&perfCounterAfter);	LONGLONG perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f/%f=%f took %d\n", value1, value2, ret,		perfDifference*1000000/performanceFrequency.QuadPart);	// Test2	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		ret = value1 * value3;	} // for	QueryPerformanceCounter(&perfCounterAfter);	perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f*%f=%f took %d\n", value1, value3, ret,		perfDifference*1000000/performanceFrequency.QuadPart);	// Test3	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		ret = value2 * (value4 / value5);	} // for	QueryPerformanceCounter(&perfCounterAfter);	perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f*(%f/%f)=%f took %d\n", value2, value4, value5, ret,		perfDifference*1000000/performanceFrequency.QuadPart);	// Test4	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		ret = value2 * value6;	} // for	QueryPerformanceCounter(&perfCounterAfter);	perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f*%f=%f took %d\n", value2, value6, ret,		perfDifference*1000000/performanceFrequency.QuadPart);	char* temp = new char[100];	scanf(temp);	return 0;} // _tmain(int argc, _TCHAR* argv[])

Microsoft DirectX MVP. My Blog: abi.exdream.com

darookie

1,441

December 30, 2005 01:25 AM

Quote:Original post by abnormal

// Note: When using Release mode the compiler will optimize out all equations
// and unnescessary code. For this reason it does not make sense to use the
// release mode here (its just a performance test for gods sake).

I'm sorry to say that, but are you serious?
What the heck is release mode and an optimising compiler good for if actually don't want it to optimise your code just to proof that you could
do said optimisation manually...

This "logic" just escapes me.

Cheers,
Pat.

Skeleton_V@T

512

December 30, 2005 01:38 AM

I agree with you darookie. But abnormal meant if we want to examine exactly what each statement does in high level code, we should do it in debug mode (of course all checkings must be turned off). That will bring us the correct result about how fast/slow an expression is in low level. Turning optimizations on helps us determine the average performance of a release build, which is unsuitable when we're examining at low level viewpoint.

--> The great thing about Object Oriented code is that it can make small, simple problems look like large, complex ones <--

abnormal

223

December 30, 2005 01:42 AM

Can you read? I just wanted to test if 3/2 or 3*0.5f is faster, doing 1 bio times. Again: This is not about making sense. Performance tests are always flawed and make no sense at all. I just want to compare floating point multiplications and divisions (as the thread creator asked). I'm well aware of the fact that every test will produce different results and in normal life situations many other factors come also into play (as some ppl here already pointed out).

When it was very hard to do with c++ optimizations turned on in release mode, I tried to let it alone. I guess the only way to do it anyway is with assembler (and hey, ppl always say assembler is fast, hehe).

Anyways, the same test with c++ and assembler (release mode this time), the results:

3.000000/2.000000=1.500000            took 3663257ns3.000000*0.500000=1.500000            took 1366358ns2.000000*(4.000000/5.000000)=1.600000 took 7738814ns2.000000*0.800000=1.600000            took 1366403ns

And the code:

// Project: TestAddMultPerformanceCpp, File: TestAddMultPerformanceCpp.cpp// Path: c:\code\TestAddMultPerformanceCpp, Author: Abi// Code lines: 78, Size of file: 2,23 KB// Creation date: 30.12.2005 07:54// Last modified: 30.12.2005 08:42// Generated with Commenter by abi.exDream.com// TestAddMultPerformanceCpp.cpp : Same stuff as in c#, just in c++.// Update: This version uses assembler inside the loops to force the// compiler not to cut everything out.//#include "stdafx.h"#include "windows.h"int _tmain(int argc, _TCHAR* argv[]){	// Declare everything here in case this eats up cycles	float returnValue = 0.0f;	float value1 = 3;	float value2 = 2;	float value3 = 1 / value2;	float value4 = 4;	float value5 = 5;	float value6 = value4 / value5;	LARGE_INTEGER perfCounterBefore, perfCounterAfter;	LARGE_INTEGER performanceFrequency;	QueryPerformanceFrequency(&performanceFrequency);	// Do 1 bio iterations.	int numberOfIterations = 1000 * 1000 * 1000;	// Test1	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		//ret = value1 / value2;		// Do it the assembler way (maybe that doesn't get optimized out)		__asm		{			fld         dword ptr [value1] 			fdiv        dword ptr [value2] 			fstp        dword ptr [returnValue]		} // __asm			} // for	QueryPerformanceCounter(&perfCounterAfter);	LONGLONG perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f/%f=%f took %dns\n", value1, value2, returnValue,		perfDifference*1000000/performanceFrequency.QuadPart);	// Test2	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		//ret = value1 * value3;		__asm		{			fld         dword ptr [value1] 			fmul        dword ptr [value3] 			fstp        dword ptr [returnValue]		} // __asm	} // for	QueryPerformanceCounter(&perfCounterAfter);	perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f*%f=%f took %dns\n", value1, value3, returnValue,		perfDifference*1000000/performanceFrequency.QuadPart);	// Test3	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		//ret = value2 * (value4 / value5);		__asm		{			fld         dword ptr [value4] 			fdiv        dword ptr [value5] 			fmul        dword ptr [value2] 			fstp        dword ptr [returnValue] 		} // __asm	} // for	QueryPerformanceCounter(&perfCounterAfter);	perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f*(%f/%f)=%f took %dns\n", value2, value4, value5, returnValue,		perfDifference*1000000/performanceFrequency.QuadPart);	// Test4	QueryPerformanceCounter(&perfCounterBefore);	for (int i = 0; i < numberOfIterations; i++)	{		//ret = value2 * value6;		__asm		{			fld         dword ptr [value2] 			fmul        dword ptr [value6] 			fstp        dword ptr [returnValue] 		} // __asm	} // for	QueryPerformanceCounter(&perfCounterAfter);	perfDifference = perfCounterAfter.QuadPart-perfCounterBefore.QuadPart;	printf("%f*%f=%f took %dns\n", value2, value6, returnValue,		perfDifference*1000000/performanceFrequency.QuadPart);	char* temp = new char[100];	scanf(temp);	return 0;} // _tmain(int argc, _TCHAR* argv[])

Microsoft DirectX MVP. My Blog: abi.exdream.com

Sneftel

1,788

December 30, 2005 01:45 AM

Quote:Original post by Skeleton_V@T
if we want to examine exactly what each statement does in high level code

High level code doesn't do anything. It cannot be executed. The CPU can't execute C++ directly. That might sound like sophistry, but it isn't. Even in debug mode, you're profiling low-level code which was generated from the high level code. In debug mode, statements are not reordered, functions are not inlined..... basically, things are done to make step-by-step debugging easier. That does NOT include doing everything in the most obvious way possible. The expressions "a << 1" and "a*2", for instance, will produce the same object code whether you're in Release or Debug mode. There is simply NO WAY to force the compiler to not mess with your arithmetic, because the compiler knows that no one in his right mind wouldn't want the faster version. If you want to profile different opcodes, profile in assembler. If you want useful results, profile in Release mode. There is no reason to profile in Debug mode.

Yes I'm an optimization freak

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Yes I'm an optimization freak

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines