This topic is 3603 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Operator overloading is a quite essential feature of c++. At least I use it exstensively, it really makes the code simple and easy to read. But I had not realised at how great a cost this came at before I ran some speed tests. Imagine you have a class with three members, and adding two objects of this class is defined as adding the corresponding members in quadrature. There are many ways to do this. The most simple way would be to overload the + operator, making the code simple to write and to read. So let us compare it to a couple of other ways which doesnt look so nice and are harder to read.
class Foo
{
public:
Foo(const double a = 0.0, const double b = 0.0, const double t = 1.0)
{
alpha = a;
beta = b;
theta = t;
}

const Foo & operator = (const Foo &R)
{
alpha = R.alpha;
beta = R.beta;
theta = R.theta;

return *this;
}

Foo operator + (const Foo &R) const
{
return Foo(sqrt(alpha*alpha + R.alpha*R.alpha),
sqrt(beta*beta + R.beta*R.beta),
sqrt(theta*theta + R.theta*R.theta));
}

const Foo & operator += (const Foo &R)
{
alpha = sqrt(alpha*alpha + R.alpha*R.alpha);
beta = sqrt(beta*beta + R.beta*R.beta);
theta = sqrt(theta*theta + R.theta*R.theta);

return *this;
}

const Foo & sum(const Foo &R, Foo &result)
{
result.alpha = sqrt(alpha*alpha + R.alpha*R.alpha);
result.beta = sqrt(beta*beta + R.beta*R.beta);
result.theta = sqrt(theta*theta + R.theta*R.theta);

return result;
}

~Foo(void) { }

double alpha;
double beta;
double theta;
};

The weakness of the + operator overloading is obvious, first a local Foo object has to be created, then returned by value. The overloaded += operator avoids this but it is cumbersome to use, requiring two statements in the code. As a third option we dont use operator overloading at all. Testing the various methods times gave the following results
uba = foo + bar;         ->  35077ms, uba = 5.17344, 8.1232, 8.08741
uba = foo; uba += foo;   ->    938ms, uba = 5.17344, 8.1232, 8.08741
foo.sum(bar, uba);       ->    939ms, uba = 5.17344, 8.1232, 8.08741

Quite an amazing difference, isnt it? I would have thought that the compiler would be able to optimize the first one, but apparently it did not. Now, my question is this: is there a way of obtaining the simplicity of the statement "uba = foo + bar;" together with the speed of "foo.sum(bar, uba);"? Here is the code for testing by the way (compiled in release mode):
#include "stdafx.h"

class Foo
{
public:
Foo(const double a = 0.0, const double b = 0.0, const double t = 1.0)
{
alpha = a;
beta = b;
theta = t;
}

const Foo & operator = (const Foo &R)
{
alpha = R.alpha;
beta = R.beta;
theta = R.theta;

return *this;
}

const Foo & operator += (const Foo &R)
{
alpha = sqrt(alpha*alpha + R.alpha*R.alpha);
beta = sqrt(beta*beta + R.beta*R.beta);
theta = sqrt(theta*theta + R.theta*R.theta);

return *this;
}

Foo operator + (const Foo &R) const
{
return Foo(sqrt(alpha*alpha + R.alpha*R.alpha),
sqrt(beta*beta + R.beta*R.beta),
sqrt(theta*theta + R.theta*R.theta));
}

const Foo & sum(const Foo &R, Foo &result)
{
result.alpha = sqrt(alpha*alpha + R.alpha*R.alpha);
result.beta = sqrt(beta*beta + R.beta*R.beta);
result.theta = sqrt(theta*theta + R.theta*R.theta);

return result;
}

~Foo(void) { }

double alpha;
double beta;
double theta;
};

int _tmain(int argc, _TCHAR* argv[])
{
Foo foo(frand(0.0, 10.0), frand(0.0, 10.0), frand(0.0, 10.0));
Foo bar(frand(0.0, 10.0), frand(0.0, 10.0), frand(0.0, 10.0));
Foo uba = foo + bar;

timeBeginPeriod(1);

ofstream savf("result.txt", ios::out);

int ta, tb;

__int64 counts = 1000000000;

ta = timeGetTime();

for(__int64 i = 0; i < counts; i++)
uba = foo + bar;

tb = timeGetTime();
savf << "uba = foo + bar;         ->  " << setw(5) << tb - ta << "ms, uba = " << uba.alpha << ", " << uba.beta << ", " << uba.theta<< "\n";
cout << "uba = foo + bar;         ->  " << setw(5) << tb - ta << "ms, uba = " << uba.alpha << ", " << uba.beta << ", " << uba.theta<< "\n";

ta = timeGetTime();

for(__int64 i = 0; i < counts; i++)
{
uba = foo;
uba += bar;
}

tb = timeGetTime();
savf << "uba = foo; uba += foo;   ->  " << setw(5) << tb - ta << "ms, uba = " << uba.alpha << ", " << uba.beta << ", " << uba.theta<< "\n";
cout << "uba = foo; uba += foo;   ->  " << setw(5) << tb - ta << "ms, uba = " << uba.alpha << ", " << uba.beta << ", " << uba.theta<< "\n";

ta = timeGetTime();

for(__int64 i = 0; i < counts; i++)
foo.sum(bar, uba);

tb = timeGetTime();
savf << "foo.sum(bar, uba);       ->  " << setw(5) << tb - ta << "ms, uba = " << uba.alpha << ", " << uba.beta << ", " << uba.theta<< "\n";
cout << "foo.sum(bar, uba);       ->  " << setw(5) << tb - ta << "ms, uba = " << uba.alpha << ", " << uba.beta << ", " << uba.theta<< "\n";

cin.ignore(1);
}


Edit: Dont ask me why I used theta instead of gamma, I really have no idea.

##### Share on other sites
Did you check the assembly to make sure all calls are inlined? If they aren't, try putting __forceinline before operator+.

##### Share on other sites
Quote:
 Quite an amazing difference, isnt it? I would have thought that the compiler would be able to optimize the first one, but apparently it did not.

Such a difference means the test case is flawed. My guess would be that the first case is the only one that gets executed, other are probably evaluated during compile time, and the time shown is just the cost of test loop overhead.

In first loop, the entire expression gets evaluated on each pass.

In following two loops, expression gets evaluated exactly once, after that, only loop counter gets incremented.

The difference here is unrelated to performance of operators.

##### Share on other sites
They are inlined, at least the results were almost identical (within a few ms) after adding __forceinline before all the member functions.

Quote:
 Original post by AntheusSuch a difference means the test case is flawed. My guess would be that the first case is the only one that gets executed, other are probably evaluated during compile time, and the time shown is just the cost of test loop overhead.

How would you suggest the test method be improved?

##### Share on other sites
Obviously the operator+ version does more work and has a different usage. (Using += instead doesn't help a lot if you do need the sum in a different object.)

However, the time difference looks suspiciously large, so there must be something wrong with the benchmark. (With GCC, the difference is about 15%, not orders of magnitude.)

##### Share on other sites
Quote:
 Original post by Jesper THow would you suggest the test method be improved?

#include <windows.h>#include <iostream>#include <fstream>#include <string>#include <iomanip>#include <cmath>using namespace std;class Foo{public:	Foo(const double a = 0.0, const double b = 0.0, const double t = 1.0)	{		alpha = a;		beta = b;		theta = t;	}	const Foo & operator = (const Foo &R)	{		alpha = R.alpha;		beta = R.beta;		theta = R.theta;		return *this;	}	const Foo & operator += (const Foo &R)	{		alpha = sqrt(alpha*alpha + R.alpha*R.alpha);		beta = sqrt(beta*beta + R.beta*R.beta);		theta = sqrt(theta*theta + R.theta*R.theta);		return *this;	}	Foo operator + (const Foo &R) const	{		return Foo(sqrt(alpha*alpha + R.alpha*R.alpha),			       sqrt(beta*beta + R.beta*R.beta),				   sqrt(theta*theta + R.theta*R.theta));	}	const Foo & sum(const Foo &R, Foo &result)	{		result.alpha = sqrt(alpha*alpha + R.alpha*R.alpha);		result.beta = sqrt(beta*beta + R.beta*R.beta);		result.theta = sqrt(theta*theta + R.theta*R.theta);		return result;	}	~Foo(void) { }	double alpha;	double beta;	double theta;};float frand(float min, float max) {	return min + ((max-min) * rand() / RAND_MAX);}struct Tester {	virtual void test() = 0;	virtual void print() = 0;};struct Baseline : Tester {	Baseline()	{	}	virtual void test() {	}	virtual void print() {	}};struct Test1 : Tester {	Test1()		: foo(frand(0.0, 10.0), frand(0.0, 10.0), frand(0.0, 10.0))		, bar(frand(0.0, 10.0), frand(0.0, 10.0), frand(0.0, 10.0))	{	}	virtual void test() {		uba = foo + bar;	}	virtual void print() {		std::cout << uba.alpha << ", " << uba.beta << ", " << uba.theta << std::endl;	}	Foo uba;	Foo foo;	Foo bar;};struct Test2 : Tester {	Test2()		: foo(frand(0.0, 10.0), frand(0.0, 10.0), frand(0.0, 10.0))		, bar(frand(0.0, 10.0), frand(0.0, 10.0), frand(0.0, 10.0))	{	}	virtual void test() {		uba = foo; 		uba += bar;	}	virtual void print() {		std::cout << uba.alpha << ", " << uba.beta << ", " << uba.theta << std::endl;	}	Foo uba;	Foo foo;	Foo bar;};struct Test3 : Tester {	Test3()		: foo(frand(0.0, 10.0), frand(0.0, 10.0), frand(0.0, 10.0))		, bar(frand(0.0, 10.0), frand(0.0, 10.0), frand(0.0, 10.0))	{	}	virtual void test() {		foo.sum(bar, uba);	}	virtual void print() {		std::cout << uba.alpha << ", " << uba.beta << ", " << uba.theta << std::endl;	}	Foo uba;	Foo foo;	Foo bar;};int run(Tester * test, const char * name, long long counts, int baseline = 0) {	int ta = timeGetTime();	for(__int64 i = 0; i < counts; i++)	test->test();	int tb = timeGetTime();	int delta = tb-ta-baseline;	cout << name << setw(5) << delta << "ms" << std::endl;	return delta;};int main(int argc, char* argv[]){	srand(argc);	timeBeginPeriod(1);	__int64 counts = 10000000;	Baseline basetest;		// calibrate the cost of NOP virtual function call + loop counter	int baseline = run(&basetest, "NOP virtual call", counts);	Test1 test1;	run(&test1, "uba = foo + bar;", counts, baseline);	Test2 test2;	run(&test2, "uba = foo; uba += foo;", counts, baseline);	Test3 test3;	run(&test3, "foo.sum(bar, uba);", counts, baseline);};

The above is a mess of copy-paste code. Feel free to improve the style as needed. The important points:
- confuse the compiler via virtual functions enough to take potential optimizations out of the equation
- calibrate the base cost and subtract it from measurements
- have each test case in separate class to prevent global optimizations

Quote:
 /fp:preciseNOP virtual call 32msuba = foo + bar; 534msuba = foo; uba += foo; 490msfoo.sum(bar, uba); 457ms

Quote:
 /fp:fastNOP virtual call 31msuba = foo + bar; 320msuba = foo; uba += foo; 321msfoo.sum(bar, uba); 316ms

Quote:
 /arch:SSE2NOP virtual call 31msuba = foo + bar; 386msuba = foo; uba += foo; 349msfoo.sum(bar, uba); 345ms

Now you can also begin to study isolated case-by-case generated assembly to see why things are different.

Note: seed isn't reset, but I don't believe it should matter (don't have time or interest to mess with it further right now)

##### Share on other sites
Changing the test to do a += equivalent operation changed things quite a bit:

uba = uba + bar;         ->  34903ms, uba = 61128.2, 178221, 395.683uba += bar;              ->  32618ms, uba = 61128.2, 178221, 395.683bar.sum(uba, uba);       ->  33224ms, uba = 61128.2, 178221, 395.683

Now there is a recursive relation, making it more difficult to optimize. So it seems you were right Antheus. The compiler played a trick on me :s

Edit: Ah, I see you came to similar results. Thanks for the effort ;)

##### Share on other sites
The temporary object that gets created with binary operators like +, can in certain types of classes be eliminated through a technique called "Expression Templates". However it is pretty advanced stuff.

Have you considered implementing operator + in terms of operator += ? It might be worth checking how much it effects performance, but it shouldn't be by much:
	friend Foo operator + (Foo L, const Foo &R) const	{		return L += R;	}

##### Share on other sites
Quote:
 Original post by iMalcHave you considered implementing operator + in terms of operator += ?

That's the idiomatic way. Binary operators implemented that way do not have to be friends. They also can take advantage of NRVO to reduce copying. An optimizing compiler can do amazing stuff with inline functions.

1. 1
2. 2
3. 3
Rutin
22
4. 4
frob
16
5. 5

• 9
• 33
• 13
• 12
• 10
• ### Forum Statistics

• Total Topics
632579
• Total Posts
3007178

×