Surprised by multi threaded app results

Started by
17 comments, last by hick18 13 years, 9 months ago
I wanted to experiment with a threaded a simple task. That task being to have a large number of floats that need to be multiplied by some number. So I wrote the following code which runs the task a number of times increasing the number of threads for the task each iteration.

The work is divided by simply splitting up the list of floats, so

1 thread has to do the entire list
2 threads have half the list each
3 threads have a third of the list each
and so on

I know that there are some division/remainder issues, and that there is no cleanup performed, but I wanted to keep it simple.

Seeing as I have a quad core, I expected the results to be more optimal around the 4 thread mark, and then get gradually worse as the number of threads increased. But thats no what I got at all, and was wondering why?

#include <windows.h>#include <iostream>using namespace std;const int NumFloats = 12000000;float* pFloatArray = 0;struct Task{	float* pFloatArray;	int Count;	Task( float* _pfloatArray, int _Count ) : pFloatArray(_pfloatArray), Count(_Count) {}};void Func( Task* pTask ){	for( int i = 0; i < pTask->Count; ++i )	{		pTask->pFloatArray *= 5;	}}void ThreadTask( int NumThreads ){	DWORD t1 = timeGetTime();	HANDLE* Handles = new HANDLE[NumThreads];	for( int j = 0; j < NumThreads; ++j )	{		int Offset = j*(NumFloats/NumThreads);		Task* ptask = new Task( pFloatArray + Offset, NumFloats/NumThreads );		Handles[j] = CreateThread( 0, 0, (LPTHREAD_START_ROUTINE)&Func, ptask, 0, 0 );	}	WaitForMultipleObjects( NumThreads, Handles, true, INFINITE );	DWORD t2 = timeGetTime();	cout << "Threads " << NumThreads << " Time : " << t2 - t1 << " Rem " << NumFloats%NumThreads <<  endl;}int main(){	pFloatArray = new float[NumFloats];	ThreadTask( 1 );	ThreadTask( 2 );	ThreadTask( 3 );	ThreadTask( 4 );	ThreadTask( 6 );	ThreadTask( 8 );	ThreadTask( 12 );	ThreadTask( 16 );	ThreadTask( 24 );	ThreadTask( 30 );	ThreadTask( 40 );	ThreadTask( 50 );	ThreadTask( 60 );	ThreadTask( 80 );	ThreadTask( 100 );	ThreadTask( 120 );	ThreadTask( 140 );	ThreadTask( 180 );	ThreadTask( 220 );	ThreadTask( 250 );	ThreadTask( 300 );	ThreadTask( 350 );	ThreadTask( 400 );	ThreadTask( 450 );	ThreadTask( 500 );	cin.get();}


Results 1
Threads  : 1,  Time(ms) : 38,  Rem 0Threads  : 2,  Time(ms) : 19,  Rem 0Threads  : 3,  Time(ms) : 19,  Rem 0Threads  : 4,  Time(ms) : 20,  Rem 0Threads  : 6,  Time(ms) : 21,  Rem 0Threads  : 8,  Time(ms) : 19,  Rem 0Threads  : 12,  Time(ms) : 21,  Rem 0Threads  : 16,  Time(ms) : 19,  Rem 0Threads  : 24,  Time(ms) : 21,  Rem 0Threads  : 30,  Time(ms) : 22,  Rem 0Threads  : 40,  Time(ms) : 21,  Rem 0Threads  : 50,  Time(ms) : 22,  Rem 0Threads  : 60,  Time(ms) : 22,  Rem 0Threads  : 80,  Time(ms) : 15,  Rem 0Threads  : 100,  Time(ms) : 17,  Rem 0Threads  : 120,  Time(ms) : 20,  Rem 0Threads  : 140,  Time(ms) : 24,  Rem 40Threads  : 180,  Time(ms) : 21,  Rem 120Threads  : 220,  Time(ms) : 28,  Rem 100Threads  : 250,  Time(ms) : 33,  Rem 0Threads  : 300,  Time(ms) : 37,  Rem 0Threads  : 350,  Time(ms) : 40,  Rem 250Threads  : 400,  Time(ms) : 46,  Rem 0Threads  : 450,  Time(ms) : 52,  Rem 300Threads  : 500,  Time(ms) : 57,  Rem 0


Results 2
Threads  : 1,  Time(ms) : 38,  Rem 0Threads  : 2,  Time(ms) : 19,  Rem 0Threads  : 3,  Time(ms) : 19,  Rem 0Threads  : 4,  Time(ms) : 20,  Rem 0Threads  : 6,  Time(ms) : 21,  Rem 0Threads  : 8,  Time(ms) : 19,  Rem 0Threads  : 12,  Time(ms) : 21,  Rem 0Threads  : 16,  Time(ms) : 19,  Rem 0Threads  : 24,  Time(ms) : 21,  Rem 0Threads  : 30,  Time(ms) : 22,  Rem 0Threads  : 40,  Time(ms) : 21,  Rem 0Threads  : 50,  Time(ms) : 22,  Rem 0Threads  : 60,  Time(ms) : 22,  Rem 0Threads  : 80,  Time(ms) : 15,  Rem 0Threads  : 100,  Time(ms) : 17,  Rem 0Threads  : 120,  Time(ms) : 20,  Rem 0Threads  : 140,  Time(ms) : 24,  Rem 40Threads  : 180,  Time(ms) : 21,  Rem 120Threads  : 220,  Time(ms) : 28,  Rem 100Threads  : 250,  Time(ms) : 33,  Rem 0Threads  : 300,  Time(ms) : 37,  Rem 0Threads  : 350,  Time(ms) : 40,  Rem 250Threads  : 400,  Time(ms) : 46,  Rem 0Threads  : 450,  Time(ms) : 52,  Rem 300Threads  : 500,  Time(ms) : 57,  Rem 0
Advertisement
Are you sure you have a quad core processor and not a dual core with hyperthreading?
Your tasks are for too short for the timings to be useful. If I recall correctly, timeGetTime() only has a resolution of about 20 milliseconds, which makes any results much under 100 milliseconds meaningless.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

Thats interesting. I am researching on that, too. What you should do definitely is turn off optimations! Your compiler may optimize the calculation and have a single result compiled in your tasks.
Several issues, as has been pointed out, your tasks are too short and simplistic. Perhaps try an array of structs with several values, perhaps an array of vector4s filled with random values and do a normalisation on each one?

You're also running each set of tasks only once when you should really run them a reasonable number of times (10 to 1000 lets say) and then average the times. This helps isolate the results of each run from other tasks your CPU is performing.

As to why you get faster results with more threads, well first you're using a low-resolution timer so you're timing may not be accurate. Secondly when running more threads the OS can sometimes schedule execution better than with 1 thread per core. If one of you run 1-thread for 1-core your thread may stall due to a cache miss, poor pipelining or other issue. There is nothing that thread/core can do but wait for the issue to pass. If you're running 2-threads for 1-core, then that issue might be hidden by the OS or core switching to another thread.

So, make your test more complicated to give each task more work, repeat and average the results to get more accurate timing results and then you should have a better idea of what it's doing. You may also want to try several different types of tests and data structures to see how the results vary, division for example often can't be pipelined but multiplication can.

Andy

"Ars longa, vita brevis, occasio praeceps, experimentum periculosum, iudicium difficile"

"Life is short, [the] craft long, opportunity fleeting, experiment treacherous, judgement difficult."

Quote:Original post by Zweistein2
Thats interesting. I am researching on that, too. What you should do definitely is turn off optimations! Your compiler may optimize the calculation and have a single result compiled in your tasks.


Negative.

This advice is similar to suggesting that in order to see how fast your car can go, you should siphon all the gas out of the tank, because higher octane fuel might just affect your engine performance.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Increase NumFloats by 10 or 100. See if it affects the results.
Well, I made changes as suggested, and now I get even stranger results

Number of vectors : 10,000,000
Number of runs : 10

Threads  : 1,  Avg Time(ms) : 7232,  Rem 0Threads  : 2,  Avg Time(ms) : 3851,  Rem 0Threads  : 3,  Avg Time(ms) : 2582,  Rem 1Threads  : 4,  Avg Time(ms) : 2046,  Rem 0Threads  : 6,  Avg Time(ms) : 2012,  Rem 4Threads  : 8,  Avg Time(ms) : 2152,  Rem 0Threads  : 12,  Avg Time(ms) : 2092,  Rem 4Threads  : 16,  Avg Time(ms) : 2062,  Rem 0Threads  : 24,  Avg Time(ms) : 2038,  Rem 16Threads  : 30,  Avg Time(ms) : 2023,  Rem 10Threads  : 40,  Avg Time(ms) : 1973,  Rem 0Threads  : 50,  Avg Time(ms) : 1973,  Rem 0Threads  : 60,  Avg Time(ms) : 1962,  Rem 40Threads  : 80,  Avg Time(ms) : 50,  Rem 0Threads  : 100,  Avg Time(ms) : 206,  Rem 0Threads  : 120,  Avg Time(ms) : 25,  Rem 40Threads  : 140,  Avg Time(ms) : 26,  Rem 80Threads  : 180,  Avg Time(ms) : 42,  Rem 100Threads  : 220,  Avg Time(ms) : 47,  Rem 120Threads  : 250,  Avg Time(ms) : 71,  Rem 0Threads  : 300,  Avg Time(ms) : 80,  Rem 100Threads  : 350,  Avg Time(ms) : 152,  Rem 150Threads  : 400,  Avg Time(ms) : 393,  Rem 0Threads  : 450,  Avg Time(ms) : 88,  Rem 100Threads  : 500,  Avg Time(ms) : 90,  Rem 0


#include <windows.h>#include <iostream>using namespace std;struct Vector{	float x,y,z;};const int NumVectors =  10 * 1000000;const int NumRuns = 10;Vector* pVectorArray = 0;void Normalise( Vector* pVec ){	float Length = sqrt( pVec->x*pVec->x + pVec->y*pVec->y + pVec->z*pVec->z );	pVec->x /= Length;	pVec->y /= Length;	pVec->z /= Length;}struct Task{	Vector* pVectorArray;	int Count;	Task( Vector* _pVectorArray, int _Count ) : pVectorArray(_pVectorArray), Count(_Count) {}};void Func( Task* pTask ){	for( int i = 0; i < pTask->Count; ++i )	{		Normalise( &pTask->pVectorArray );	}}void ThreadTask( int NumThreads ){	DWORD Total = 0;	for( int i = 0; i < NumRuns; ++i )	{		DWORD t1 = timeGetTime();		HANDLE* Handles = new HANDLE[NumThreads];		for( int j = 0; j < NumThreads; ++j )		{			int Offset = j*(NumVectors/NumThreads);			Task* ptask = new Task( pVectorArray + Offset, NumVectors/NumThreads );			Handles[j] = CreateThread( 0, 0, (LPTHREAD_START_ROUTINE)&Func, ptask, 0, 0 );		}		WaitForMultipleObjects( NumThreads, Handles, true, INFINITE );		DWORD t2 = timeGetTime();		Total += (t2 - t1);	}	cout << "Threads  : " << NumThreads << ",  Avg Time(ms) : " << Total / NumRuns << ",  Rem " << NumVectors % NumThreads <<  endl;}int main(){	pVectorArray = new Vector[NumVectors];	ThreadTask( 1 );	ThreadTask( 2 );	ThreadTask( 3 );	ThreadTask( 4 );	ThreadTask( 6 );	ThreadTask( 8 );	ThreadTask( 12 );	ThreadTask( 16 );	ThreadTask( 24 );	ThreadTask( 30 );	ThreadTask( 40 );	ThreadTask( 50 );	ThreadTask( 60 );	ThreadTask( 80 );	ThreadTask( 100 );	ThreadTask( 120 );	ThreadTask( 140 );	ThreadTask( 180 );	ThreadTask( 220 );	ThreadTask( 250 );	ThreadTask( 300 );	ThreadTask( 350 );	ThreadTask( 400 );	ThreadTask( 450 );	ThreadTask( 500 );	cin.get();}
Quote:Threads : 1, Avg Time(ms) : 7232, Rem 0
Threads : 2, Avg Time(ms) : 3851, Rem 0
Threads : 3, Avg Time(ms) : 2582, Rem 1
Threads : 4, Avg Time(ms) : 2046, Rem 0

This part is ok, 4 cores is ~3.6 times faster


The incorrect short time occur because t2-t1 = 0 - it completes faster than timeGetTime can measure.
Whilst that might be true, I dont think its the issue in this case, as I was watching the experiment and saw the last tasks flying through much much quicker.

Here is an update, with a better precision timer, and the results are the same.

Threads  : 1,  Avg Time : 7.24659,  Rem 0Threads  : 2,  Avg Time : 3.85804,  Rem 0Threads  : 3,  Avg Time : 2.57677,  Rem 1Threads  : 4,  Avg Time : 2.06466,  Rem 0Threads  : 6,  Avg Time : 2.01013,  Rem 4Threads  : 8,  Avg Time : 2.0157,  Rem 0Threads  : 12,  Avg Time : 1.98388,  Rem 4Threads  : 16,  Avg Time : 1.97695,  Rem 0Threads  : 24,  Avg Time : 1.97691,  Rem 16Threads  : 30,  Avg Time : 1.98669,  Rem 10Threads  : 40,  Avg Time : 1.96183,  Rem 0Threads  : 50,  Avg Time : 1.97918,  Rem 0Threads  : 60,  Avg Time : 1.97647,  Rem 40Threads  : 80,  Avg Time : 0.0099639,  Rem 0Threads  : 100,  Avg Time : 0.099033,  Rem 0Threads  : 120,  Avg Time : 0.0292294,  Rem 40Threads  : 140,  Avg Time : 0.0338388,  Rem 80Threads  : 180,  Avg Time : 0.0358496,  Rem 100Threads  : 220,  Avg Time : 0.0540778,  Rem 120Threads  : 250,  Avg Time : 0.0660725,  Rem 0Threads  : 300,  Avg Time : 0.0563562,  Rem 100Threads  : 350,  Avg Time : 0.0662096,  Rem 150Threads  : 400,  Avg Time : 0.0825936,  Rem 0Threads  : 450,  Avg Time : 0.116231,  Rem 100Threads  : 500,  Avg Time : 0.107333,  Rem 0


#include <windows.h>#include <iostream>using namespace std;double GetHighPrecisionTime(){	static __int64 sTimeStamp;	static bool sInitialized = false;	static double sf = 0.0f;	if( !sInitialized )	{		__int64 l_TicksPerSec;		::QueryPerformanceFrequency((LARGE_INTEGER*)&l_TicksPerSec);		sf = 1.0 /(double)l_TicksPerSec;		sInitialized = true;	}	QueryPerformanceCounter((LARGE_INTEGER*)&sTimeStamp);	return sTimeStamp * sf;}struct Vector{	float x,y,z;};const int NumVectors =  10 * 1000000;const int NumRuns = 10;Vector* pVectorArray = 0;void Normalise( Vector* pVec ){	float Length = sqrt( pVec->x*pVec->x + pVec->y*pVec->y + pVec->z*pVec->z );	pVec->x /= Length;	pVec->y /= Length;	pVec->z /= Length;}struct Task{	Vector* pVectorArray;	int Count;	Task( Vector* _pVectorArray, int _Count ) : pVectorArray(_pVectorArray), Count(_Count) {}};void Func( Task* pTask ){	for( int i = 0; i < pTask->Count; ++i )	{		Normalise( &pTask->pVectorArray );	}}void ThreadTask( int NumThreads ){	double Total = 0;	for( int i = 0; i < NumRuns; ++i )	{		double t1 = GetHighPrecisionTime();		HANDLE* Handles = new HANDLE[NumThreads];		for( int j = 0; j < NumThreads; ++j )		{			int Offset = j*(NumVectors/NumThreads);			Task* ptask = new Task( pVectorArray + Offset, NumVectors/NumThreads );			Handles[j] = CreateThread( 0, 0, (LPTHREAD_START_ROUTINE)&Func, ptask, 0, 0 );		}		WaitForMultipleObjects( NumThreads, Handles, true, INFINITE );		double t2 = GetHighPrecisionTime();		Total += (t2 - t1);	}	cout << "Threads  : " << NumThreads << ",  Avg Time : " << Total / NumRuns << ",  Rem " << NumVectors % NumThreads <<  endl;}int main(){	pVectorArray = new Vector[NumVectors];	ThreadTask( 1 );	ThreadTask( 2 );	ThreadTask( 3 );	ThreadTask( 4 );	ThreadTask( 6 );	ThreadTask( 8 );	ThreadTask( 12 );	ThreadTask( 16 );	ThreadTask( 24 );	ThreadTask( 30 );	ThreadTask( 40 );	ThreadTask( 50 );	ThreadTask( 60 );	ThreadTask( 80 );	ThreadTask( 100 );	ThreadTask( 120 );	ThreadTask( 140 );	ThreadTask( 180 );	ThreadTask( 220 );	ThreadTask( 250 );	ThreadTask( 300 );	ThreadTask( 350 );	ThreadTask( 400 );	ThreadTask( 450 );	ThreadTask( 500 );	cin.get();}


This topic is closed to new replies.

Advertisement