Jump to content
  • Advertisement
Sign in to follow this  
hick18

Surprised by multi threaded app results

This topic is 3081 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I wanted to experiment with a threaded a simple task. That task being to have a large number of floats that need to be multiplied by some number. So I wrote the following code which runs the task a number of times increasing the number of threads for the task each iteration.

The work is divided by simply splitting up the list of floats, so

1 thread has to do the entire list
2 threads have half the list each
3 threads have a third of the list each
and so on

I know that there are some division/remainder issues, and that there is no cleanup performed, but I wanted to keep it simple.

Seeing as I have a quad core, I expected the results to be more optimal around the 4 thread mark, and then get gradually worse as the number of threads increased. But thats no what I got at all, and was wondering why?


#include <windows.h>
#include <iostream>
using namespace std;

const int NumFloats = 12000000;
float* pFloatArray = 0;

struct Task
{
float* pFloatArray;
int Count;

Task( float* _pfloatArray, int _Count ) : pFloatArray(_pfloatArray), Count(_Count) {}
};

void Func( Task* pTask )
{
for( int i = 0; i < pTask->Count; ++i )
{
pTask->pFloatArray *= 5;
}
}

void ThreadTask( int NumThreads )
{
DWORD t1 = timeGetTime();
HANDLE* Handles = new HANDLE[NumThreads];
for( int j = 0; j < NumThreads; ++j )
{
int Offset = j*(NumFloats/NumThreads);
Task* ptask = new Task( pFloatArray + Offset, NumFloats/NumThreads );
Handles[j] = CreateThread( 0, 0, (LPTHREAD_START_ROUTINE)&Func, ptask, 0, 0 );
}
WaitForMultipleObjects( NumThreads, Handles, true, INFINITE );
DWORD t2 = timeGetTime();
cout << "Threads " << NumThreads << " Time : " << t2 - t1 << " Rem " << NumFloats%NumThreads << endl;

}

int main()
{
pFloatArray = new float[NumFloats];

ThreadTask( 1 );
ThreadTask( 2 );
ThreadTask( 3 );
ThreadTask( 4 );
ThreadTask( 6 );
ThreadTask( 8 );
ThreadTask( 12 );
ThreadTask( 16 );
ThreadTask( 24 );
ThreadTask( 30 );
ThreadTask( 40 );
ThreadTask( 50 );
ThreadTask( 60 );
ThreadTask( 80 );
ThreadTask( 100 );
ThreadTask( 120 );
ThreadTask( 140 );
ThreadTask( 180 );
ThreadTask( 220 );
ThreadTask( 250 );
ThreadTask( 300 );
ThreadTask( 350 );
ThreadTask( 400 );
ThreadTask( 450 );
ThreadTask( 500 );

cin.get();
}




Results 1

Threads : 1, Time(ms) : 38, Rem 0
Threads : 2, Time(ms) : 19, Rem 0
Threads : 3, Time(ms) : 19, Rem 0
Threads : 4, Time(ms) : 20, Rem 0
Threads : 6, Time(ms) : 21, Rem 0
Threads : 8, Time(ms) : 19, Rem 0
Threads : 12, Time(ms) : 21, Rem 0
Threads : 16, Time(ms) : 19, Rem 0
Threads : 24, Time(ms) : 21, Rem 0
Threads : 30, Time(ms) : 22, Rem 0
Threads : 40, Time(ms) : 21, Rem 0
Threads : 50, Time(ms) : 22, Rem 0
Threads : 60, Time(ms) : 22, Rem 0
Threads : 80, Time(ms) : 15, Rem 0
Threads : 100, Time(ms) : 17, Rem 0
Threads : 120, Time(ms) : 20, Rem 0
Threads : 140, Time(ms) : 24, Rem 40
Threads : 180, Time(ms) : 21, Rem 120
Threads : 220, Time(ms) : 28, Rem 100
Threads : 250, Time(ms) : 33, Rem 0
Threads : 300, Time(ms) : 37, Rem 0
Threads : 350, Time(ms) : 40, Rem 250
Threads : 400, Time(ms) : 46, Rem 0
Threads : 450, Time(ms) : 52, Rem 300
Threads : 500, Time(ms) : 57, Rem 0





Results 2

Threads : 1, Time(ms) : 38, Rem 0
Threads : 2, Time(ms) : 19, Rem 0
Threads : 3, Time(ms) : 19, Rem 0
Threads : 4, Time(ms) : 20, Rem 0
Threads : 6, Time(ms) : 21, Rem 0
Threads : 8, Time(ms) : 19, Rem 0
Threads : 12, Time(ms) : 21, Rem 0
Threads : 16, Time(ms) : 19, Rem 0
Threads : 24, Time(ms) : 21, Rem 0
Threads : 30, Time(ms) : 22, Rem 0
Threads : 40, Time(ms) : 21, Rem 0
Threads : 50, Time(ms) : 22, Rem 0
Threads : 60, Time(ms) : 22, Rem 0
Threads : 80, Time(ms) : 15, Rem 0
Threads : 100, Time(ms) : 17, Rem 0
Threads : 120, Time(ms) : 20, Rem 0
Threads : 140, Time(ms) : 24, Rem 40
Threads : 180, Time(ms) : 21, Rem 120
Threads : 220, Time(ms) : 28, Rem 100
Threads : 250, Time(ms) : 33, Rem 0
Threads : 300, Time(ms) : 37, Rem 0
Threads : 350, Time(ms) : 40, Rem 250
Threads : 400, Time(ms) : 46, Rem 0
Threads : 450, Time(ms) : 52, Rem 300
Threads : 500, Time(ms) : 57, Rem 0



Share this post


Link to post
Share on other sites
Advertisement
Are you sure you have a quad core processor and not a dual core with hyperthreading?

Share this post


Link to post
Share on other sites
Your tasks are for too short for the timings to be useful. If I recall correctly, timeGetTime() only has a resolution of about 20 milliseconds, which makes any results much under 100 milliseconds meaningless.

Share this post


Link to post
Share on other sites
Thats interesting. I am researching on that, too. What you should do definitely is turn off optimations! Your compiler may optimize the calculation and have a single result compiled in your tasks.

Share this post


Link to post
Share on other sites
Several issues, as has been pointed out, your tasks are too short and simplistic. Perhaps try an array of structs with several values, perhaps an array of vector4s filled with random values and do a normalisation on each one?

You're also running each set of tasks only once when you should really run them a reasonable number of times (10 to 1000 lets say) and then average the times. This helps isolate the results of each run from other tasks your CPU is performing.

As to why you get faster results with more threads, well first you're using a low-resolution timer so you're timing may not be accurate. Secondly when running more threads the OS can sometimes schedule execution better than with 1 thread per core. If one of you run 1-thread for 1-core your thread may stall due to a cache miss, poor pipelining or other issue. There is nothing that thread/core can do but wait for the issue to pass. If you're running 2-threads for 1-core, then that issue might be hidden by the OS or core switching to another thread.

So, make your test more complicated to give each task more work, repeat and average the results to get more accurate timing results and then you should have a better idea of what it's doing. You may also want to try several different types of tests and data structures to see how the results vary, division for example often can't be pipelined but multiplication can.

Andy

Share this post


Link to post
Share on other sites
Quote:
Original post by Zweistein2
Thats interesting. I am researching on that, too. What you should do definitely is turn off optimations! Your compiler may optimize the calculation and have a single result compiled in your tasks.


Negative.

This advice is similar to suggesting that in order to see how fast your car can go, you should siphon all the gas out of the tank, because higher octane fuel might just affect your engine performance.

Share this post


Link to post
Share on other sites
Well, I made changes as suggested, and now I get even stranger results

Number of vectors : 10,000,000
Number of runs : 10


Threads : 1, Avg Time(ms) : 7232, Rem 0
Threads : 2, Avg Time(ms) : 3851, Rem 0
Threads : 3, Avg Time(ms) : 2582, Rem 1
Threads : 4, Avg Time(ms) : 2046, Rem 0
Threads : 6, Avg Time(ms) : 2012, Rem 4
Threads : 8, Avg Time(ms) : 2152, Rem 0
Threads : 12, Avg Time(ms) : 2092, Rem 4
Threads : 16, Avg Time(ms) : 2062, Rem 0
Threads : 24, Avg Time(ms) : 2038, Rem 16
Threads : 30, Avg Time(ms) : 2023, Rem 10
Threads : 40, Avg Time(ms) : 1973, Rem 0
Threads : 50, Avg Time(ms) : 1973, Rem 0
Threads : 60, Avg Time(ms) : 1962, Rem 40
Threads : 80, Avg Time(ms) : 50, Rem 0
Threads : 100, Avg Time(ms) : 206, Rem 0
Threads : 120, Avg Time(ms) : 25, Rem 40
Threads : 140, Avg Time(ms) : 26, Rem 80
Threads : 180, Avg Time(ms) : 42, Rem 100
Threads : 220, Avg Time(ms) : 47, Rem 120
Threads : 250, Avg Time(ms) : 71, Rem 0
Threads : 300, Avg Time(ms) : 80, Rem 100
Threads : 350, Avg Time(ms) : 152, Rem 150
Threads : 400, Avg Time(ms) : 393, Rem 0
Threads : 450, Avg Time(ms) : 88, Rem 100
Threads : 500, Avg Time(ms) : 90, Rem 0





#include <windows.h>
#include <iostream>
using namespace std;

struct Vector
{
float x,y,z;
};

const int NumVectors = 10 * 1000000;
const int NumRuns = 10;
Vector* pVectorArray = 0;

void Normalise( Vector* pVec )
{
float Length = sqrt( pVec->x*pVec->x + pVec->y*pVec->y + pVec->z*pVec->z );
pVec->x /= Length;
pVec->y /= Length;
pVec->z /= Length;
}

struct Task
{
Vector* pVectorArray;
int Count;

Task( Vector* _pVectorArray, int _Count ) : pVectorArray(_pVectorArray), Count(_Count) {}
};

void Func( Task* pTask )
{
for( int i = 0; i < pTask->Count; ++i )
{
Normalise( &pTask->pVectorArray );
}
}

void ThreadTask( int NumThreads )
{
DWORD Total = 0;

for( int i = 0; i < NumRuns; ++i )
{
DWORD t1 = timeGetTime();
HANDLE* Handles = new HANDLE[NumThreads];
for( int j = 0; j < NumThreads; ++j )
{
int Offset = j*(NumVectors/NumThreads);
Task* ptask = new Task( pVectorArray + Offset, NumVectors/NumThreads );
Handles[j] = CreateThread( 0, 0, (LPTHREAD_START_ROUTINE)&Func, ptask, 0, 0 );
}
WaitForMultipleObjects( NumThreads, Handles, true, INFINITE );
DWORD t2 = timeGetTime();
Total += (t2 - t1);
}

cout << "Threads : " << NumThreads << ", Avg Time(ms) : " << Total / NumRuns << ", Rem " << NumVectors % NumThreads << endl;

}

int main()
{
pVectorArray = new Vector[NumVectors];

ThreadTask( 1 );
ThreadTask( 2 );
ThreadTask( 3 );
ThreadTask( 4 );
ThreadTask( 6 );
ThreadTask( 8 );
ThreadTask( 12 );
ThreadTask( 16 );
ThreadTask( 24 );
ThreadTask( 30 );
ThreadTask( 40 );
ThreadTask( 50 );
ThreadTask( 60 );
ThreadTask( 80 );
ThreadTask( 100 );
ThreadTask( 120 );
ThreadTask( 140 );
ThreadTask( 180 );
ThreadTask( 220 );
ThreadTask( 250 );
ThreadTask( 300 );
ThreadTask( 350 );
ThreadTask( 400 );
ThreadTask( 450 );
ThreadTask( 500 );

cin.get();
}


Share this post


Link to post
Share on other sites
Quote:
Threads : 1, Avg Time(ms) : 7232, Rem 0
Threads : 2, Avg Time(ms) : 3851, Rem 0
Threads : 3, Avg Time(ms) : 2582, Rem 1
Threads : 4, Avg Time(ms) : 2046, Rem 0

This part is ok, 4 cores is ~3.6 times faster


The incorrect short time occur because t2-t1 = 0 - it completes faster than timeGetTime can measure.

Share this post


Link to post
Share on other sites
Whilst that might be true, I dont think its the issue in this case, as I was watching the experiment and saw the last tasks flying through much much quicker.

Here is an update, with a better precision timer, and the results are the same.


Threads : 1, Avg Time : 7.24659, Rem 0
Threads : 2, Avg Time : 3.85804, Rem 0
Threads : 3, Avg Time : 2.57677, Rem 1
Threads : 4, Avg Time : 2.06466, Rem 0
Threads : 6, Avg Time : 2.01013, Rem 4
Threads : 8, Avg Time : 2.0157, Rem 0
Threads : 12, Avg Time : 1.98388, Rem 4
Threads : 16, Avg Time : 1.97695, Rem 0
Threads : 24, Avg Time : 1.97691, Rem 16
Threads : 30, Avg Time : 1.98669, Rem 10
Threads : 40, Avg Time : 1.96183, Rem 0
Threads : 50, Avg Time : 1.97918, Rem 0
Threads : 60, Avg Time : 1.97647, Rem 40
Threads : 80, Avg Time : 0.0099639, Rem 0
Threads : 100, Avg Time : 0.099033, Rem 0
Threads : 120, Avg Time : 0.0292294, Rem 40
Threads : 140, Avg Time : 0.0338388, Rem 80
Threads : 180, Avg Time : 0.0358496, Rem 100
Threads : 220, Avg Time : 0.0540778, Rem 120
Threads : 250, Avg Time : 0.0660725, Rem 0
Threads : 300, Avg Time : 0.0563562, Rem 100
Threads : 350, Avg Time : 0.0662096, Rem 150
Threads : 400, Avg Time : 0.0825936, Rem 0
Threads : 450, Avg Time : 0.116231, Rem 100
Threads : 500, Avg Time : 0.107333, Rem 0




#include <windows.h>
#include <iostream>
using namespace std;

double GetHighPrecisionTime()
{
static __int64 sTimeStamp;
static bool sInitialized = false;
static double sf = 0.0f;

if( !sInitialized )
{
__int64 l_TicksPerSec;
::QueryPerformanceFrequency((LARGE_INTEGER*)&l_TicksPerSec);
sf = 1.0 /(double)l_TicksPerSec;
sInitialized = true;
}

QueryPerformanceCounter((LARGE_INTEGER*)&sTimeStamp);
return sTimeStamp * sf;
}

struct Vector
{
float x,y,z;
};

const int NumVectors = 10 * 1000000;
const int NumRuns = 10;
Vector* pVectorArray = 0;

void Normalise( Vector* pVec )
{
float Length = sqrt( pVec->x*pVec->x + pVec->y*pVec->y + pVec->z*pVec->z );
pVec->x /= Length;
pVec->y /= Length;
pVec->z /= Length;
}

struct Task
{
Vector* pVectorArray;
int Count;

Task( Vector* _pVectorArray, int _Count ) : pVectorArray(_pVectorArray), Count(_Count) {}
};

void Func( Task* pTask )
{
for( int i = 0; i < pTask->Count; ++i )
{
Normalise( &pTask->pVectorArray );
}
}

void ThreadTask( int NumThreads )
{
double Total = 0;

for( int i = 0; i < NumRuns; ++i )
{
double t1 = GetHighPrecisionTime();
HANDLE* Handles = new HANDLE[NumThreads];
for( int j = 0; j < NumThreads; ++j )
{
int Offset = j*(NumVectors/NumThreads);
Task* ptask = new Task( pVectorArray + Offset, NumVectors/NumThreads );
Handles[j] = CreateThread( 0, 0, (LPTHREAD_START_ROUTINE)&Func, ptask, 0, 0 );
}
WaitForMultipleObjects( NumThreads, Handles, true, INFINITE );
double t2 = GetHighPrecisionTime();
Total += (t2 - t1);
}

cout << "Threads : " << NumThreads << ", Avg Time : " << Total / NumRuns << ", Rem " << NumVectors % NumThreads << endl;

}

int main()
{
pVectorArray = new Vector[NumVectors];

ThreadTask( 1 );
ThreadTask( 2 );
ThreadTask( 3 );
ThreadTask( 4 );
ThreadTask( 6 );
ThreadTask( 8 );
ThreadTask( 12 );
ThreadTask( 16 );
ThreadTask( 24 );
ThreadTask( 30 );
ThreadTask( 40 );
ThreadTask( 50 );
ThreadTask( 60 );
ThreadTask( 80 );
ThreadTask( 100 );
ThreadTask( 120 );
ThreadTask( 140 );
ThreadTask( 180 );
ThreadTask( 220 );
ThreadTask( 250 );
ThreadTask( 300 );
ThreadTask( 350 );
ThreadTask( 400 );
ThreadTask( 450 );
ThreadTask( 500 );

cin.get();
}



Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!