Sign in to follow this  

Multithreaded slower (despite Hyper-Threading)!

This topic is 4746 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

UPDATE: The post below is obsolete. I have a Pentium 4 with Hyper-Threading (SMP) so I made one part of my code multi-threaded to get a speed-up. Two threads are created, each of which evaluates roughly equal amounts of data. In Linux this gave me a nice ~50% increase in framerate, but in Windows it actually decreases -- single-threaded, it's 17 fps, but multi-threaded, it's 11 fps. The threads do not write to the same memory, though each thread does read from memory that the other thread writes to, without any sort of mutex (the order of the read/write is irrelevant). I don't know if that somehow causes a slowdown. My code for creating the threads looks like: [removed] AIThread is the function, ThreadData is the parameter passed, and SimTime is initialized to zero, and set to some value when the thread finishes (at the end of the AIThread function). So the while loop merely loops forever, doing nothing, until both SimTimes are nonzero, indicating both threads have finished. Does anyone know why I could be getting such a drastic framerate decrease? [Edited by - CGameProgrammer on December 17, 2004 2:54:12 AM]

Share this post


Link to post
Share on other sites
Do you have any kind of synchronization ?

I'm actually surprized at your Linux result. 50% is a lot, especially for hyper threading. Are you sure the number is correct ?

Don't forget that HT is fake parallelism. You still only have a single CPU, so your two threads are going to fight for CPU/memory/IO access. Plus there is a cost in context switching threads.. especially in Windows, it seems :)

I do have a real dual core machine (dual AMD 2400). If you want i can test your program. The results should be a lot different.

Y.

Share this post


Link to post
Share on other sites
Windows doesn't support multiple CPUs - you need to get a multiple CPU version of the OS (NT only IIRC and I'm not sure it even supports HT). You pay a license per physical CPU, so a dual CPU in one chip counts as one. So, in Windows, your multithreaded app was running on one CPU so it was never going to be quicker.

Skizz

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Forgive me my ignorance of threading apps but I believe you have a busy loop here. The proggie will spent lots of time in that while loop, whereas it's probably not so critical to be checking those SimTime variables billions of times per second. Try adding an OS sleep call within the loop.

When you run it single threaded, you do the work of the 2 threads and nothing more, and now you have three threads. So in single threaded mode, it takes 1/17=0.0588 seconds per frame, that is 0.0588/2*3 seconds per frame for three threads that get even time. And 1/(0.0588/2*3)=11, the framerate you are getting in Windows when multithreaded..

On the other hand, this might be BS, but I'm convinced!

Share this post


Link to post
Share on other sites
Quote:
Original post by Skizz
Windows doesn't support multiple CPUs - you need to get a multiple CPU version of the OS (NT only IIRC and I'm not sure it even supports HT). You pay a license per physical CPU, so a dual CPU in one chip counts as one. So, in Windows, your multithreaded app was running on one CPU so it was never going to be quicker.


That's only true for a certain licence agreements with Windows Server on high-grade server machines, certainly not true for workstation editions.

NT, Win2K and XP Pro handle multiple CPUs and HT processors fine, right out of the box. XP Home, however, has been crippled and only supports one CPU (and HT is disabled).

Share this post


Link to post
Share on other sites
Ysaneya: I have no synchronization of any sort. Yes I'm aware of how Hyper-Threading works, and my framerate had increased by roughly 50%, though my performance in Linux was bad to begin with for some reason (partly ATI-related). Anyway, I have uploaded a new GDomin.zip. Press 'm' to toggle multithreading on or off (it's off by default). Note that you will see no difference unless you have a bunch of ships flying around, so select alot of ships and right-click somewhere on screen so they'll go there (don't forget to press Space first so you can select stuff). Let me know if it crashes; the MT conversion is not totally finished, but it should be stable. I can run the program forever and it never crashes. Also, ignore the workload statistics; they're wrong... the framerate is correct however.

Oh, I should probably mention the game will run much slower than it did with the first release; this is not due to multithreading, but due to improved collision avoidance. It's slower though, but there are still ways to speed it up.

Skizz: Windows XP Professional supports multiple CPUs; I can open the Task Manager and see the workload on each virtual CPU.

Anonymous: Interesting... very interesting. What you say makes alot of sense. I did it that weird way because I couldn't find a good way in Windows of waiting for a thread to finish; the available functions seemed weird. Linux just has a nice simple pthread_join function. Ah well.

Share this post


Link to post
Share on other sites
Thanks darookie. I haven't looked at mutexes yet, but I used WaitForSingleObject (and then switched to WaitForMultipleObjects which is better for me) and I no longer get a framerate decrease when being multithreaded, so the AP was correct. But I don't get a noticeable framerate increase either.

I've now updated GDomin.zip with the 'fixed' version, Ysaneya, so I'd be interested to here your results. I probably need some sort of synchronization, though I'm not sure how that works... I get using a mutex variable to prevent conflicting writes, but no memory is written to by both threads.

Share this post


Link to post
Share on other sites
Works. No crash.. but i don't see any framerate difference between ST and MT. However i do see that the CPU usage in the task manager changes a lot. Between 20 and 50% in ST, and 50 to 80% in MT.

Y.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
It seems very odd that you get 50% boost in Linux. When you do two sequentical things as multithreaded, the time will be 100% + "relative time spent for switching between threads". Hyperthreading will only make the "relative time spent for switching between threads" lower -- Not negative! (At least that's how I remember hyperthreading) The multithreaded version should always be slower, but not necessarily much.

One possibility is that in Linux you have several computation-intensive threads running on background and all threads are same priority. So by doubling your own app's active thread count you will make Linux spend more time executing your app instead of other background threads. But the better solution would be to do it single-threaded and just increase that thread's priority.

Share this post


Link to post
Share on other sites
Quote:
Original post by CGameProgrammer
I did it that weird way because I couldn't find a good way in Windows of waiting for a thread to finish; the available functions seemed weird.

You can synchronize on the thread handle. When the thread exits its handle will be signalled so you can call WaitForSingleObject() or WaitForMultipleObjects() on the actual Threads[] variables. Your windows implementation is extremely badly written. Assuming your other two threads are completely CPU bound and have nothing to wait on, your original thread will take up 33% of the process' CPU time. I'm surprised you even see a small speedup, I'd expect this code to run longer than the single threaded version.

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
It seems very odd that you get 50% boost in Linux. When you do two sequentical things as multithreaded, the time will be 100% + "relative time spent for switching between threads". Hyperthreading will only make the "relative time spent for switching between threads" lower -- Not negative! (At least that's how I remember hyperthreading) The multithreaded version should always be slower, but not necessarily much.

Well, I think it could give a (small) boost in performance. The thing about HT is that when one thread is stalling, waiting for memory, flushing the pipeline or whatever else, another can use the execution units. So in effect, you get a more effective use of the cpu. Since you get a lot of wasted idle time on a P4, HT can give you performance improvements. I doubt you'd get a 50% boost though. Sounds like there were something else holding back your singlethreaded version for some reason. Not sure if it speeds up the time for switching threads, but I'm pretty sure it allows on thread to work when another would otherwise stall the cpu.

However, HT isn't some magic performance-wizard. It's more of a hotfix for the horrible design of the Pentium 4... ;)
It only boosts performance because the cpu has so much wasted idle time normally.

Quote:

One possibility is that in Linux you have several computation-intensive threads running on background and all threads are same priority. So by doubling your own app's active thread count you will make Linux spend more time executing your app instead of other background threads. But the better solution would be to do it single-threaded and just increase that thread's priority.

Not sure how it works on Linux, but does a process get more cpu time if it has more threads? I thought it just split the process' timeslot between the threads, without actually allocating more time.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
i have a P4 3Ghz Northwood with HT, 512MB ram and a Radeon 9500 Pro 1280, WinXP
SP2.

MT runs @ 12fps, ST @ 25fps.

1) id say your probably using you MT methods incorrectly, are you piplining or using parallellization (I think this method would suit AI)?

2) dont use Mutexes' unless you need cross PROCESS synchronization (hint: you don't) mutexes will use around 600 cycles to sync, critical sections are much more light weight.

3) you're wasting your main process thread, put it to sleep or do your rendering preperation in it?

4) there are guides to increasing performace on HT systems on intel's website, use them :)

http://www.intel.com/technology/itj/2002/volume06issue01/art01_hyper/p01_abstract.htm
there are a few more, but intels site is so bad and im really tired.

Cheers,
-Dan

Share this post


Link to post
Share on other sites
Hmm, even I can see that you need a sleep in that while loop in your main function.

That first AP post was quite insightful.

50% speed boost might seem high for the linux version, but I've always understood that Windows' threading isn't particularly grand.
I think that an extra thread in linux may actually get you more total CPU time for your program (equal amount per thread of equal priority perhaps).
Whereas under windows your single thread already took most of the processor time, and your second one meant that you had to divide it up.

Share this post


Link to post
Share on other sites
Ysaneya: I have a theory why it's doing that, which I'm working on fixing. I think the problem is the two threads probably have wildly different workloads. They both evaluate the same number of ships, but ships that aren't moving are evaluated much quicker than moving ships, so they could be doing 16 moving ships in one thread and 16 stationary ones in the other. I'm changing the code to fix this and should have an updated build Wednesday.

Share this post


Link to post
Share on other sites
You need to use a semaphore to syncronize the execution of your worker threads.
They should only run when there's work for them to do.

Creating and destroying threads is expensive. I'd create a thread pool (with 2 threads) and bump the semphore by 2 when it's time to do the physics calculations. Then wait for the semaphore count to return to 0; each worker thread should reduce the count when it's done with its calclations. The main thread could make the first work split and configure some data structure with the part of the world each thread needs to handle. This ensures everything runs smoothly. You have to make special OS calls that put your threads to sleep until they are told it's time to wake up. The worker threads need to sleep until the semaphore is incremented by the main thread and the main thread needs to sleep until the count returns to 0. You might need to use two semaphores to make that happen; one to queue the the worker threads and another to tell the main thread when they are done.

Now with hyper-threading, I would not expect a performance increase with two busy threads. Hyper-threading works best if the competing threads are waiting for data most of the time.

Share this post


Link to post
Share on other sites
CGameProgrammer, I think your current version of the game doesnt work right. An older version (maybe a week?) used to tell me the %'s for each catagory as well as run at a great speed. Now I get 0% for everything, except for Misc-AI which is -1% and the FPS are extremely low. Do you have a older version I could take a look at? Thanks.

Share this post


Link to post
Share on other sites
Neat idea, Magmai. So I create two threads and a semaphore (initialized to 0) initially, and the threads sleep until the semaphore is above 0, and then they decrement it and do their calculations. So when it's time for the calculations, the main thread just sets the semaphore to 2 and waits for it to be 0 again. That makes alot of sense.

EDIT: I started to code this, but I am confused on how to implement it in Windows. On initialization, I create the two threads, and I call:

ThreadSignal = CreateSemaphore( NULL, 0, 2, NULL );


So that creates a semaphore with an initial value of 0 (and max of 2). The body of the thread function looks like this:


while( 1 )
{
WaitForSingleObject( ThreadSignal, INFINITE );

// Do AI
}


So that should wait for ThreadSignal to be greater than zero, then decrement it and proceed with the AI. The main thread does this when it's time to run the AI:

ReleaseSemaphore( ThreadSignal, 2, NULL );


Setting the semaphore to 2, thus allowing the threads to run. But my problem is the main thread needs to wait for the semaphore to be 0, and I don't know how to do that. Also, it seems like it could be theoretically possible for a single thread to decrement the semaphore twice before the other thread ever gets a chance to look at it. So I don't know how the semaphore code is supposed to work.

Drew_Benton: I do not, but it's worrying that you're having problems... all statistics are reported correctly for me. I assume you downloaded it after I made my lastest post? Oh, also it would be useful to know what OS you're using.

[Edited by - CGameProgrammer on December 17, 2004 3:52:31 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Ysaneya
A little improvement now. Framerate is 18 ST and 22 MT when the 3 groups fight. 60 when all's idle (vsync *is* disabled).

Y.

That's a smaller change than I would have expected, though it's possible it's "correct". For me, the AI is about 60% of the workload (50% collision, 10% sim), so if it takes 10 seconds (bear with me) to render a frame, with 6 seconds for the AI, and that changed to 3 seconds, it would be 7 seconds per frame, or 10/7 = 42% improvement. That would make 18 FPS change to 25. Maybe the AI is a smaller portion of the workload for you.

Share this post


Link to post
Share on other sites
What you say would be true if you had two threads, having the same workload, working in paralell. I'm suspecting it's not the case.

What probably happens is that your first thread has a lot more work to do than your second one, because the first thread does a lot more things than A.I.: rendering, collisions, logic, input, etc.. while your second thread only does A.I. The second thread is probably sleeping a bit while the first one orders it to start the A.I. work.

Y.

Share this post


Link to post
Share on other sites
No no no, you misunderstand how the threads work. The main process spawns two threads and then sleeps until those threads have finished. The child threads do AI and nothing more.

The way the workload was split between the two threads has changed a number of times, but with the last build I posted, the way I do it is I have 32 ships evaluated each frame, but the threads do not necessarily do 16 ships each. In pseudocode, the thread function is this:

while(1)
{
EnterCriticalSection
Index = SharedData->Counter
SharedData->Counter ++
LeaveCriticalSection

Ship[Index].DoAI( );

if( Index == SharedData->LastShip )
break;
}

Counter could initially be 0, for example, and LastShip could be 31. So each thread evaluates ships until a specified number of them have been evaluated. This code ensures the threads do roughly the same amount of work, even if that means one thread does 24 ships and the other does 8.

Share this post


Link to post
Share on other sites
But it doesn't matter if you have a parent thread and then two threads for AI, the problem remains.

Your two AI threads (now balanced) are working in parallel. But they're not doing that continuously, they are interrupted when the parent is doing its own stuff. So you have the two processors busy for a short amount of time (when doing the AI), but the rest of the time only one CPU is really working. So you cannot expect your framerate to double. Maybe you can check exactly how much time the A.I. threads are taking, and how much the parent thread is using.

Y.

Share this post


Link to post
Share on other sites

This topic is 4746 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this