Sign in to follow this  
andhow

Is context switching really this fast?

Recommended Posts

Our game uses multiple threads to handle slow or blocking operations and it works very well at simplifying our development, but I was scared that the overhead introduced by the extra context switching might be secretly eating our precious frame time. I wrote a tiny test to see how much slower it would be doing something once in X threads rather than doing it X times in one thread. The results were surprising: not counting the thread creation and destruction time which is quite expensive, with threads was as fast or faster. I am almost positive my test must be incorrect, but its pretty simple and I can't find any problems. Assuming the test is correct, the only rationalization of the results I can think of is that having a bunch of threads grabs up every ounce of CPU time, more so than Windows gives to a single thread. I posted the full compiling code below. Is the test wrong? Does anyone have any better explanations? Thanks.
#include <windows.h>
#include <process.h>
#include <cmath>
#include <iostream>
using namespace std;

static const int NumThreads = 20;

LARGE_INTEGER g_freq, g_bef;
volatile long g_go = 0, g_count = 0;

void lot_o_math()   // just waste a reliable amount of time
{
  float x = 1;
  for(size_t i=0; i<1000000; ++i)
    x = sin(x);
}

unsigned int __stdcall do_math(void*)
{
  while(!g_go);
  lot_o_math();
  InterlockedIncrement(&g_count);

  if (g_count == NumThreads) // if last thread, display timing results
  {
    LARGE_INTEGER aft;
    QueryPerformanceCounter(&aft);
    cout << "time: " << double(aft.QuadPart-g_bef.QuadPart)/g_freq.QuadPart << endl;
    ++g_count;
  }
  return 0;
}

int main()
{
  LARGE_INTEGER bef, aft;
  QueryPerformanceFrequency(&g_freq);

  // single threaded
  QueryPerformanceCounter(&bef);
  for(size_t i=0; i<NumThreads; ++i)
    lot_o_math(); 
  QueryPerformanceCounter(&aft);
  cout << "time: " << double(aft.QuadPart-bef.QuadPart)/g_freq.QuadPart << endl;

  // create a bunch of threads (busy waiting on g_go)
  for(size_t i=0; i<20; ++i)
    CloseHandle((HANDLE)_beginthreadex(0, 0, do_math, 0, 0, 0));

  // record start time and release the threads
  QueryPerformanceCounter(&g_bef);
  InterlockedIncrement(&g_go);
  
  // wait until all threads are finished
  while(g_count < NumThreads+1)
    Sleep(1000);

  cout << "Enter a key" << endl;
  char c;
  cin >> c;
}

Share this post


Link to post
Share on other sites
Do you have multiple processors, or hyper-threading turned on? If so, then it would definately run faster using multiple threads.

Just beware that your routine 'lot_o_math' isn't reliable benchmark code. You must use the result somehow, although it's good that you're feeding the last result back into the next iteration. However, the worlds smartest compilers may discover that x is not used in any way after calling sin, thus omitting the loop entirely would produce the same result, only faster.
The best way to ensure your benchmark is valid is to cout the result, and look through your code to ensure that leaving out any part of it would cause it to output a different answer.

Don't worry about slowness of context switching. It most likely wont be your bottleneck. Also, threading doesn't usually simplify development as it introduces the extra need to use mutexes & critical sections etc. Threading simply serves a purpose - to make things happen in parallel.

Share this post


Link to post
Share on other sites
Quote:
Original post by iMalc
Don't worry about slowness of context switching. It most likely wont be your bottleneck. Also, threading doesn't usually simplify development as it introduces the extra need to use mutexes & critical sections etc. Threading simply serves a purpose - to make things happen in parallel.


It most likely won't - unless the OP gets thread happy. It's all in the design. If every entity ends up having its own processing thread, there could be hundreds of threads, if not thousands, in an RTS. If there's only a handful, threads will almost certainly have no noticeable performance issues. As you point out, it could open up some nasty synchronization issues.

Share this post


Link to post
Share on other sites
Unless you have 1000s of threads, context switching is minimal overhead. There is much more of a problem with synchronization and locking. If you don't do that properly, your program can slow to a crawl or deadlock.

Share this post


Link to post
Share on other sites
Consider that the primary reason for a thread to stall is a hit to memory or L2 cache. These latencies can be rather a lot of cycles (on the order of tens or even hundreds for main memory). Remember also that an OS can schedule a new thread in that gap in constant time. Also, realize that typical code run on a Pentium 4 spends massive amounts of time (20%-50%) waiting for L2 or main memory. Combine all these factors and what you get is that the threads can very efficiently fill in the stalls. This has dimishing returns, obviously, so for any given code, CPU, main memory, etc. you'll find various optimal points. But even for a single CPU, multiple threads can do better, context switches and all.

Now, the need to synchronize can be a difficult one, cutting sharply into your performance benefits. Lockless structures, or simply threads that don't need to sync, can be invaluable for threaded performance.

Share this post


Link to post
Share on other sites
You probably know this, but there is a huge difference between thread switching (same memory context) and process switching (swapping memory).

EDIT: oops please ignore the following lines :) sorry!
switching between threads isnt much more than pushing the IP (instruction pointer) and stack pointer and popping two others - almost same performance hit of calling a function.

<-- sorry, i forgot the rest of the registers, been a while since i learnt inner cpu workings! (i try to stay as top level as i can)

Iftah.

[Edited by - Iftah on October 7, 2005 11:53:02 PM]

Share this post


Link to post
Share on other sites
Quote:
Original post by Iftah
switching between threads isnt much more than pushing the IP (instruction pointer) and stack pointer and popping two others - almost same performance hit of calling a function.


You need to save all the registers, which includes FPU/MMX and SSE states. That's the main thing.

Share this post


Link to post
Share on other sites
Well, heck, if you want to simplify a thread switch to saying all you need to do is save-and-restore two CPU registers you might as well simplify process switching to saying all you need to do is restore the CPU state and load a new pagetable. That's "all there is to it!"

Looking for performance bottlenecks in things that you can't control (z.b. how long it takes for threads to swap out) is not the right way to attack optimization, and neither is attacking it before you even know there's a problem. Just make intelligent design choices and if performance becomes an issue later in development, deal with it then, otherwise the only bottleneck you're going to have is a development stall when you drive yourself batty trying to find ways to cut corners that are already round.

Share this post


Link to post
Share on other sites
MSVC will definetly remove lot_o_math() as dead-code.
Be sure to profile release builds after you return the result and print it out from lot_o_math.


You have to save the entire register state including FPU when you switch threads. To switch processes there might only be 1 more register to restore for the memory table. The likelyhood of a cache miss goes way up though.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Thanks for all the comments everyone. This and further testing has definitely made up my mind that context switch (within a process) is Fast Enough (TM).

Quote:
Also, threading doesn't usually simplify development as it introduces the extra need to use mutexes & critical sections etc.

If not designed properly, I imagine that situation might come up. Our game uses a sort of job/assembly line abstraction which makes race conditions pretty simple to eliminate at design time and deadlocks/high contention/priority inversion non existent.

Quote:
MSVC will definetly remove lot_o_math() as dead-code.

It doesn't if you turn of optimizations. I can step through the assembly and verify its doing what I want it to be doing. I thought cout would be inappropriate because it is synchronized, and this is testing concurrency...

After more testing and looking over Windows Internals (great book), I realized why there was no perf hit for context switching: it is not scheduling any more frequently! There may be X threads active, but there are no more context switches per unit time than if there was 1 active thread. A better way to test context switching overhead would be to have several SwitchToThread calls per scheduling quanta so that I was actually increasing the number of context switches.

Share this post


Link to post
Share on other sites
Quote:
Original post by Promit
Consider that the primary reason for a thread to stall is a hit to memory or L2 cache. These latencies can be rather a lot of cycles (on the order of tens or even hundreds for main memory). Remember also that an OS can schedule a new thread in that gap in constant time. Also, realize that typical code run on a Pentium 4 spends massive amounts of time (20%-50%) waiting for L2 or main memory. Combine all these factors and what you get is that the threads can very efficiently fill in the stalls. This has dimishing returns, obviously, so for any given code, CPU, main memory, etc. you'll find various optimal points. But even for a single CPU, multiple threads can do better, context switches and all.


But as the number of threads grows, the likelihood of cache misses grows as each thread is hitting memory.


Quote:
Original post by Omaha
Looking for performance bottlenecks in things that you can't control (z.b. how long it takes for threads to swap out) is not the right way to attack optimization, and neither is attacking it before you even know there's a problem. Just make intelligent design choices and if performance becomes an issue later in development, deal with it then, otherwise the only bottleneck you're going to have is a development stall when you drive yourself batty trying to find ways to cut corners that are already round.


I disagree completely. It's very important to find performance bottlenecks you can't control at the beginning. If you have no control over something, you better plan for it from the beginning because you will be unable to change it later. It's those kinds of changes that can cause a massive re-write and a monumental waste of time.

In the real world of this example, the thred situation will most likely not be an issue; however, if the OP is planning on using 1000 threads, he should do some research first to determine whether or not it will become an issue before going further down the road.

Share this post


Link to post
Share on other sites
Quote:
Original post by Troll
Quote:
Original post by Promit
Consider that the primary reason for a thread to stall is a hit to memory or L2 cache. These latencies can be rather a lot of cycles (on the order of tens or even hundreds for main memory). Remember also that an OS can schedule a new thread in that gap in constant time. Also, realize that typical code run on a Pentium 4 spends massive amounts of time (20%-50%) waiting for L2 or main memory. Combine all these factors and what you get is that the threads can very efficiently fill in the stalls. This has dimishing returns, obviously, so for any given code, CPU, main memory, etc. you'll find various optimal points. But even for a single CPU, multiple threads can do better, context switches and all.


But as the number of threads grows, the likelihood of cache misses grows as each thread is hitting memory.


That's true, but you can find a good balance where the threads are simply scheduled in between each other as they stall on cache misses. The P4 in particular (not sure about Athlons or P3) can have a huge number of outstanding cache misses at once, while still scheduling new threads.

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
Thanks for all the comments everyone. This and further testing has definitely made up my mind that context switch (within a process) is Fast Enough (TM).

Quote:
Also, threading doesn't usually simplify development as it introduces the extra need to use mutexes & critical sections etc.

If not designed properly, I imagine that situation might come up. Our game uses a sort of job/assembly line abstraction which makes race conditions pretty simple to eliminate at design time and deadlocks/high contention/priority inversion non existent.
That's good to hear then. It nice when it's easy to do threading.
Quote:


Quote:
MSVC will definetly remove lot_o_math() as dead-code.

It doesn't if you turn of optimizations. I can step through the assembly and verify its doing what I want it to be doing. I thought cout would be inappropriate because it is synchronized, and this is testing concurrency...
You'll find a number of people not happy with a benchmark being run without optimisations on. However in this case you're actually measuring the performance of the OS so I suppose you can be let off.
cout can have flags set to turn the sync off. However, that probably isn't necessary either. All you need to do is cout a single number when the thing has finished, even after you've output the timing results is okay. So long as the value eventually outputted depends upon all of the code being benchmarked.
Quote:


After more testing and looking over Windows Internals (great book), I realized why there was no perf hit for context switching: it is not scheduling any more frequently! There may be X threads active, but there are no more context switches per unit time than if there was 1 active thread. A better way to test context switching overhead would be to have several SwitchToThread calls per scheduling quanta so that I was actually increasing the number of context switches.
True, although I imagine you wont need to do that within your program, so whatever you discover wont be so important.

Share this post


Link to post
Share on other sites
Quote:
True, although I imagine you wont need to do that within your program, so whatever you discover wont be so important.

We do actually, though not for the same reason that everyone seems to use (and complain about) Sleep. We run our main thread at normal priority and the ai and resource loading threads at below normal and lowest priority, respectively. this is great because a spike in ai pathfinding or resource loading does not touch the fps, however to avoid starving these threads, at the end of every main thread frame, we Sleep(1) to give up the rest of the timeslice.

Share this post


Link to post
Share on other sites
Quote:
Original post by Troll
Quote:
Original post by Promit
Consider that the primary reason for a thread to stall is a hit to memory or L2 cache. These latencies can be rather a lot of cycles (on the order of tens or even hundreds for main memory). Remember also that an OS can schedule a new thread in that gap in constant time. Also, realize that typical code run on a Pentium 4 spends massive amounts of time (20%-50%) waiting for L2 or main memory. Combine all these factors and what you get is that the threads can very efficiently fill in the stalls. This has dimishing returns, obviously, so for any given code, CPU, main memory, etc. you'll find various optimal points. But even for a single CPU, multiple threads can do better, context switches and all.


But as the number of threads grows, the likelihood of cache misses grows as each thread is hitting memory.


So it is diminishing returns, then. If every other thread misses cache, then you completely remove the benefit of cache. And on-chip cache is probably the one most important thing on modern CPUs.

Using threads smartly can really help an application. But blindly using them to improve performance or "responsiveness" is completely useless and will harm your program significantly in the end.

I once heard a computer science professor state something to the tune of , computers are a parallel technology, and trying to write a serially-run program won't work. Well, duh, computers are (largely) serial to the programmer. Imagining them as magical parallelly-executing boxes doesn't help worth a darn.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this