Is context switching really this fast?

Started by
14 comments, last by etothex 18 years, 6 months ago
Our game uses multiple threads to handle slow or blocking operations and it works very well at simplifying our development, but I was scared that the overhead introduced by the extra context switching might be secretly eating our precious frame time. I wrote a tiny test to see how much slower it would be doing something once in X threads rather than doing it X times in one thread. The results were surprising: not counting the thread creation and destruction time which is quite expensive, with threads was as fast or faster. I am almost positive my test must be incorrect, but its pretty simple and I can't find any problems. Assuming the test is correct, the only rationalization of the results I can think of is that having a bunch of threads grabs up every ounce of CPU time, more so than Windows gives to a single thread. I posted the full compiling code below. Is the test wrong? Does anyone have any better explanations? Thanks.
#include <windows.h>
#include <process.h>
#include <cmath>
#include <iostream>
using namespace std;

static const int NumThreads = 20;

LARGE_INTEGER g_freq, g_bef;
volatile long g_go = 0, g_count = 0;

void lot_o_math()   // just waste a reliable amount of time
{
  float x = 1;
  for(size_t i=0; i<1000000; ++i)
    x = sin(x);
}

unsigned int __stdcall do_math(void*)
{
  while(!g_go);
  lot_o_math();
  InterlockedIncrement(&g_count);

  if (g_count == NumThreads) // if last thread, display timing results
  {
    LARGE_INTEGER aft;
    QueryPerformanceCounter(&aft);
    cout << "time: " << double(aft.QuadPart-g_bef.QuadPart)/g_freq.QuadPart << endl;
    ++g_count;
  }
  return 0;
}

int main()
{
  LARGE_INTEGER bef, aft;
  QueryPerformanceFrequency(&g_freq);

  // single threaded
  QueryPerformanceCounter(&bef);
  for(size_t i=0; i<NumThreads; ++i)
    lot_o_math(); 
  QueryPerformanceCounter(&aft);
  cout << "time: " << double(aft.QuadPart-bef.QuadPart)/g_freq.QuadPart << endl;

  // create a bunch of threads (busy waiting on g_go)
  for(size_t i=0; i<20; ++i)
    CloseHandle((HANDLE)_beginthreadex(0, 0, do_math, 0, 0, 0));

  // record start time and release the threads
  QueryPerformanceCounter(&g_bef);
  InterlockedIncrement(&g_go);
  
  // wait until all threads are finished
  while(g_count < NumThreads+1)
    Sleep(1000);

  cout << "Enter a key" << endl;
  char c;
  cin >> c;
}
And how!
Advertisement
Do you have multiple processors, or hyper-threading turned on? If so, then it would definately run faster using multiple threads.

Just beware that your routine 'lot_o_math' isn't reliable benchmark code. You must use the result somehow, although it's good that you're feeding the last result back into the next iteration. However, the worlds smartest compilers may discover that x is not used in any way after calling sin, thus omitting the loop entirely would produce the same result, only faster.
The best way to ensure your benchmark is valid is to cout the result, and look through your code to ensure that leaving out any part of it would cause it to output a different answer.

Don't worry about slowness of context switching. It most likely wont be your bottleneck. Also, threading doesn't usually simplify development as it introduces the extra need to use mutexes & critical sections etc. Threading simply serves a purpose - to make things happen in parallel.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
Quote:Original post by iMalc
Don't worry about slowness of context switching. It most likely wont be your bottleneck. Also, threading doesn't usually simplify development as it introduces the extra need to use mutexes & critical sections etc. Threading simply serves a purpose - to make things happen in parallel.


It most likely won't - unless the OP gets thread happy. It's all in the design. If every entity ends up having its own processing thread, there could be hundreds of threads, if not thousands, in an RTS. If there's only a handful, threads will almost certainly have no noticeable performance issues. As you point out, it could open up some nasty synchronization issues.
Unless you have 1000s of threads, context switching is minimal overhead. There is much more of a problem with synchronization and locking. If you don't do that properly, your program can slow to a crawl or deadlock.
Consider that the primary reason for a thread to stall is a hit to memory or L2 cache. These latencies can be rather a lot of cycles (on the order of tens or even hundreds for main memory). Remember also that an OS can schedule a new thread in that gap in constant time. Also, realize that typical code run on a Pentium 4 spends massive amounts of time (20%-50%) waiting for L2 or main memory. Combine all these factors and what you get is that the threads can very efficiently fill in the stalls. This has dimishing returns, obviously, so for any given code, CPU, main memory, etc. you'll find various optimal points. But even for a single CPU, multiple threads can do better, context switches and all.

Now, the need to synchronize can be a difficult one, cutting sharply into your performance benefits. Lockless structures, or simply threads that don't need to sync, can be invaluable for threaded performance.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
You probably know this, but there is a huge difference between thread switching (same memory context) and process switching (swapping memory).

EDIT: oops please ignore the following lines :) sorry!
switching between threads isnt much more than pushing the IP (instruction pointer) and stack pointer and popping two others - almost same performance hit of calling a function.

<-- sorry, i forgot the rest of the registers, been a while since i learnt inner cpu workings! (i try to stay as top level as i can)

Iftah.

[Edited by - Iftah on October 7, 2005 11:53:02 PM]
Quote:Original post by Iftah
switching between threads isnt much more than pushing the IP (instruction pointer) and stack pointer and popping two others - almost same performance hit of calling a function.


You need to save all the registers, which includes FPU/MMX and SSE states. That's the main thing.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Well, heck, if you want to simplify a thread switch to saying all you need to do is save-and-restore two CPU registers you might as well simplify process switching to saying all you need to do is restore the CPU state and load a new pagetable. That's "all there is to it!"

Looking for performance bottlenecks in things that you can't control (z.b. how long it takes for threads to swap out) is not the right way to attack optimization, and neither is attacking it before you even know there's a problem. Just make intelligent design choices and if performance becomes an issue later in development, deal with it then, otherwise the only bottleneck you're going to have is a development stall when you drive yourself batty trying to find ways to cut corners that are already round.
MSVC will definetly remove lot_o_math() as dead-code.
Be sure to profile release builds after you return the result and print it out from lot_o_math.


You have to save the entire register state including FPU when you switch threads. To switch processes there might only be 1 more register to restore for the memory table. The likelyhood of a cache miss goes way up though.

- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara
Thanks for all the comments everyone. This and further testing has definitely made up my mind that context switch (within a process) is Fast Enough (TM).

Quote:Also, threading doesn't usually simplify development as it introduces the extra need to use mutexes & critical sections etc.

If not designed properly, I imagine that situation might come up. Our game uses a sort of job/assembly line abstraction which makes race conditions pretty simple to eliminate at design time and deadlocks/high contention/priority inversion non existent.

Quote:MSVC will definetly remove lot_o_math() as dead-code.

It doesn't if you turn of optimizations. I can step through the assembly and verify its doing what I want it to be doing. I thought cout would be inappropriate because it is synchronized, and this is testing concurrency...

After more testing and looking over Windows Internals (great book), I realized why there was no perf hit for context switching: it is not scheduling any more frequently! There may be X threads active, but there are no more context switches per unit time than if there was 1 active thread. A better way to test context switching overhead would be to have several SwitchToThread calls per scheduling quanta so that I was actually increasing the number of context switches.

This topic is closed to new replies.

Advertisement