Is context switching really this fast?

Started by
14 comments, last by etothex 18 years, 6 months ago
Quote:Original post by Promit
Consider that the primary reason for a thread to stall is a hit to memory or L2 cache. These latencies can be rather a lot of cycles (on the order of tens or even hundreds for main memory). Remember also that an OS can schedule a new thread in that gap in constant time. Also, realize that typical code run on a Pentium 4 spends massive amounts of time (20%-50%) waiting for L2 or main memory. Combine all these factors and what you get is that the threads can very efficiently fill in the stalls. This has dimishing returns, obviously, so for any given code, CPU, main memory, etc. you'll find various optimal points. But even for a single CPU, multiple threads can do better, context switches and all.


But as the number of threads grows, the likelihood of cache misses grows as each thread is hitting memory.


Quote:Original post by Omaha
Looking for performance bottlenecks in things that you can't control (z.b. how long it takes for threads to swap out) is not the right way to attack optimization, and neither is attacking it before you even know there's a problem. Just make intelligent design choices and if performance becomes an issue later in development, deal with it then, otherwise the only bottleneck you're going to have is a development stall when you drive yourself batty trying to find ways to cut corners that are already round.


I disagree completely. It's very important to find performance bottlenecks you can't control at the beginning. If you have no control over something, you better plan for it from the beginning because you will be unable to change it later. It's those kinds of changes that can cause a massive re-write and a monumental waste of time.

In the real world of this example, the thred situation will most likely not be an issue; however, if the OP is planning on using 1000 threads, he should do some research first to determine whether or not it will become an issue before going further down the road.
Advertisement
Quote:Original post by Troll
Quote:Original post by Promit
Consider that the primary reason for a thread to stall is a hit to memory or L2 cache. These latencies can be rather a lot of cycles (on the order of tens or even hundreds for main memory). Remember also that an OS can schedule a new thread in that gap in constant time. Also, realize that typical code run on a Pentium 4 spends massive amounts of time (20%-50%) waiting for L2 or main memory. Combine all these factors and what you get is that the threads can very efficiently fill in the stalls. This has dimishing returns, obviously, so for any given code, CPU, main memory, etc. you'll find various optimal points. But even for a single CPU, multiple threads can do better, context switches and all.


But as the number of threads grows, the likelihood of cache misses grows as each thread is hitting memory.


That's true, but you can find a good balance where the threads are simply scheduled in between each other as they stall on cache misses. The P4 in particular (not sure about Athlons or P3) can have a huge number of outstanding cache misses at once, while still scheduling new threads.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Quote:Original post by Anonymous Poster
Thanks for all the comments everyone. This and further testing has definitely made up my mind that context switch (within a process) is Fast Enough (TM).

Quote:Also, threading doesn't usually simplify development as it introduces the extra need to use mutexes & critical sections etc.

If not designed properly, I imagine that situation might come up. Our game uses a sort of job/assembly line abstraction which makes race conditions pretty simple to eliminate at design time and deadlocks/high contention/priority inversion non existent.
That's good to hear then. It nice when it's easy to do threading.
Quote:

Quote:MSVC will definetly remove lot_o_math() as dead-code.

It doesn't if you turn of optimizations. I can step through the assembly and verify its doing what I want it to be doing. I thought cout would be inappropriate because it is synchronized, and this is testing concurrency...
You'll find a number of people not happy with a benchmark being run without optimisations on. However in this case you're actually measuring the performance of the OS so I suppose you can be let off.
cout can have flags set to turn the sync off. However, that probably isn't necessary either. All you need to do is cout a single number when the thing has finished, even after you've output the timing results is okay. So long as the value eventually outputted depends upon all of the code being benchmarked.
Quote:

After more testing and looking over Windows Internals (great book), I realized why there was no perf hit for context switching: it is not scheduling any more frequently! There may be X threads active, but there are no more context switches per unit time than if there was 1 active thread. A better way to test context switching overhead would be to have several SwitchToThread calls per scheduling quanta so that I was actually increasing the number of context switches.
True, although I imagine you wont need to do that within your program, so whatever you discover wont be so important.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
As dated as it is, this still might provide some insight, Win2K Quantums
"I thought what I'd do was, I'd pretend I was one of those deaf-mutes." - the Laughing Man
Quote:True, although I imagine you wont need to do that within your program, so whatever you discover wont be so important.

We do actually, though not for the same reason that everyone seems to use (and complain about) Sleep. We run our main thread at normal priority and the ai and resource loading threads at below normal and lowest priority, respectively. this is great because a spike in ai pathfinding or resource loading does not touch the fps, however to avoid starving these threads, at the end of every main thread frame, we Sleep(1) to give up the rest of the timeslice.
And how!
Quote:Original post by Troll
Quote:Original post by Promit
Consider that the primary reason for a thread to stall is a hit to memory or L2 cache. These latencies can be rather a lot of cycles (on the order of tens or even hundreds for main memory). Remember also that an OS can schedule a new thread in that gap in constant time. Also, realize that typical code run on a Pentium 4 spends massive amounts of time (20%-50%) waiting for L2 or main memory. Combine all these factors and what you get is that the threads can very efficiently fill in the stalls. This has dimishing returns, obviously, so for any given code, CPU, main memory, etc. you'll find various optimal points. But even for a single CPU, multiple threads can do better, context switches and all.


But as the number of threads grows, the likelihood of cache misses grows as each thread is hitting memory.


So it is diminishing returns, then. If every other thread misses cache, then you completely remove the benefit of cache. And on-chip cache is probably the one most important thing on modern CPUs.

Using threads smartly can really help an application. But blindly using them to improve performance or "responsiveness" is completely useless and will harm your program significantly in the end.

I once heard a computer science professor state something to the tune of , computers are a parallel technology, and trying to write a serially-run program won't work. Well, duh, computers are (largely) serial to the programmer. Imagining them as magical parallelly-executing boxes doesn't help worth a darn.

This topic is closed to new replies.

Advertisement