Multithreading using wait/notification mechanism

Started by
29 comments, last by JohnnyCode 8 years, 11 months ago


We have what we call a "safe sleep" class that picks either Sleep(0) or Sleep(1) based on an internal counter. You just keep calling Sleep() on the class in your tight loop, and every X times it actually calls Sleep(1) instead of Sleep(0). This helps keep responsiveness relatively high in things like spinlocks where we don't expect to be spinning for very long, but also allows lower priority threads run on occasion in case one of those is the one holding the lock (thus avoiding deadlock due to thread starvation).

This kind of "stuttered" sleep, similar to what Hodgman sugguested, is something I will look at. I've done some profiling yesterday and almost 40% of the profiled time is in the Sleep function. But this is for sure a result of the high number of (builder) threads of which I have talked in the posting above (answer to AllEightUp), that is, currently 12 threads on a 4-core CPU.

Advertisement

This leads to an other question. Eg. on a 4 core machine I use 1 main-thread, 3 job-worker and upto 8 builder threads.


The general "rule of thumb" I've seen batted around for best threading performance is 1 thread per core, hyperthreaded cores included. This is because context switching between threads on the same core is very expensive, so the fewer threads you have, the fewer context switches need to be done.

My intention is, that the builder threads fill the unused power when a core is not involved in an other task (main/job). Thinking about it, what would be a good "configuration" for these builder threads ?

Should I restrict them to #cores-1 threads, so that the main thread is unhindered all the time ? Can I use 8 thread when setting the priority quite low, or would this result in system wide reduction of priority (that is, other applications will get more CPU time then my running game?). How can I tell the system, that the main and job threads should be most important, when active, prioertiy again ?


I don't think you can tell Windows what cores to run threads on. Consoles like the Playstation and Xbox will let you, however (they'll also tell you what cores are reserved or partially reserved for OS tasks).

I think you're going to get better performance if you pay attention to what jobs you are running in your job pools and when, rather then trying to micromanage where threads run anyway. You generally don't want tasks that write to the same shared data running at the same time to reduce lock contention. (Multiple readers are fine, of course)

Of course, you should be using your profiler to find where lock contention is happening and where you have various CPU cores idling while waiting for work - but I assume you're doing that already smile.png

Edit: Thread priority is something you can change, however. For Windows, you use the SetThreadPriority function. The remarks on that page and the scheduling priorities page explain how priority affects scheduling on a Windows system. If you are not on Windows, you'll want to read your OS's documentation.

AllEightUp, on 20 May 2015 - 10:45 AM, said:

Often my overall time goes 'down', primarily because, while I reduce contention as much as possible via partitioning, there is still a certain amount of atomic operation contention which is reduced by dropping worker threads.
I've a similiar approach, that is a high-degree of partitioning and atomic sync, e.g . I've two thread pools, one for the job scheduler and one pool of "data builders", the latter builders only communicate atomically with the main thread and use a really lax spinlock (10 ms sleep).

This leads to an other question. Eg. on a 4 core machine I use 1 main-thread, 3 job-worker and upto 8 builder threads. The builder threads create data and are quite independent of the rest of the game (read-only access of gamedata, no further sync, errors are accepted). On the other hand, the job-worker are only active for a defined duration during a frame and should be as fast as possible. The main thread, doing all the sync management, taking over the role of a job-worker and do some important, not concurrently running work in the shadow of the GPU. This is, when the GPU has been fed, the job-scheduler is done, then I use the time the GPU needs to finish its work to do all the single threaded gamelogic working on a large, not shared game world.

My intention is, that the builder threads fill the unused power when a core is not involved in an other task (main/job). Thinking about it, what would be a good "configuration" for these builder threads ?

Should I restrict them to #cores-1 threads, so that the main thread is unhindered all the time ? Can I use 8 thread when setting the priority quite low, or would this result in system wide reduction of priority (that is, other applications will get more CPU time then my running game?). How can I tell the system, that the main and job threads should be most important, when active, prioertiy again ?


This sounds very much like the division I use though I kept it generic and called it asynchronous task based processing (aka: Thread Pool with work Queue) and the n-core horizontal distribution system. The problem was always that those two different things tend to fight each other at the most inopportune times. For instance, let's take texture requests, I call them async tasks because they may spend several frames loading data so I don't want them being tied into the main loop distribution system. But, things are never as simple as they seem on the surface, so let me describe the steps involved just so some items are clear:

1. Load (or pull from a metadata cache) the texture header so we have size, mip count etc information. (IO bound or potentially memory bound.)
2. Using #1 information, send a request to the rendering thread to give me an appropriate buffer to store the texture in. The buffer may be a transient chunk of memory the renderer will feed to the Gfx api or it may be the real resource memory block, not up to me to decide on this side of things. (Frame latency bound, i.e. once a frame I pump a message queue and push back results.)
3. Load the actual data. (Io bound again.)
4. For fun, assume it is an 8k x 8k backdrop image and for some reason it is on disk as a jpg, decompress and push it into the buffer. Potentially build mips, maybe even do runtime dxt compression etc. (CPU bound.)
5. Send the 'complete' back to the renderer and fill in a resource handle with a reference to the real resource. (Frame latency and atomic contention bound.)

Breaking down the steps here, you might see the issue that turns up. If I use a fixed set of async task threads and several texture requests come in at the same time, or over a couple frames (a common case since you load a model and then it says I use x/y/z textures) I'm going to be oversubscribed at different points in those steps. With a good asynchronous file system behind the scenes, the io bound bits won't be too horrible but the bits involving #4 can be a real bugger because you can't predict when that work will need to be done, and oversubscribing your CPU's can be worse than just running it all single threaded because you drag down everything and start wasting major CPU on simple OS overhead instead of running your code.

Filling in the empty spaces in the CPU time is exactly why I have the two different solutions and it is also how the OS tends to fulfill various internal items such as async file IO etc. But the nature of the work involved in step #4 (assume it's jpg decompression for example) effectively tricks the OS into doing the wrong things sometimes. It will see the n-core threads as occasional work and prioritize them lower in favor of the async thread doing this big chunk of texture processing. Which means that for the frames that the async task is actively using CPU time, you have effectively lost one of your cores to this thread that the OS won't want to interrupt. What happens is the following (bad ascii art):

1: 11111xxxxxxx1111111xx111xx1111
2: 2222xxxxxxxx2222222244444x2222
3: 3333344444443333333xxxxxxx33xx4444444444
4: aaaaaaaaaaaaaaaaaaaaaaaa....

I.e. Cores are vertical, and threads 1-4 are your n-core per frame workers with 'a' as some busy asynchronous task. The n-core distribution gets fubar'd because thread 4 ends up pushed into the other three 'spurious' threads while 'a' ends up with all of a single core since it is not spurious at this point.

Now, you might think thread priorities or affinity could solve this, go ahead and play with it but time your results because you will find that, other than on consoles, using those items generally (almost always) make things much worse.

So, going back to my initial point that you need to look at the top level of things first, the solution for this in my system was more heuristics. I ended up with a solution which at the start of the n-core portion I would maintain the cpu utilization trend and a really simple concept of 'available resources' and then bid things per frame in terms of if I would allow asynchronous tasks to start working. What I mean by this is that I may call a hardware thread .9 units of possible work and add up all the available workers to have say 4 times .9 = 3.6 cpu units available. Based on prior frames, I subtract out n-core usage and any running asynchronous tasks. The remainder is what I use to determine if an asynchronous task will be allowed to start.

This is not a perfect solution and of course this is a very simplified description. The idea though is if you fix the asynchronous request count, you can easily oversubscribe the cores and cause much worse issues than I tried to outline above. I like the bidding approach myself, it is generally easy to do and the simple concepts lend themselves to weighting different tasks such that expensive things can be 1.1 weight and simple things could be .5 weight etc.

My fear would be reading with 8 threads from disk could slow down disk access to a crawl.
Suppose you read 8 or more large files from those 8 threads. That could overfill the disk access queue very fast. The OS may try to share access to the disk equally and the HDD would randomly and slowly seek back and forth between 8 far away places on the disk, always just reading in a bit of each file then getting a bit from next file.
Then you might wait x times longer, as the seeks take away time thats not used for reading in data. Having to wait for 8 partially read files will also make you buffer 8 times more data, so you might have read more bytes than 7 files contain, but still waiting for the first file to finish.
Then, if you also filled the memory up too much, you might get a few page faults in your main thread and have to wait ages to access the swapfile, as the disk queue is full.

I would feel safer having a thread per disk do only IO and putting the decoding work into the thread pool with 4 threads you already have, possibly with a lower priority.

My fear would be reading with 8 threads from disk could slow down disk access to a crawl.

This is why I mentioned having a 'good' async io subsystem, such an item should prevent such things. On the other hand, the mentioned bidding solution works just as well with file io as it does with the cpu. I use the bidding item to deal with the occasions when I'm going to access an API which does io itself outside of my control. The bidding idea means you can correct for such things pretty easily per platform, feature whatever may change. Generally speaking you can use a bid solution for memory pressure, cpu cost, disk utilization and even network io. There may be better solutions but I like it for simplicity and ease of tweaking.

This sounds very much like the division I use though I kept it generic and called it asynchronous task based processing (aka: Thread Pool with work Queue) and the n-core horizontal distribution system. The problem was always that those two different things tend to fight each other at the most inopportune times. For instance, let's take texture requests, I call them async tasks because they may spend several frames loading data so I don't want them being tied into the main loop distribution system. But, things are never as simple as they seem on the surface, so let me describe the steps involved just so some items are clear:

1. Load (or pull from a metadata cache) the texture header so we have size, mip count etc information. (IO bound or potentially memory bound.)
2. Using #1 information, send a request to the rendering thread to give me an appropriate buffer to store the texture in. The buffer may be a transient chunk of memory the renderer will feed to the Gfx api or it may be the real resource memory block, not up to me to decide on this side of things. (Frame latency bound, i.e. once a frame I pump a message queue and push back results.)
3. Load the actual data. (Io bound again.)
4. For fun, assume it is an 8k x 8k backdrop image and for some reason it is on disk as a jpg, decompress and push it into the buffer. Potentially build mips, maybe even do runtime dxt compression etc. (CPU bound.)
5. Send the 'complete' back to the renderer and fill in a resource handle with a reference to the real resource. (Frame latency and atomic contention bound.)

Well, it seems that we indeed use similar systems. My level terrain texture is 16kx16k large, but will be generated on-the-fly instead of being loaded. It sounds like a task for the GPU, but I mix up 8 different textures per terrain element (8x8 pixel) out of 256 textures, so for the moment it is easier for me to do it on the CPU. Nevertheless, my approach is

1. The rendering systems sends a request to the builder queue/list to fetch a terrain texture chunk.

2. The texture manager controls the texture builders, depending on the camera position, takes a request from the list and send the request to a free builder, which works async.

3. The builder creates a new texture and writes it to a ogl buffer.

4. The texture manager controls the ogl buffer swapping.

5. The builder worker is a simple state machine, the state is controlled by the atomic operations only.

My system, due to the heavy CPU usage, is even more prone to block job-threads dry.png


Filling in the empty spaces in the CPU time is exactly why I have the two different solutions and it is also how the OS tends to fulfill various internal items such as async file IO etc. But the nature of the work involved in step #4 (assume it's jpg decompression for example) effectively tricks the OS into doing the wrong things sometimes. It will see the n-core threads as occasional work and prioritize them lower in favor of the async thread doing this big chunk of texture processing. Which means that for the frames that the async task is actively using CPU time, you have effectively lost one of your cores to this thread that the OS won't want to interrupt. What happens is the following (bad ascii art):

1: 11111xxxxxxx1111111xx111xx1111
2: 2222xxxxxxxx2222222244444x2222
3: 3333344444443333333xxxxxxx33xx4444444444
4: aaaaaaaaaaaaaaaaaaaaaaaa....

I.e. Cores are vertical, and threads 1-4 are your n-core per frame workers with 'a' as some busy asynchronous task. The n-core distribution gets fubar'd because thread 4 ends up pushed into the other three 'spurious' threads while 'a' ends up with all of a single core since it is not spurious at this point.

Hmm... This will for sure happen in my case too. Just a thought, wouldn't it help to give the async thread a lower priority and use frequently SwtichToThread or similar to break the run ? Or some global,atomic state which signals, that the job workers are busy, leading to pause the async threads for some time, a load balancer, or would this kill the benefit ?


So, going back to my initial point that you need to look at the top level of things first, the solution for this in my system was more heuristics. I ended up with a solution which at the start of the n-core portion I would maintain the cpu utilization trend and a really simple concept of 'available resources' and then bid things per frame in terms of if I would allow asynchronous tasks to start working. What I mean by this is that I may call a hardware thread .9 units of possible work and add up all the available workers to have say 4 times .9 = 3.6 cpu units available. Based on prior frames, I subtract out n-core usage and any running asynchronous tasks. The remainder is what I use to determine if an asynchronous task will be allowed to start.

So you have some kind of synchronisation between your threads, a load balancing ? Something like this :

if you have for each core a worker and a sister thread for the async work, couldn't these two threads use a simple (virtual) core sharing:


// 1. global state
// only modified by job workers (use atomic operations)
Long virtualCoreSharing[#numberOfWorkerThreads]=0;

// 2. job threads (#cores - 1 + main thread)
// job worker, starting 
virtualCoreSharing[jobWorker.getVirtualCoreIndex()] = 1;

.. process jobs..

// done, give control to main thread
virtualCoreSharing[jobWorker.getVirtualCoreIndex()] = 0;


// 3. texture threads (#cores -1 , no main thread, because the main-thread should be busy all the time)
// execution
for each line in texture do
   // spin lock wait, no while loop to avoid starvation
   if virtualCoreSharing[textureWorker.getVirtualCoreIndex()]==1 do
     // let the sister thread have more CPU time
     Sleep(0); // or higher
   end

   ..process single line or data block
end


I think you're going to get better performance if you pay attention to what jobs you are running in your job pools and when, rather then trying to micromanage where threads run anyway. You generally don't want tasks that write to the same shared data running at the same time to reduce lock contention. (Multiple readers are fine, of course)

That is no issue, I don't have shared data where I write concurrently,everything is either isolated data or double buffered.


This is because context switching between threads on the same core is very expensive

Is this still the case on modern systems ? I've learned that at least on Windows, a lot of performance optimizsation has been done in multithreading. And I'm gettng really jealously when reading all the stuff of how much control you have over consoles...

This is because context switching between threads on the same core is very expensive

Is this still the case on modern systems ? I've learned that at least on Windows, a lot of performance optimizsation has been done in multithreading. And I'm gettng really jealously when reading all the stuff of how much control you have over consoles...


Context switching not only requires the OS to do all the work of swapping out one thread's data for another, but means you've just flushed all your caches on that CPU away as well.

Oversubscription can make sense if your thread is otherwise idle or waiting on other resources (i.e. you can run two threads on one core if one thread is waiting for data from disk or the network for example) but it generally just adds additional runtime for no benefit.

Hmm... This will for sure happen in my case too. Just a thought, wouldn't it help to give the async thread a lower priority and use frequently SwtichToThread or similar to break the run ? Or some global,atomic state which signals, that the job workers are busy, leading to pause the async threads for some time, a load balancer, or would this kill the benefit ?


In the async work queue side of things, I don't always have control over interrupting the underlying code unless I hack it in which I'd prefer to avoid. Worse, inserting such things doesn't really fix anything as now the async thread ends up simply getting mixed into the other n-core work similar to what happened with thread 4 in the ugly little diagram. As to the priorities, I was able to work around many of the issues using thread priorities but as soon as I went to another os, I had to go through the whole re balance again due to them not having the same behaviors. Heck, even going to a different machine with the same exact Os didn't tend to work, it's just too touchy and uncontrollable that I immediately rolled back that work and went looking for a better solution.

AllEightUp, on 21 May 2015 - 08:25 AM, said:
So, going back to my initial point that you need to look at the top level of things first, the solution for this in my system was more heuristics. I ended up with a solution which at the start of the n-core portion I would maintain the cpu utilization trend and a really simple concept of 'available resources' and then bid things per frame in terms of if I would allow asynchronous tasks to start working. What I mean by this is that I may call a hardware thread .9 units of possible work and add up all the available workers to have say 4 times .9 = 3.6 cpu units available. Based on prior frames, I subtract out n-core usage and any running asynchronous tasks. The remainder is what I use to determine if an asynchronous task will be allowed to start.

So you have some kind of synchronisation between your threads, a load balancing ? Something like this :
if you have for each core a worker and a sister thread for the async work, couldn't these two threads use a simple (virtual) core sharing:


I just keep two pools of threads available in wait states. So long as you don't go crazy and have say 50 n-core and 50 async threads, having threads ready to go in the engine but not used is not going to affect things in any notable way. The behaviors of how each pool is used are very different though. Typically speaking I let any async task run start to finish without trying to micro-manage what it is doing, as mentioned above. Once a frame I do something along the following lines:

Compute available resources:
  // Simple 'guess' as to maximum possible processing the machine can handle.
  float totalCpu = (Hardware::GetCoreCount()-1) * .9f - (Hardware::IsHyperthreaded()?(Hardware::GetCoreCount()-1)/2*0.1f):0);
  // Subtract out the amount of weight given to currently running async tasks.
  // The value is adjusted when async tasks are started/completed.
  totalCpu -= asyncTotalCost;
  // Figure out a good guess as to number of n-core threads I want to wake this frame.
  <various hueristics to come up with 'ncoreCount'>
  // Subtract ncore work from totalCpu
  totalCpu -= ncore;  // Note I'm effectively doubling the headroom allowed when I measure cores as .9f unit.

  // Kick off ncore processing this frame.
  mNCoreThreadSync.Release(ncore);

  // Decide to kick off any new async tasks if desired.
At this point, you look at totalCpu and see if anything in the async pending task queue is less than or equal in cost to the remaining totalCpu resources, if so, go ahead and start the async task to use up a core. I do various checks before starting an async task and generally speaking I modify the cpu weight such that at least one async task is always allowed to be running (can't just block them, but I can slow them down a bit). Typically speaking, unless there is really light load on the machine or the system is running on some beast box like the mentioned 24 core, I don't end up with more than one running async task at a time except for various blocking IO tasks that have near 0.0f weights.

All the code is relatively straight forward and pretty simple, I don't like complicated solutions for most things if avoidable. I think the ncore worker function is probably the most complicated bit after the real version of the above heuristics. Of course the heuristics are simple math with very limited logic so other than adding a bunch of different metrics, more accurate cost calculations etc it's very straight forward. Overall though, other than some tricky detail bits, there is nothing really that fancy involved.

So, that virtual core bit? Nope, I don't mess with it, the whole thing is kept as simple as possible. The benefit to the simplicity is that I've probably rewritten the system from scratch 10 times in the last 3-4 years and other than progressively getting better performance & balancing, the rest of my code rarely cares about the changes.

I decided to profile some of my busy loop stuff and found some interesting results.

All done on Windows 8.1 Pro + Core i7-4770K @ 3.5GHz.

YieldProcessor - 0?s

SwitchToThread - 11?s

SleepEx( 0, TRUE ) - 9?s

SleepEx( 1, TRUE ) - 1790?s

WaitForSingleObjectEx( x, 1, TRUE ) - 1070?s

WaitForSingleObjectEx( x, 0, TRUE ) - 2?s
ReleaseSemaphore - 2?s
YieldProcessor is a NOP, so it's not surprising to be less than a microsecond :)
I was surprised that SwitchToThread is slightly higher than a Sleep(0).
Sleep(1) seems as if it's waiting the 1ms, but then waking on the next scheduling tick (I was using timeBeginPeriod(1))... but actually Win8 is tickless, so I'm not sure about this extra delay...
WaitFor...(1) on the other hand seems much more likely to wake up after almost exactly 1ms (only 70?s over!). There must be some microsecond-accurate timer interrupt at the OS level, or something...
Signalling a semaphore or checking if it's been signaled yet, is extremely cheap - you can signal it 16k times per frame at 30hz!

This topic is closed to new replies.

Advertisement