AllEightUp, on 20 May 2015 - 10:45 AM, said:
Often my overall time goes 'down', primarily because, while I reduce contention as much as possible via partitioning, there is still a certain amount of atomic operation contention which is reduced by dropping worker threads.
I've a similiar approach, that is a high-degree of partitioning and atomic sync, e.g . I've two thread pools, one for the job scheduler and one pool of "data builders", the latter builders only communicate atomically with the main thread and use a really lax spinlock (10 ms sleep).
This leads to an other question. Eg. on a 4 core machine I use 1 main-thread, 3 job-worker and upto 8 builder threads. The builder threads create data and are quite independent of the rest of the game (read-only access of gamedata, no further sync, errors are accepted). On the other hand, the job-worker are only active for a defined duration during a frame and should be as fast as possible. The main thread, doing all the sync management, taking over the role of a job-worker and do some important, not concurrently running work in the shadow of the GPU. This is, when the GPU has been fed, the job-scheduler is done, then I use the time the GPU needs to finish its work to do all the single threaded gamelogic working on a large, not shared game world.
My intention is, that the builder threads fill the unused power when a core is not involved in an other task (main/job). Thinking about it, what would be a good "configuration" for these builder threads ?
Should I restrict them to #cores-1 threads, so that the main thread is unhindered all the time ? Can I use 8 thread when setting the priority quite low, or would this result in system wide reduction of priority (that is, other applications will get more CPU time then my running game?). How can I tell the system, that the main and job threads should be most important, when active, prioertiy again ?
This sounds very much like the division I use though I kept it generic and called it asynchronous task based processing (aka: Thread Pool with work Queue) and the n-core horizontal distribution system. The problem was always that those two different things tend to fight each other at the most inopportune times. For instance, let's take texture requests, I call them async tasks because they may spend several frames loading data so I don't want them being tied into the main loop distribution system. But, things are never as simple as they seem on the surface, so let me describe the steps involved just so some items are clear:
1. Load (or pull from a metadata cache) the texture header so we have size, mip count etc information. (IO bound or potentially memory bound.)
2. Using #1 information, send a request to the rendering thread to give me an appropriate buffer to store the texture in. The buffer may be a transient chunk of memory the renderer will feed to the Gfx api or it may be the real resource memory block, not up to me to decide on this side of things. (Frame latency bound, i.e. once a frame I pump a message queue and push back results.)
3. Load the actual data. (Io bound again.)
4. For fun, assume it is an 8k x 8k backdrop image and for some reason it is on disk as a jpg, decompress and push it into the buffer. Potentially build mips, maybe even do runtime dxt compression etc. (CPU bound.)
5. Send the 'complete' back to the renderer and fill in a resource handle with a reference to the real resource. (Frame latency and atomic contention bound.)
Breaking down the steps here, you might see the issue that turns up. If I use a fixed set of async task threads and several texture requests come in at the same time, or over a couple frames (a common case since you load a model and then it says I use x/y/z textures) I'm going to be oversubscribed at different points in those steps. With a good asynchronous file system behind the scenes, the io bound bits won't be too horrible but the bits involving #4 can be a real bugger because you can't predict when that work will need to be done, and oversubscribing your CPU's can be worse than just running it all single threaded because you drag down everything and start wasting major CPU on simple OS overhead instead of running your code.
Filling in the empty spaces in the CPU time is exactly why I have the two different solutions and it is also how the OS tends to fulfill various internal items such as async file IO etc. But the nature of the work involved in step #4 (assume it's jpg decompression for example) effectively tricks the OS into doing the wrong things sometimes. It will see the n-core threads as occasional work and prioritize them lower in favor of the async thread doing this big chunk of texture processing. Which means that for the frames that the async task is actively using CPU time, you have effectively lost one of your cores to this thread that the OS won't want to interrupt. What happens is the following (bad ascii art):
1: 11111xxxxxxx1111111xx111xx1111
2: 2222xxxxxxxx2222222244444x2222
3: 3333344444443333333xxxxxxx33xx4444444444
4: aaaaaaaaaaaaaaaaaaaaaaaa....
I.e. Cores are vertical, and threads 1-4 are your n-core per frame workers with 'a' as some busy asynchronous task. The n-core distribution gets fubar'd because thread 4 ends up pushed into the other three 'spurious' threads while 'a' ends up with all of a single core since it is not spurious at this point.
Now, you might think thread priorities or affinity could solve this, go ahead and play with it but time your results because you will find that, other than on consoles, using those items generally (almost always) make things much worse.
So, going back to my initial point that you need to look at the top level of things first, the solution for this in my system was more heuristics. I ended up with a solution which at the start of the n-core portion I would maintain the cpu utilization trend and a really simple concept of 'available resources' and then bid things per frame in terms of if I would allow asynchronous tasks to start working. What I mean by this is that I may call a hardware thread .9 units of possible work and add up all the available workers to have say 4 times .9 = 3.6 cpu units available. Based on prior frames, I subtract out n-core usage and any running asynchronous tasks. The remainder is what I use to determine if an asynchronous task will be allowed to start.
This is not a perfect solution and of course this is a very simplified description. The idea though is if you fix the asynchronous request count, you can easily oversubscribe the cores and cause much worse issues than I tried to outline above. I like the bidding approach myself, it is generally easy to do and the simple concepts lend themselves to weighting different tasks such that expensive things can be 1.1 weight and simple things could be .5 weight etc.