I profiled the WaitForSingleObjectwhen I wrote the sheduler and it came with a latency arount 1-10ms and my sleep(0) was in the peek 1ms. On Fedora 22 I profiled usleep and got the same behaviour but the pthread signals are much faster compared to WaitForSingleObject.I smell something fishy. I work with code that relies on sub-millisecond responses from inter-thread signals on Windows and it runs just fine across a very large number of users. If you're doing your threading right you can count on the kernel to wake a relevant thread in a few microseconds, not 10ms. It sounds like you mistakenly measured the process quantum, not actual thread context switching. See (anecdotally) here for similar experiences.
I messured the whole time a job take from enqueuing it till it's done and took over a couple of seconds the min and max time.
Which mean the common background noise from other process on the system had influence if they blocked my process.
This is a concurrency smell. Trying to outwit the kernel with thread priorities and core affinities is almost always counterproductive unless you're extremely good.
I never used thread priorities and tried the affinity because I saw it in a couple of sources and it did the job.
This discussion fired me up to take a deeper look into the reason why it had the effect and found some information.
The official optimization guide from AMD(starting at page 351) talks only about NUMA system which mean multiple CPUs and they recommend to use affinity if you run on a no high load system else the system sheduler can do a better over all balancing across all processes and assigned threads.
Which means a server with 2 cpus and only running your game benefits from using affinity.
The reason is the polution of the L3 cache and the system sheduler moves threads across cpu's which result in copying memory because the cpus don't share the same memory.
I also found a fresh article about this topic which also explains the reasons of the discovery.
Using affinity on a single cpu with multiple cores works worse if your threads do completly random access or use different code.
If the thread works alsways on the same core then it's more likely that L1-L3 cache is hot and more important all the cpu optimization like commad pipeline, branch prediction, indirect pointer access and so on are kick in.
If you let the system decide to run your threads it will let them run on the core which is free and this pollute the caches and optimization mechanics.
Exactly the same I found in the optimization guide of OpenMP but they recommend to allow the user to set the amount of threads per logical processor and the code assigns the threads by itself. The reason is that different cpus run best with different amount of threads per core at best.
Note that system profiling with Task Manager is bush league at best and probably just flat wrong in the common case. You should be giving evidence with ETW perf counters or similar tools if you want to convince anyone.
I used cpuz to get the core speed, Task Manager to track the "System Idle" and CodeAnalyst IBS to profile my cpu.