Basic multithread question

Started by
14 comments, last by Shannon Barber 7 years, 7 months ago

I profiled the WaitForSingleObjectwhen I wrote the sheduler and it came with a latency arount 1-10ms and my sleep(0) was in the peek 1ms. On Fedora 22 I profiled usleep and got the same behaviour but the pthread signals are much faster compared to WaitForSingleObject.
I smell something fishy. I work with code that relies on sub-millisecond responses from inter-thread signals on Windows and it runs just fine across a very large number of users. If you're doing your threading right you can count on the kernel to wake a relevant thread in a few microseconds, not 10ms. It sounds like you mistakenly measured the process quantum, not actual thread context switching. See (anecdotally) here for similar experiences.

I messured the whole time a job take from enqueuing it till it's done and took over a couple of seconds the min and max time.

Which mean the common background noise from other process on the system had influence if they blocked my process.

This is a concurrency smell. Trying to outwit the kernel with thread priorities and core affinities is almost always counterproductive unless you're extremely good.

I never used thread priorities and tried the affinity because I saw it in a couple of sources and it did the job.

This discussion fired me up to take a deeper look into the reason why it had the effect and found some information.

The official optimization guide from AMD(starting at page 351) talks only about NUMA system which mean multiple CPUs and they recommend to use affinity if you run on a no high load system else the system sheduler can do a better over all balancing across all processes and assigned threads.

Which means a server with 2 cpus and only running your game benefits from using affinity.

The reason is the polution of the L3 cache and the system sheduler moves threads across cpu's which result in copying memory because the cpus don't share the same memory.

I also found a fresh article about this topic which also explains the reasons of the discovery.

Using affinity on a single cpu with multiple cores works worse if your threads do completly random access or use different code.

If the thread works alsways on the same core then it's more likely that L1-L3 cache is hot and more important all the cpu optimization like commad pipeline, branch prediction, indirect pointer access and so on are kick in.

If you let the system decide to run your threads it will let them run on the core which is free and this pollute the caches and optimization mechanics.

Exactly the same I found in the optimization guide of OpenMP but they recommend to allow the user to set the amount of threads per logical processor and the code assigns the threads by itself. The reason is that different cpus run best with different amount of threads per core at best.

Note that system profiling with Task Manager is bush league at best and probably just flat wrong in the common case. You should be giving evidence with ETW perf counters or similar tools if you want to convince anyone.

I used cpuz to get the core speed, Task Manager to track the "System Idle" and CodeAnalyst IBS to profile my cpu.

Advertisement
Desktop CPUs run all the time and use the same amount of power because if you don't use the cpu the kernel will do.

What happens is that your kernel take the cpu time, to do house keeping work and on top it runs nop if nothing is to do.

The "house keeping" work a kernel does is neglectable. Maybe sometimes you have some stupid "services" run in the background that e.g. create index lists of filenames. But the kernels (Be it Windows or Linux) them self do not do this kind of stuff.

x86 (Since the 8086 to be accurate) uses the HLT command to put the CPU/Core into a halt state. And via IRQ/INT the CPU/Core wakes up again. So there is no need for a NOP busy loops.

I were looking for HLT on Wikipedia and they say it was implemented in Windows NT and newer.

Windows 95 and 98 still used NOP, I didn't know this till yet.

It seems I have to move a book from my shelf into the trash can because of beeing out of date :(

It seems I have to move a book from my shelf into the trash can because of beeing out of date :(
Start a collection of retro books :D

Desktop CPUs run all the time and use the same amount of power because if you don't use the cpu the kernel will do.What happens is that your kernel take the cpu time, to do house keeping work and on top it runs nop if nothing is to do.


The "house keeping" work a kernel does is neglectable. Maybe sometimes you have some stupid "services" run in the background that e.g. create index lists of filenames. But the kernels (Be it Windows or Linux) them self do not do this kind of stuff.

x86 (Since the 8086 to be accurate) uses the HLT command to put the CPU/Core into a halt state. And via IRQ/INT the CPU/Core wakes up again. So there is no need for a NOP busy loops.
FWIW, if you're actually writing a busy wait / spin loop for X86 these days, then you really must put a very specific NOP instruction in the loop body (the one emitted by _mm_pause/YieldProcessor), which not only signals that hyperthreading should kick in immediately, but that the CPU should enter a low power state until the loop condition is met and to de-pipeline the entire loop body on the fly.

For highly contended locks, it can be useful to spin for some nanoseconds before actually putting the thread to sleep via the kernel, as it's likely that this short pause is enough for the lock to become available.

Typically you make your threads sleep in condition variable. Main thread prepares task, queues it in some shared container, and then signals condition variable to wake up single or all worker threads sleeping in that condition variable.

Thus class you have adviced, will perform a check on a condition sparsely, to save CPU, but I gess it will be some 1ms

If we were running on i386's at 33MHz it might take 1 ms.

A realistic expectation today would be 750 to 1500 ns.

Condition variables are the *nix analog to Window's Event notifications. They are slightly different but can serve many of the same purposes.
They are a kernel synchronization primitive not a user-code technique.

Spin-locking is also available when you have an SMP kernel.

- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara

This topic is closed to new replies.

Advertisement