This topic is 656 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

Hi Guys,

I have an application where I need to have four threads running in the background to process data per frame.

I have found that creating and destroying the threads per frame costly, so I am leaving the threads running which works fine, but thrashes the CPU constantly.

Is it possible to make the CPU idle when the extra processing isn't required?

I have added Sleep(0) but am wondering if there is anything additional I can do, or is that about it?

##### Share on other sites
Nice! That looks like exactly what I am looking for!

Thanks heaps :)

##### Share on other sites

Depending on the workload you expect on the threads you can use different functions to get the threads work best.

1. With the assembly command nop you can wait for N cycles on the core(amount of cycles depend on the cpu).
2. With Sleep/usleep(0) your process will be signal the sheduler that it has nothing to do and if an other process need to work on this core it can take the rest of the time slice.
3. With Sleep/usleep(1) your process will be signal the sheduler that it has nothing to do and an other process can take this and upcomming time slices.
4. With Sleep/usleep > 1 the process will be signal the sheduler that it want the core in N ms which can be the next fitting time slice or in the middle of a time slice if the previous process used Sleep(0).
5. The condition/mutex/signal solutions are platform specific and different in their behaviour.

Commonly you implement a worker function with 1. till 4. in a loop because 5. will work pretty bad(high latency) if you push a lot of jobs in your worker thread.

If you get an task you expect a upcomming one and wait some nop operations, then wait a little bit longer(sleep(0)) and longer(sleep(1)) and longer(sleep(15)) till you go back in the outer loop.

If you get a task you reset to the shortes wait period.

You should prefere a lock free queue as single producer single consumer(SPSC->main thread push to dedicated worker) or single producer multiple consumer(SPMC->every worker push on it's locale queue but can steal from other worker if the own queue is empty).

while(false == exitProcess){
Job job;
{
case 6:Sleep(235);
case 5:Sleep(15);
if (inQueue.Dequeue(job))
break;

case 4:Sleep(1)
if (inQueue.Dequeue(job))
break;

case 3:Sleep(0)
if (inQueue.Dequeue(job))
break;

case 2:for (int i=0; i < 10; ++i) nop();
case 1:for (int i=0; i < 5; ++i) nop();
case 0:
if (inQueue.Dequeue(job))
break;
}
// if you use SPMC queue
if (WorkerPool::GetInstance().StealJob(job))

job();
}

##### Share on other sites

Typically you make your threads sleep in condition variable. Main thread prepares task, queues it in some shared container, and then signals condition variable to wake up single or all worker threads sleeping in that condition variable.

First thing first, I know DarkRonin fights with threads, and since, he hopes to use those 4 threads for a per frame load, what you have adviced has likely a serious cullprit towards his intent.

Particulary that ", and then signals condition variable to wake up single or all worker threads" , is something that will take pretty big time just to wake the thread, since:

while (a==true){}   // a being some outter any variable

is a code that will utilize a single core at 100%, (if you compile without optimizations)

Thus class you have adviced, will perform a check on a condition sparsely, to save CPU, but I gess it will be some 1ms (or some setting to tweek in the class, but precisions and timings under 1ms are- rare ability, and consider DarkRonin wishes to save CPU time either).

So, this class I gess is not ment for waking/freeing threads as for per frame loads, since, at 40 FPS you have 25 ms of total time for one frame in main thread.

I really cannot tell upfront weather this class will outperform or underperform the threads creating/freeing alternative, which DR finds costly already.

##### Share on other sites
Depending on your workloads, you can also try using thread building blocks for the same set of problems, and you have to care less about doing os level thread operations. It won't solve your problems magically (you still have to handle data races, synchronization), but at least you will do less OS specific work.

##### Share on other sites

There's a lot of poor quality information in this thread.

Condition variables (or other synchronization methods) are your best bet. They tell the OS to use virtually zero (CPU) resources on your thread until it is time for it to do work. Any good multithreading-aware kernel will schedule your thread so that it wakes up as soon as it can after being signalled.

The latency of synchronization is trivial and not worth freaking out about. Basing your threading models on Sleeps and fairy dust is a recipe for frustrating bugs and non-scalable architectures.

You're right about the usage of a low latency sheduler for this kind of job, of course you can use the much easier stl condition api.

I'm working with low latency sheduler at work and that was the reason I posted a low latency sheduler.

This kind of sheduler is also used by Intel TBB and MS .net if you can trust this book .

In my private framework I use currently a sleep(0) and after the first fail a sleep(1) and the debugger shows me that the sleep(1) do a WaitHandle call internally.

I profiled the WaitForSingleObject when I wrote the sheduler and it came with a latency arount 1-10ms and my sleep(0) was in the peek 1ms.

On Fedora 22 I profiled usleep and got the same behaviour but the pthread signals are much faster compared to WaitForSingleObject.

The numbers were even worse before I assigned the threads to the cores.

It seems that you expect that you use less CPU power if the process monitor show 0-1% of cpu usage on your process then 25 or even 100%.

Depending on what kind of CPU and OS your are this is not right.

Desktop CPUs run all the time and use the same amount of power because if you don't use the cpu the kernel will do.

What happens is that your kernel take the cpu time, to do house keeping work and on top it runs nop if nothing is to do.

If possible it will lower the frequency of the cpu to use less energy and in this case your power supply will drain less energy(FX-8350 can work between 1,4-4,2Ghz).

You can run tools like cpuz to monitor the cpu frequence and run your work sheduler code to find a solution which drain less power.

If the core speed go down the same code will result in higher process usage which makes this number treacherous.

On my 8 Core FX the core speed jumps beween 1,4GHz and 4,2Ghz if you enqueue many tasks, my cpu usage go up to 100% for all Cores and it runs with 4,2GHz per Core.

If I lower the job amount to 1 every 5ms then it go down to 1-10% and clock speed jump between 1,4Ghz and 4,2Ghz which means my Sleep(1) have the same effect as using Windows wait mechanic in respect of throttle the core speed. If I send every second a job the process manager shows 0% with a short peek of 100%.

Reiterating:

Sleep() a solution for a different problem. Don't use it here. They tend to operate on large time slices, far larger than you want here. They'll make your game stutter.

Busy waits and spinlocks, which you see in the { while(no value) nop(); } solution, consume tons of CPU time and normally harm processing overall. Don't use them because they tend to make performance plummet.

Use the platform-specific wait command, such as WaitForSingleObject() or pthread_cont_wait() or similar.  These let the OS completely remove the thread from the run queue until the event is triggered, then the thread will immediately return to running state.

As already mentioned Sleep/usleep(0) and Sleep/usleep(1) are specialized and deliver lower latency then WaitForSingleObject.

You also should read more carefully the example I used 5 nop calls which are on my CPU 5 cycles with 0 cylce latency, then 20 nops and then it already use the sleep mechanic.

This is only for the case of the producer thread pushed a new job and the copy of the data into the cache is still going on.

L1 access take unter 10cycles, L2 cache access can take around 10-30 cycles and L3 take around 100cycles(this number can vary on each cpu generation).

This nop stuff runs around a couple of nano seconds and don't consume tons of CPU time.

After it failed the 3rd time you already idle in us range and at the 5th fail you idle in the ms range but stay at it.

Edited by TAK2004

##### Share on other sites

I profiled the WaitForSingleObjectwhen I wrote the sheduler and it came with a latency arount 1-10ms and my sleep(0) was in the peek 1ms.
On Fedora 22 I profiled usleep and got the same behaviour but the pthread signals are much faster compared to WaitForSingleObject.
[/quote]

I smell something fishy. I work with code that relies on sub-millisecond responses from inter-thread signals on Windows and it runs just fine across a very large number of users. If you're doing your threading right you can count on the kernel to wake a relevant thread in a few microseconds, not 10ms. It sounds like you mistakenly measured the process quantum, not actual thread context switching.

See (anecdotally) here for similar experiences.

The numbers were even worse before I assigned the threads to the cores.

This is a concurrency smell. Trying to outwit the kernel with thread priorities and core affinities is almost always counterproductive unless you're extremely good.

It seems that you expect that you use less CPU power if the process monitor show 0-1% of cpu usage on your process then 25 or even 100%.

Complete and total straw-man, I never said anything of the kind. I said you use less CPU resources by which I meant actual scheduled CPU clock time.

Desktop CPUs run all the time and use the same amount of power because if you don't use the cpu the kernel will do.

If possible it will lower the frequency of the cpu to use less energy and in this case your power supply will drain less energy(FX-8350 can work between 1,4-4,2Ghz).
You can run tools like cpuz to monitor the cpu frequence and run your work sheduler code to find a solution which drain less power.
If the core speed go down the same code will result in higher process usage which makes this number treacherous.

On my 8 Core FX the core speed jumps beween 1,4GHz and 4,2Ghz if you enqueue many tasks, my cpu usage go up to 100% for all Cores and it runs with 4,2GHz per Core.

[/quote]

Note that system profiling with Task Manager is bush league at best and probably just flat wrong in the common case. You should be giving evidence with ETW perf counters or similar tools if you want to convince anyone.

In any case microoptimization is dangerous and hard, and concurrent microoptimization is at least an order of magnitude harder. Even if your numbers suggested that sleeping/nopping is a good idea for general use, you're gonna lose a lot of ground to solutions that aren't so fragile. Edited by ApochPiQ

1. 1
2. 2
3. 3
Rutin
24
4. 4
JoeJ
18
5. 5

• 14
• 23
• 11
• 11
• 9
• ### Forum Statistics

• Total Topics
631766
• Total Posts
3002232
×