Sign in to follow this  

Fast thread synchronisation (WIN32)

This topic is 4379 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi again. In an application I'm writing, I sometimes have to load some new resources on-the fly (a specific command from user comes at run-time). Unfortunately, I have to keep stable 50 fps. So what I wanted to do, is to load resources in a separate thread. A standard model (I guess): - a queue (Q1) for requests, and a queue (Q2) for completed requests, - active thread adds new requests to Q1 (synchro here) anytime it wants to, - active thread gets and handles completed requests from Q2 (synchro here) in an appropriate time (in-between frames, for example), - loading thread gets requests from Q1 (synchro here) when they come (waken up if needed), then performs loading - loading thread returns completed requests to Q2 (synchro here), if there are more, performs loading in a loop, if none, sleeps A few question/problems: 1. How could I go about synchronisation issues (under WIN32), especially with sleeping/waking problem (I know what synchro is about, I just can't find my way around using structures given in WIN32 - I only got CRITICAL_SECTION)? 2. Performance is of a great issue here. Could I use other synchronisation objects/functions, or should I stick with CRITICAL_SECTION only? 3. I tried to use InitializeCriticalSectionAndSpinCount, but it wouldn't compile, but InitializeCriticalSection does. It's plain "identifier not found" error. Sure, I don't have SMP (but the target machine will have), could that be the problem (doubt it, but what do I know)? So, a bunch of noobish questions for you ;) ANY ideas appreciated, even "this sucks, man! (because...)" Cheers. ~def

Share this post


Link to post
Share on other sites
Synchronization Functions

If you want the lightest weight synchronization possible, I believe what you want is InterlockedExchange

You could use it to implement a mutex-like entity like so:
typedef LONG volatile InterThreadMutex;

BOOL TryAcquire(InterThreadMutex *TheMutex)
{
if(InterlockedExchange(TheMutex, 1) == 0)
{
return TRUE;
}
return FALSE;
}

void WaitAcquire(InterThreadMutex *TheMutex)
{
while(!TryAcquire(TheMutex))
{
Sleep(0);
}
}

void Release(InterThreadMutex *TheMutex)
{
InterlockedExchange(TheMutex, 0);
}

(written in win32-api-like C, would be much much cleaner as C++);

Share this post


Link to post
Share on other sites
Identify the resource used by the different threads and use a critical section to guard it. I used the term resource in a similar sense to how Relisoft does: Resources and their Ownership. Reviewing that page, I see that it even uses CriticalSections as part of it's discussion.

Q1. Win32 synch objects are intended to be referenced by handles. The critical section structure is meant to be opaque. CriticalSections only work within the same process. Mutexes operate in the same way that CS do, except they can work across processes too. Semaphores and Events provide additional specialization features.

Q2. Stick with CS.

Q3. InitializeCriticalSectionAndSpinCount To compile an application that uses this function, define _WIN32_WINNT as 0x0403 or later.

Share this post


Link to post
Share on other sites
You need to tell the windows headers that it's ok if your code doesn't run on Win95 or WinNT 4.0 (pre-SP3). InitializeCriticalSectionAndSpinCount only exists on Win98 and NT4sp3+. Do that with "#define _WIN32_WINNT 0x0403" (or higher) before including windows.h.

Critical sections are pretty good, particularly with carefully chosen spin counts. Always keep in mind that having a spin count on a single-proc system will only hurt you. Choosing the best spin count is something of a black art and tends to be very sensitive to the particular system the code is running on.

The takeaway is to not simply write them off as so many people do.

You also might consider an slist. They are only supported on XP+ though.

If you insist on rolling your own google for "lock free queue" or something similiar and please keep in mind that it's very easy to screw this stuff up and be worse off than if you just used the standard stuff.

Share this post


Link to post
Share on other sites
Anon Mike: You could use code similar to that in (Example)Getting Hardware Information to get the number of processors and avoid setting a spin count for single-processor machines. Of course, you should be careful when doing such things as there's no real way to tell how things will change in the future. The 100-core single-chip machines of tommorow might report only single CPU =-|

Share this post


Link to post
Share on other sites
Thanks for all your help.

InterlockExchange (and family) looks nice, but I would need to reimplement spin-wait, which is already given in CS, I guess as effective as it could.

As for the sleeping part, I went with Extrarius original suggestion (in the code snippet) of simply Sleep'ing for a fixed amount of time (2 seconds[!]).

Also, I think that using slist introduces unnecessary synchronizations, where it shouldn't, since I have to do all the operations and some additional checkings already in the section. So I'm using similar structure, only plain and home-made.


Seems to be working, but it really brings the main thread to it's knees, causing a lot of frames to be lost (I have to keep stable 50 fps). I'll try lowering the resource thread's priority, see if that will work.

Share this post


Link to post
Share on other sites
Quote:
Original post by Extrarius
If you're going to use critical sections, use WaitForSingleObject to acquire them.

No, you need to use EnterCriticalSection to aquire a critical section. WaitForSingleObject is for syncronization objects that require kernel intervention like mutexes, events, etc. Half the point of using a critical section in the first place to avoid the overhead of calling into the kernel.

I can't imagine why you think slists are to heavyweight.

To repeat, at least try the standard objects before dismissing them.

Share this post


Link to post
Share on other sites
Quote:
Original post by Anon Mike
I can't imagine why you think slists are to heavyweight.


Because it's synchronizing every time I use it. And when I'm using it, I already am in some kind of a critical section (that is, in this implementation, in some other it would be a different situation), and I'm accessing it several times in a row.

Heavyweight may be too big a word here. I only mentioned it being unnecessary.


As for WaitForSingleObject, I'm using it for waiting for the thread to end, after I mark a proper volatile bool saying "thread, quit now, please".

--

I also tried to lower the priority of the loading thread, for one point only. I loose no more frames in the main thread, but the loading process takes mucho long (about 2-3 seconds for a 360x288 jpg file to standard ARGB texture). As I have to read, like, a few hundreds of those, guess I have to speed it up a little...

Any ideas?

Share this post


Link to post
Share on other sites
The trouble with InterlockedExchange() is that you can't do synchronization with it unless you do explicit yielding, sleeping, or spin-locking. You don't want to spinlock. CRITICAL_SECTION is pretty good; it performs very well for the general-purpose locking primitive that it is. It will use the equivalent of InterlockedExchange() in the case of no contention, and will revert to using a kernel primitive when there is contention.

However, for a producer/consumer situation, you don't actually want (or need) a critical section. Instead, you want to use something like a HEVENT, and something like a non-blocking FIFO for queuing requests (and results). The loading thread would wait on an event, and then it would read all currently available items in the FIFO, process them, and put the response in another FIFO. It then loops back and waits on the event again. The real-time thread puts something into the non-blocking FIFO and signals the event to indicate that something's available; each time through the loop, it also polls the non-blocking output FIFO to see whether there is any data.

When you know you have exactly one producer and one consumer, a non-blocking FIFO doesn't even need interlocked exchange, although you may be in trouble if you use a dual-CPU system and don't pay attention to out-of-order memory commits. Executing a serialization instruction like x86 SFENCE or CPUID, or PPC EIEIO will typically take care of this problem -- I'd prefer SFENCE because that's what it's for :-)

Second to last, you should beware that if you have a single CPU, and the loading thread needs to use some significant amount of CPU for doing its job (say, spin-locking on the graphics card to allocate vertex buffers), then the real-time thread will likely end up stalling anyway -- there's no way you can GUARANTEE 50 Hz when you have user-directed input involved (unless you can pre-load all resources that will ever be necessary).

Last, InitializeCriticalSectionAndSpinCount() is available only if you define WINNT to 0x500 or higher (IAICR), and you may also have to download and install an updated version of the Platform SDK to get it (at least if you're still using Visual Studio 6).

Share this post


Link to post
Share on other sites
Quote:
Original post by hplus0603
However, for a producer/consumer situation, you don't actually want (or need) a critical section. Instead, you want to use something like a HEVENT, and something like a non-blocking FIFO for queuing requests (and results). The loading thread would wait on an event, and then it would read all currently available items in the FIFO, process them, and put the response in another FIFO. It then loops back and waits on the event again. The real-time thread puts something into the non-blocking FIFO and signals the event to indicate that something's available; each time through the loop, it also polls the non-blocking output FIFO to see whether there is any data.


What do you mean by non-blocking FIFO? Access to any data structure, that can be directly modified by more than one thread, should be somehow restricted.

Quote:
Original post by hplus0603
When you know you have exactly one producer and one consumer, a non-blocking FIFO doesn't even need interlocked exchange, although you may be in trouble if you use a dual-CPU system and don't pay attention to out-of-order memory commits. Executing a serialization instruction like x86 SFENCE or CPUID, or PPC EIEIO will typically take care of this problem -- I'd prefer SFENCE because that's what it's for :-)


Whee? I thought volatile is for this. Was I mistaken.. .. ?

Quote:
Original post by hplus0603
Second to last, you should beware that if you have a single CPU, and the loading thread needs to use some significant amount of CPU for doing its job (say, spin-locking on the graphics card to allocate vertex buffers), then the real-time thread will likely end up stalling anyway -- there's no way you can GUARANTEE 50 Hz when you have user-directed input involved (unless you can pre-load all resources that will ever be necessary).


True, but I guess there's nothing I could do about it, is there?
I cannot read all resources ahead, that's what the original problem came from in the first place. [smile]

Share this post


Link to post
Share on other sites
@hplus0603, from reading about lock free and wait free operations yesterday, it appears that InterlockCompareExchange resides at the core of many of the implementations (Win32). Can you address that in light of your comments about InterlockedExchange?

Share this post


Link to post
Share on other sites
Quote:
@hplus0603, from reading about lock free and wait free operations yesterday, it appears that InterlockCompareExchange resides at the core of many of the implementations (Win32). Can you address that in light of your comments about InterlockedExchange?

There are 2 different ways to go about things. 1) implement some sort of mutual-exclusion mechanism to protect critical sections; 2) make all operations in the (formerly) critical section implicitly thread-safe and get rid of all locks.

Quote:
The trouble with InterlockedExchange() is that you can't do synchronization with it unless you do explicit yielding, sleeping, or spin-locking. You don't want to spinlock.

This is referring to 1), a mutex implementation. In short, you would atomically test and set a flag via InterlockedExchange (more commonly called CAS), and then do one of the above in case of contention.

Quote:
something like a non-blocking FIFO for queuing requests (and results).

This is #2, the lock-free data structure. It happens to be implemented by means of CAS (one of the universal primitives that can be used for the purpose).

So, CAS can be used to implement synchronization, or to make possible techniques that obviate synchronization. The abovementioned problems only apply to the former.

Share this post


Link to post
Share on other sites

This topic is 4379 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this