atomic float add ?

Started by
14 comments, last by King Mir 10 years, 2 months ago

The problem gets more complicated every time biggrin.png

a lot of computation using their own memory space

No problem.

Nobody else cares what a thread does with data that isn't shared (and even if it's shared, it only matters if at least one thread writes to the area).

the result of the computation needs to be stored back into a very large array in memory.

So basically, 4 threads do some lengthy operation and output 4 result values once. When they are available (presumably you block on an event?) you read them. No problem.

Acutally, you don't even need atomicity for that to work reliably. Since you must synchronize with a semaphore or an event or a cond var anyway, that's already more than good enough.


random locations and can partially overlap (it depends from what the user do)

Problem. This isn't so much a threading or atomicity thing, however. It's a correctness thing.

If you add those results to some "random ranges" and these ranges overlap, you will add two results instead of one, atomic or not. If that is not allowable (or even desirable) then your code will not work correctly.

Advertisement

My threads do a lot of computation using their own memory space before this float add happen. It's actually the last step, where the result of the computation needs to be stored back into a very large array in memory. Each thread is assigned a fragment (4096 continuous floats) from this very large array, and theses fragments are at random locations and can partially overlap (it depends from what the user do). Because they can overlap (it rarely happen though), I need to handle theses cases as well when the results are stored back.


Why do you think it has to work this way, or that it's the right way? I've yet to see an algorithm needing this. Generally your threads would post their independent result buffers to a queue and then another thread (or threads!) would perform the reduce/merge operations (if using multiple threads, ensuring that regions with overlaps are either split up or all handled by the same thread).

The previous comment about scaling is apt, btw. Atomics are expensive. Less expensive than locks, sure, but not at all free. Once you get to the point where you're potentially dealing with hundreds or thousands of atomic operations, you're quite possibly better off just using a coarse lock on each buffer or region. For instance, you could have a lock per 4096 contiguous chunk of your large array and your threads take the necessary locks (in constistent order!) before writing out their results. There'd be no lock contention except in your (supposedly rare) overlapping cases. Profile and test to be sure performance actually does improve, of course.

This smells like the X/Y problem. You asked about your supposed solution (atomics) to an (at the time) unspecified problem and now you're posting some broad details of an algorithm being applied for an unstated purpose. What are you actually trying to do that you need these float buffers and threads in the first place? There may be an entirely different and superior way to achieve your goal.

Sean Middleditch – Game Systems Engineer – Join my team!

Ok, so far I've wrote this (destination and addvalue are float*):


            float oldval, newval;
            do {
                oldval=*destination;
                newval=oldval+*addvalue;
            } while (!std::atomic_compare_exchange_strong(destination,&oldval,newval));

But it doesn't compile (C2665: 'std::atomic_compare_exchange_strong' : none of the 32 overloads could convert all the argument types). Any idea what I should do ?

Compare and exchange is bit-wise compare so it will also only work on integers here.. but you can cast it (as the compare when storing can still be bitwise). You can write it prettier and safer perhaps but this should work:


            float oldval, newval;
            do {
                oldval=*destination;
                newval=oldval+*addvalue;
            } while (!std::atomic_compare_exchange_strong((uint32_t*)destination,(uint32_t*)&oldval,*(uint32*)(&newval)));

But actually to be honest I would not do it exactly like that even if it happens to work...

Do all read/writes on the memory that needs to be atomic in integers, and use memcpy to write the 4 bytes to a float that is added, and then memcpy it back.

@samoth: I actually have something like 1000 threads (only 4 of thoses are running at the same time, matching the real CPU cores - I'm chaining them using QThreadPool).

So I end up with a thousand blocks of results that can add anytime to the final (large) array.

I could wait for all the threads to finish computation, and store the results from theses threads in a single thread, but then I would loose time (1 thread storing results being slower than 4 threads storing results).

1, 2 or 3 results being added at the same place is a wanted behaviour (the results adding each other to the final array even if they overlap). I'm dealing with audio data (PCM), think of the threads as special audio effects being added to the long audio array.

@SeanMiddleditch: see the answer above (audio processing). However you're right about the time took by atomic functions, I'll have to bench that...

I actually have something like 1000 threads

If 4 out of 1,000 threads are running, why create 1,000 in the first place? Create a work queue and have 4 threads pull tasks (each task having one big block of private data, and outputting one value to another queue) from the queue. Pull one value at a time from the results queue (in the main thread) and apply them, one at a time.

Creating 1,000 threads takes very non-neglegible CPU time and address space and memory (think stack reserve / commit), even when threads are not running (if they are all running, you have a lot of context switches on top). These CPU cycles and this memory are better spent elsewhere.

1 thread storing results being slower than 4 threads storing results

Yes, no, maybe. It depends. One thread adding some float value to memory will almost certainly be limited by memory bandwidth, not by arithmetic.

4 threads will be limited by memory bandwidth, too.

However, 4 threads modifying the same location (or a close-by location in the same cache line, or a multiple of cacheline-size times associativity) will incur false sharing, which greatly impacts performance. Also, atomic operations, which are necessary to make this work reliably, are orders of magnitude slower than normal writes.

I didn't express clearly, it's actually 4 threads with 1000 tasks queued (using QThreadPool).
Good point with the memory bandwith bottleneck, I'll have to bench that.

Indeed. The way I would solve this is (assuming 4 threads for the sake of example) pre-allocate a buffer i.e. float results[4]; (actually more for the sake of padding to avoid false cached sharing, i.e. results[3*16+1] and use results[0]; results[16] results[32] and results[48]) and then a final thread would sum all those results together when all four threads release the semaphore.

Still my question remains unanswered. I asked Why do you need to add floats to random locations from multiple threads?; you answered "How".

I was expecting something like "I'm implementing my own software rasterizer including shader support" or "I'm doing DNA sequencing analysis", etc.

It should be faster, instead of making the whole array full of atomic floats, to partition the result array into partitions protected with locks. That way, only the lock is expensive, but the actual writes can happen out of order and without other synchronicity.

Otherwise, due to how compare-exchange is implemented, you're effectively making all 4 cores line up in one queue to do their writes to this buffer.

This topic is closed to new replies.

Advertisement