Back to General and Gameplay Programming

atomic float add ?

rldivide · 2014-02-11T19:57:14

Hi,I'm desperatly trying for a week to add one float (at some place in memory) onto another float (at some other place), in an atomic way. Meaning, multiple threads should be able to add values to this memory location without interfering.I don't wanna use mutex this they would slow down the process a lot.I couldn't find any answer to this ! Any idea ?(I just need to do an add, nothing else, no need for a full atomic float class if the problem is simplified this way)Thanks !

General and Gameplay Programming Programming

Started by rldivide February 09, 2014 02:27 PM

14 comments, last by King Mir 10 years, 2 months ago

samoth

9,833

February 09, 2014 06:51 PM

The problem gets more complicated every time

a lot of computation using their own memory space

No problem.

Nobody else cares what a thread does with data that isn't shared (and even if it's shared, it only matters if at least one thread writes to the area).

the result of the computation needs to be stored back into a very large array in memory.

So basically, 4 threads do some lengthy operation and output 4 result values once. When they are available (presumably you block on an event?) you read them. No problem.

Acutally, you don't even need atomicity for that to work reliably. Since you must synchronize with a semaphore or an event or a cond var anyway, that's already more than good enough.

random locations and can partially overlap (it depends from what the user do)

Problem. This isn't so much a threading or atomicity thing, however. It's a correctness thing.

If you add those results to some "random ranges" and these ranges overlap, you will add two results instead of one, atomic or not. If that is not allowable (or even desirable) then your code will not work correctly.

SeanMiddleditch

17,596

February 09, 2014 06:55 PM

My threads do a lot of computation using their own memory space before this float add happen. It's actually the last step, where the result of the computation needs to be stored back into a very large array in memory. Each thread is assigned a fragment (4096 continuous floats) from this very large array, and theses fragments are at random locations and can partially overlap (it depends from what the user do). Because they can overlap (it rarely happen though), I need to handle theses cases as well when the results are stored back.

Why do you think it has to work this way, or that it's the right way? I've yet to see an algorithm needing this. Generally your threads would post their independent result buffers to a queue and then another thread (or threads!) would perform the reduce/merge operations (if using multiple threads, ensuring that regions with overlaps are either split up or all handled by the same thread).

The previous comment about scaling is apt, btw. Atomics are expensive. Less expensive than locks, sure, but not at all free. Once you get to the point where you're potentially dealing with hundreds or thousands of atomic operations, you're quite possibly better off just using a coarse lock on each buffer or region. For instance, you could have a lock per 4096 contiguous chunk of your large array and your threads take the necessary locks (in constistent order!) before writing out their results. There'd be no lock contention except in your (supposedly rare) overlapping cases. Profile and test to be sure performance actually does improve, of course.

This smells like the X/Y problem. You asked about your supposed solution (atomics) to an (at the time) unspecified problem and now you're posting some broad details of an algorithm being applied for an unstated purpose. What are you actually trying to do that you need these float buffers and threads in the first place? There may be an entirely different and superior way to achieve your goal.

Sean Middleditch – Game Systems Engineer – Join my team!

Erik Rufelt

5,903

February 09, 2014 06:59 PM

Ok, so far I've wrote this (destination and addvalue are float*):
            float oldval, newval;
            do {
                oldval=*destination;
                newval=oldval+*addvalue;
            } while (!std::atomic_compare_exchange_strong(destination,&oldval,newval));
But it doesn't compile (C2665: 'std::atomic_compare_exchange_strong' : none of the 32 overloads could convert all the argument types). Any idea what I should do ?

Compare and exchange is bit-wise compare so it will also only work on integers here.. but you can cast it (as the compare when storing can still be bitwise). You can write it prettier and safer perhaps but this should work:


            float oldval, newval;
            do {
                oldval=*destination;
                newval=oldval+*addvalue;
            } while (!std::atomic_compare_exchange_strong((uint32_t*)destination,(uint32_t*)&oldval,*(uint32*)(&newval)));

Erik Rufelt

5,903

February 09, 2014 07:09 PM

But actually to be honest I would not do it exactly like that even if it happens to work...

Do all read/writes on the memory that needs to be atomic in integers, and use memcpy to write the 4 bytes to a float that is added, and then memcpy it back.

rldivide

132

Author

February 09, 2014 07:38 PM

@samoth: I actually have something like 1000 threads (only 4 of thoses are running at the same time, matching the real CPU cores - I'm chaining them using QThreadPool).

So I end up with a thousand blocks of results that can add anytime to the final (large) array.

I could wait for all the threads to finish computation, and store the results from theses threads in a single thread, but then I would loose time (1 thread storing results being slower than 4 threads storing results).

1, 2 or 3 results being added at the same place is a wanted behaviour (the results adding each other to the final array even if they overlap). I'm dealing with audio data (PCM), think of the threads as special audio effects being added to the long audio array.

@SeanMiddleditch: see the answer above (audio processing). However you're right about the time took by atomic functions, I'll have to bench that...

samoth

9,833

February 09, 2014 08:22 PM

I actually have something like 1000 threads

If 4 out of 1,000 threads are running, why create 1,000 in the first place? Create a work queue and have 4 threads pull tasks (each task having one big block of private data, and outputting one value to another queue) from the queue. Pull one value at a time from the results queue (in the main thread) and apply them, one at a time.

Creating 1,000 threads takes very non-neglegible CPU time and address space and memory (think stack reserve / commit), even when threads are not running (if they are all running, you have a lot of context switches on top). These CPU cycles and this memory are better spent elsewhere.

1 thread storing results being slower than 4 threads storing results

Yes, no, maybe. It depends. One thread adding some float value to memory will almost certainly be limited by memory bandwidth, not by arithmetic.

4 threads will be limited by memory bandwidth, too.

However, 4 threads modifying the same location (or a close-by location in the same cache line, or a multiple of cacheline-size times associativity) will incur false sharing, which greatly impacts performance. Also, atomic operations, which are necessary to make this work reliably, are orders of magnitude slower than normal writes.

rldivide

132

Author

February 09, 2014 11:13 PM

I didn't express clearly, it's actually 4 threads with 1000 tasks queued (using QThreadPool).
Good point with the memory bandwith bottleneck, I'll have to bench that.

Matias Goldberg

9,637

February 11, 2014 01:40 AM

Indeed. The way I would solve this is (assuming 4 threads for the sake of example) pre-allocate a buffer i.e. float results[4]; (actually more for the sake of padding to avoid false cached sharing, i.e. results[3*16+1] and use results[0]; results[16] results[32] and results[48]) and then a final thread would sum all those results together when all four threads release the semaphore.

Still my question remains unanswered. I asked Why do you need to add floats to random locations from multiple threads?; you answered "How".

I was expecting something like "I'm implementing my own software rasterizer including shader support" or "I'm doing DNA sequencing analysis", etc.

Twitter: @matiasgoldberg

Distant Souls ? Alliance AirWar ? My Free Royalty-Free Music Library

King Mir

2,506

February 11, 2014 07:57 PM

It should be faster, instead of making the whole array full of atomic floats, to partition the result array into partitions protected with locks. That way, only the lock is expensive, but the actual writes can happen out of order and without other synchronicity.

Otherwise, due to how compare-exchange is implemented, you're effectively making all 4 cores line up in one queue to do their writes to this buffer.

atomic float add ?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

atomic float add ?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines