Jump to content

  • Log In with Google      Sign In   
  • Create Account


atomic float add ?


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
18 replies to this topic

#1 rldivide   Members   -  Reputation: 130

Like
0Likes
Like

Posted 09 February 2014 - 08:27 AM

Hi,

I'm desperatly trying for a week to add one float (at some place in memory) onto another float (at some other place), in an atomic way. Meaning, multiple threads should be able to add values to this memory location without interfering.
I don't wanna use mutex this they would slow down the process a lot.
I couldn't find any answer to this ! Any idea ?
(I just need to do an add, nothing else, no need for a full atomic float class if the problem is simplified this way)

Thanks !



Sponsor:

#2 Sponji   Members   -  Reputation: 1012

Like
0Likes
Like

Posted 09 February 2014 - 08:59 AM

How about std::atomic<float>?


Derp

#3 rldivide   Members   -  Reputation: 130

Like
0Likes
Like

Posted 09 February 2014 - 10:29 AM

It doesn't exist (specially with add operator), only int are supported AFAIK.



#4 samoth   Crossbones+   -  Reputation: 4109

Like
1Likes
Like

Posted 09 February 2014 - 10:31 AM

It depends on what exactly you want. This

 

add one float (at some place in memory) onto another float (at some other place), in an atomic way.

 

and this:

multiple threads should be able to add values to this memory location without interfering.

 

are very different things.

 

The first thing (adding a value from one [volatile] memory location to a given float value) is impossible to do in a threadsafe manner without a mutex. The second thing (adding some not concurrently chaning value to a given float value) is trivially achieved using std::atomic or manually using a compare-exchange function (casting the float bits to an integer of the same size).

 

Generally, it is trivial to do this:

  • read bit pattern from some address atomically
  • perform any kind of complex operation (including casting to float, adding a value, and casting back)
  • atomically write the new bit pattern using compare-exchange

 

It only gets problematic (requiring a mutex) when there is another memory location you read from involved, and that one might be modified concurrently too.


Edited by samoth, 09 February 2014 - 10:37 AM.


#5 rldivide   Members   -  Reputation: 130

Like
0Likes
Like

Posted 09 February 2014 - 10:37 AM

Maybe I wasn't clear in the second quote, both quotes are supposed to refer to the same operation: add the value of a float to another float. Unfortunately std::atomic doesn't handle float add operations.



#6 Erik Rufelt   Crossbones+   -  Reputation: 2836

Like
6Likes
Like

Posted 09 February 2014 - 11:08 AM

You have to do a loop in some way, either a spin-lock mutex that waits to perform the add until no other thread is currently doing it, or do the add first and store the result with atomic compare and swap and retry on failure.

 

First way:

while( forever ) {
  if(atomicCompareAndSetFlag())
    break;
}

memory[...] += float;

clearTheAtomicFlag();

Second:

bool done = false;
while( !done ) {
  float old = memory[...];
  float new = old + float;
  
  // This will only return true if the old value hasn't been changed since it was read
  done = atomicCompareAndStore(old, new);
}

Both these reduce to the same problem, atomically changing a value in memory only if it is of a known value. This is called compare and exchange (or swap).

Such a method returns true if the value was equal to the expected value, and in that case the value has atomically been swapped to the new value.

Processors implement these atomically, and if two threads tries to do it at exactly the same time, one of them will always go first and succeed, while the other thread will return false and not change the value as it has already been changed.

 

std::atomic has compare and exchange, which you can use for that part.



#7 rldivide   Members   -  Reputation: 130

Like
0Likes
Like

Posted 09 February 2014 - 11:20 AM

wow, your 2nd suggestion does answer my problem very well actually !

I'm actually adding float values from 4 threads to a lot of different random locations in memory. So the case where one thread would access to the same location as another thread is almost null, but I cannot act as if it would never happen ever. So atomicCompareAndStore would succeed a lot of time, and would only have to retry in very rare cases.

So that sound like a perfect solution for me :)



#8 Matias Goldberg   Crossbones+   -  Reputation: 2765

Like
0Likes
Like

Posted 09 February 2014 - 11:59 AM

The fact that you need to do this hints a bad practice pattern. I suspect that even if you had hw atomic float operations, your algorithm wouldn't scale with cores and fail spectacularly.

 

Why do you need to add floats to random locations from multiple threads?


Edited by Matias Goldberg, 09 February 2014 - 12:00 PM.


#9 rldivide   Members   -  Reputation: 130

Like
0Likes
Like

Posted 09 February 2014 - 12:13 PM

The fact that you need to do this hints a bad practice pattern. I suspect that even if you had hw atomic float operations, your algorithm wouldn't scale with cores and fail spectacularly.

 

Why do you need to add floats to random locations from multiple threads?

 

My threads do a lot of computation using their own memory space before this float add happen. It's actually the last step, where the result of the computation needs to be stored back into a very large array in memory. Each thread is assigned a fragment (4096 continuous floats) of this very large array, and theses fragments are at random locations and can partially overlap (it depends from what the user do). Because they can overlap (it rarely happen though), I need to handle theses cases as well when the results are stored back.


Edited by rldivide, 09 February 2014 - 12:48 PM.


#10 rldivide   Members   -  Reputation: 130

Like
0Likes
Like

Posted 09 February 2014 - 12:47 PM

Ok, so far I've wrote this (destination and addvalue are float*):

            float oldval, newval;
            do {
                oldval=*destination;
                newval=oldval+*addvalue;
            } while (!std::atomic_compare_exchange_strong(destination,&oldval,newval));

But it doesn't compile (C2665: 'std::atomic_compare_exchange_strong' : none of the 32 overloads could convert all the argument types). Any idea what I should do ?


Edited by rldivide, 09 February 2014 - 12:50 PM.


#11 samoth   Crossbones+   -  Reputation: 4109

Like
1Likes
Like

Posted 09 February 2014 - 12:51 PM

The problem gets more complicated every time biggrin.png

 

 

a lot of computation using their own memory space

No problem.

 

Nobody else cares what a thread does with data that isn't shared (and even if it's shared, it only matters if at least one thread writes to the area).
 

the result of the computation needs to be stored back into a very large array in memory.

So basically, 4 threads do some lengthy operation and output 4 result values once. When they are available (presumably you block on an event?) you read them. No problem.

 

Acutally, you don't even need atomicity for that to work reliably. Since you must synchronize with a semaphore or an event or a cond var anyway, that's already more than good enough.

 


random locations and can partially overlap (it depends from what the user do)

Problem. This isn't so much a threading or atomicity thing, however. It's a correctness thing.

If you add those results to some "random ranges" and these ranges overlap, you will add two results instead of one, atomic or not. If that is not allowable (or even desirable) then your code will not work correctly.


Edited by samoth, 09 February 2014 - 12:53 PM.


#12 SeanMiddleditch   Members   -  Reputation: 1913

Like
0Likes
Like

Posted 09 February 2014 - 12:55 PM

My threads do a lot of computation using their own memory space before this float add happen. It's actually the last step, where the result of the computation needs to be stored back into a very large array in memory. Each thread is assigned a fragment (4096 continuous floats) from this very large array, and theses fragments are at random locations and can partially overlap (it depends from what the user do). Because they can overlap (it rarely happen though), I need to handle theses cases as well when the results are stored back.


Why do you think it has to work this way, or that it's the right way? I've yet to see an algorithm needing this. Generally your threads would post their independent result buffers to a queue and then another thread (or threads!) would perform the reduce/merge operations (if using multiple threads, ensuring that regions with overlaps are either split up or all handled by the same thread).

The previous comment about scaling is apt, btw. Atomics are expensive. Less expensive than locks, sure, but not at all free. Once you get to the point where you're potentially dealing with hundreds or thousands of atomic operations, you're quite possibly better off just using a coarse lock on each buffer or region. For instance, you could have a lock per 4096 contiguous chunk of your large array and your threads take the necessary locks (in constistent order!) before writing out their results. There'd be no lock contention except in your (supposedly rare) overlapping cases. Profile and test to be sure performance actually does improve, of course.

This smells like the X/Y problem. You asked about your supposed solution (atomics) to an (at the time) unspecified problem and now you're posting some broad details of an algorithm being applied for an unstated purpose. What are you actually trying to do that you need these float buffers and threads in the first place? There may be an entirely different and superior way to achieve your goal.

#13 Erik Rufelt   Crossbones+   -  Reputation: 2836

Like
0Likes
Like

Posted 09 February 2014 - 12:59 PM

Ok, so far I've wrote this (destination and addvalue are float*):

            float oldval, newval;
            do {
                oldval=*destination;
                newval=oldval+*addvalue;
            } while (!std::atomic_compare_exchange_strong(destination,&oldval,newval));

But it doesn't compile (C2665: 'std::atomic_compare_exchange_strong' : none of the 32 overloads could convert all the argument types). Any idea what I should do ?

 

 

Compare and exchange is bit-wise compare so it will also only work on integers here.. but you can cast it (as the compare when storing can still be bitwise). You can write it prettier and safer perhaps but this should work:

            float oldval, newval;
            do {
                oldval=*destination;
                newval=oldval+*addvalue;
            } while (!std::atomic_compare_exchange_strong((uint32_t*)destination,(uint32_t*)&oldval,*(uint32*)(&newval)));


#14 Erik Rufelt   Crossbones+   -  Reputation: 2836

Like
0Likes
Like

Posted 09 February 2014 - 01:09 PM

But actually to be honest I would not do it exactly like that even if it happens to work...

Do all read/writes on the memory that needs to be atomic in integers, and use memcpy to write the 4 bytes to a float that is added, and then memcpy it back.


Edited by Erik Rufelt, 09 February 2014 - 01:34 PM.


#15 rldivide   Members   -  Reputation: 130

Like
0Likes
Like

Posted 09 February 2014 - 01:38 PM

@samoth: I actually have something like 1000 threads (only 4 of thoses are running at the same time, matching the real CPU cores - I'm chaining them using QThreadPool).

So I end up with a thousand blocks of results that can add anytime to the final (large) array.

I could wait for all the threads to finish computation, and store the results from theses threads in a single thread, but then I would loose time (1 thread storing results being slower than 4 threads storing results).

1, 2 or 3 results being added at the same place is a wanted behaviour (the results adding each other to the final array even if they overlap). I'm dealing with audio data (PCM), think of the threads as special audio effects being added to the long audio array.

 

@SeanMiddleditch: see the answer above (audio processing). However you're right about the time took by atomic functions, I'll have to bench that...


Edited by rldivide, 09 February 2014 - 01:39 PM.


#16 samoth   Crossbones+   -  Reputation: 4109

Like
1Likes
Like

Posted 09 February 2014 - 02:22 PM

I actually have something like 1000 threads

If 4 out of 1,000 threads are running, why create 1,000 in the first place? Create a work queue and have 4 threads pull tasks (each task having one big block of private data, and outputting one value to another queue) from the queue. Pull one value at a time from the results queue (in the main thread) and apply them, one at a time.

 

Creating 1,000 threads takes very non-neglegible CPU time and address space and memory (think stack reserve / commit), even when threads are not running (if they are all running, you have a lot of context switches on top). These CPU cycles and this memory are better spent elsewhere.
 

 

1 thread storing results being slower than 4 threads storing results

Yes, no, maybe. It depends. One thread adding some float value to memory will almost certainly be limited by memory bandwidth, not by arithmetic.

 

4 threads will be limited by memory bandwidth, too.

 

However, 4 threads modifying the same location (or a close-by location in the same cache line, or a multiple of cacheline-size times associativity) will incur false sharing, which greatly impacts performance. Also, atomic operations, which are necessary to make this work reliably, are orders of magnitude slower than normal writes.


Edited by samoth, 09 February 2014 - 02:23 PM.


#17 rldivide   Members   -  Reputation: 130

Like
0Likes
Like

Posted 09 February 2014 - 05:13 PM

I didn't express clearly, it's actually 4 threads with 1000 tasks queued (using QThreadPool).
Good point with the memory bandwith bottleneck, I'll have to bench that.

#18 Matias Goldberg   Crossbones+   -  Reputation: 2765

Like
0Likes
Like

Posted 10 February 2014 - 07:40 PM

Indeed. The way I would solve this is (assuming 4 threads for the sake of example) pre-allocate a buffer i.e. float results[4]; (actually more for the sake of padding to avoid false cached sharing, i.e. results[3*16+1] and use results[0]; results[16] results[32] and results[48]) and then a final thread would sum all those results[i] together when all four threads release the semaphore.

 

Still my question remains unanswered. I asked Why do you need to add floats to random locations from multiple threads?; you answered "How".

I was expecting something like "I'm implementing my own software rasterizer including shader support" or "I'm doing DNA sequencing analysis", etc.



#19 King Mir   Members   -  Reputation: 1816

Like
1Likes
Like

Posted 11 February 2014 - 01:57 PM

It should be faster, instead of making the whole array full of atomic floats, to partition the result array into partitions protected with locks. That way, only the lock is expensive, but the actual writes can happen out of order and without other synchronicity.

Otherwise, due to how compare-exchange is implemented, you're effectively making all 4 cores line up in one queue to do their writes to this buffer.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS