• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
rldivide

atomic float add ?

15 posts in this topic

Hi,

I'm desperatly trying for a week to add one float (at some place in memory) onto another float (at some other place), in an atomic way. Meaning, multiple threads should be able to add values to this memory location without interfering.
I don't wanna use mutex this they would slow down the process a lot.
I couldn't find any answer to this ! Any idea ?
(I just need to do an add, nothing else, no need for a full atomic float class if the problem is simplified this way)

Thanks !

0

Share this post


Link to post
Share on other sites

It depends on what exactly you want. This

 

add one float (at some place in memory) onto another float (at some other place), in an atomic way.

 

and this:

multiple threads should be able to add values to this memory location without interfering.

 

are very different things.

 

The first thing (adding a value from one [volatile] memory location to a given float value) is impossible to do in a threadsafe manner without a mutex. The second thing (adding some not concurrently chaning value to a given float value) is trivially achieved using std::atomic or manually using a compare-exchange function (casting the float bits to an integer of the same size).

 

Generally, it is trivial to do this:

  • read bit pattern from some address atomically
  • perform any kind of complex operation (including casting to float, adding a value, and casting back)
  • atomically write the new bit pattern using compare-exchange

 

It only gets problematic (requiring a mutex) when there is another memory location you read from involved, and that one might be modified concurrently too.

Edited by samoth
1

Share this post


Link to post
Share on other sites

Maybe I wasn't clear in the second quote, both quotes are supposed to refer to the same operation: add the value of a float to another float. Unfortunately std::atomic doesn't handle float add operations.

0

Share this post


Link to post
Share on other sites

wow, your 2nd suggestion does answer my problem very well actually !

I'm actually adding float values from 4 threads to a lot of different random locations in memory. So the case where one thread would access to the same location as another thread is almost null, but I cannot act as if it would never happen ever. So atomicCompareAndStore would succeed a lot of time, and would only have to retry in very rare cases.

So that sound like a perfect solution for me :)

0

Share this post


Link to post
Share on other sites

The fact that you need to do this hints a bad practice pattern. I suspect that even if you had hw atomic float operations, your algorithm wouldn't scale with cores and fail spectacularly.

 

Why do you need to add floats to random locations from multiple threads?

Edited by Matias Goldberg
0

Share this post


Link to post
Share on other sites

The fact that you need to do this hints a bad practice pattern. I suspect that even if you had hw atomic float operations, your algorithm wouldn't scale with cores and fail spectacularly.

 

Why do you need to add floats to random locations from multiple threads?

 

My threads do a lot of computation using their own memory space before this float add happen. It's actually the last step, where the result of the computation needs to be stored back into a very large array in memory. Each thread is assigned a fragment (4096 continuous floats) of this very large array, and theses fragments are at random locations and can partially overlap (it depends from what the user do). Because they can overlap (it rarely happen though), I need to handle theses cases as well when the results are stored back.

Edited by rldivide
0

Share this post


Link to post
Share on other sites

Ok, so far I've wrote this (destination and addvalue are float*):

            float oldval, newval;
            do {
                oldval=*destination;
                newval=oldval+*addvalue;
            } while (!std::atomic_compare_exchange_strong(destination,&oldval,newval));

But it doesn't compile (C2665: 'std::atomic_compare_exchange_strong' : none of the 32 overloads could convert all the argument types). Any idea what I should do ?

Edited by rldivide
0

Share this post


Link to post
Share on other sites

The problem gets more complicated every time biggrin.png

 

 

a lot of computation using their own memory space

No problem.

 

Nobody else cares what a thread does with data that isn't shared (and even if it's shared, it only matters if at least one thread writes to the area).
 

the result of the computation needs to be stored back into a very large array in memory.

So basically, 4 threads do some lengthy operation and output 4 result values once. When they are available (presumably you block on an event?) you read them. No problem.

 

Acutally, you don't even need atomicity for that to work reliably. Since you must synchronize with a semaphore or an event or a cond var anyway, that's already more than good enough.

 


random locations and can partially overlap (it depends from what the user do)

Problem. This isn't so much a threading or atomicity thing, however. It's a correctness thing.

If you add those results to some "random ranges" and these ranges overlap, you will add two results instead of one, atomic or not. If that is not allowable (or even desirable) then your code will not work correctly.

Edited by samoth
1

Share this post


Link to post
Share on other sites

My threads do a lot of computation using their own memory space before this float add happen. It's actually the last step, where the result of the computation needs to be stored back into a very large array in memory. Each thread is assigned a fragment (4096 continuous floats) from this very large array, and theses fragments are at random locations and can partially overlap (it depends from what the user do). Because they can overlap (it rarely happen though), I need to handle theses cases as well when the results are stored back.


Why do you think it has to work this way, or that it's the right way? I've yet to see an algorithm needing this. Generally your threads would post their independent result buffers to a queue and then another thread (or threads!) would perform the reduce/merge operations (if using multiple threads, ensuring that regions with overlaps are either split up or all handled by the same thread).

The previous comment about scaling is apt, btw. Atomics are expensive. Less expensive than locks, sure, but not at all free. Once you get to the point where you're potentially dealing with hundreds or thousands of atomic operations, you're quite possibly better off just using a coarse lock on each buffer or region. For instance, you could have a lock per 4096 contiguous chunk of your large array and your threads take the necessary locks (in constistent order!) before writing out their results. There'd be no lock contention except in your (supposedly rare) overlapping cases. Profile and test to be sure performance actually does improve, of course.

This smells like the X/Y problem. You asked about your supposed solution (atomics) to an (at the time) unspecified problem and now you're posting some broad details of an algorithm being applied for an unstated purpose. What are you actually trying to do that you need these float buffers and threads in the first place? There may be an entirely different and superior way to achieve your goal.
0

Share this post


Link to post
Share on other sites

Ok, so far I've wrote this (destination and addvalue are float*):

            float oldval, newval;
            do {
                oldval=*destination;
                newval=oldval+*addvalue;
            } while (!std::atomic_compare_exchange_strong(destination,&oldval,newval));

But it doesn't compile (C2665: 'std::atomic_compare_exchange_strong' : none of the 32 overloads could convert all the argument types). Any idea what I should do ?

 

 

Compare and exchange is bit-wise compare so it will also only work on integers here.. but you can cast it (as the compare when storing can still be bitwise). You can write it prettier and safer perhaps but this should work:

            float oldval, newval;
            do {
                oldval=*destination;
                newval=oldval+*addvalue;
            } while (!std::atomic_compare_exchange_strong((uint32_t*)destination,(uint32_t*)&oldval,*(uint32*)(&newval)));
0

Share this post


Link to post
Share on other sites

But actually to be honest I would not do it exactly like that even if it happens to work...

Do all read/writes on the memory that needs to be atomic in integers, and use memcpy to write the 4 bytes to a float that is added, and then memcpy it back.

Edited by Erik Rufelt
0

Share this post


Link to post
Share on other sites

@samoth: I actually have something like 1000 threads (only 4 of thoses are running at the same time, matching the real CPU cores - I'm chaining them using QThreadPool).

So I end up with a thousand blocks of results that can add anytime to the final (large) array.

I could wait for all the threads to finish computation, and store the results from theses threads in a single thread, but then I would loose time (1 thread storing results being slower than 4 threads storing results).

1, 2 or 3 results being added at the same place is a wanted behaviour (the results adding each other to the final array even if they overlap). I'm dealing with audio data (PCM), think of the threads as special audio effects being added to the long audio array.

 

@SeanMiddleditch: see the answer above (audio processing). However you're right about the time took by atomic functions, I'll have to bench that...

Edited by rldivide
0

Share this post


Link to post
Share on other sites
I actually have something like 1000 threads

If 4 out of 1,000 threads are running, why create 1,000 in the first place? Create a work queue and have 4 threads pull tasks (each task having one big block of private data, and outputting one value to another queue) from the queue. Pull one value at a time from the results queue (in the main thread) and apply them, one at a time.

 

Creating 1,000 threads takes very non-neglegible CPU time and address space and memory (think stack reserve / commit), even when threads are not running (if they are all running, you have a lot of context switches on top). These CPU cycles and this memory are better spent elsewhere.
 

 

1 thread storing results being slower than 4 threads storing results

Yes, no, maybe. It depends. One thread adding some float value to memory will almost certainly be limited by memory bandwidth, not by arithmetic.

 

4 threads will be limited by memory bandwidth, too.

 

However, 4 threads modifying the same location (or a close-by location in the same cache line, or a multiple of cacheline-size times associativity) will incur false sharing, which greatly impacts performance. Also, atomic operations, which are necessary to make this work reliably, are orders of magnitude slower than normal writes.

Edited by samoth
1

Share this post


Link to post
Share on other sites
I didn't express clearly, it's actually 4 threads with 1000 tasks queued (using QThreadPool).
Good point with the memory bandwith bottleneck, I'll have to bench that.
0

Share this post


Link to post
Share on other sites

Indeed. The way I would solve this is (assuming 4 threads for the sake of example) pre-allocate a buffer i.e. float results[4]; (actually more for the sake of padding to avoid false cached sharing, i.e. results[3*16+1] and use results[0]; results[16] results[32] and results[48]) and then a final thread would sum all those results[i] together when all four threads release the semaphore.

 

Still my question remains unanswered. I asked Why do you need to add floats to random locations from multiple threads?; you answered "How".

I was expecting something like "I'm implementing my own software rasterizer including shader support" or "I'm doing DNA sequencing analysis", etc.

0

Share this post


Link to post
Share on other sites
It should be faster, instead of making the whole array full of atomic floats, to partition the result array into partitions protected with locks. That way, only the lock is expensive, but the actual writes can happen out of order and without other synchronicity.

Otherwise, due to how compare-exchange is implemented, you're effectively making all 4 cores line up in one queue to do their writes to this buffer.
1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0