• Announcements

    • khawk

      Download the Game Design and Indie Game Marketing Freebook   07/19/17

      GameDev.net and CRC Press have teamed up to bring a free ebook of content curated from top titles published by CRC Press. The freebook, Practices of Game Design & Indie Game Marketing, includes chapters from The Art of Game Design: A Book of Lenses, A Practical Guide to Indie Game Marketing, and An Architectural Approach to Level Design. The GameDev.net FreeBook is relevant to game designers, developers, and those interested in learning more about the challenges in game development. We know game development can be a tough discipline and business, so we picked several chapters from CRC Press titles that we thought would be of interest to you, the GameDev.net audience, in your journey to design, develop, and market your next game. The free ebook is available through CRC Press by clicking here. The Curated Books The Art of Game Design: A Book of Lenses, Second Edition, by Jesse Schell Presents 100+ sets of questions, or different lenses, for viewing a game’s design, encompassing diverse fields such as psychology, architecture, music, film, software engineering, theme park design, mathematics, anthropology, and more. Written by one of the world's top game designers, this book describes the deepest and most fundamental principles of game design, demonstrating how tactics used in board, card, and athletic games also work in video games. It provides practical instruction on creating world-class games that will be played again and again. View it here. A Practical Guide to Indie Game Marketing, by Joel Dreskin Marketing is an essential but too frequently overlooked or minimized component of the release plan for indie games. A Practical Guide to Indie Game Marketing provides you with the tools needed to build visibility and sell your indie games. With special focus on those developers with small budgets and limited staff and resources, this book is packed with tangible recommendations and techniques that you can put to use immediately. As a seasoned professional of the indie game arena, author Joel Dreskin gives you insight into practical, real-world experiences of marketing numerous successful games and also provides stories of the failures. View it here. An Architectural Approach to Level Design This is one of the first books to integrate architectural and spatial design theory with the field of level design. The book presents architectural techniques and theories for level designers to use in their own work. It connects architecture and level design in different ways that address the practical elements of how designers construct space and the experiential elements of how and why humans interact with this space. Throughout the text, readers learn skills for spatial layout, evoking emotion through gamespaces, and creating better levels through architectural theory. View it here. Learn more and download the ebook by clicking here. Did you know? GameDev.net and CRC Press also recently teamed up to bring GDNet+ Members up to a 20% discount on all CRC Press books. Learn more about this and other benefits here.
Sign in to follow this  
Followers 0
Tsus

DX11
[Dx11] InterlockedAdd on floats in Pixel Shader - Workaround?

10 posts in this topic

Hi!

I'd like to use an InterlockedAdd operation on floats. Unfortunately, the documentation says it’s allowed on ints and uints only. So, I’m looking for some sort of workaround.

Here is my scenario: I have pixel shaders that need to write multiple times into a resource (at different positions). The problem is, the number of writes differs from pixel to pixel and is somewhere between zero (not that unlikely actually) and about five. The write operation is a simple addition of a float (so, a simple scattering via rasterization would do the trick, if I wouldn’t have to write more than once...)

Btw, what is the globallycoherent modifier doing in pixel shaders? Will this ensure that different primitives will see the writes of each other?

Any thoughts or ideas on that would be greatly appreciated!

Thanks in advance!
Cheers,
Tsus
1

Share this post


Link to post
Share on other sites
Maybe "scale" the floats to a certain range and store/add as normal ints? For example you know your floats will be from 0.0f to 100.0f, therefore you store them as uints from 0x00000000 to 0xFFFFFFFF with precision (step) (100.0f-0.0f)/(2^-32), which is over 9 decimal digits, if I count correctly :-) Then you can decimate them to these uints, run the atomic operations on them as uints and just convert them back to floats after reading back (if needed).

This, of course, will not work, if you require a huge range.
0

Share this post


Link to post
Share on other sites
Thanks for the quick response!

[quote name='pcmaster' timestamp='1319545249' post='4876715']
Maybe "scale" the floats to a certain range, almost a fixed-point, round them and store/add as normal ints? For example you know your floats will be from 0.0f to 100.0f, therefore you store them as uints from 0x00000000 to 0xFFFFFFFF with precision (100.0f-0.0f)/(2^-32), which is over 9 decimal digits, if I count correctly :-) Then you can decimate them to these uints, run the atomic operations on them as uints and just convert them back to floats after reading back (if needed).

This, of course, will not work, if you require a huge range.
[/quote]
I hesitate to go in this direction, since the computation should be unbiased. The range I’m expecting increases over time and unfortunately I’ll need most precision at the end.
It’s nice to have a fallback solution, but accuracy is crucial in my case…
0

Share this post


Link to post
Share on other sites
Could you an integer as an array index to store each of the floats in a different location? With a maximum of 5 floats that shouldn't be too much extra storage required.
0

Share this post


Link to post
Share on other sites
[quote name='Adam_42' timestamp='1319560284' post='4876806']
Could you an integer as an array index to store each of the floats in a different location? With a maximum of 5 floats that shouldn't be too much extra storage required.
[/quote]
Hm, in general I can't make any assumption on how many write operations the pixel shader will have to do at most. It could be 5 (as I stated as an example before, just to give you a rough feeling), it could as well be 100. It always depends on the scene and the view. (Sorry, should have made that clearer..)
Even if I would know that there will happen at most 5 write operations, they would actually happen concurrently. Different pixel shaders probably like to add a value at the same position. So, storing that fix number of floats at a dedicated pre-allocated position wouldn't help me to sum them up. I'd rather avoid a second splatting pass...
0

Share this post


Link to post
Share on other sites
Maybe you could do with some kind of manual locking and a kind of busy waiting (boo boo boo :D). Kinda manual mutex. So, you will have a texture representing mutex, one for each fragment, initialised to 0. Now a thread (fragment) wants to operate on some memory location [x,y].

[code][loop]do // critical section enter (alias mutex::lock())
{
uint orig;
InterlockedCompareExchange(mutex[x,y], 0, 1, orig);
if (orig == 0) // this means the exchange succeeded! you own the "mutex"
break; // mutex[x,y] now equals 1

} while (1);[/code]
Then tamper the float4 texture at [x,y]. Read it. Modify the value. Write it back. Nobody else will touch it in the meantime. After you're done, call
[code]InterlockedCompareExchange(mutex[x,y], 1, 0, dummy); // critical section leave (alias mutex::unlock())[/code]
Since we made sure that mutex[x,y]==1, this will exchange its value to 0. This is a signal for the other threads waiting in the loop for this location, that the mutex is "free" and one of them can enter the critical section. I claim this is actually the same serialisation that the GPU thread scheduler or whatever name would do anyway -- if many want to access the same critical location, they have to queue up.

I have not done this before, I mean not with DX11 (I did something similar with OpenCL). I have mixed experience with such "complex" shaders and DX11 (fxc.exe), so I have no idea whether this will actually work but to me it now seems legit :-) I'm NOOOOOOT sure whether this will work with Pixel Shader but in a Compute Shader (or OpenCL or CUDA), this really should work. The main problem might be in the eternal loop, which is something the optimiser doesn't seem to like at all :D
2

Share this post


Link to post
Share on other sites
Oh, well! The fact that we need to pull out such guns tells me we’re running out of options, aren’t we? [font="Wingdings"]:)[/font]
But seriously, I’ll give that a try and come back to share the results with you. It looks odd and promising at the same time. :)
Unfortunately my colleague is having a hard time with a deadline, so I’m going to help out there for a while. Thus, this project will have to wait a little… Nevertheless, I’ll try that.
Thank you very much pcmaster, you gave me two things to try out. Two thumbs up! [font="Wingdings"][font="Wingdings"]:)[/font][/font]

In the meantime I’ll hope someone will come up with another brilliant idea.
I’m having a little hope that the globallycoherent modifier could help me out, but I still don’t know what it’s doing. Isn’t this the kind of thing I would put on a resource in compute shaders to make operations like float-additions visible to all threads? If so, what is it doing in pixel shaders?
0

Share this post


Link to post
Share on other sites
Let us know what you came up with in the end :-)

I don't know much about globallycoherent and as you probably did too, I tried looking for some info but only run into a few bits. That's the case with all this new GPGPU HLSL stuff, discussions and info are extremely scarce, not so many people seem to use this new stuff, although it's been out there for over an year now :-(
The MSDN documentation suggests that globallycoherent should synchronise between all groups, device-wide, somehow. However, I don't even see any barrier() HLSL instructions that would block a whole dispatch in compute shaders or all pixels in a PS. Perhaps AllMemoryBarrier*() or DeviceMemoryBarrier() would change behaviour depending on globallycoherent modifier? Who can tell?

There's only one way to figure this HLSL synchronisation thing out - just try it out :D Bad luck I don't have time for this right now :-(
1

Share this post


Link to post
Share on other sites
Okay, I just looked over the docs once more and here is what I found out (or let's rather say - my interpretation).

The DeviceMemoryBarrier is the only barrier available in pixel shaders. (Makes sense, since we don't have group shared memory available.)
It synchronizes all device memory accesses of pixel shader threads inside of a group. (Internally the pixel shader threads are divided into groups, as well. The rasterizer decides on its dimension.). I guess globallycoherent will sync the threads from different groups as well - at least the groups that are currently running. Not all pixel shader threads are necessarily executed at the same time, since the number of shader units is limited and I find it rather hard to believe that Direct3D will push all states of the running groups on a stack and then start the next groups up to the barrier. That would consume too much memory. In my experience GPGPU languages don't have that feature for exactly that reason. (At least I'm not aware of any.)

Am I wrong with my interpretation of the globallycoherent modifier?

However, synchronizing every thread won't help if two threads plan to write at the same position in a buffer...
0

Share this post


Link to post
Share on other sites
Alright… it has been a while. [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img]

After scribbling on a sheet of paper for quite some time, I eventually found another approach that avoided the InterlockedAddFloat. I had adapted a workaround I found in the Nvidia forum months ago, but since I didn’t gave it a fair testing and comparison to some ground truth I decided to wait before putting my solution out here. It was good that I waited, because it turned out that it was buggy. [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img]

The funny thing is… it seems to be a compiler bug.
When I use a while-loop it works and when I use a do-while-loop it doesn’t. Very strange, but perhaps it might be of help to someone in the future. [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img]

Here is the code that worked for me:
[CODE]
RWByteAddressBuffer Accum : register( u0 );

void interlockedAddFloat(uint addr, float value) // Works perfectly!
{
uint i_val = asuint(value);
uint tmp0 = 0;
uint tmp1;
[allow_uav_condition] while (true)
{
Accum.InterlockedCompareExchange(addr, tmp0, i_val, tmp1);
if (tmp1 == tmp0)
break;
tmp0 = tmp1;
i_val = asuint(value + asfloat(tmp1));
}
}
[/CODE]

And here (just for the curious reader [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img]) the one that didn’t (only difference is the loop). With “not working” I mean, values were added too often (image got too bright).
[CODE]
void interlockedAddFloat_Test(uint addr, float value) // Does not work and is slower.
{
uint i_val = asuint(value);
uint tmp0 = 0;
uint tmp1;
[allow_uav_condition] do
{
Accum.InterlockedCompareExchange(addr, tmp0, i_val, tmp1);
if (tmp1 == tmp0)
break;
tmp0 = tmp1;
i_val = asuint( value + asfloat(tmp1) );
}while(true);
}
[/CODE]

I have to admit that the code above is nested in another loop with unpredictable end, involving three conditional continues. [img]http://public.gamedev.net//public/style_emoticons/default/smile.png[/img]

Cheers!
2

Share this post


Link to post
Share on other sites

HI! Thanks a lot!, this along with some NVidia slides led me what I believe you were attempting as well.

Modifying your solution, This is the most Elegant regarding what I believe most people will stumble upon this post for.

 

RWByteAddressBuffer Accum : register( u0 );

void interlockedAddFloat(uint addr, float value)
{
uint comp,orig = Accum.Load(addr);
[allow_uav_condition]do
{
Accum.InterlockedCompareExchange(addr, comp = orig, asuint(asfloat(orig) + value), orig);
}
while(orig != comp);
}

1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0

  • Similar Content

    • By Holy Fuzz
      I am working on a game (shameless plug: Cosmoteer) that is written in a custom game engine on top of Direct3D 11. (It's written in C# using SharpDX, though I think that's immaterial to the problem at hand.)
      The problem I'm having is that a small but understandably-frustrated percentage of my players (about 1.5% of about 10K players/day) are getting frequent device hangs. Specifically, the call to IDXGISwapChain::Present() is failing with DXGI_ERROR_DEVICE_REMOVED, and calling GetDeviceRemovedReason() returns DXGI_ERROR_DEVICE_HUNG. I'm not ready to dismiss the errors as unsolveable driver issues because these players claim to not be having problems with any other games, and there are more complaints on my own forums about this issue than there are for games with orders of magnitude more players.
      My first debugging step was, of course, to turn on the Direct3D debug layer and look for any errors/warnings in the output. Locally, the game runs 100% free of any errors or warnings. (And yes, I verified that I'm actually getting debug output by deliberately causing a warning.) I've also had several players run the game with the debug layer turned on, and they are also 100% free of errors/warnings, except for the actual hung device:
      [MessageIdDeviceRemovalProcessAtFault] [Error] [Execution] : ID3D11Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when the hang occurred. The application may want to respawn and fallback to less aggressive use of the display hardware). So something my game is doing is causing the device to hang and the TDR to be triggered for a small percentage of players. The latest update of my game measures the time spent in IDXGISwapChain::Present(), and indeed in every case of a hung device, it spends more than 2 seconds in Present() before returning the error. AFAIK my game isn't doing anything particularly "aggressive" with the display hardware, and logs report that average FPS for the few seconds before the hang is usually 60+.
      So now I'm pretty stumped! I have zero clues about what specifically could be causing the hung device for these players, and I can only debug post-mortem since I can't reproduce the issue locally. Are there any additional ways to figure out what could be causing a hung device? Are there any common causes of this?
      Here's my remarkably un-interesting Present() call:
      SwapChain.Present(_vsyncIn ? 1 : 0, PresentFlags.None); I'd be happy to share any other code that might be relevant, though I don't myself know what that might be. (And if anyone is feeling especially generous with their time and wants to look at my full code, I can give you read access to my Git repo on Bitbucket.)
      Some additional clues and things I've already investigated:
      1. The errors happen on all OS'es my game supports (Windows 7, 8, 10, both 32-bit and 64-bit), GPU vendors (Intel, Nvidia, AMD), and driver versions. I've been unable to discern any patterns with the game hanging on specific hardware or drivers.
      2. For the most part, the hang seems to happen at random. Some individual players report it crashes in somewhat consistent places (such as on startup or when doing a certain action in the game), but there is no consistency between players.
      3. Many players have reported that turning on V-Sync significantly reduces (but does not eliminate) the errors.
      4. I have assured that my code never makes calls to the immediate context or DXGI on multiple threads at the same time by wrapping literally every call to the immediate context and DXGI in a mutex region (C# lock statement). (My code *does* sometimes make calls to the immediate context off the main thread to create resources, but these calls are always synchronized with the main thread.) I also tried synchronizing all calls to the D3D device as well, even though that's supposed to be thread-safe. (Which did not solve *this* problem, but did, curiously, fix another crash a few players were having.)
      5. The handful of places where my game accesses memory through pointers (it's written in C#, so it's pretty rare to use raw pointers) are done through a special SafePtr that guards against out-of-bounds access and checks to make sure the memory hasn't been deallocated/unmapped. So I'm 99% sure I'm not writing to memory I shouldn't be writing to.
      6. None of my shaders use any loops.
      Thanks for any clues or insights you can provide. I know there's not a lot to go on here, which is part of my problem. I'm coming to you all because I'm out of ideas for what do investigate next, and I'm hoping someone else here has ideas for possible causes I can investigate.
      Thanks again!
       
    • By thmfrnk
      Hello,
      I am working on a Deferred Shading Engine, which actually uses MSAA for Antialising. Apart from the big G-Buffer ressources its working fine. But the intention of my engine is not only realtime-rendering as also render Screenshots as well as Videos. In that case I've enough time to do everything to get the best results. While using 8x MSAA, some scenes might still flicker.. especially on vegetations. Unfortunately 8x seems to be the maximum on DX11 Hardware, so there is no way to get better results, even if don't prefer realtime.
      So finally I am looking for a solution, which might offer an unlimited Sample count. The first thing I thought about was to find a way to manually manipulate MSAA Sample locations, in order to be able to render multiple frames with different patterns and combining them. I found out that NVIDIA did something equal with TXAA. However, I only found a solution to use NVAPI, in order to change sample locations. https://mynameismjp.wordpress.com/2015/09/13/programmable-sample-points/
      While I am working on .NET and SlimDX I've no idea how hard it would to implement the NVIDIA API and if its possible to use it together with SlimDX. And this approach would be also limited to NV.
      Does anyone have an idea or maybe a better approach I could use?
      Thanks, Thomas
    • By matt77hias
      For vector operations which mathematically result in a single scalar f (such as XMVector3Length or XMPlaneDotCoord), which of the following extractions from an XMVECTOR is preferred:
      1. The very explicit store operation
      const XMVECTOR v = ...; float f; XMStoreFloat(&f, v); 2. A shorter but less explicit version (note that const can now be used explicitly)
      const XMVECTOR v = ...; const float f = XMVectorGetX(v);  
    • By Coelancanth
      Hi guys,
      this is a exam question regarding alpha blending, however there is no official solution, so i am wondering  whether my solution is right or not... thanks in advance...

      my idea:
      BS1:
      since BS1 with BlendEnable set as false, just write value into back buffer.
      -A : (0.4, 0.4, 0.0, 0.5)
      -B : (0.2, 0.4, 0.8, 0.5)
       
      BS2:
       
      backbuffer.RGB: = (0.4, 0.0, 0.0) * 1 + (0.0, 0.0, 0.0) * (1-0.5)      = ( 0.4, 0.0, 0.0)
      backbuffer.Alpha = 1*1 + 0*0   =1
       
      A.RGB = (0.4, 0.4, 0.0)* 0.5 + (0.4, 0.0, 0.0)* ( 1-0.5)   = (0.4,0.2,0.0)
      A.Alpha=0.5*1+1*(1-0.5) = 1
       
       
      B.RGB = (0.2, 0.4, 0.8) * 0.5 + (0.4, 0.2, 0.0) * (1-0.5)  = (0.3, 0.3, 0.4)
      B.Alpha = 0.5 * 1 + 1*(1-0.5)  = 1
       
      ==========================
      BS3:
       
      backbuffer.RGB = (0.4, 0.0, 0.0) + (0.0, 0.0, 0.0)  = (0.4, 0.0, 0.0)
      backbuffer.Alpha = 0
       
      A.RGB = (0.4, 0.4, 0.0) + (0.4, 0.0, 0.0) = (0.8, 0.4, 0.0)
      A.Alpha = 0
       
      B.RGB = (0.2, 0.4, 0.8) + (0.8, 0.4, 0.0) = (1.0, 0.8, 0.8)
      B.Alpha = 0
       
       
       
    • By lonewolff
      Hi Guys,
      I am revisiting an old DX11 framework I was creating a while back and am scratching my head with a small issue.
      I am trying to set the pixel shader resources and am getting the following error on every loop.
      As you can see in the below code, I am clearing out the shader resources as per the documentation. (Even going overboard and doing it both sides of the main PSSet call). But I just can't get rid of the error. Which results in the render target not being drawn.
      ID3D11ShaderResourceView* srv = { 0 }; d3dContext->PSSetShaderResources(0, 1, &srv); for (std::vector<RenderTarget>::iterator it = rtVector.begin(); it != rtVector.end(); ++it) { if (it->szName == name) { //std::cout << it->srv <<"\r\n"; d3dContext->PSSetShaderResources(0, 1, &it->srv); break; } } d3dContext->PSSetShaderResources(0, 1, &srv);  
      I am storing the RT's in a vector and setting them by name. I have tested the it->srv and am retrieving a valid pointer.
      At this stage I am out of ideas.
      Any help would be greatly appreciated
       
  • Popular Now