Sign in to follow this  
forsandifs

Compute Shader read write access [problem restated]

Recommended Posts

forsandifs    154
----
RESTATEMENT OF THE PROBLEM
----

I need to atomically perform the following two lines of code.

[code]
ObjectsInfoMap[ Objects[ObjectIndex].FirstInfoSlotIndex + ObjectsInfoMapCounter[ObjectIndex] ] = Info;
ObjectsInfoMapCounter[ObjectIndex]++;
[/code]

Where the resources being written to are UAVs. If the first line is performed by two threads before either one has had time to perform the second line then my results will be incorrect. Essentially any time a thread performs the first line it has to perform the second line before another thread perfroms the first line.

Any help would be very much appreciated.



----
OLD STATEMENT OF THE PROBLEM, FEEL FREE TO IGNORE
----

I have the following compute shader HLSL (pseudo)code:

[code]
RWStructuredBuffer<InfoStruct> ObjectsInfoMap : register(u0);
RWBuffer<uint> ObjectsInfoMapCounter : register(u1);

void StoreInfo(uint ObjectIndex, InfoStruct Info)
{
ObjectsInfoMap[ Objects[ObjectIndex].FirstInfoSlotIndex + ObjectsInfoMapCounter[ObjectIndex] ] = Info;
ObjectsInfoMapCounter[ObjectIndex]++;
}
[/code]

Where StoreInfo() is called many times for every possible value of ObjectIndex during the execution of the shader. (I suspect the above is not valid due to a race condition).

The are a couple of problems with this which I will enumerate as follows.

1. The above does not do what I intend it to do. Testing reveals that ObjectsInfoMapCounter[ObjectIndex] (where the array is filled with zeroes before the shader is executed) is always equal to 1 for any value of ObjectIndex, where I want it to be many times higher than 1. This is perhaps to be expected due to a race condition. But strangely the compiler does not give a race condition warning as I would expect it to do.

2. Using InterlockedAdd(ObjectsInfoMapCounter[ObjectIndex], 1) instead of ObjectsInfoMapCounter[ObjectIndex]++ doesn't work as intended either. The values in the UAVs are no longer constant for every shader call as they should be. I think this is due to sometimes many threads writing to the resource before the InterlockedAdd is called and thus writing to the wrong slots in ObjectsInfoMap.

I think this could be solved if there was a way to lock access across all thread groups to the ObjectsInfoMap and ObjectsInfoMapCounter UAVs when StoreInfo is called and unlocking access to those UAVs just before the function exits but I couldn't find a way to do that.

Is there a way to get this function to work as intended?

Share this post


Link to post
Share on other sites
MJP    19788
If you want a write to a resource to be visible to other threads, you need to use a DeviceMemoryBarrier or DeviceMemoryBarrierWithGroupSync. Note that this only flushes across a thread group. If you need those writes to be synchronized/flushed across all thread groups, then you need to prefix the resource with "globallycoherent". Of course this cansignificantly affect your performance. I would suggest that, if possible, you rework your implementation so that you only need to share writes among thread groups (preferably in shared memory rather than in device memory resources). Another possibility is splitting your algorithm into several dispatches, using each dispatch as an implicit synchronization.

Share this post


Link to post
Share on other sites
forsandifs    154
Thank you very much for your reply.

Normally I launch shaders with 32 * 32 * 1 threads per group and then X/32 * Y/32 * 1 groups, where X * Y represents the total number of threads I want.

Would you recommend that instead I dispatch serially in groups of size 32 * 32 * 1 and 1 * 1 * 1 groups at a time? I can see the sense in this when X and Y are not large compared to 32, but what about when X * Y is something like 512 x 512? That would require dispatching over a for loop of size 512/32 * 512/32 = 256?

Share this post


Link to post
Share on other sites
MJP    19788
No you'll want to dispatch multiple groups simultaneously, otherwise most of the GPU's multiprocessors will Idle. I was just suggesting that you try to make sure that threads only read data written to by other threads in the same thread group, but that may not be possible in your case depending on what you're doing.

Share this post


Link to post
Share on other sites
forsandifs    154
Thanks for your reply.

However I've been trying to implement this without success. The results don't seem to be any different.

Here's what I've done:

[code]
globallycoherent RWStructuredBuffer<InfoStruct> ObjectsInfoMap : register(u0);
globallycoherent RWBuffer<uint> ObjectsInfoMapCounters : register(u1);

void StoreInfo(InfoStruct Info, uint ObjectIndex)
{
DeviceMemoryBarrier();

ObjectsInfoMap[ Objects[ObjectIndex].FirstInfoSlotIndex + ObjectsInfoMapCounters[ObjectIndex] ] = Info;
InterlockedAdd(ObjectsInfoMapCounters[ObjectIndex],1); //Though given that DeviceMemoryBarrier is meant to stop all threads until all memory accesses has been completed I guess a ObjectsInfoMapCounters[ObjectIndex]++ would do for this line?
}
[/code]

Am I doing something wrong? BTW, could someone link the documentation for "globallycoherent" please? I couldn't find it.

Share this post


Link to post
Share on other sites
forsandifs    154
I'll try to rephrase the problem.

I need to atomically perform the following two lines of code.

[code]
ObjectsInfoMap[ Objects[ObjectIndex].FirstInfoSlotIndex + ObjectsInfoMapCounter[ObjectIndex] ] = Info;
ObjectsInfoMapCounter[ObjectIndex]++;
[/code]

If the first line is performed by two threads before either one has had time to perform the second line then my results will be incorrect. Essentially any time a thread performs the first line it has to perform the second line before another thread perfroms the first line.

Any help would be very much appreciated.

Share this post


Link to post
Share on other sites
_the_phantom_    11250
The problem is that you are doing the order of operations in the wrong order.

You are doing;

- calculate address
- write value to it
- increment offset

and at any point in that sequence things can Go Wrong(tm).

Instead what you want to do is move the first operation last but with a twist the key is the optional last parameter for InterlockedAdd() which returns the old value.

So, your code becomes;


[source]
float offset = 0.0;
InterlockedAdd(ObjectsInfoMapCounter[ObjectIndex], 1, offset); // offset will contain the last value, the objectinfomapcounter for that objectindex will hold the incremented value
ObjectsInfoMap[ Objects[ObjectIndex].FirstInfoSlotIndex + offset ] = Info;
[/source]

This should solve the problem.

Share this post


Link to post
Share on other sites
forsandifs    154
That's brilliant! Thank you very much! :) :)

EDIT: though I think the offset would have to be an int or uint, since IIRC InterlockedAdd only works for uint or int types.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this