Directx Shader calcualtions.

Started by
5 comments, last by _the_phantom_ 9 years, 9 months ago

Hi everyone.

I am currently trying to re-implement a project i have created form CPU side to GPU side.
I am trying to use Directx11 Compute Shader.

The problem:
I am trying to calculate the center of mass of a model using the vertex list.
Currently i have an input buffer that holds all the vertices. and an output buffer, that should hold the center of mass position.

Currently i am using a test vector that has 4 vertices.(for testing).

I dispatch one group in each direction. Dispatch(1,1,1)
In each group i have numthreds[4,1,1] .

Basically i am trying to calculate the sum of all vertex positions, and store in in single float3 variable.

I am running into a problem: race condition, as i understand this is due to the code being parallel , and trying to write into the same variable at the same time, which might cause miscalculations.

Thank you all!


Advertisement

Having numthreads be [numverts, 1, 1] is not going to be viable going forward, when the number of verts is unbounded. Probably what you'll want to do is have a set number of threads, (probably a multiple of 64, for example, so you fully utilize your wavefronts) and then split your verts into 64*n groups. One for each thread.

Each thread can add up its vert positions, then write the result to a unique location in the group shared mem. Since each thread is writing to a separate location, there will be no race condition.

Then at the very end, you can wait on a synchronization barrier (AllMemoryBarrier) to make sure each thread has written its result. Then one of the threads can add up all the results that are stored in the shared mem and write the final output.

Alternately: you could use InterlockedAdd in your current compute shader instead of +=. (not for floats, though)

How many vertices do you expect from a real usecase? Will there be multiple independent meshes, for which you want to compute the centers?

What you tried to do only works with atomic operations, and even that should be avoided. What you need is a proper "reduction" algorithm. In my experience, butterfly networks work great on the GPU for these kinds of things. But the specifics depend heavily on how many vertices per model and how many models in parallel you want to process.

Ok, inside a project i might have a few models, but on compute shader the idea currently is to process one at a time, I do understand that i am limited with the amount of threads per group, and i should split it into multiple groups. Currently the project has only 1 model , combination of groups and the number of threads in them is not the current issue i have some ideas about how that should be implemented.

i took a look at the InterlockedAdd , and it does not support floats =/.

Can you please elaborate on " then write the result to a unique location in the group shared mem." sorry i never worked with shaders =/

Also is there any difference in using numthreds[1000,1,1] and numthreads[10,10,10] (performance wise), since they are all running in parallel anyway ( AFAIK).

sorry if i am asking simple questions, just trying to get my head around this.


i took a look at the InterlockedAdd , and it does not support floats =/.

Derp, my bad.


Can you please elaborate on " then write the result to a unique location in the group shared mem." sorry i never worked with shaders =/

In your code screenshot you have a line that says "groupshared unsigned int shared_data". That's shared memory for your group. Why I was proposing is having something like "groupshared float accumulation[NUM_THREADS];" and then each thread would write its results into accumulation[threadID], which means no two threads would be writing to the same location in memory, thus no race conditions.

Ok, i think i understand what is happening, but im still running into the main problem, how would i sum them all up? =/
my head is not thinking today... i tried "for looping it" with a condition, didn't work out.

Simplest way?
Check thread id; if thread id = 0 then loop over the results, add them up and write them out.

Probably the better performing way; Do a reduction in stages.
If you have, for example, 64 outputs from the first pass then during the summing phase;
- Thread 0 adds result[0] + result[1]
- Thread 1 adds result[2] + result[3]
- Thread 2 adds result[4] + result[5]
and up to thread 32 (rest do nothing)

then repeat, each time halving the thread count doing the summing; your final calculation will be the result.

This topic is closed to new replies.

Advertisement