Jump to content

  • Log In with Google      Sign In   
  • Create Account

#Actualbackstep

Posted 14 October 2013 - 04:17 PM

If you've called dispatch before mapping the constant buffer, then I believe it won't get overwritten.  Depending on the type of constant buffer (usage and cpu access) and how you're mapping it, it will either stall the pipeline or use an alias of the buffer for the map/unmap.  That's to say, it'll either wait to update the buffer until it's not in use, or if you're mapping with discard it'll just update to an alias of it without making your context cpu thread wait.  That's my understanding at least.

 

So I don't think you need to worry about serial vs parallel execution of your dispatches causing errors.  I think the main problem is you'll be underutilzing the GPU constantly, since only 64 threads will be active at once (out of hundreds available).  Another possible issue could be constant waterfalling, which occurs if each thread is accessing a unique value from the constant buffer, which makes each thread block the others while it accesses it's value from the constant buffer.  From your example of dispatch(10, 1, 1) with numthreads(64, 1, 1) requiring a 100KB constant buffer, it sounds like each thread accesses a different part in the constant buffer.  So that would mean at the point in your shader where you access the thread-unique constant buffer values your utilization would drop to a single thread out of hundreds.

 

I am still a beginner like you, but a better approach might be to use a SRV of a structured buffer instead of a constant buffer.  That would allow you to update with a single map (write_discard to prevent a stall), then use a single dispatch with multiple threadgroups when required, and also avoid constant waterfalling performance problems.

 

Oh also if you don't need to update the whole 100KB every frame, you could simply make two structured buffers - a smaller one for typical use cases, and a 100KB one for your worst case scenario.  Then simply map/update/unmap/bind whichever one's SRV is required for that frame.  It's not quite dynamic allocation but it'd allow you the benefit of the smaller buffer 99% of the time.

 

Also from what I can tell from a quick google, directcompute does only support a single dispatch being processed on the GPU at once (slide 43 in http://www.slideshare.net/NVIDIA/1015-gtc09-5013444 ).  CUDA appears to have some kernel concurrency support, but that's something else entirely.


#2backstep

Posted 14 October 2013 - 03:39 PM

If you've called dispatch before mapping the constant buffer, then I believe it won't get overwritten.  Depending on the type of constant buffer (usage and cpu access) and how you're mapping it, it will either stall the pipeline or use an alias of the buffer for the map/unmap.  That's to say, it'll either wait to update the buffer until it's not in use, or if you're mapping with discard it'll just update to an alias of it without making your context cpu thread wait.  That's my understanding at least.

 

So I don't think you need to worry about serial vs parallel execution of your dispatches causing errors.  I think the main problem is you'll be underutilzing the GPU constantly, since only 64 threads will be active at once (out of hundreds available).  Another possible issue could be constant waterfalling, which occurs if each thread is accessing a unique value from the constant buffer, which makes each thread block the others while it accesses it's value from the constant buffer.  From your example of dispatch(10, 1, 1) with numthreads(64, 1, 1) requiring a 100KB constant buffer, it sounds like each thread accesses a different part in the constant buffer.  So that would mean at the point in your shader where you access the thread-unique constant buffer values your utilization would drop to a single thread out of hundreds.

 

I am still a beginner like you, but a better approach might be to use a UAV of a structured buffer instead of a constant buffer.  That would allow you to update with a single map (write_discard to prevent a stall), then use a single dispatch with multiple threadgroups when required, and also avoid constant waterfalling performance problems.

 

Oh also if you don't need to update the whole 100KB every frame, you could simply make two structured buffers - a smaller one for typical use cases, and a 100KB one for your worst case scenario.  Then simply map/update/unmap/bind whichever UAV is required for that frame.  It's not quite dynamic allocation but it'd allow you the benefit of the smaller buffer 99% of the time.

 

Also from what I can tell from a quick google, directcompute does only support a single dispatch being processed on the GPU at once (slide 43 in http://www.slideshare.net/NVIDIA/1015-gtc09-5013444 ).  CUDA appears to have some kernel concurrency support, but that's something else entirely.


#1backstep

Posted 14 October 2013 - 03:31 PM

If you've called dispatch before mapping the constant buffer, then I believe it won't get overwritten.  Depending on the type of constant buffer (usage and cpu access) and how you're mapping it, it will either stall the pipeline or use an alias of the buffer for the map/unmap.  That's to say, it'll either wait to update the buffer until it's not in use, or if you're mapping with discard it'll just update to an alias of it without making your context cpu thread wait.  That's my understanding at least.

 

So I don't think you need to worry about serial vs parallel execution of your dispatches causing errors.  I think the main problem is you'll be underutilzing the GPU constantly, since only 64 threads will be active at once (out of hundreds available).  Another possible issue could be constant waterfalling, which occurs if each thread is accessing a unique value from the constant buffer, which makes each thread block the others while it accesses it's value from the constant buffer.  From your example of dispatch(10, 1, 1) with numthreads(64, 1, 1) requiring a 100KB constant buffer, it sounds like each thread accesses a different part in the constant buffer.  So that would mean at the point in your shader where you access the thread-unique constant buffer values your utilization would drop to a single thread out of hundreds.

 

I am still a beginner like you, but a better approach might be to use a UAV of a structured buffer instead of a constant buffer.  That would allow you to update with a single map (write_discard to prevent a stall), then use a single dispatch with multiple threadgroups when required, and also avoid constant waterfalling performance problems.  


PARTNERS