ClearUnorderedAccessViewUint is slower than handcrafted shader, and does that works for all format?

Started by
3 comments, last by MJP 7 years, 5 months ago

Hey Guys,

I have looked through the MSDN page about this function, and replaced my reset compute shader with this function. It did it job, and simplified my code by few lines. However for clearing a 192^3 R8 volume it takes 0.08ms, which only take 0.05ms if I use my compute shader.

So here comes my questions: what's the advantage of using this function over your own compute shader? Also it seems this function can handle all possible format with at most 4 32bit int clear value, what will happen if I use four 65535 as clear value to clean a R8 INT volume?

Thanks

Advertisement

The main benefit is (as you mentioned) simplification of the code.

this functions wraps all possible options (does the UAV references a texture or buffer?), all possible formats and some possible dark corners (for example feature level 10.1).

About a clear value of 65535, I'm assuming that it the values will be clamped to 255 as clamping is default gpu behaviour

Thanks

So I guess hand crafted compute shader for copying resource around will also be faster than CopyResource APIs. But it doesn't make sense to me why API calls will be slower than hand crafted compute shader since API calls should be crazily optimized

There's a few spots where APIs have built in features that don't need to be built in, as they're possible by using other API features. One example is mip-map generation -- many API's offer a single function call for this, but you can also do it yourself by rendering to each mip map level. Usually this stuff should belong in a utility library instead of the core API.

As for this particular case, it's pretty much just a utility function. I guess it's in there in case one of the GPU vendors is able to clear memory in a way other than writing to it with CS... In my experience so far though, the generic way to clear a block of memory is to write to it using a CS, so this is probably what the driver is doing under the hood.

As for your timing difference -- 50µs vs 80µs -- it's pretty much the same. I don't really trust GPU timing measurements that are less than around a dozen microseconds :wink:

If that difference remains when scaling up -- e.g. when clearing a larger block of memory, one method takes 5ms whereas the other takes 8ms, then I would certainly believe there is a difference in performance of 8/5=1.6x... but it's possible that there's also a performance difference of 20µs overehad, so in the large scale test the result would be 5ms vs 5.02ms (1.004x difference).

Benchmarking this stuff is also hard because the actual commands that the GPU has to execute include the dispatch/compute-shader execution, but then also include cache flushing, cache invalidation, and pipeline stalling. The cost of these extra operation can highly depend on what kind of dispatch/draw command follows your shader.

Back to mipmap generation -- In theory that should be part of the API so that each vendor can implement it in the fastest way possible for their hardware... but in my experience it's also just there as a helper/utility function, and that it's possible to implement your own versions of it that are faster than the driver.

It's possible that the driver is using a DMA unit to fill the buffer with the clear value. This is probably slower than using a compute shader, but it has the advantage of leaving the shader cores free to do other work in parallel.

This topic is closed to new replies.

Advertisement