Texture3D strange behavior in volume raycasting

Started by
6 comments, last by Mr_Fox 7 years, 11 months ago

Hey Guys,

We all know that Texture resources are normally swizzled in GPU memory for better cache hit rate, since normally we read texture with neighbor data. In order to see how much this would impact the performance, I write a volume rendering program to profile it.

The program is in DX12, and I modified the Microsoft MiniEngine to build this demo based on it and added GPU profiler and ImGUI. If you run it in Profile and Debug build you should see GPU perf graph on top of the window which timed each GPU tasks. The scene is simple: A dynamic volumetric cube is updated and rendered while it is rotating constantly. every frame the volumetric cube content is updated through a compute shader, and then rendered using raycasting algorithm running on pixel shader. The GUI give you the ability to switch between Typedbuffer volume, StructuredBuffer volume and Texture3D volume each effectively have 4 32bit data for every voxel. For detail information you could look through the code:

https://github.com/pengliu916/Texture3D_StructuredBuffer.git

The result I got is strange:

when the virtual camera is far enough, everything works as expected: StructuredBuffer/TypedBuffer volume have huge variance render time as the cube keeps rotating (since volume data is not swizzled, from some point of view, cache hit rate may be very bad) Texture3D volume render time is almost constant, and always faster than StructuredBuffer/TypedBuffer(which is as expected).

The weird thing is when virtual camera get close to volume, StructuredBuffer/TypedBuffer volume render time still varying, Texture3D volume render time still very stable, but it get much slower as camera get close to volume. Please see this video for the result:

I can't understand why Texture3D volume render time will be much slower than StructuredBuffer/TypedBuffer when camera get close. It will be greatly appreciated if someone could run this test to confirm this is not related to my hardware, and provide some insights why this may happen this way.

Thanks

Peng

Advertisement

Anyone, any ideas, guesses?

I ran your demo and Texture3D was fastest at all zoom levels on NVIDIA and AMD hardware. On Intel, Texture3D was faster when zoomed out, StructuredBuffer was faster when zoomed in.

You didn't mention what hardware you are running on.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Thank Adam for run the test, and it's good to know this maybe related to specific hardware. FYI my test machine has 2 nvidia 680m (sli is disabled) with latest driver (can't check the driver version right now, but I updated the driver yesterday night right before my test run). And I believe I have the latest released windows version.

So any idea why in some hardware the Texture3D performance is strangely related to zoom level?

And one other interesting question I would like to ask is: in this volume update,render case what's the pro and con side of using Texture3D instead of linear layout buffer if built in typed format is all we need? In my machine, I noticed the updating compute shader runs a slight slower on Texture3D (but may not be noticeable in your case)

Thanks

It certainly looks a lot like a hardware specific bottleneck you're hitting, probably one specific to the particular model of NVIDIA GPU you have. 128 bit textures are not a particularly common thing to use and often sampling them comes with multiplying penalties relative to smaller formats (eg if 32bit is full speed, then 64 bit half speed and 128bit is quarter speed). I don't know a lot about Kepler, but it may well have these sort of penalties when it comes to higher bit-depth textures that don't apply to buffer loads.

You may well be in a situation where cache coherency is the dominant factor when zoomed out and thus swizzled/tiled texture data wins out when the camera is zoomed out. But as you zoom in to only a portion of the cube and it covers more pixels on screen that's less of an issue and "fetch cost" begins to dominate.

Have you tried a 64-bit format instead to see what difference that makes (if any)?

For what you're doing, I would always favour working with a Texture3D, even in light of the performance results on Intel hardware. As you pointed out, marching across a StructuredBuffer is only going to be cache coherent along the primary axis (X, probably), along Y and Z you're jumping one row or slice of data with each step. You can see the cost see-saw up and down as the cube rotates with the 'Buffer' methods in a way that it doesn't with Texture3D.

As for populating the Volume texture, I see zero difference on NVIDIA. On AMD and Intel, Texture3D population is marginally faster than the two Buffer methods.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Thanks Adam, I will try 64-bit format later today.

Just one question: from what I know (correct me if I am wrong): For stucturedbuffer/typedbuffer load, GPU will only load 'one voxel' for each PS thread from PS point of view, while Texture3D will load 4(or more) voxels even if I specificity use load function (not a sampler), right? So is that the 'fetch cost' you mentioned when camera is zoomed in? which make Texture3D slower since it fetches more data than StructuredBuffer

Also it seems if your data structure satisfied any of the built-in typed format, Texture should be preferred over buffer, right?

Thanks

No, not really.

The Load or [] operator in HLSL is asking only for a single texel from the 3D Texture. GPUs still have cache lines like the CPU does, so it's unlikely to be able to only fetch 16 bytes from memory, it'll have to pull in the whole cache line regardless, even if the other 48 bytes aren't necessary (assuming a 64 byte cache line). That's fine if you go on to use the other 48 bytes in future fetches, but a complete waste if you don't. But that applies to Buffers as well as Textures, so it's not really what I meant.

The mechanics vary from vendor to vendor and architecture to architecture, but generally a given GPU will have a fixed number of memory operations (aka fetches) it can perform per second, and often this is reduced by 1/2, 1/4 or even more if the format is larger than some given size. Even if your Texture were 1x1x1 and the texel immediately ended up in the fastest cache the GPU has (eg L1), you're still requesting that texel be fetched potentially millions of times. Generally that penalty is more often associated with using Texture Sampling (.Sample) than Load operations, but there's really not enough vendor-specific information out there to be sure one way or the other.

If your data structure is logically 2D or 3D, I'd use 2D or 3D textures, yes. Texture1D I don't use very often because it's very similar to a Buffer (except that it supports filtering). That's not to say there aren't particular scenarios where that advice might not be true, but as a rule of thumb it works.

Adam Miles - Principal Software Development Engineer - Microsoft Xbox Advanced Technology Group

Hi Adam,

I have tried the 16bit per channel version. Texture3D version runs almost twice as fast as TypedBuffer version at the default distance, but as I move all the way in, they almost run the same speed. So as you said I definitely hit some HW specific bottleneck (There is no difference between switch between Load or more expensive anisotropic sampler for Texture3D when I zoomed in). But what still confuse me is why Texture3D is more sensitive than TypedBuffer when zoom in (as you said the amount of fetch ops should be the same for both Texture3D and TypedBuffer, so why Texture3D perf drops more than TypedBuffer)?

Sorry for being annoy, I am just a grad student interested in DirectX, want to know what happened under the hood....

Thanks

Peng

This topic is closed to new replies.

Advertisement