• Advertisement
  • Popular Tags

  • Popular Now

  • Advertisement
  • Similar Content

    • By Jason Smith
      While working on a project using D3D12 I was getting an exception being thrown while trying to get a D3D12_CPU_DESCRIPTOR_HANDLE. The project is using plain C so it uses the COBJMACROS. The following application replicates the problem happening in the project.
      #define COBJMACROS #pragma warning(push, 3) #include <Windows.h> #include <d3d12.h> #include <dxgi1_4.h> #pragma warning(pop) IDXGIFactory4 *factory; ID3D12Device *device; ID3D12DescriptorHeap *rtv_heap; int WINAPI wWinMain(HINSTANCE hinst, HINSTANCE pinst, PWSTR cline, int cshow) { (hinst), (pinst), (cline), (cshow); HRESULT hr = CreateDXGIFactory1(&IID_IDXGIFactory4, (void **)&factory); hr = D3D12CreateDevice(0, D3D_FEATURE_LEVEL_11_0, &IID_ID3D12Device, &device); D3D12_DESCRIPTOR_HEAP_DESC desc; desc.NumDescriptors = 1; desc.Type = D3D12_DESCRIPTOR_HEAP_TYPE_RTV; desc.Flags = D3D12_DESCRIPTOR_HEAP_FLAG_NONE; desc.NodeMask = 0; hr = ID3D12Device_CreateDescriptorHeap(device, &desc, &IID_ID3D12DescriptorHeap, (void **)&rtv_heap); D3D12_CPU_DESCRIPTOR_HANDLE rtv = ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart(rtv_heap); (rtv); } The call to ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart throws an exception. Stepping into the disassembly for ID3D12DescriptorHeap_GetCPUDescriptorHandleForHeapStart show that the error occurs on the instruction
      mov  qword ptr [rdx],rax
      which seems odd since rdx doesn't appear to be used. Any help would be greatly appreciated. Thank you.
       
    • By lubbe75
      As far as I understand there is no real random or noise function in HLSL. 
      I have a big water polygon, and I'd like to fake water wave normals in my pixel shader. I know it's not efficient and the standard way is really to use a pre-calculated noise texture, but anyway...
      Does anyone have any quick and dirty HLSL shader code that fakes water normals, and that doesn't look too repetitious? 
    • By turanszkij
      Hi,
      I finally managed to get the DX11 emulating Vulkan device working but everything is flipped vertically now because Vulkan has a different clipping space. What are the best practices out there to keep these implementation consistent? I tried using a vertically flipped viewport, and while it works on Nvidia 1050, the Vulkan debug layer is throwing error messages that this is not supported in the spec so it might not work on others. There is also the possibility to flip the clip scpace position Y coordinate before writing out with vertex shader, but that requires changing and recompiling every shader. I could also bake it into the camera projection matrices, though I want to avoid that because then I need to track down for the whole engine where I upload matrices... Any chance of an easy extension or something? If not, I will probably go with changing the vertex shaders.
    • By NikiTo
      Some people say "discard" has not a positive effect on optimization. Other people say it will at least spare the fetches of textures.
       
      if (color.A < 0.1f) { //discard; clip(-1); } // tons of reads of textures following here // and loops too
      Some people say that "discard" will only mask out the output of the pixel shader, while still evaluates all the statements after the "discard" instruction.

      MSN>
      discard: Do not output the result of the current pixel.
      clip: Discards the current pixel..
      <MSN

      As usual it is unclear, but it suggests that "clip" could discard the whole pixel(maybe stopping execution too)

      I think, that at least, because of termal and energy consuming reasons, GPU should not evaluate the statements after "discard", but some people on internet say that GPU computes the statements anyways. What I am more worried about, are the texture fetches after discard/clip.

      (what if after discard, I have an expensive branch decision that makes the approved cheap branch neighbor pixels stall for nothing? this is crazy)
    • By NikiTo
      I have a problem. My shaders are huge, in the meaning that they have lot of code inside. Many of my pixels should be completely discarded. I could use in the very beginning of the shader a comparison and discard, But as far as I understand, discard statement does not save workload at all, as it has to stale until the long huge neighbor shaders complete.
      Initially I wanted to use stencil to discard pixels before the execution flow enters the shader. Even before the GPU distributes/allocates resources for this shader, avoiding stale of pixel shaders execution flow, because initially I assumed that Depth/Stencil discards pixels before the pixel shader, but I see now that it happens inside the very last Output Merger state. It seems extremely inefficient to render that way a little mirror in a scene with big viewport. Why they've put the stencil test in the output merger anyway? Handling of Stencil is so limited compared to other resources. Does people use Stencil functionality at all for games, or they prefer discard/clip?

      Will GPU stale the pixel if I issue a discard in the very beginning of the pixel shader, or GPU will already start using the freed up resources to render another pixel?!?!



       
  • Advertisement
  • Advertisement
Sign in to follow this  

DX12 Texture3D strange behavior in volume raycasting

This topic is 678 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hey Guys,

 

We all know that Texture resources are normally swizzled in GPU memory for better cache hit rate, since normally we read texture with neighbor data. In order to see how much this would impact the performance, I write a volume rendering program to profile it.

 

The program is in DX12, and I modified the Microsoft MiniEngine to build this demo based on it and added GPU profiler and ImGUI. If you run it in Profile and Debug build you should see GPU perf graph on top of the window which timed each GPU tasks. The scene is simple: A dynamic volumetric cube is updated and rendered while it is rotating constantly. every frame the volumetric cube content is updated through a compute shader, and then rendered using raycasting algorithm running on pixel shader. The GUI give you the ability to switch between Typedbuffer volume, StructuredBuffer volume and Texture3D volume each effectively have 4 32bit data for every voxel. For detail information you could look through the code:

 

 https://github.com/pengliu916/Texture3D_StructuredBuffer.git

 

The result I got is strange:

when the virtual camera is far enough, everything works as expected: StructuredBuffer/TypedBuffer volume have huge variance render time as the cube keeps rotating (since volume data is not swizzled, from some point of view, cache hit rate may be very bad) Texture3D volume render time is almost constant, and always faster than StructuredBuffer/TypedBuffer(which is as expected).

 

The weird thing is when virtual camera get close to volume, StructuredBuffer/TypedBuffer volume render time still varying, Texture3D volume render time still very stable, but it get much slower as camera get close to volume. Please see this video for the result:

 

   

 

I can't understand why Texture3D volume render time will be much slower than StructuredBuffer/TypedBuffer when camera get close. It will be greatly appreciated if someone could run this test to confirm this is not related to my hardware, and provide some insights why this may happen this way.

 

Thanks

 

Peng

Share this post


Link to post
Share on other sites
Advertisement

I ran your demo and Texture3D was fastest at all zoom levels on NVIDIA and AMD hardware. On Intel, Texture3D was faster when zoomed out, StructuredBuffer was faster when zoomed in.

You didn't mention what hardware you are running on.

Share this post


Link to post
Share on other sites

Thank Adam for run the test, and it's good to know this maybe related to specific hardware. FYI my test machine has 2 nvidia 680m (sli is disabled) with latest driver (can't check the driver version right now, but I updated the driver yesterday night right before my test run). And I believe I have the latest released windows version.

 

So any idea why in some hardware the Texture3D performance is strangely related to zoom level?

 

And one other interesting question I would like to ask is: in this volume update,render case what's the pro and con side of using Texture3D instead of linear layout buffer if built in typed format is all we need? In my machine, I noticed the updating compute shader runs a slight slower on Texture3D (but may not be noticeable in your case)

 

Thanks

Edited by Mr_Fox

Share this post


Link to post
Share on other sites

It certainly looks a lot like a hardware specific bottleneck you're hitting, probably one specific to the particular model of NVIDIA GPU you have. 128 bit textures are not a particularly common thing to use and often sampling them comes with multiplying penalties relative to smaller formats (eg if 32bit is full speed, then 64 bit half speed and 128bit is quarter speed). I don't know a lot about Kepler, but it may well have these sort of penalties when it comes to higher bit-depth textures that don't apply to buffer loads.

 

You may well be in a situation where cache coherency is the dominant factor when zoomed out and thus swizzled/tiled texture data wins out when the camera is zoomed out. But as you zoom in to only a portion of the cube and it covers more pixels on screen that's less of an issue and "fetch cost" begins to dominate.

 

Have you tried a 64-bit format instead to see what difference that makes (if any)?

 

For what you're doing, I would always favour working with a Texture3D, even in light of the performance results on Intel hardware. As you pointed out, marching across a StructuredBuffer is only going to be cache coherent along the primary axis (X, probably), along Y and Z you're jumping one row or slice of data with each step. You can see the cost see-saw up and down as the cube rotates with the 'Buffer' methods in a way that it doesn't with Texture3D.

 

As for populating the Volume texture, I see zero difference on NVIDIA. On AMD and Intel, Texture3D population is marginally faster than the two Buffer methods.

Share this post


Link to post
Share on other sites

Thanks Adam, I will try 64-bit format later today. 

 

Just one question: from what I know (correct me if I am wrong): For stucturedbuffer/typedbuffer load, GPU will only load 'one voxel' for each PS thread from PS point of view, while Texture3D will load 4(or more) voxels even if I specificity use load function (not a sampler), right? So is that the 'fetch cost' you mentioned when camera is zoomed in? which make Texture3D slower since it fetches more data than StructuredBuffer

 

Also it seems if your data structure satisfied any of the built-in typed format, Texture should be preferred over buffer, right?

 

Thanks

Share this post


Link to post
Share on other sites

No, not really.

 

The Load or [] operator in HLSL is asking only for a single texel from the 3D Texture. GPUs still have cache lines like the CPU does, so it's unlikely to be able to only fetch 16 bytes from memory, it'll have to pull in the whole cache line regardless, even if the other 48 bytes aren't necessary (assuming a 64 byte cache line). That's fine if you go on to use the other 48 bytes in future fetches, but a complete waste if you don't. But that applies to Buffers as well as Textures, so it's not really what I meant.

 

The mechanics vary from vendor to vendor and architecture to architecture, but generally a given GPU will have a fixed number of memory operations (aka fetches) it can perform per second, and often this is reduced by 1/2, 1/4 or even more if the format is larger than some given size. Even if your Texture were 1x1x1 and the texel immediately ended up in the fastest cache the GPU has (eg L1), you're still requesting that texel be fetched potentially millions of times. Generally that penalty is more often associated with using Texture Sampling (.Sample) than Load operations, but there's really not enough vendor-specific information out there to be sure one way or the other.

 

If your data structure is logically 2D or 3D, I'd use 2D or 3D textures, yes. Texture1D I don't use very often because it's very similar to a Buffer (except that it supports filtering). That's not to say there aren't particular scenarios where that advice might not be true, but as a rule of thumb it works.

Edited by Adam Miles

Share this post


Link to post
Share on other sites

Hi Adam,

 

I have tried the 16bit per channel version.  Texture3D version runs almost twice as fast as TypedBuffer version at the default distance, but as I move all the way in, they almost run the same speed. So as you said I definitely hit some HW specific bottleneck (There is no difference between switch between Load or more expensive anisotropic sampler for Texture3D when I zoomed in).  But what still confuse me is why Texture3D is more sensitive than TypedBuffer when zoom in (as you said the amount of fetch ops should be the same for both Texture3D and TypedBuffer, so why Texture3D perf drops more than TypedBuffer)?

 

Sorry for being annoy, I am just a grad student interested in DirectX, want to know what happened under the hood....

 

Thanks

 

Peng 

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement