Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 08 Aug 2012
Offline Last Active Mar 15 2013 02:30 PM

Posts I've Made

In Topic: Texture memory access patterns

01 March 2013 - 01:24 AM

Thanks for the responses. In regards to the CUDA and OpenCL documentation, I've read most of them, and while they give lots of details and guidelines on global memory access, they don't mention very much about texture memory. The only guideline they provide is to have 2D spatial coherency when using textures (although they don't explicitly define what they mean by spatial coherency). The CUDA documentation is extremely detailed about how to get coalesced global memory access within a warp, how to avoid memory bank conflicts, and many more optimizations, but it's surprising there is next to nothing about how to minimize texture cache misses. I would think there would be at least a guideline for which texels to access for each thread in a warp to achieve the greatest memory throughput. Wouldn't the performance be different if each warp read the texels in a 32x1 manner compared to a 16x2 or a 8x4?


The article that phil_t linked to was very helpful and provided lots of insightful information on how texture memory works in the graphics pipeline. One section mentioned how the L1 texture cache can make use of compressed texture formats. These formats are compressed in blocks of 4x4 pixels, and when requested, are uncompressed and stored in the L1 texture cache. If the threads in the same warp make use of some of these 16 pixels, you can get multiple pixels worth of data in one memory fetch and decompression (well, if I understood the article correctly). So I suppose I'll stick to trying to read texels in a 4x4 pattern within a warp, unless someone tells me otherwise.

In Topic: HLSL fast trig functions

06 February 2013 - 10:39 PM

Thanks for the insightful and detailed responses, everybody. Do you think that future versions of hlsl would support this though? Even with the differences between amd and nvidia architecture, I would think that it wouldn't be too hard to create an assembly instruction that would result in using the fast trig functions with nvidia hardware while using the normal trig options with amd hardware. Doesn't the JIT compiler know what hardware is being used? I don't think the compiler should use the fast trig functions without being explicitly told to do so, because accuracy may be important for some applications. I just don't understand why there wouldn't be an assembly instruction for this. Just because the function isn't supported by both vendors shouldn't mean it can't be exploited by hlsl at all. There just needs to be an assembly instruction that uses fast trig operations when the available hardware is detected. Seems simple to me... but then, I'm no expert.

In Topic: HLSL fast trig functions

06 February 2013 - 02:53 AM

So I'm assuming no one knows the hlsl functions (I thought it might've been some [attribute] modifier). Strange thing is CUDA has a bunch of functions that sacrifice accuracy for speed, including square roots, exponentials, and trigonometric functions. This is detailed in the CUDA best practices guide under instruction optimization: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

I suppose it might just have a larger math library than hlsl.

In Topic: Advice for workflow and organization using Effects11

18 September 2012 - 05:12 PM

I installed the Windows 8 SDK and tried using the fxc compiler included in the kit, but now it won't compile. Apparently it doesn't like my use of group syncs inside loops that depend on uav conditions (even though I've specified the allow_uav_condition flag). Weird thing is that the compiler in the June 2010 SDK doesn't have any problems and my shader runs exactly how I want it to. Should I stick with the older compiler, or should I be concerned that the new compiler doesn't like my code? Is the new compiler more strict about thread syncs? In my shader, all the threads in a group read from the same UAV address, which determines the flow in the loop, so all the warps in the group should be following the same flow... don't know why it's generating an error in the new compiler.

Another possibility is that I'm not setting up the project correctly to use the new compiler. I don't want to switch entirely to the Windows 8 SDK (I'm using some D3DX functionality), so the only thing I changed was the executable directory in the project properties to the Windows 8 SDK bin directory. Does the compiler need the new libraries and headers, or can it just use the ones in the June 2010 SDK?

In Topic: Advice for workflow and organization using Effects11

18 September 2012 - 12:12 AM

Thanks for your great reply, as always, MJP. I am using the June 2010 SDK, so I'll definitely take a look at the Windows 8 SDK. I suppose it's time for me to abandon the Effects11 framework, since Microsoft doesn't really even support it anymore. I just thought it might be common practice, since Frank Luna's book Introduction to 3D Game Programming with DirectX11 used it. Would you happen to have a good source for working with shaders and buffers directly (or through a self-developed system), as well as compiling shaders offline properly (I've stumbled across certain Microsoft documentation talking about aligning resources to the correct slots across multiple shader files)? The Microsoft documentation can be frustratingly sparse, so I would definitely prefer a good book or website.