Jump to content
  • Advertisement


  • Content Count

  • Joined

  • Last visited

Community Reputation

149 Neutral

About NotTakenSN

  • Rank
  1. I would like to be able to right click on a file, such as a jpg, and open it with my application under the "open with" menu in Windows. How do I get the filename and directory to be passed along to the arguments in main (or winmain, since my app has a graphics interface)?   RIght now, my application has a predefined directory that it reads from. It executes fine when I double click the executable, but when I try opening my application under the "open with" drop down menu from a random file, I get an invalid allocation size error. My current code doesn't actually use any of the command line arguments to winmain, so why am I getting this error? Shouldn't it just launch the same way as when I double click my executable?
  2. NotTakenSN

    Texture memory access patterns

    Thanks for the responses. In regards to the CUDA and OpenCL documentation, I've read most of them, and while they give lots of details and guidelines on global memory access, they don't mention very much about texture memory. The only guideline they provide is to have 2D spatial coherency when using textures (although they don't explicitly define what they mean by spatial coherency). The CUDA documentation is extremely detailed about how to get coalesced global memory access within a warp, how to avoid memory bank conflicts, and many more optimizations, but it's surprising there is next to nothing about how to minimize texture cache misses. I would think there would be at least a guideline for which texels to access for each thread in a warp to achieve the greatest memory throughput. Wouldn't the performance be different if each warp read the texels in a 32x1 manner compared to a 16x2 or a 8x4?   The article that phil_t linked to was very helpful and provided lots of insightful information on how texture memory works in the graphics pipeline. One section mentioned how the L1 texture cache can make use of compressed texture formats. These formats are compressed in blocks of 4x4 pixels, and when requested, are uncompressed and stored in the L1 texture cache. If the threads in the same warp make use of some of these 16 pixels, you can get multiple pixels worth of data in one memory fetch and decompression (well, if I understood the article correctly). So I suppose I'll stick to trying to read texels in a 4x4 pattern within a warp, unless someone tells me otherwise.
  3. How is the texture cache constructed (I'd suppose different hardware would have different implementations, but wouldn't there be some similarities)? From what I've read, texture memory is just global memory with a dedicated special texture cache that is designed for better memory access when threads in the same warp read data that is near each other in 2D space. What constitutes as being "near" in 2D space? If a thread requests data from (5,5), what data ultimately gets sent to the cache along with it? Does it depend on the data type as well? If your warp size is 32, what type of grid pattern would you use to most efficiently read/write to each texel (2x16, 4x8, 8x4, etc.)?   The documentation on global memory access is quite detailed, but I can't seem to find much about texture memory access (maybe because the implementation varies too much from hardware to hardware).
  4. What's the best way for creating a g-buffer? Most of the documentation I've read suggest rendering the scene geometry normally and then writing to multiple render targets. A drawback to this method is that render targets have to be in the four 32-bit value format. This could waste memory space and bandwidth if you're not writing a multiple of four 32-bit values. Also, it restricts the geometric data alignment to single rows (where each row starts at xmin and increases to xmax with the same y value), when small tiles might be better for coalesced memory access later on in a compute shader.   Is it a better idea instead to write to a structured buffer? How would you go about doing the depth testing to ensure that the final value written into the structured buffer is actually on top? I would think one method would be to explicitly read the depth buffer, compare the values, then write to both the depth buffer and g-buffer if the test passes. The other option would be to define the earlydepthstencil attribute so that only fragments that pass the depth test can invoke the pixel shader (which writes the value to the g-buffer). Does this actually work? Are there major setbacks to this method?
  5. What's the best way to go about displaying an image calculated in a compute shader to the screen? Is it possible to write directly to a render target from the compute shader? Or would you have to write the results to a 2D UAV texture, then somehow swap that into the back buffer? I suppose writing to a RWTexture2D<float4> is the way to go, but how exactly would you set up the swap chain for this?   The only way I can get it working right now is to write into a 2D UAV texture, then render a rectangle to activate the pixel shader, which then reads from the texture and writes those values to the render target. Obviously, I would like to avoid this method because it requires unnecessary switching between the compute shader and pixel shader, which impacts performance.
  6. NotTakenSN

    HLSL fast trig functions

    Thanks for the insightful and detailed responses, everybody. Do you think that future versions of hlsl would support this though? Even with the differences between amd and nvidia architecture, I would think that it wouldn't be too hard to create an assembly instruction that would result in using the fast trig functions with nvidia hardware while using the normal trig options with amd hardware. Doesn't the JIT compiler know what hardware is being used? I don't think the compiler should use the fast trig functions without being explicitly told to do so, because accuracy may be important for some applications. I just don't understand why there wouldn't be an assembly instruction for this. Just because the function isn't supported by both vendors shouldn't mean it can't be exploited by hlsl at all. There just needs to be an assembly instruction that uses fast trig operations when the available hardware is detected. Seems simple to me... but then, I'm no expert.
  7. NotTakenSN

    HLSL fast trig functions

    So I'm assuming no one knows the hlsl functions (I thought it might've been some [attribute] modifier). Strange thing is CUDA has a bunch of functions that sacrifice accuracy for speed, including square roots, exponentials, and trigonometric functions. This is detailed in the CUDA best practices guide under instruction optimization: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html I suppose it might just have a larger math library than hlsl.
  8. Does hlsl have access to fast trig functions that sacrifice accuracy for speed? I know CUDA has __sinf(x), __cosf(x), etc. which can be an order of magnitude faster than the sinf(x), cosf(x) counterparts. I swore I read it somewhere before, but I just can't find it on google or msdn anymore.
  9. I'm trying to remove the effects pipeline from my program and have realized that I have no idea how to load a precompiled shader without the effects functions. I've found example code that uses D3DReadFileToBlob, but that is only available in the Windows 8 SDK (I'm using the June 2010 SDK). I've even tried linking to the d3dcompiler.lib in the Windows 8 SDK, but still can't use the function. Right now, I'm attempting to read in the file through std::ifstream into a std::vector<char>, then use CreateVertexShader(). Is this the right way to go? Also, how do I set up the input layout afterwards? Most of the examples I see store the byte code in a ID3DBlob, so they are able to use ID3DBlob::GetBufferPointer() to retrieve the void *pShaderBytecodeWithInputSignature required in the CreateInputLayout() function. My byte code is stored in a vector<char>, so how do I do this? Can I create a ID3DBlob from a vector<char>?
  10. I installed the Windows 8 SDK and tried using the fxc compiler included in the kit, but now it won't compile. Apparently it doesn't like my use of group syncs inside loops that depend on uav conditions (even though I've specified the allow_uav_condition flag). Weird thing is that the compiler in the June 2010 SDK doesn't have any problems and my shader runs exactly how I want it to. Should I stick with the older compiler, or should I be concerned that the new compiler doesn't like my code? Is the new compiler more strict about thread syncs? In my shader, all the threads in a group read from the same UAV address, which determines the flow in the loop, so all the warps in the group should be following the same flow... don't know why it's generating an error in the new compiler. Another possibility is that I'm not setting up the project correctly to use the new compiler. I don't want to switch entirely to the Windows 8 SDK (I'm using some D3DX functionality), so the only thing I changed was the executable directory in the project properties to the Windows 8 SDK bin directory. Does the compiler need the new libraries and headers, or can it just use the ones in the June 2010 SDK?
  11. Thanks for your great reply, as always, MJP. I am using the June 2010 SDK, so I'll definitely take a look at the Windows 8 SDK. I suppose it's time for me to abandon the Effects11 framework, since Microsoft doesn't really even support it anymore. I just thought it might be common practice, since Frank Luna's book Introduction to 3D Game Programming with DirectX11 used it. Would you happen to have a good source for working with shaders and buffers directly (or through a self-developed system), as well as compiling shaders offline properly (I've stumbled across certain Microsoft documentation talking about aligning resources to the correct slots across multiple shader files)? The Microsoft documentation can be frustratingly sparse, so I would definitely prefer a good book or website.
  12. I'm currently using the Effects11 framework, and I find it very convenient to organize and compile my shaders, set the resources, make the draw calls, etc. But now I'm running into a problem with my compile times. Inside my effects file, I have a vertex, geometry, pixel, and two compute shaders, all of which are necessary for a technique I'm designing. One of the compute shaders is very long (about 900 instructions without any unrolling) and takes over 3 minutes to compile. I have finished working on that compute shader and do not need to make any more changes to it, but when I make changes to the other shaders in the effects file, I have to recompile the entire effects file, which includes waiting 3 minutes for the big compute shader to recompile. This is quite inconvenient when I'm trying to debug the shaders. Is there a way to exclude a specific shader from recompilation? Or do I need to create a new effects file? What is your preferred workflow when using the effects11 framework, or do you even use the effects11 framework? And do you lump all of your shaders under one big effects file, or do you separate them into smaller ones? I appreciate your replies.
  13. When you're doing a final build for the shaders and you enable the highest optimizations, how does the compiler know how to balance register usage with occupancy? Individual threads may execute faster when more registers are used, but you will have lower occupancy, since each block will require more resources. Does it balance out? Which one do you prefer?
  14. Thanks for the reply. I am looking at the D3D shader assembly that is generated by the fxc.exe compiler. I'm aware that the assembly isn't 100% what the driver would generate at runtime, but how much would it actually differ? If the assembly code isn't similar to the actual code, what would be the purpose of even looking at assembly code then? How would you go about manually controlling the register usage then, or is it just something people don't bother with?
  15. This is extremely frustrating. What kind of compiler has no setting to compile exactly what the programmer writes? Controlling the number of registers is vital for getting the optimal performance, yet there is no way to force the compiler to stick a certain number of registers. Does anyone know if you get this sort of control in OpenGL and OpenCL? I'm beginning to regret ever learning DirectX. Absolutely worthless documentation and support.
  • Advertisement

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!