Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 29 Mar 2007
Offline Last Active Yesterday, 07:57 PM

#5086296 Buffer<uint> reading wrong values

Posted by on 15 August 2013 - 04:21 PM

Yeah the debug layer output can be a lifesaver sometimes! If you want, you can actually set it up to break into the debugger when a warning or error occurs. I usually do this so that I know right away that something is wrong, and also so that I know exactly where it's coming from.  It's really easy to do:


ID3D11InfoQueue* infoQueue = NULL;
device->QueryInterface(__uuidof(ID3D11InfoQueue), reinterpret_cast<void**>(&infoQueue));
infoQueue->SetBreakOnSeverity(D3D11_MESSAGE_SEVERITY_WARNING, TRUE);
infoQueue->SetBreakOnSeverity(D3D11_MESSAGE_SEVERITY_ERROR, TRUE);


As for the BRDF, it's basically an energy-conserving Blinn-Phong BRDF paired with a Fresnel term. The energy-conserving Blinn-Phong model does indeed come from the "Physically Based Shading"  side of things. If you're not familiar with the basics, I'd recommend reading through Real-Time Rendering 3rd Edition. It's a fantastic book all-around, but it has a great chapter that describe the physics of lighting and reflectance, as well another chapter that gives a comprehensive overview of BRDF's and the basics of physically based shading models. I'd also recommend reading through Naty Hoffman's presentations and course notes from SIGGRAPH. The slides from this year's course can be found here, but his course notes aren't up yet. You can also find last year's material here.

#5086237 Device Context Question

Posted by on 15 August 2013 - 02:42 PM

You can just set the hull and domain shader to NULL if you're not using them anymore.

#5086235 Buffer<uint> reading wrong values

Posted by on 15 August 2013 - 02:40 PM

It sounds like you're doing everything correctly, so I'm not sure what the problem is. You shouldn't need to do anything to the buffer after the compute shader is done writing to it, you should be able to just read from it in your pixel shader assuming that the UAV and SRV were set up correctly. If you didn't set up the UAV or SRV correctly the debug layer will usually output a warning to let you know that you did something wrong, assuming that you enabled the debug layer when creating the device.

#5085881 Percentage Closer Filtering

Posted by on 14 August 2013 - 12:37 PM


Are you creating your device with the D3D11_CREATE_DEVICE_DEBUG flag and checking for warnings/errors in your debugger output? It's possible that you messed something up when creating the sampler state, and that's screwing up the shader.


I donot have D3D11_CREATE_DEVICE_DEBUG flag.



You should always, always, *always* be using it for debug builds. Always.

#5085641 Uber (Compute) Shader construction

Posted by on 13 August 2013 - 03:05 PM

When you hard code a constant into HLSL or Cg code the end result is the same as having that constant specified in a constant buffer and bound at runtime. 

Well that of course isn't true. Any constants that are hard-coded into the shader can be folded into the instructions and used to optimize the resulting assembly/bytecode. In materials a lot of parameters end up having a value of 0 or 1. If those parameters are hard-coded then any multiplication with those values can be optimized away completely. Or in the case of 0, all operations involved in producing the value to be multiplied with that parameter can be stripped away. With values in a constant buffer the compiler can't make these assumptions, and must issue wasted instructions and possibly consume additional registers. There are also cases where the hardware supports modifiers that allow multiplications by certain values to be folded into a previous instruction. For instance AMD's GCN ISA supports modifiers that allow for a free multiplication by 0.25, 0.5, 2, or 4.

#5084552 Easy-to-use Version Control on Windows? Needs to be able to easily ignore cer...

Posted by on 09 August 2013 - 06:55 PM

Yeah Perforce very much likes to be in control of files in the repo, and generally doesn't behave well when you go behind its back. So if you want to delete a file you have to tell P4 to delete it, and it will then delete it from your hard drive. If you happen to delete a file locally it won't even be aware it's gone, and if you try to sync on it (Get Latest in P4V) it won't replace it. The only way it will replace it is if you use -f to force a sync.

#5084049 Far Buildings Illusion

Posted by on 07 August 2013 - 10:35 PM

Those sort of things are almost always modeled at a much lower level of detail relative to the geometry that the camera will get close to, which means less polygons, small textures, and fewer draw calls. It's also common to use forced perspective to make the models appear much further away than they actually are.

#5082877 InputLayout Permutations

Posted by on 03 August 2013 - 07:55 PM

On all hardware that I'm familiar with, there's no dedicated functionality for fetching vertices. Instead there will be a small bit of auto-generated vertex shader code that's run before your actual vertex shader, and that code will be responsible for reading data from your vertex buffer(s) using normal shader buffer loads and storing loaded vertex data into registers. AMD refers to this as a "fetch shader" in their documentation, and I think you can find a bit about how it works in their architecture guides. An input layout essentially contains all data required to generate this bit of shader code, so is probably why the D3D10/D3D11 API requires you to create input layouts ahead of time instead of just binding a vertex declaration like you did in D3D9. In the old D3D9 model the driver couldn't fully generate a fetch shader until you bound the vertex declaration and vertex shader for rendering, which meant it might have to do expensive driver work the first time that you do it. In D3D10/D3D11 on the other hand the association is known ahead of time, the driver can do it during initialization/loading instead of mid-frame.

As far as performance goes, the fetch shader is only going to read the vertex elements required by the vertex shader and will ignore everything else. So if you have extra elements in the vertex buffer that goes unused, the main performance deficient will be that you will have some extra unused data polluting the cache while the shader is running. If you were to have a more optimal vertex buffer that only contained the elements required by the vertex shader, then it's more likely that you'll get a cache hit. In practice though vertex shader memory access isn't something that I find myself having to optimize very often.

#5082676 InputLayout Permutations

Posted by on 02 August 2013 - 11:52 PM

Actually if that's from one of the deferred rendering samples, then I'm pretty sure that I wrote that code. tongue.png


The input elements array that you passed is meant to represent all of the possible elements that you will have in your vertex buffer when you render with that vertex shader. The input layout then represents a unique mapping of those elements to the input parameters of the vertex shader. There doesn't need to be a 1:1  relation between vertex elements and VS input parameters, the input elements just needs to contain at least the parameters expected by the vertex shader. All other elements will be ignored. So yes, you can certainly re-use a vertex element array if you're using the same vertex buffer with different vertex shaders. A good example would be if you had a mesh, and you had two sets of shaders: one for normal rendering, and one for shadow map rendering. The main vertex shader will probably require all of the vertex elements, but the shadow map vertex shader will probably require only the positions. Since the two shaders have different input parameters you'll need to 2 different input layouts, but if you use the same vetex buffer for both then you can re-use the same vertex element array.

You can also use the same input layout for different vertex shaders as long as the vertex shaders have the same input signature.

#5082308 GPU Compute threading model and its relationship to hardware

Posted by on 01 August 2013 - 03:10 PM

Each CU in a 7970 actually executes 4 wavefronts concurrently. Each wavefront is mapped to a 16-wide SIMD unit, and instructions are executed 16 threads at time. Not that this changes much, but who doesn't like being pedantic. tongue.png


Anyway these would be my TL:DR guidlines for thread groups:


1. Always prefer thread group sizes that are a multiple of the warp/wavefront count of your hardware. If you're targetting both AMD and Nvidia, a multiple of 64 is safe for both.

2. You usually want more than one wavefront/warp per thread group. Having more lets the hardware swap out warps/wavefronts to hide latency. 128-256 is usually a good place to aim.

3. Don't use too much shared memory, it kills occupancy

#5082279 Bounding Box with Compute Shader

Posted by on 01 August 2013 - 01:35 PM

To read back results on the CPU you have to create two buffers of the same size. The first you create with D3D11_USAGE_DEFAULT and you use that as the output of your compute shader. For the other buffer you create D3D11_USAGE_STAGING and CPU read access. Then after you run your compute shader, you use CopyResource to copy the data from the GPU buffer to the staging buffer. You can then call Map on the staging buffer to read its contents. Just be aware that doing this will cause the CPU flush its command buffers and wait around while the GPU finishes executing commands, which will kill parallelism and hurt performance. You can alleviate this by waiting as long as possible after calling CopyResource but before calling Map.

Also just so you're aware, while global atomics are the most straightforward way to do this they're definitely not the fastest. Running a multi-pass parallel reduction is likely to be much faster.

#5082253 Compute Shader execution time

Posted by on 01 August 2013 - 12:49 PM

The problem with timestamp queries is that they just tell you the amount of time it takes for the GPU's command processor to reach a certain point in the command buffer. Actually measuring the amount of time it takes for a Draw or Dispatch call is more complicated than that, because the GPU can be executing multiple Draw/Dispatch calls simultaneously. Since there can be lots of things in flight, the command processor generally won't wait for a Draw/Dispatch to finish before moving on and executing the next command. So if you just wrap a single Dispatch call, all you'll get is the amount of time for the CP to start the Dispatch and then move on. To get any sort of accurate timing info that need to wrap your Begin/End around something that will cause the driver to insert a sync point, or try to force a sync point yourself. Typically any Dispatch or Draw that reads from the output of another Dispatch or Draw will cause the GPU to sync. But of course inserting artificial sync points will hurt your overall efficiency by preventing multiple Draw/Dispatch calls from overlapping, so you have to be careful.

On a related note, this is related to why Nsight will give you the "sync" and "async" timings for a Draw or Dispatch call. The "sync" value gives you the time it takes to execute the call if it's the only call being executed on the GPU, while the "async" value gives you the time it took to execute with all of the other Draw/Dispatch calls being executed during the frame.

#5081340 SV_VertexID and ID3D11DeviceContext::Draw()/StartVertexLocation

Posted by on 28 July 2013 - 08:58 PM

Cache performance of UAV's is dependent on the locality of your writes. For the case of rasterization, pixel shader executions are grouped by locality in terms of pixel position, so writing to a texture at the pixel location should be fine.

As for synchronization, you only need to worry about that if different shader executions read from or write to the same memory locations during the same Draw or Dispatch call. If you don't do that, there's nothing that you need to worry about.

#5081257 CPU GPU (compute shader) parallelism

Posted by on 28 July 2013 - 01:16 PM

In the case you're describing the driver will automatically detect the dependency between your two Dispatch calls, and it will insert a sync point before executing the second Dispatch. This will cause the GPU to wait until all threads from the first Dispatch complete, so the second Dispatch won't begin until all of the data is present in the state buffer.

#5081141 CPU GPU (compute shader) parallelism

Posted by on 28 July 2013 - 12:28 AM

D3D and the driver will queue up as many commands as you give it, and the GPU will eventually execute them. Typically in applications that use the device for rendering, the device will block the CPU during Present if the CPU starts getting too far ahead of the GPU. I'm not sure how it works exactly if you're only using the device for compute, but I would assume that something similar happens if the driver has too many commands queued up.

If you wanted a system that dynamically changes what commands it issues based on the GPU load, there's no direct support for doing it. If I were to try implementing such a thing, I would probably start by trying to using timestamp queries to track when Dispatch calls actually get executed. Then based on that feedback you could try to decide whether to issue new Dispatch calls.