Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 29 Mar 2007
Offline Last Active Today, 01:54 AM

#5082279 Bounding Box with Compute Shader

Posted by on 01 August 2013 - 01:35 PM

To read back results on the CPU you have to create two buffers of the same size. The first you create with D3D11_USAGE_DEFAULT and you use that as the output of your compute shader. For the other buffer you create D3D11_USAGE_STAGING and CPU read access. Then after you run your compute shader, you use CopyResource to copy the data from the GPU buffer to the staging buffer. You can then call Map on the staging buffer to read its contents. Just be aware that doing this will cause the CPU flush its command buffers and wait around while the GPU finishes executing commands, which will kill parallelism and hurt performance. You can alleviate this by waiting as long as possible after calling CopyResource but before calling Map.

Also just so you're aware, while global atomics are the most straightforward way to do this they're definitely not the fastest. Running a multi-pass parallel reduction is likely to be much faster.

#5082253 Compute Shader execution time

Posted by on 01 August 2013 - 12:49 PM

The problem with timestamp queries is that they just tell you the amount of time it takes for the GPU's command processor to reach a certain point in the command buffer. Actually measuring the amount of time it takes for a Draw or Dispatch call is more complicated than that, because the GPU can be executing multiple Draw/Dispatch calls simultaneously. Since there can be lots of things in flight, the command processor generally won't wait for a Draw/Dispatch to finish before moving on and executing the next command. So if you just wrap a single Dispatch call, all you'll get is the amount of time for the CP to start the Dispatch and then move on. To get any sort of accurate timing info that need to wrap your Begin/End around something that will cause the driver to insert a sync point, or try to force a sync point yourself. Typically any Dispatch or Draw that reads from the output of another Dispatch or Draw will cause the GPU to sync. But of course inserting artificial sync points will hurt your overall efficiency by preventing multiple Draw/Dispatch calls from overlapping, so you have to be careful.

On a related note, this is related to why Nsight will give you the "sync" and "async" timings for a Draw or Dispatch call. The "sync" value gives you the time it takes to execute the call if it's the only call being executed on the GPU, while the "async" value gives you the time it took to execute with all of the other Draw/Dispatch calls being executed during the frame.

#5081340 SV_VertexID and ID3D11DeviceContext::Draw()/StartVertexLocation

Posted by on 28 July 2013 - 08:58 PM

Cache performance of UAV's is dependent on the locality of your writes. For the case of rasterization, pixel shader executions are grouped by locality in terms of pixel position, so writing to a texture at the pixel location should be fine.

As for synchronization, you only need to worry about that if different shader executions read from or write to the same memory locations during the same Draw or Dispatch call. If you don't do that, there's nothing that you need to worry about.

#5081257 CPU GPU (compute shader) parallelism

Posted by on 28 July 2013 - 01:16 PM

In the case you're describing the driver will automatically detect the dependency between your two Dispatch calls, and it will insert a sync point before executing the second Dispatch. This will cause the GPU to wait until all threads from the first Dispatch complete, so the second Dispatch won't begin until all of the data is present in the state buffer.

#5081141 CPU GPU (compute shader) parallelism

Posted by on 28 July 2013 - 12:28 AM

D3D and the driver will queue up as many commands as you give it, and the GPU will eventually execute them. Typically in applications that use the device for rendering, the device will block the CPU during Present if the CPU starts getting too far ahead of the GPU. I'm not sure how it works exactly if you're only using the device for compute, but I would assume that something similar happens if the driver has too many commands queued up.

If you wanted a system that dynamically changes what commands it issues based on the GPU load, there's no direct support for doing it. If I were to try implementing such a thing, I would probably start by trying to using timestamp queries to track when Dispatch calls actually get executed. Then based on that feedback you could try to decide whether to issue new Dispatch calls.

#5080872 SIGGRAPH 2013 Master Thread

Posted by on 26 July 2013 - 06:17 PM

Yeah it's pretty nice. biggrin.png

Like most modern GPU's there's plenty of ALU available while you're waiting for reads from memory, and complex BRDF's are a pretty straightforward way to take advantage of that.

#5080802 SIGGRAPH 2013 Master Thread

Posted by on 26 July 2013 - 12:41 PM

If you guys have any questions about my talk (Crafting a Next-Gen Material Pipeline for The Order: 1886), feel free to ask. I can also see if I can get the other course presenters on here if you have questions for them.

#5079968 DX11 - Tessellation - Something is wrong

Posted by on 23 July 2013 - 03:47 PM

I know this problem because it happened to me when I first implemented tessellation! You can't use SV_Position to pass values between VS/HS and HS/DS. You can only use it for outputting from your domain shader. For other shader stages, you should use a non-SV semantic for passing position.

#5079747 API Wars horror - Will it matter?

Posted by on 22 July 2013 - 10:22 PM


My bad huh.png That's what I remembered reading in the past, but like I said I never programmed for the PS3.

At the start of the PS3's life time the fact a OGL|ES implementation existed was jumped on by the 'opengl everywhere!' gang and has since been reported as a fact the PS3 uses OpenGL... alas to this day the misinformation exists and thus this common mistake crops up sad.png


Yeah I still see that all of the time (especially on general gaming forums) and it drives me nuts. Hopefully the same thing doesn't happen to PS4.

#5079622 [Instancing] Flickering when updating one instance.

Posted by on 22 July 2013 - 11:52 AM


If your common use case is to only change a small part of the buffer at a time, then you can try using a two-buffer approach. Create your primary instancing buffer as D3D11_USAGE_IMMUTABLE, and then create a secondary buffer with D3D11_USAGE_STAGING. Then when you want to change the buffer contents, map your staging buffer and then use CopySubresourceRegion to copy some or all of the staging buffer to your primary buffer.

Wait. Copying to an immutable ? I've just tried that with D3D 11.0 and I get this:

ID3D11DeviceContext::CopySubresourceRegion: Cannot invoke CopySubresourceRegion when the destination Resource was created with the D3D11_USAGE_IMMUTABLE Usage.

What am I missing ? Is this some D3D 11.1 or higher feature ?


You're not missing anything, I messed up. That should have been D3D11_USAGE_DEFAULT instead of D3D11_USAGE_IMMUTABLE.

#5079195 [Instancing] Flickering when updating one instance.

Posted by on 20 July 2013 - 01:06 PM

When you map with D3D11_MAP_WRITE_DISCARD you lose all previous contents of that resource. So if you use it, you need to fill up the entire buffer again.

If your common use case is to only change a small part of the buffer at a time, then you can try using a two-buffer approach. Create your primary instancing buffer as D3D11_USAGE_DEFAULT, and then create a secondary buffer with D3D11_USAGE_STAGING. Then when you want to change the buffer contents, map your staging buffer and then use CopySubresourceRegion to copy some or all of the staging buffer to your primary buffer.

#5079090 f16tof32 f32tof16 doesn't work correctly

Posted by on 19 July 2013 - 11:12 PM

Are you using DXGI_FORMAT_R16G16B16A16_FLOAT for the position element in your vertex buffer? You said the format is R16G16B16A16, but there are several varieties of that format.

For debugging compute shaders the only tools you can use are Nsight (Nvidia) and GPU PerfStudio (AMD). However you don't necessarily need any tools to debug this...you can copy your buffer to a staging buffer that you can MAP on the CPU so that you can print out the values or inspect them in a debugger.

#5078810 Shader Unlimited Lights

Posted by on 18 July 2013 - 06:48 PM

One can break the constant limit - and force to use a proper loop - in SM 3.0 by encoding stuff in textures

I think there's an echo in here tongue.png

#5078634 Shader Unlimited Lights

Posted by on 17 July 2013 - 11:27 PM


D3D9 pixel shaders have a major limitation, which is that they can't dynamically index into shader constants. This means that it can't use an actual loop construct in assembly to implement your for loop, instead it has to unroll it and do something like this:



It's possible to use a loop with D3D9, take a look at this sample:



Have you looked at the generated assembly? It looks like this:

def c46, -4, -5, -6, -7
def c47, 0, 1, 2, 3
dcl_texcoord v0.xyz
dcl_texcoord1 v1.xy
dcl_texcoord2 v2.xyz
dcl_texcoord3 v3.xyz
dcl_2d s0
nrm r0.xyz, v3
dp3 r0.w, v2, v2
rsq r0.w, r0.w
mov r1, c47.x
mov r2.x, c47.x
rep i0
  add r3, r2.x, -c47
  add r4, r2.x, c46
  mov r5.x, c47.x
  cmp r2.yzw, -r3_abs.x, c0.xxyz, r5.x
  cmp r2.yzw, -r3_abs.y, c5.xxyz, r2
  cmp r2.yzw, -r3_abs.z, c10.xxyz, r2
  cmp r2.yzw, -r3_abs.w, c15.xxyz, r2
  cmp r2.yzw, -r4_abs.x, c20.xxyz, r2
  cmp r2.yzw, -r4_abs.y, c25.xxyz, r2
  cmp r2.yzw, -r4_abs.z, c30.xxyz, r2
  cmp r2.yzw, -r4_abs.w, c35.xxyz, r2
  add r2.yzw, r2, -v0.xxyz
  cmp r5.y, -r3_abs.x, c4.x, r5.x
  cmp r5.y, -r3_abs.y, c9.x, r5.y
  cmp r5.y, -r3_abs.z, c14.x, r5.y
  cmp r5.y, -r3_abs.w, c19.x, r5.y
  cmp r5.y, -r4_abs.x, c24.x, r5.y
  cmp r5.y, -r4_abs.y, c29.x, r5.y
  cmp r5.y, -r4_abs.z, c34.x, r5.y
  cmp r5.y, -r4_abs.w, c39.x, r5.y
  rcp r5.y, r5.y
  mul r2.yzw, r2, r5.y
  dp3 r5.y, r2.yzww, r2.yzww
  add r5.z, -r5.y, c47.y
  max r6.x, r5.z, c47.x
  rsq r5.y, r5.y
  mul r2.yzw, r2, r5.y
  mad r5.yzw, v2.xxyz, r0.w, r2
  nrm r7.xyz, r5.yzww
  dp3_sat r2.y, r0, r2.yzww
  dp3_sat r2.z, r0, r7
  pow r5.y, r2.z, c44.x
  cmp r7, -r3_abs.x, c1, r5.x
  cmp r7, -r3_abs.y, c6, r7
  cmp r7, -r3_abs.z, c11, r7
  cmp r7, -r3_abs.w, c16, r7
  cmp r7, -r4_abs.x, c21, r7
  cmp r7, -r4_abs.y, c26, r7
  cmp r7, -r4_abs.z, c31, r7
  cmp r7, -r4_abs.w, c36, r7
  mad r7, r6.x, r7, c45
  cmp r8, -r3_abs.x, c2, r5.x
  cmp r8, -r3_abs.y, c7, r8
  cmp r8, -r3_abs.z, c12, r8
  cmp r8, -r3_abs.w, c17, r8
  cmp r8, -r4_abs.x, c22, r8
  cmp r8, -r4_abs.y, c27, r8
  cmp r8, -r4_abs.z, c32, r8
  cmp r8, -r4_abs.w, c37, r8
  mul r8, r8, c41
  mul r8, r2.y, r8
  mul r8, r6.x, r8
  mad r7, c40, r7, r8
  cmp r8, -r3_abs.x, c3, r5.x
  cmp r8, -r3_abs.y, c8, r8
  cmp r8, -r3_abs.z, c13, r8
  cmp r3, -r3_abs.w, c18, r8
  cmp r3, -r4_abs.x, c23, r3
  cmp r3, -r4_abs.y, c28, r3
  cmp r3, -r4_abs.z, c33, r3
  cmp r3, -r4_abs.w, c38, r3
  mul r3, r3, c43
  mul r3, r5.y, r3
  cmp r3, -r2.y, c47.x, r3
  mad r3, r3, r6.x, r7
  add r1, r1, r3
  add r2.x, r2.x, c47.y
texld r0, v1, s0
mul oC0, r0, r1

Because of the constant indexing limitation it has to do a compare and select for every single constant register. It's just a different variant of what I mentioned. Basically it's like doing this:

for(uint i = 0; i < NumLights; ++i)
    float3 LightPos = Lights[0].Position;
    if(i == 1)
        LightPos = Lights[1].Position;
    else if(i == 2)
        LightPos = Lights[2].Position;
    else if(i == 3)
        LightPos = Lights[3].Position;
    else if(i == 7)
        LightPos = Lights[7].Position;
    float3 LightColor = Lights[0].Color;
    if(i == 1)
        LightColor = Lights[1].Color;
    else if(i == 2)
        LightColor = Lights[2].Color;
    else if(i == 3)
        LightColor = Lights[3].Color;
    else if(i == 7)
        LightColor = Lights[7].Color;
    // and so on

#5078547 "Rough" material that is compatible with color-only lightmaps?

Posted by on 17 July 2013 - 02:49 PM

When you bake a diffuse lightmap you're pre-integrating your lighting environment at each texel with your BRDF. You can do this and end up with a single value because you assume that the surface normal never changes, and that the reflected light doesn't depend on the viewing angle (which is true for a Lambertian BRDF). For most other BRDF's that last assumption doesn't hold, so you can't really use the same approach and get correct results. You would need to either...

  • Store the lighting environment without pre-integrating your BRDF using some sort of basis (such as spherical harmonics), and integrate your BRDF at runtime taking the current viewing angle into account (with this approach you can also vary the surface normal at runtime, which allows for normal mapping)
  • Pre-integrate with your BRDF and store multiple values corresponding to multiple viewing angles, then interpolate between them at runtime based on the current viewing angle
  • Don't change the way you bake lightmaps, and instead attempt to apply some approximating curve to the value that takes the viewing angle into account.