Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 29 Mar 2007
Offline Last Active Yesterday, 07:50 PM

#5279535 Setting constant buffer 'asynchronously'

Posted by on 04 March 2016 - 02:05 PM

The runtime/driver will synchronize UpdateResource for you. This means that use a constant buffer for draw call A, update the constant buffer, and then issue draw call B, it will appear as though the buffer were update in between A and B.

#5279533 Manually copying texture data from a loaded image

Posted by on 04 March 2016 - 01:54 PM

Thank you! No plans for D3D12 book at a moment.

#5279349 Manually copying texture data from a loaded image

Posted by on 03 March 2016 - 04:08 PM

You're not supposed to fill out D3D12_SUBRESOURCE_FOOTPRINT and D3D12_PLACED_SUBRESOURCE_FOOTPRINT. You need to fill out your D3D12_RESOURCE_DESC for the 2D texture that you want to create, and then pass that to ID3D12Device::GetCopyableFootprints in order to get the array of footprints for each subresource. You can then use the footprints to fill in an UPLOAD buffer, and then use CopyTextureRegion to copy the subresources from your upload buffer to the actual 2D texture resource in a DEFAULT heap.

You may want to look at the HelloTexture sample on GitHub to use as a reference.

EDIT: actually you don't strictly need to use GetCopyableFootprints, see Brian's post below

#5279328 [D3D12] Copying between upload buffers

Posted by on 03 March 2016 - 02:02 PM

You don't want to copy between upload buffers on the CPU. It can be very slow, since the memory will probably be write combined (uncached reads). There's really no way for you to figure out how big of a buffer you need ahead of time? Personally I would try to avoid having to create new buffers in the middle of rendering setup.

#5278921 Multiple small shaders or larger shader with conditionals?

Posted by on 01 March 2016 - 05:21 PM

Somebody correct me if I'm wrong, but if every path in a shader takes the same branch it is nearly (not entirely) the same cost as if the changes were compiled in with defines. If you are rendering an object with a specific set of branches that every thread will take it may not be a big deal. If threads take different branches you will eat the cost of all branches.

Branching on constants is nice, because there's no chance of divergence. Divergence is when some threads within a warp/wavefront take one side of the branch, and some take another. It's bad because you end up paying the cost of executing both sides of the branch. When branching on a constant everybody takes the same path, so you only pay some (typically small) cost for checking the constant and executing the branch instructions.

As for whether it's the same cost as compiling an entirely different shader permutation, it may depend on what's in the branch as well as the particulars of the hardware. One potential issue with branching on constants is register pressure. Many GPUs have static register allocation, which means that they compute the maximum number of registers needed by a particular shader program at compile time and then make sure that the maximum is always available when the shader is actually executed. Typically the register file has a fixed size and is shared among multiple warps/wavefronts, which means that if a shader needs lots of registers then fewer warps/wavefronts can be in flight simultaneously. GPUs like to hide latency by swapping out warps/wavefronts, so having fewer in flight limits their ability to hide latency from memory access. So let's say that you have something like this:

// EnableLights comes from a constant buffer
By branching you can avoid the direct cost of computing the lighting, but you won't be able to avoid the indirect cost that may occur from increased register pressure. However if you were to use a preprocessor macro instead of branch and compile 2 permutations, then the permutation with lighting disabled can potentially use less registers and have greater occupancy. But again, this depends quite a bit on the specifics of the hardware as well as your shader code, so you don't want to generalize about this too much. In many cases branching on a constant might have the exact same performance as creating a second shader permutation, or might even be faster due to CPU overhead from switching shaders.

#5276220 Shadow map depth range seems off

Posted by on 17 February 2016 - 04:46 PM

^^^ what mhagain said. I would suggest reading through this article for some good background information on the subject.

Also, a common trick for visualizing a perspective depth buffer is to just ignore the range close to the near clip plane and then treat the remaining part as linear. Like this:

float zw = DepthBuffer[pixelPos];  
float visualizedDepth = saturate(zw - 0.9f) * 10.0f;  
You can also compute the original view-space Z value from a depth buffer if you have the original projection matrix used for generating that depth buffer. If you take this and normalize to a [0,1] range use the near and far clip planes, then you get a nice linear value:

float zw = DepthBuffer[pixelPos];
float z = Projection._43 / (zw - Projection._33);
float visualizedDepth = saturate((z - NearClip) / (FarClip - NearClip));

#5275840 D3D12: Copy Queue and ResourceBarrier

Posted by on 15 February 2016 - 08:35 PM

Honestly i use a direct queue for everything including moving data to and from the GPU. The Present API only presents at multiples of the screen refresh rate (I haven't had luck getting unlocked fps yet), and i get 120 FPS whether i use a direct queue or a copy queue. Unless you are moving a lot of data to and from the GPU, i personally feel the copy queue just makes things more complex than they really need to be for how much performance gain you might get with it.

It definitely depends on how much data you're moving around, and how long it might take the GPU to copy that data. The problem with using the direct queue for everything is that it's probably going to serialize with your "real" graphics work. So if you submit 15ms worth of graphics work for a frame and you also submit 1ms worth of resource copying on the direct queue, then your entire frame will probably take 16ms on the GPU. Using a copy queue could potentially allow the GPU to execute the copy while also concurrently executing graphics work, reducing your total frame time.

#5275061 Diffuse IBL - Importance Sampling vs Spherical Harmonics

Posted by on 09 February 2016 - 05:48 PM

Spherical harmonics are pretty popular for representing environment diffuse, because they're fairly compact and they can be evaluated with just a bit of shader code. For L2 SH you need 9 coefficients (so 27 floats for RGB), which is about the size of a 2x2 cubemap. However it will always introduce some amount of approximation error compared to ground truth, and in some cases that error can be very noticeable. In particular it has the problem where very intense lighting will cause the SH representation to over-darken on the opposite side of the sphere, which can lead to totally black (or even negative!) values.

The other nice thing about SH is that it's really simple to integrate. With specular pre-integration you usually have to integrate for each texel of your output cubemap, which is why importance sampling is used as an optimization. If you're integrating to SH you don't need to do this, you're effectively integrating for a single point. This means you can just loop over all of the texels in your source cubemap, which means you won't "miss" any details. You can look at my code for an example, if you want.

#5274941 Limiting light calculations

Posted by on 09 February 2016 - 02:49 AM

I was always advised to avoid dynamic branching in pixel shader.

You should follow this advice if you're working on a GPU from 2005. If you're working on one from the last 10 years...not so much. On a modern GPU I would say that there's two main things you should be aware of with dynamic branches and loops:

1. Shaders will follow the worst case within a warp or wavefront. For a pixel shader, this means groups of 32-64 pixels that are (usually) close together in screen space. What this means is that if you have an if statement where the condition evaluates false to 31 pixels but true for one pixel in the 32-thread warp, then they're all to execute what's inside the if statement. This can be especially bad if you have an else clause, since you can end up with your shader executing both the "if" as well as the "else" part of your branch! For loops it's similar: the shader will keep executing the loop until all threads have hit the termination condition. Note that if you're branching or looping on something from a constant buffer, then you don't need to worry about any of this. For that case every single pixel will take the same path, so there's no coherency issue.

2. Watch out for lots of nested flow control. Doing this can start to add overhead from the actual flow control instructions (comparisons, jumps, etc.), and can cause the compiler to use a lot of general purpose registers.

For the case you're talking about, a dynamic branch is totally appropriate and is likely to give you a performance increase. The branch should be fairly coherent in screen space, so you should get lots of warps/wavefronts that can skip what's inside of the branch. For an even more optimal approach, look into deferred or clustered techniques.

#5274702 Questions on Baked GI Spherical Harmonics

Posted by on 06 February 2016 - 05:34 PM

For The Order we kept track of "dead" probes that were buried under geometry. These were detected by counting the percentage of rays that hit backfaces when baking the probes, and marking as "dead" if over a threshold. Early in the project the probe sampling was done on the CPU, and was done once per object. When doing this, we would detect dead probes during filtering (they were marked with a special value), and give them a filter weight of 0. Later on we moved to per-pixel sampling on the GPU, and we decided that manual filtering would be too expensive. This lead us to preprocess the probes by using a flood-fill algorithm to assign dead probes a value from their closest neighbor. We also ended up allowing the lighting artists to author volumes, where any probes inside of the volume would be marked as "dead". This was useful for preventing leaking through walls or floors.

#5274158 A problem about implementing stochastic rasterization for rendering motion blur

Posted by on 03 February 2016 - 08:33 PM

So they're using iFragCoordBase to lookup a value the random time texture. This will essentially "tile" the random texture over the screen, taking MSAA subsamples into account. So if there's no MSAA the random texture will be tiled over 128x128 squares on the screen, while for the 4xMSAA case it will be tiled over 64x64 squares. This ensures that each of the 4 subsamples gets a different random time value inside of the loop.

#5274149 Normalized Blinn Phong

Posted by on 03 February 2016 - 07:50 PM

You should read through the section called "BRDF Characteristics" in chapter 7, specifically the part where they cover directional-hemispherical reflectance. This value is the "area under the function" that Hodgman is referring to, and must be <= 1 in order for a BRDF to obey energy conservation. As Hodgman mentioned a BRDF can still return a value > 1 for a particular view direction, as long as the result is still <= 1 after integrating about the hemisphere of possible view directions.

#5273767 Shadow Map gradients in Forward+ lighting loop

Posted by on 01 February 2016 - 07:24 PM

In our engine I implemented it the way that you've described. It definitely works, but it consumes extra registers which isn't great. I don't know of any cheaper alternatives that would work with anisotropic filtering.

#5272923 directional shadow map problem

Posted by on 27 January 2016 - 07:48 PM

You can use bias value that depends on angle between surface normal and direction to light:

float bias = clamp(0.005 * tan(acos(NoL)), 0, 0.01);
where: NoL = dot(surfaceNormal, lightDirection);

tan(acos(x)) == sqrt(1 - x * x) / x

You really do not want to use the inverse trig functions on a GPU. They are not natively supported by their ALUs, and will cause the compiler to generate a big pile of expensive code.

#5272897 D3d12 : d24_x8 format to rgba8?

Posted by on 27 January 2016 - 05:01 PM

Yes they mentionned it on some twitter account, but then does GCN store 24 bits depth value as 32 bits if a 24 bits depth texture is requested ?
Since there is no performance bandwidth advantage since 24 bits needs to be stored in a 32 bits location and 8 bits are wasted the driver might as well promote d24x8 to d32 + r8 ?

No, they store it as 24-bit fixed point with 8 bits unused. It only uses 32 bits if you request a floating point depth buffer, and they can't promote from fixed point -> floating point since the distribution of precision is different.

[EDIT] Is it possible to copy depth component to a RGBA8 (possibly typeless) texture or do I have to use a shader to manually convert the float depth to int, do some bit shift operations and store component separatly ?

You can only copy between textures that have the same format family.