Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 29 Mar 2007
Online Last Active Today, 08:27 PM

#5285175 When would you want to use Forward+ or Differed Rendering?

Posted by MJP on 04 April 2016 - 10:28 PM

In The Order we output depth and vertex normals in our prepass so that we could compute AO from our capsule-based proxy occluders. Unfortunately this means having a pixel shader, even if that shader is extremely simple. Rasterizing the scene twice is a real bummer, especially if you want to push higher geometric complexity. But at the same time achieving decent Z sorting is also pretty hard, unless you're doing GPU-driven rendering and/or you have a really good occlusion culling.

#5285174 Soft Particles and Linear Depth Buffers

Posted by MJP on 04 April 2016 - 10:22 PM

Yes, z/w is very non-linear. If you're using a hardware depth buffer, you can compute the original view-space Z value by using the original projection matrix used for transforming the vertices: 


float linearDepth = Projection._43 / (zw - Projection._33);


If you'd like you can then normalize this value to [0, 1] by dividing by the far clip plane, by doing z = (z - nearClip) / (farClip - nearClip). 


Using a linear depth value for soft particles should give you much more consistent results across your depth range, so I would recommend doing that.

#5284941 Texture sample as uniform array index.

Posted by MJP on 03 April 2016 - 07:09 PM

All modern hardware that I know of can dynamically index into constant (uniform) buffers. For AMD hardware it's basically the same as using a structured buffer: for an index that can vary per-thread in a wavefront, the shader unit will issue a vector memory load through a V# contains the descriptor (base address, num elements, etc.). On Nvidia, there's 2 different paths for constant buffers and structured buffers. They recommend using constant buffers if the data is very coherent between threads, since this will be lower-latency path compared to structured buffers. I have no idea what the situation is for Intel, or any mobile GPU's.

#5283795 In Game Console window using DirectX 11

Posted by MJP on 27 March 2016 - 11:11 PM

You can use an orthographic projection matrix to map from a standard 2D coordinate system (where (0,0) is the top left, and (DisplayWidth, DisplayHeight) is the bottom right) to D3D normalized device coordinates (where (-1, -1) is the bottom left and (1, 1) is the top right). DirectXMath has the XMMatrixOrthographicOffCenterLH function which you can use to generate such a matrix. Just fill out the parameters such that Top = 0, Left = 0, Bottom = DisplayHeight, and Right = DisplayWidth. If you look at the documentation from the old D3DX function for doing the same thing, you can see how it generates a matrix such that it has an appropriate scale and translation.

#5283446 Per Triangle Culling (GDC Frostbite)

Posted by MJP on 25 March 2016 - 03:10 PM

Nvidia has OpenGL and D3D extensions for a "passthrough" GS that's meant to be fast as long as you can live with the restrictions (no arbitrary amplification, only triangles, no stream out, etc.). So if you could use that to do per-triangle culling, it could potentially be much easier to get it working. If anybody actually tries it, I'd love to hear about the results. :)

#5283071 Math behind anisotropic filtering?

Posted by MJP on 23 March 2016 - 11:42 PM

Is there an article/explanation and is it standardized somehow or vendor dependant? (in gl it's not core AFAIK even if all vendors supports it)

It's an extension in GL because it's patented. :(

#5282547 Iso-/Dimetric tile texture has jagged edges

Posted by MJP on 22 March 2016 - 01:22 AM

I had a similiar issue once, and it turned out I was doing windowed mode wrong in terms of calculating the window size to fit the backbuffer
size etc., resulting in a vaguely stretched display that was hard to notice for a long while.. Maybe you could check your window+DirectX
initialisation code?

I was going to say the same thing. You want to make sure that the client area of your window is the same size as your D3D backbuffer,
otherwise you'll get really crappy scaling when the backbuffer is blit onto the window. You can use something like this:
RECT windowRect;
SetRect(&windowRect, 0, 0, backBufferWidth, backBufferHeight);

BOOL isMenu = (GetMenu(hwnd) != nullptr);
if(AdjustWindowRectEx(&windowRect, style, isMenu, exStyle) == 0)

if(SetWindowPos(hwnd, HWND_NOTOPMOST, 0, 0, windowRect.right - windowRect.left, windowRect.bottom - windowRect.top, SWP_NOMOVE) == 0)
See the docs for AdjustWindowRectEx for more details.

#5282527 EVSM, 2 component vs 4 component

Posted by MJP on 21 March 2016 - 10:54 PM

I went with 16-bit because there's too many artifacts when using the 2-component version EVSM. Specifically, you run into issues in areas with high geometrical complexity where the receiver surface is non-planar relative to the filter kernel. See the original paper (go to section 7) for some more details.


I really noticed it on our characters and faces, due to their dense, curved geometry. I took some screenshots from my sample app in attempt to replicate the issues that I saw in The Order:


This is 4-component EVSM with 32-bit textures:




This is 2-component EVSM with 32-bit textures (look at the shadow cast by the nose):




And this is 4-component EVSM with 16-bit textures, with the bias and leak reduction turned up:



#5282243 Object Space Lightning

Posted by MJP on 20 March 2016 - 07:04 PM

That said, I don't get the comparisons to REYES and overall it seems like a very special purpose, application-specific approach to rendering


Yeah, I agree that the frequent mentioning of REYES is misleading. The only real commonality with REYES is the idea of not shading per-pixel, and even in that regard REYES has a very different approach (dicing into micropolygons followed by stochastic rasterization).


I also agree that it's pretty well-tailored to their specific style of game, and the general requirements of that genre (big terrain, small meshes, almost no overdraw). I would image that to adopt something similar for more general scenes you would need to a much much better job of allocating appropriately-sized tiles, and you would need to account for occlusion. I could see maybe going down the megatexture approach of rasterizing out tile ID's, and then analyzing that on the CPU or GPU to allocate memory for tiles. However this implies latency, unless you do it all on the GPU and rasterize your scene twice. Doing it all on the GPU would rule out any form of tiled resources/sparse textures, since you can't update page tables from the GPU.

ptex would be nice for avoiding UV issues (it would also be nice for any kind of arbitrary-rate surface calculations, such as pre-computed lightmaps or real-time GI), but it's currently a PITA to use on the GPU (you need borders for HW filtering, and you need quad->page mappings and lookups).

#5282237 Gamma correction. Is it lossless?

Posted by MJP on 20 March 2016 - 06:49 PM

The sRGB->Linear transformation for color textures will typically happen in the fixed-function texture units, before filtering. You can think of the process as going something like this:


result = 0.0
foreach(texel in filterFootPrint):
    encoded = ReadMemory(texel)
    decoded = DecodeTexel(encoded)  // Decompress from block compression and/or convert from 8-bit to intermediate precision (probably 16-bit)
    linear = sRGBToLinear(decoded)
    result += linear * FilterWeight(texel) // Apply bilinear/trilinear/anisotropic filtering
return ConvertToFloat(result)


It's important that the filtering and sRGB->Linear conversion happen at some precision that's higher than 8-bit, otherwise you will get banding. For sRGB conversion 16-bit fixed point or floating point is generally good enough for this purpose. The same goes for writing to a render target: the blending and linear->sRGB conversion need to happen at higher precision than the storage format, or you will get poor results. You will also get poor results if you write linear data to an 8-bit render target, since there will be insufficient precision in the darker range.


Probably the vast majority of modern game are gamma-correct. It's a bona fide requirement for PBR, which almost everyone is adopting in some form or another. I seem to recall someone mentioning that Unity maintains a non-gamma-correct rendering path for legacy compatibility, but don't quote me on that.

#5279535 Setting constant buffer 'asynchronously'

Posted by MJP on 04 March 2016 - 02:05 PM

The runtime/driver will synchronize UpdateResource for you. This means that use a constant buffer for draw call A, update the constant buffer, and then issue draw call B, it will appear as though the buffer were update in between A and B.

#5279533 Manually copying texture data from a loaded image

Posted by MJP on 04 March 2016 - 01:54 PM

Thank you! No plans for D3D12 book at a moment.

#5279349 Manually copying texture data from a loaded image

Posted by MJP on 03 March 2016 - 04:08 PM

You're not supposed to fill out D3D12_SUBRESOURCE_FOOTPRINT and D3D12_PLACED_SUBRESOURCE_FOOTPRINT. You need to fill out your D3D12_RESOURCE_DESC for the 2D texture that you want to create, and then pass that to ID3D12Device::GetCopyableFootprints in order to get the array of footprints for each subresource. You can then use the footprints to fill in an UPLOAD buffer, and then use CopyTextureRegion to copy the subresources from your upload buffer to the actual 2D texture resource in a DEFAULT heap.

You may want to look at the HelloTexture sample on GitHub to use as a reference.

EDIT: actually you don't strictly need to use GetCopyableFootprints, see Brian's post below

#5279328 [D3D12] Copying between upload buffers

Posted by MJP on 03 March 2016 - 02:02 PM

You don't want to copy between upload buffers on the CPU. It can be very slow, since the memory will probably be write combined (uncached reads). There's really no way for you to figure out how big of a buffer you need ahead of time? Personally I would try to avoid having to create new buffers in the middle of rendering setup.

#5278921 Multiple small shaders or larger shader with conditionals?

Posted by MJP on 01 March 2016 - 05:21 PM

Somebody correct me if I'm wrong, but if every path in a shader takes the same branch it is nearly (not entirely) the same cost as if the changes were compiled in with defines. If you are rendering an object with a specific set of branches that every thread will take it may not be a big deal. If threads take different branches you will eat the cost of all branches.

Branching on constants is nice, because there's no chance of divergence. Divergence is when some threads within a warp/wavefront take one side of the branch, and some take another. It's bad because you end up paying the cost of executing both sides of the branch. When branching on a constant everybody takes the same path, so you only pay some (typically small) cost for checking the constant and executing the branch instructions.

As for whether it's the same cost as compiling an entirely different shader permutation, it may depend on what's in the branch as well as the particulars of the hardware. One potential issue with branching on constants is register pressure. Many GPUs have static register allocation, which means that they compute the maximum number of registers needed by a particular shader program at compile time and then make sure that the maximum is always available when the shader is actually executed. Typically the register file has a fixed size and is shared among multiple warps/wavefronts, which means that if a shader needs lots of registers then fewer warps/wavefronts can be in flight simultaneously. GPUs like to hide latency by swapping out warps/wavefronts, so having fewer in flight limits their ability to hide latency from memory access. So let's say that you have something like this:

// EnableLights comes from a constant buffer
By branching you can avoid the direct cost of computing the lighting, but you won't be able to avoid the indirect cost that may occur from increased register pressure. However if you were to use a preprocessor macro instead of branch and compile 2 permutations, then the permutation with lighting disabled can potentially use less registers and have greater occupancy. But again, this depends quite a bit on the specifics of the hardware as well as your shader code, so you don't want to generalize about this too much. In many cases branching on a constant might have the exact same performance as creating a second shader permutation, or might even be faster due to CPU overhead from switching shaders.