Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 29 Mar 2007
Offline Last Active Today, 12:44 AM

#5290200 Irrandiance Volume v.s. 4-Basis PRT in Farcry

Posted by MJP on 05 May 2016 - 12:40 AM

It seems 2 band SH also has ringing. When I implemented my SH2 irr-vol I used an Lanczos window to reduce them. Since only trivial SH co-effs multiplies involved in this windowing ops, from the performance perspectives I feel it doesn't like a big deal. Maybe the storage & filtering performance cost is not be the point about the question.


The FarCry's motivation really confused me for a while until by chance I found the Order 1886 (Sig’15 course) also used an multi-basis SG baking solution. One of most interesting things about the course is that they shared some experiences about using SH3 irradiance-cube to represent the HDR lighting, namely, HDR lighting can cause some SH lobes to be very large negative numbers to cancel out the high positive co-effs, which is really bad for baking quality and compression.


So finally I find my own answer: Don’t ever use SH irradiance-cube under HDR lighting situation. The irradiance-cube representation by using low-band SH under HDR situation may be far from accuate, and it's not suitable for baking output. Use muli-basis PRT method instead.


Indeed, that was the conclusion we eventually came to while working on The Order. SH has some really great properties, but ultimately it doesn't do well for storing arbitrary lighting environments. It's not so bad if you're storing very low-frequency data from indirect lighting, but if ever try to bake in direct lighting from an area light source the result is unusable without filtering. But then once you filter, you completely lose the directionality which also doesn't look right. SG's are much better in this regard, and also have the capability of storing higher-frequency signals. 

#5289015 [D3D12] Binding multiple shader resources

Posted by MJP on 27 April 2016 - 06:44 PM

The CopyDescriptors approach is mostly for convenience and rapid iteration, since it doesn't require you to have descriptors in a contiguous table until you're ready to draw. For a real engine where you care about performance, you'll probably want to pursue something along the lines of what Jesse describes: put your descriptors in contiguous tables from the start, so that you're not constantly copying things around while you're building up your command buffers.


I also want to point out that the sample demonstrates another alternative to both approaches in its use of indexing into descriptor tables. In that sample it works by grabbing all of the textures needed to render the entire scene, putting them in one contiguous descriptor table, and then looking up the descriptor indices from a structured buffer using the material ID. Using indices can effectively give you an indirection, which means that your descriptors don't necessarily have to be contiguous inside the descriptor heap.

#5288662 How does material layering work ?

Posted by MJP on 25 April 2016 - 03:21 PM

This seems to be a really nice workflow for artists as they have some kind of material library which they can customize and blend to obtain advanced materials on complicated object. This seems to be the best regarding to performances.

Yes, I would say that it has worked out very well for us. It helps divide the responsibility appropriately among the content team: a lot of environment artists can just pull from common material libraries and composite them together in order to create unique level assets. At the same time our texture/shader artists can't author the most low-level material templates, and whenever they make changes they are automatically propagated to the final runtime materials.

So you are using some kind of uber shader that accepts multiple albedos, normals, etc. each with his associated tiling and offsets with a masking texture for each layer ?

Yup. We have an ubershader that has a for loop over all of the material layers, but we generate a unique shader for every material with certain constants and additional code compiled in. The number of layers ends up being a hard-coded constant at compile time, and so we unroll the loop that samples the textures for each layer and blends the resulting parameters.

You might also have multiple drawcalls from those layers which are not present in the above technics, right ? This has some performances costs, can those be neglected ?

I don't think you would ever want to have multiple draw calls for runtime layer blending. It would likely be quite a bit more expensive than doing it all in a loop in the pixel shader.

#5288474 How does material layering work ?

Posted by MJP on 24 April 2016 - 12:29 PM

I'm not really familiar with the Allegorithmic tools, but I can certainly explain how the material compositing works for The Order. Our compositing process is primarily offline: we have a custom asset processing system that produces runtime assets, and one of the processors is responsible for generating the final composite material textures. The tools expose a compositing stack that's similar to layers in Photoshop: the artists pick a material for each layer in the stack, and each layer is blended with the layer below it. Each layer specifies a material asset ID, a blend mask, and several other parameters that can be used to customize how exactly the layers are composited (for instance, using multiply blending for albedo maps). The compositing itself is done in a pixel shader, but again this is all an offline process. At runtime we just end up with a set of maps containing the result of blending together all of the materials in the stack, so it's ready to be sampled and used for shading. This is nice for runtime performance, since you already did all of the heavy lifting during the build process.


The downside of offline compositing is that you're ultimately limited by the final output resolution of your composite texture, so that has to be chosen carefully. To help mitigate that problem we also support up to 4 levels of runtime layer blending, which is mostly used by static geometry to add some variation to tiled textures. So for instance you might have a wall with a brick texture tiled over it 10 times horizontally, which would obviously look tiled if you only had that layer. With runtime blending you can add some moss or some exposed mortar to break up the pattern without having to offline composite a texture that's 10x the size. 


With UE4 all of the layers are composited at runtime. So the pixel shader iterates through all layers, determines the blend amount, and if necessary samples textures from that layer so that it can blend the parameters with the previous layer. If you do it this way you avoid needing complex build processes to generate your maps, and you also can decouple the texture resolution of your layers. But on the other hand, it may get expensive to blend lots of layers.

#5288056 [D3D12] About CommandList, CommandQueue and CommandAllocator

Posted by MJP on 21 April 2016 - 04:59 PM

GPU's can also pre-fetch command buffer memory in order to hide any latency. Pre-fetching is easy in this case because the front-end will typically just march forward, since jumps are not common (unless you're talking about the PS3 :D)

#5287886 How to blend World Space Normals

Posted by MJP on 20 April 2016 - 08:48 PM

You'll want to use the vertex normal vector, since this is what determines the Z basis in your tangent frame.

#5287884 [D3D12] About CommandList, CommandQueue and CommandAllocator

Posted by MJP on 20 April 2016 - 08:44 PM

There is no implied copy from CPU->GPU memory when you submit a command list. GPU's are perfectly capable of reading from CPU memory across PCI-e, and on some systems the CPU and GPU may even share the memory.

#5287707 How to blend World Space Normals

Posted by MJP on 19 April 2016 - 07:34 PM

The sample implementation of RNM on that blog post assumes that the "s" vector is a unit z vector, which is the case for tangent-space normal maps. This is represented in equations 5/6/7. If you want to work in world-space, then you need to implement equation 4 as a function that takes s as an additional parameter:


float3 ReorientNormal(in float3 u, in float3 t, in float3 s)
    // Build the shortest-arc quaternion
    float4 q = float4(cross(s, t), dot(s, t) + 1) / sqrt(2 * (dot(s, t) + 1));
    // Rotate the normal
    return u * (q.w * q.w - dot(q.xyz, q.xyz)) + 2 * q.xyz * dot(q.xyz, u) + 2 * q.w * cross(q.xyz, u);


If you pass float3(0, 0, 1) as the "s" parameter, then you will get the same result as the pre-optimized version. However the compiler may not be able to optimize it as well as the hand-optimized code provided in the blog.

#5286063 D3D alternative for OpenGL gl_BaseInstanceARB

Posted by MJP on 09 April 2016 - 03:09 PM

ExecuteIndirect supports setting arbitrary 32-bit constants through the D3D12_INDIRECT_ARGUMENT_TYPE_CONSTANT argument type. You can use this to specify transform/materialID data per-draw without having to abuse the instance offset. You can also set a root CBV or SRV via a GPU virtual address, which means you can use that to directly specify a pointer to the draw's transform data or material data.

#5285926 PIXBeginEvent and PIXEndEvent member functions on CommandList object

Posted by MJP on 08 April 2016 - 04:55 PM

The documentation you linked to is the old pre-release documentation. The final documentation doesn't list those methods. Instead it has BeginEvent and EndEvent, which are called by the helper functions in pix.h.

#5285343 [D3D12] Synchronization on resources creation. Need a fence?

Posted by MJP on 05 April 2016 - 02:55 PM

Yeah, there's no need to wait for commands to finish executing because they don't actually issue any commands. If you look at some of the other samples, they all have a wait at the end of LoadAssets. They do this so that they can ensure that any GPU copies finish before they destroy upload resources. So for instance if you look at the HelloTexture sample, it goes like this:


  • Create upload resource
  • Map upload resource, and fill it with data
  • Issue GPU copy commands on a direct command list
  • Submit the direct command list
  • Wait for the GPU to finish executing the command list
  • ComPtr destructor calls Release on the upload resource, destroying it

#5285175 When would you want to use Forward+ or Differed Rendering?

Posted by MJP on 04 April 2016 - 10:28 PM

In The Order we output depth and vertex normals in our prepass so that we could compute AO from our capsule-based proxy occluders. Unfortunately this means having a pixel shader, even if that shader is extremely simple. Rasterizing the scene twice is a real bummer, especially if you want to push higher geometric complexity. But at the same time achieving decent Z sorting is also pretty hard, unless you're doing GPU-driven rendering and/or you have a really good occlusion culling.

#5285174 Soft Particles and Linear Depth Buffers

Posted by MJP on 04 April 2016 - 10:22 PM

Yes, z/w is very non-linear. If you're using a hardware depth buffer, you can compute the original view-space Z value by using the original projection matrix used for transforming the vertices: 


float linearDepth = Projection._43 / (zw - Projection._33);


If you'd like you can then normalize this value to [0, 1] by dividing by the far clip plane, by doing z = (z - nearClip) / (farClip - nearClip). 


Using a linear depth value for soft particles should give you much more consistent results across your depth range, so I would recommend doing that.

#5284941 Texture sample as uniform array index.

Posted by MJP on 03 April 2016 - 07:09 PM

All modern hardware that I know of can dynamically index into constant (uniform) buffers. For AMD hardware it's basically the same as using a structured buffer: for an index that can vary per-thread in a wavefront, the shader unit will issue a vector memory load through a V# contains the descriptor (base address, num elements, etc.). On Nvidia, there's 2 different paths for constant buffers and structured buffers. They recommend using constant buffers if the data is very coherent between threads, since this will be lower-latency path compared to structured buffers. I have no idea what the situation is for Intel, or any mobile GPU's.

#5283795 In Game Console window using DirectX 11

Posted by MJP on 27 March 2016 - 11:11 PM

You can use an orthographic projection matrix to map from a standard 2D coordinate system (where (0,0) is the top left, and (DisplayWidth, DisplayHeight) is the bottom right) to D3D normalized device coordinates (where (-1, -1) is the bottom left and (1, 1) is the top right). DirectXMath has the XMMatrixOrthographicOffCenterLH function which you can use to generate such a matrix. Just fill out the parameters such that Top = 0, Left = 0, Bottom = DisplayHeight, and Right = DisplayWidth. If you look at the documentation from the old D3DX function for doing the same thing, you can see how it generates a matrix such that it has an appropriate scale and translation.