Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 29 Mar 2007
Offline Last Active Today, 12:15 AM

#5253542 Deconstructing Post process of the Order 1886

Posted by MJP on 22 September 2015 - 04:36 PM

The post-processing chain went like this:

* Motion blur - full resolution, based on Morgan McGuire's work with optimizations inspired by Jorge Jimenez's presentation on COD: AW

* Depth of field - half resolution, with 7x7 bokeh-shaped gather patterns similar to what Tiago Sousa proposed in his CryEngine 3 presentation from SIGGRAPH 2013.

* Bloom - half resolution, separable 21x21 Guassian filter in a compute shader

* Lens flares - 1/4 resolution, with several custom screen-space filter kernels for different shapes (aperture, streaks, etc.). Originally implemented using FFT to convolve arbitrary artist-specified kernels (painted in textures) in the frequency domain, but at the very end of the project we switched to fixed kernels since we only ever used one fixed set of shape textures

* Tone Mapping - combined chromatic aberration, lens distortion, film grain, exposure, tone mapping, bloom + lens flare composite, and color correction

#5253538 Dealing with D3DX____ functions being depricated.

Posted by MJP on 22 September 2015 - 04:20 PM

I really doubt that you're ever going to see a Effects library for DX12. Huge parts of it would have to be written in order to support things like the new resource binding model, PSO's, and manual memory management/synchronization.

#5253225 Alpha to Coverage w/o MSAA

Posted by MJP on 20 September 2015 - 04:58 PM

The original implementations of alpha to coverage would take your alpha pattern and convert it into a dither pattern, where that pattern would span either a 2x2 or 4x4 pixel grid, as well as the MSAA subsamples within those pixels. Basically at various alpha values those pixels and/or subsamples would "turn on", basically giving you a stochastic approximation of transparency. If you don't use MSAA this means you have only 1 subsample per pixel, and so the dither pattern will only apply to entire pixels. Later on, some drivers and GPU's switched to only dithering within subsamples rather than across pixels. Subsample dithering can look better than pixel dithering, since you do actually get a few levels of transparency based on the number of subsamples. Without MSAA, it basically degenerates to a simple alpha test.

If you're using newer GPU's and API's, then you can manually output the coverage mask from your pixel/fragment shader and do whatever you want, rather than relying on the driver to do it. So you can do dithering across pixels, subsample-only dithering, or something else entirely.

#5253223 Direct3D11 Multithreading

Posted by MJP on 20 September 2015 - 04:52 PM

Just so that you know ahead of time, it's unlikely that you'll get any significant performance gains out of D3D11 deferred contexts. Unfortunately the way that the API is structured prevents most drivers from actually building command buffers ahead of time, and so you still end up serializing a lot of the work. However with that said, it could still definitely be a good learning experience. It can also help you get your engine ready for D3D12 and/or Vulkan, where multithreading is actually useful.

By the way, there is a multithreaded rendering sample available on MSDN.

#5252766 Efficient rendering of a dynamic grid

Posted by MJP on 17 September 2015 - 03:50 PM

Instead of a structured buffer, can you can use a "regular" buffer with DXGI_FORMAT_R8_UINT? If you use this, then it will work in the shader just as if it were a byte[] array, except that during the load the 8-bit value will be cast to a 32-bit uint (which is fine). If you can't do that with Unity, then you can still do it with a structured buffer. You'll just need to load 4 bytes at a time, and mask and shift the uint result in order to get the byte that you're looking for. Like this:

uint elemIdx = byteIdx / 4;
uint bufferValue = GridBuffer[elemIdx];
uint gridValue = (bufferValue >> (byteIdx % 4)) & 0xFF;
As for updating the buffer, how big is the buffer that you're updating? The drivers are definitely optimized for the case of updating GPU resources from the CPU, and it shouldn't be an issue unless your buffer is really big. Depending on how many tile states you have, you could also possibly reduce your memory requirements by packing your tile state into less than 8 bits.

#5252754 Passing Computations from VS to PS

Posted by MJP on 17 September 2015 - 03:12 PM

Are these D3D9 shaders, or D3D11?

#5252584 Passing Computations from VS to PS

Posted by MJP on 16 September 2015 - 07:19 PM

The only way to pass data from the VS to the PS is through interpolants. Is this what you mean when you say that "slots of my PS buffer"?

On certain GPU's having too many interpolants can be bad for performance, so I wouldn't recommend trying to use a lot of them just to save a few computations in the pixel shader. How much data are you trying to pass from the VS to the PS?

#5252418 [D3D12] Scissor Rectangle Culling

Posted by MJP on 15 September 2015 - 02:38 PM

Yeah, that's very weird. I did a grep through the headers, and I couldn't find anything for enabling or disabling the scissor test. I guess that you need to always set it to the viewport size if you don't want anything to be culled.

#5252395 Tile-based deferred shading questions

Posted by MJP on 15 September 2015 - 01:01 PM

1. You can certainly handle shadow casting lights, you just need all of the shadow maps to be in memory at the same time. For the last game I worked on we kept spotlight shadow maps in a texture array with 16 elements, and then had a separate 4 element array for the directional light cascades.

2. You can handle other light types, like spotlights. Lights are culled per-tile by computing the planes of a sub-frustum that surrounds the tile, and then testing the light's bounding volume for intersection with that frustum. So for a point light you do a sphere/frustum test, and for a spotlight you can do a cone/frustum test. Just be aware that both sphere/frustum and cone/frustum can have false positives when you're doing the typical "test the volume against each plane" approach.

In case you didn't get this from the paper, the reason they do this is so that each thread group can cull a bunch of lights in parallel. So basically during the culling phase you assign a different light to each of your N threads in your thread group, and then append each intersecting light to a list in thread group shared memory. Then in the second phase each thread loops over the entire list of intersecting lights, and computes the light contribution for a single pixel.

3. Sure, that works. I'm pretty sure that's how the sample does it as well.

4. One of the main advantages of this approach is that you avoid blending. Basically you combine the light contributions for all lights (or at least, many lights) inside of your compute shader, and then write out the combined result to your texture. This saves a lot of bandwidth, since you don't have to do read/modify/write for every single light source. If you need to do multiple tiled passes, you can still do that with a compute shader approach. Just be aware that in D3D11 you can't read from a RWTexture2D unless it has R32_FLOAT/R32_UINT/R32_INT format*. This means that you can't do a manual read/modify/write for a R16G16B16A16_FLOAT texture. If you want to do use an fp16 format, you'll need to ping pong between two textures so that you can read from one and write to the other.

*This restriction was was relaxed for FEATURE_LEVEL_12_0 hardware, which now supports typed UAV loads for additional formats.

#5252084 Reversed Shadow Map for omni-directional light

Posted by MJP on 13 September 2015 - 03:39 PM

It's either getting reversed in your view/projection matrix that you're using for rasterizing the shadow map, or it's happening when you sample the shadow map. You can check the first case pretty easily by capturing a frame in RenderDoc or in the VS Graphics Debugger, and inspecting the resulting depth texture.

#5252083 Texture arrays vs. multiple (single) textures?

Posted by MJP on 13 September 2015 - 03:36 PM

[edit] Yeah, on AMD at least, the texture descriptor that's being used for the sample/load instruction has to be stored in SGPRs (aka constant/uniform memory), so it's not possible for those GPU, which normally shades 64 pixels in parallel, to fetch from a texture-descriptor that varies per pixel. If it does work, it will be at 1/64th throughput due to the serialization. See 8.2.1 Image Instructions

They can do it. They basically have to construct a while loop that handles each divergent case by putting the descriptor into SGPR's, and terminates once all possible cases were handled. So if you have a wavefront that uses 4 different descriptors, then the loop will execute 4 times and you'll get a max 1/4 throughput.

#5252082 when will we see DXGI_FORMAT_R64G64B64A64_FLOAT?

Posted by MJP on 13 September 2015 - 03:32 PM

GPU's support 64-bit about as well as CPU's do: they support 64-bit pointers, can perform 64-bit arithmetic on those pointers, can store 64-bit values in registers (usually by using 2 adjacent 32-bit registers), can work with double-precision floating point, etc. HLSL is a little behind in this regard, but it's hardly a big deal since almost all graphics workloads are going to work with 8-bit, 16-bit, or 32-bit values.

As for a 64-bit per component texture format, I don't see it happening anytime soon. You can already write 64-bit values to buffers and textures by using 2 32-bit values, and those cases are probably too rare to justify making the ROP's and texture units more complicated. It's probably a better tradeoff to spend that silicon and on more ALU and cache, since that would benefit all workloads.

#5251421 The Order 1886: Spherical Gaussian Lightmaps

Posted by MJP on 09 September 2015 - 03:43 PM

There no information about segmentation,paramatrization,box packing algos has been used in The Order 1886. For me it's most complicated part of pre-backed GI implementation.

Indeed, and that's a pretty complex topic on its own. In most cases we just re-use the base parameterization that the artists created for the mesh, instead of automatically computing a new one. We essentially extract each separate UV chart from Maya, and then run a packing algorithm to pack all of the charts from all of the meshes into a set of atlased textures. For The Order this algorithm was complicated and slow, since it tried many different positions and orientations to try to find the tightest fit. However we've since started moving to a simpler algorithm that respects 4x4 BC6H tile boundaries, which has more unused space but is much much faster, and also avoids compression artifacts. I unfortunately didn't do any of the implementation for this part of the pipeline, so I'm not sure if I could give a thorough interview. But perhaps I can convince my coworker to write up a blog post, or something similar.

#5251419 The Order 1886: Spherical Gaussian Lightmaps

Posted by MJP on 09 September 2015 - 03:35 PM

About the Maya integration - what are your thoughts on that in hindsight? I only dabbled in Mel/Py to do small tasks. I'm assuming most of the heavy lifting was written with the Maya Cpp SDK?

Sorry, I forgot to reply to this!

There were a lot of ups and downs with having such tight Maya integration. In general the artists were big fans, since they were already familiar with Maya and did a lot of work in there anyway. Being able to render with our engine inside of the Maya viewport was a big win for them, since they could see exactly what the game would look like as they were modeling. We also made it so that our material editor could run inside of Maya, which was another big win for them: most of the time they actually authored materials right inside of Maya, which let them do the authoring while viewing the material in the environment of their choosing. For gameplay/level authoring it's a little less clear cut. In some ways using Maya is natural since it already supports a lot of things that you need for a level editor (3D viewer, orthographic views, translation/scaling/rotation widgets, user-defined attributes and UI, etc.), and that kept us programmers from having to re-implement all of those things. But in some ways it's also rather clunky and heavyweight, especially if you just want to move a few locators around.

We actually have a good mix of C++ plugins as well as Python tools. We also still have some MEL tools, but we've been deprecating that stuff in favor of Python. The C++ plugins do most of the things that need tight integration with the engine, most notably our Viewport 2.0 override plugin that renders the Maya scene with our renderer. There's also plugins for registering a bunch of custom node types as well as their custom attributes, loading asset data, and kicking off GI bakes to our bake farm. We then mostly Python to create our own UI, and also for helper scripts that automate repetitive art and gameplay tasks. Most of our programmers hate working on the C++ plugins, since you have to start up Maya to run them and their API isn't always easy to work with. The Viewport plugin in particular was a *lot* of work, and has a pretty high maintenance cost. We basically have to treat it like an additional platform for our automated graphics tests, since a lot of things go through custom code paths in order to extract the right data from Maya on-the-fly, instead of being processed in our content build pipeline. The artists love it, so we can't get rid of it now. tongue.png

Also: you two should feel free to join the gd.net chat from time to time if you find yourself yerning for an extremely long and one-sided conversation about this. Milk and cookies are ready...at dawn.

Haha, I'll definitely come by!

#5251103 Precalculate lightmap and specular reflection?

Posted by MJP on 07 September 2015 - 11:55 PM

I'm more interest in the IBL approach, Call of Duty uses the merge method(http://blog.selfshadow.com/publications/s2013-shading-course/), but that is used for indirect light. I don't understand why the lightmap luminance could affect the specular hightlight, which is view dependent.

The reason they do this is that IBL probes are typically very sparse, due to their large memory footprint. When your sample points are very sparse, you end up with a lot of error when that probe is used by a surface that's relatively far away from the sample position. This typically manifests as occlusion problems: the probe has visibility to a bright surface that shouldn't be visible from the surface being shaded, and so you see the reflections even though they should have been occluded. On the other hand, 2D lightmap samples tend to be much higher density since they store less data per texel. So the idea behind normalization is that you try to make use of the higher-frequency visibility data baked into lightmap by combining it with the IBL probes. It can introduce lots of new errors, but it can still be an overall improvement since lack of occlusion can be very noticable.

On a related note, this was a major motivation for using Spherical Gaussians in The Order, since the SG lightmaps have high spatial density but still allow for low-to-mid frequency specular.