Jump to content

  • Log In with Google      Sign In   
  • Create Account

Banner advertising on our site currently available from just $5!

1. Learn about the promo. 2. Sign up for GDNet+. 3. Set up your advert!


Member Since 29 Mar 2007
Offline Last Active Today, 03:18 PM

#5236626 Best technique for lighting

Posted by MJP on 24 June 2015 - 03:29 PM

Shadow-casting lights aren't incompatible with a tile based deferred renderering, you just need to have all of your shadow maps available at once in your tiled lighting compute shader. The simplest way to do this is to make a texture array for all of your shadow maps, and render all of your shadow maps before running your tiled lighting compute shader. Then when you run the compute shader, you just use the light's shadow map index to look up the shadowmap texture from your texture array. The downside of this approach is that you now need to allocate enough memory for all of your shadow maps simultaneously, as opposed to re-using the same shadow map texture for every light. For the last game I worked on we just capped our renderer to 16 shadow-casting spotlights per frame, and made a fixed-size texture array with 16 elements.

If you need to save memory or you really need lots of shadow maps, you can do a hybrid approach. For example you could make your texture array hold 8 shadow maps, and then you could render 8 shadows at a time. So it would go like this:

- Render shadow maps for lights 0 through 7
- Run tiled lighting compute shader for lights 0 through 7
- Render shadow maps for lights 8 through 15
- Run tiled lighting compute shader for lights 8 through 15
- etc.

#5235983 What performance will AMD's HBM bring for graphics programmers?

Posted by MJP on 21 June 2015 - 01:32 AM

Unfortunately , the windows Kernel does have to patch your command buffers, depending on the memory hardware being used under the hood, for stability/visualization/security purposes. Consoles don't have this problem (they can do user-mode submission), but only because a lot more trust is placed on the developers to not undermine the operating system.
Newer hardware with better memory and virtualization systems might be able to fix this issue.

With D3D12/WDDM 2.0 the address patching actually isn't necessary anymore, since GPU's can use a per-process virtual address space that's managed the OS. However that still doesn't mean that user mode code is allowed to poke at the low-level GPU registers for submitting command buffers. With D3D12/Mantle/Vulkan user-mode code can initiate the submission of a command buffer, but it's still ultimately going to be kernel-mode code that mucks with the registers. I don't see that changing anytime soon, for a lot of reasons. If you can't submit command buffers quickly, then I suppose you're left with trying to do something like JTS patching on PS3 where you'd initially submit a command buffer with stalls in it, and then overwrite the stalls with new commands. This isn't really ideal, and would need API support to overcome all of the low-level hardware issues.

I have more thoughts on the idea of low-latency rendering, but I think that they will have to wait until tomorrow.

#5235982 Uses of curves in graphics?

Posted by MJP on 21 June 2015 - 01:17 AM

Our curves had arbitrary numbers of control points, and so it was simpler for the particle simulation shader to just have the curves baked down into textures. The shader didn't have to care how many control points there were or even what kind of spline was used, it could just fetch and that was it.

As for performance, it's not necessarily so straightforward. When the curve is baked into textures you just end up with one texture fetch per particle attribute. With arbitrary control points you would need to perform N memory accesses inside of a dynamic loop, which can be slow for per-thread latency since you can't pipeline memory access as well in a dynamic loop. You also have the issue that every warp/wavefront will need to iterate for the worst case of all threads in that group, and so you may end up with some wasted work if you have particles from different systems packed into the same warp/wavefront. For us though, the performance didn't even matter in the end: we used async compute for the particle simulation and it just got totally absorbed by the normal rendering workload.

#5235791 What performance will AMD's HBM bring for graphics programmers?

Posted by MJP on 19 June 2015 - 06:09 PM

Yes, that's exactly what I'm talking about.  The problem isn't memory transfers, whether or not they happen, and how fast they are if they do.  The problem is that the CPU and GPU are two separate processors that operate asynchronously.  If you have one frame of latency and you need to do a readback every frame, the fastest memory transfer in the world (or no memory transfer) won't help you; you'll still halve your framerate.

With current APIs, sure. With ones designed around shared memory not necessarily. On a shared memory system, there's no reason why after issuing a command it couldn't be running on the GPU nanoseconds (or at the very least microseconds) later (provided it wasn't busy of course). With that level of fine-grain control you could switch back and forth between CPU and GPU easily multiple times per frame.

Even with a shared memory architecture it doesn't mean that you can suddenly run the CPU and GPU in lockstep with no consequences. Or at least, certainly not in the general case of issuing arbitrary rendering commands. What happens when the CPU issues a draw command that takes 3 milliseconds on the GPU? Does the CPU now sit around for 3ms waiting for the GPU to finish? It also totally breaks the concurrency model exposed by D3D12/Mantle/Vulkan, which are all based around the idea of different threads writing commands to separate command buffers that are later submitted in batches. On top of that, the GPU hardware that I'm familiar with is very much built around this submission model, and requires kernel-level access to privileged registers in order to submit command buffers. So it's certainly not something you'd want to do after draw or dispatch call.

Obviously these problems aren't insurmountable, but I think at the very least you would need a much tighter level of integration between CPU and GPU for the kind of generalized low-latency submission that you're talking about. Currently the only way to get anywhere close to that is to use async compute on AMD GPU's, which is specifically designed to let you submit small command buffers with compute jobs with minimal latency. With shared memory and careful cache management it is definitely possible to get your asysnc compute results back pretty quickly, but that's only going to work well for a certain category of short-running tasks that don't need a lot of GPU resources to execute.

#5235779 [Solved]IBL(ggx) problem

Posted by MJP on 19 June 2015 - 04:29 PM

The split-sum approximation doesn't fix this issue. The error is actually caused by the assumption that N = V = R for pre-integrating the environment map, which causes incorrect weighting at glancing angles. If you look at Figure 4 in the course notes, you'll see that the results when using a cubemap are pretty similar to what you're getting.

#5235771 How does games manage different resolutions?

Posted by MJP on 19 June 2015 - 04:07 PM

1. Call IDXGIOutput::GetDisplayModeList. It will give you the list of possible resolutions and refresh rates that you can use for creating your fullscreen swapchain on that particular monitor.

2. Many games actually don't handle multi-GPU and multi-monitor very well: they'll just default to the first adapter and use the first output on that adapter. What I would probably do is have either a configuration dialog or command line options that let advanced users pick which GPU to use, and then pick which monitor to use. If they just use default options, then you can just do what other games do and use the first adapter and the first output from that adapter. This will always be the "primary" display for Windows, so it's usually not a bad choice. If the user picks an adapter that has no outputs, then you can either try to gracefully fall back to a different adapter that does have outputs, our you can just output an error message.

3. Like I said above, I would probably just default to the first output and then provide an advanced settings menu for letting users choose a different display. You could try and be smart by looking at all displays and picking the biggest, but I think that defaulting to the primary display is still a sensible choice.

4. For windowed mode you just find out the size of the window's client area, and then use that for your backbuffer size when creating your swap chain. Or alternatively, just specify 0 as the width and height when creating your swap chain and it will automatically use the client area size. To handle resizing, you just need to check for WM_SIZE messages from your window and then call IDXGISwapChain::ResizeBuffers to resize the back buffer. Once again you can either ask the window for its client area size, or just pass 0 to let DXGI do that for you. Also if it's not convenient to handle window messages, you can instead just ask the window for its client area during every tick of your update loop, and then call ResizeBuffers if the size is different from last frame.

#5235768 Backbuffer resolution scale filter

Posted by MJP on 19 June 2015 - 03:48 PM

Anything that happens after rendering to the backbuffer is out of your control. In fact you don't even know if it's the GPU that's upscaling to native resolution or the monitor itself (typically this is an option in the driver control panel).

If you want to maintain the "blocky" look, then I would suggest that you just always go to fullscreen at the monitor's native resolution. Typically this is the highest resolution mode given by IDXGIOutput::GetDisplayModeList, but you can also ask the output for the current desktop resolution using IDXGIOutput::GetDesc.

#5235170 Uses of curves in graphics?

Posted by MJP on 16 June 2015 - 12:47 PM

We used tons of curves for driving the behavior of our GPU-simulated particles. How they spawned, how the moved, what color they were, etc. For the most part we didn't really evaluate curves directly on the GPU, we would instead quantize the curves into a big lookup texture that contained the quantized values for all curves being used by the active particle systems. We did similar things for having the artists specify distance and height-based falloffs for fog effects.

I'm not sure if this is exactly what you're looking for, but you should take a look into tessellation and subdivision surfaces. These algorithms generally work by treating the mesh positions as control points on a spline, and then evaluating the curve to generate intermediate vertices. The Open Subdiv project has tons of info about doing this on a GPU using compute shaders and tessellation hardware.

#5235164 Can i resolve a multisampled depth buffer into a texture?

Posted by MJP on 16 June 2015 - 12:34 PM

The part you're missing here is that your shadow map isn't lined up 1:1 with your screen. Instead it's projected onto a frustum that's oriented with your light's position, with the resulting shadows being projected onto your screen. As a result, you can end up with projective aliasing artifacts where the projection of the shadow-casting light causes the shadow map resolution to be less than the sampling rate of your back buffer. You'll see this as very jaggies in your shadow, where the jagged edges are actually bigger than a pixel on your screen. Increasing the size of your shadow map will increase the shadow map resolution, which will in turn increase the relative sampling rate of your shadow map depth vs. your screen pixel sampling rate. The end result will be that the shadows will look less jagged.

Since a shadow map projection isn't related to your screen projection, it's common to pick a resolution that's not at all tied to your screen resolution. Typically you'll just pick a size like 512x512 or 1024x1024.

#5234984 HLSL questions

Posted by MJP on 15 June 2015 - 06:30 PM

In case it's not clear from the assembly, what's happening in your version of the shader (the one with gOutput) is that the compiler is generating an "immediate" constant buffer for the shader to use. In HLSL assembly there's no stack and you don't have the ability to dynamically index into registers, so the compiler has to place your array into an automatically-generated constant buffer so that it can be indexed by a dynamic value (SV_VertexID). For cases where the index is known at compile time (say, when unrolling a loop with a fixed iteration count), the compiler can instead just directly embed the array values into the assembly instructions which avoids the need to load the values from memory.

#5234983 Can i resolve a multisampled depth buffer into a texture?

Posted by MJP on 15 June 2015 - 06:15 PM

If all you want to do is increase the resolution of the shadow map, then you really should just increase the resolution of the shadow map. Multisampling isn't really useful for your particular situation.

#5234685 how much i can trust the shader compiler?

Posted by MJP on 13 June 2015 - 08:34 PM

The HLSL compiler is fine if you're talking about basic optimizations like dead-code stripping. It will strip out any code that doesn't have any affect on computing output values, and will also strip out code when doing math or branches with constant values. So for instance if you have a lot of code to compute a value but then you end up multiplying the value by 0.0, the compiler will cull out all of that code. It will also agressively unroll loops when it can, which can often allow it to strip out things like branches on the loop control variable. For our engine most of our shaders contain code and variables that are automatically generated based on a material asset, and we heavily rely on these sorts of optimizations to strip out unnecessary code when features are disabled.

#5234620 Async asset loading

Posted by MJP on 13 June 2015 - 11:40 AM

By default, ID3D11Device is thread-safe. So you can just create your textures and buffers on your worker thread, without having to use the device context at all. For best performance the driver should support concurrent resource creation, which is an optional cap bit that you can check for using ID3D11Device::CheckFeatureSupport.

For more info about multithreading, you should read this section of the documentation.

#5233908 Roughness in a Reflection

Posted by MJP on 09 June 2015 - 03:05 PM

Like swiftcoder mentioned, the common way to do this is pick a roughness per mip level and pre-convolve with the corresponding specular lobe for each mip level. As roughness increases you end up with lower-frequency information (AKA blurrier data), and so it makes sense to use the lower-resolution mip levels for higher roughnesses.

If you want to try to approximate more physically-based BRDF's, then you have to go further and try to incorporate a fresnel term as well as a geometry term. Unfortunately you can't pre-integrate those terms into a cubemap, and so you have to use approximations in order to try get the correct reflection intensity. There are two courses from SIGGRAPH 2013 that go into the details of doing this: the Black Ops II presentation, and the Unreal Engine 4 presentation.

#5233660 order of resource creation

Posted by MJP on 08 June 2015 - 04:40 PM

I don't want to get into specifics since it's under NDA, so I'll just say that it does a great job of fully exposing all of the hardware's functionality in a sane manner. I really like what's happening with D3D12/Vulkan, particularly in that it shifts a lot of the memory and synchronization management over to us. That, combined with bindless resousources and more direct control over command buffer generation/submission, should allow for much greater efficiency compared to D3D11. I don't want to compare PS4 and D3D12 too much though, since it's not really a fair comparison. PS4 API's are only need to expose the functionality of only 1 hardware configuration on platform that's primarily designed for running 1 game at a time. D3D12 is in the much tougher position of exposing a wide range of GPU's and drivers in the context of a multitasking OS running many applications, which is a considerably more complex situation. And so you end up with concepts like a root signature, which abstracts away many different low-level binding models so that it can present them through a single coherent interface. In that context, I think D3D12 has done a great job of exposing things as efficiently as possible while still working within the limitations imposed by PC development.