Jump to content

  • Log In with Google      Sign In   
  • Create Account

MJP

Member Since 29 Mar 2007
Offline Last Active Yesterday, 12:44 AM

#5283446 Per Triangle Culling (GDC Frostbite)

Posted by MJP on 25 March 2016 - 03:10 PM

Nvidia has OpenGL and D3D extensions for a "passthrough" GS that's meant to be fast as long as you can live with the restrictions (no arbitrary amplification, only triangles, no stream out, etc.). So if you could use that to do per-triangle culling, it could potentially be much easier to get it working. If anybody actually tries it, I'd love to hear about the results. :)




#5283071 Math behind anisotropic filtering?

Posted by MJP on 23 March 2016 - 11:42 PM

Is there an article/explanation and is it standardized somehow or vendor dependant? (in gl it's not core AFAIK even if all vendors supports it)


It's an extension in GL because it's patented. :(


#5282547 Iso-/Dimetric tile texture has jagged edges

Posted by MJP on 22 March 2016 - 01:22 AM

I had a similiar issue once, and it turned out I was doing windowed mode wrong in terms of calculating the window size to fit the backbuffer
size etc., resulting in a vaguely stretched display that was hard to notice for a long while.. Maybe you could check your window+DirectX
initialisation code?


I was going to say the same thing. You want to make sure that the client area of your window is the same size as your D3D backbuffer,
otherwise you'll get really crappy scaling when the backbuffer is blit onto the window. You can use something like this:
 
RECT windowRect;
SetRect(&windowRect, 0, 0, backBufferWidth, backBufferHeight);

BOOL isMenu = (GetMenu(hwnd) != nullptr);
if(AdjustWindowRectEx(&windowRect, style, isMenu, exStyle) == 0)
    DoErrorHandling();

if(SetWindowPos(hwnd, HWND_NOTOPMOST, 0, 0, windowRect.right - windowRect.left, windowRect.bottom - windowRect.top, SWP_NOMOVE) == 0)
    DoErrorHandling();
See the docs for AdjustWindowRectEx for more details.


#5282527 EVSM, 2 component vs 4 component

Posted by MJP on 21 March 2016 - 10:54 PM

I went with 16-bit because there's too many artifacts when using the 2-component version EVSM. Specifically, you run into issues in areas with high geometrical complexity where the receiver surface is non-planar relative to the filter kernel. See the original paper (go to section 7) for some more details.

 

I really noticed it on our characters and faces, due to their dense, curved geometry. I took some screenshots from my sample app in attempt to replicate the issues that I saw in The Order:

 

This is 4-component EVSM with 32-bit textures:

 

EVSM4.PNG

 

This is 2-component EVSM with 32-bit textures (look at the shadow cast by the nose):

 

EVSM2.PNG

 

And this is 4-component EVSM with 16-bit textures, with the bias and leak reduction turned up:

 

EVSM4_16bit.PNG




#5282243 Object Space Lightning

Posted by MJP on 20 March 2016 - 07:04 PM


That said, I don't get the comparisons to REYES and overall it seems like a very special purpose, application-specific approach to rendering

 

Yeah, I agree that the frequent mentioning of REYES is misleading. The only real commonality with REYES is the idea of not shading per-pixel, and even in that regard REYES has a very different approach (dicing into micropolygons followed by stochastic rasterization).

 

I also agree that it's pretty well-tailored to their specific style of game, and the general requirements of that genre (big terrain, small meshes, almost no overdraw). I would image that to adopt something similar for more general scenes you would need to a much much better job of allocating appropriately-sized tiles, and you would need to account for occlusion. I could see maybe going down the megatexture approach of rasterizing out tile ID's, and then analyzing that on the CPU or GPU to allocate memory for tiles. However this implies latency, unless you do it all on the GPU and rasterize your scene twice. Doing it all on the GPU would rule out any form of tiled resources/sparse textures, since you can't update page tables from the GPU.

ptex would be nice for avoiding UV issues (it would also be nice for any kind of arbitrary-rate surface calculations, such as pre-computed lightmaps or real-time GI), but it's currently a PITA to use on the GPU (you need borders for HW filtering, and you need quad->page mappings and lookups).




#5282237 Gamma correction. Is it lossless?

Posted by MJP on 20 March 2016 - 06:49 PM

The sRGB->Linear transformation for color textures will typically happen in the fixed-function texture units, before filtering. You can think of the process as going something like this:

 

result = 0.0
foreach(texel in filterFootPrint):
    encoded = ReadMemory(texel)
    decoded = DecodeTexel(encoded)  // Decompress from block compression and/or convert from 8-bit to intermediate precision (probably 16-bit)
    linear = sRGBToLinear(decoded)
    result += linear * FilterWeight(texel) // Apply bilinear/trilinear/anisotropic filtering
 
return ConvertToFloat(result)

 

It's important that the filtering and sRGB->Linear conversion happen at some precision that's higher than 8-bit, otherwise you will get banding. For sRGB conversion 16-bit fixed point or floating point is generally good enough for this purpose. The same goes for writing to a render target: the blending and linear->sRGB conversion need to happen at higher precision than the storage format, or you will get poor results. You will also get poor results if you write linear data to an 8-bit render target, since there will be insufficient precision in the darker range.

 

Probably the vast majority of modern game are gamma-correct. It's a bona fide requirement for PBR, which almost everyone is adopting in some form or another. I seem to recall someone mentioning that Unity maintains a non-gamma-correct rendering path for legacy compatibility, but don't quote me on that.




#5279535 Setting constant buffer 'asynchronously'

Posted by MJP on 04 March 2016 - 02:05 PM

The runtime/driver will synchronize UpdateResource for you. This means that use a constant buffer for draw call A, update the constant buffer, and then issue draw call B, it will appear as though the buffer were update in between A and B.


#5279533 Manually copying texture data from a loaded image

Posted by MJP on 04 March 2016 - 01:54 PM

Thank you! No plans for D3D12 book at a moment.


#5279349 Manually copying texture data from a loaded image

Posted by MJP on 03 March 2016 - 04:08 PM

You're not supposed to fill out D3D12_SUBRESOURCE_FOOTPRINT and D3D12_PLACED_SUBRESOURCE_FOOTPRINT. You need to fill out your D3D12_RESOURCE_DESC for the 2D texture that you want to create, and then pass that to ID3D12Device::GetCopyableFootprints in order to get the array of footprints for each subresource. You can then use the footprints to fill in an UPLOAD buffer, and then use CopyTextureRegion to copy the subresources from your upload buffer to the actual 2D texture resource in a DEFAULT heap.

You may want to look at the HelloTexture sample on GitHub to use as a reference.

EDIT: actually you don't strictly need to use GetCopyableFootprints, see Brian's post below


#5279328 [D3D12] Copying between upload buffers

Posted by MJP on 03 March 2016 - 02:02 PM

You don't want to copy between upload buffers on the CPU. It can be very slow, since the memory will probably be write combined (uncached reads). There's really no way for you to figure out how big of a buffer you need ahead of time? Personally I would try to avoid having to create new buffers in the middle of rendering setup.


#5278921 Multiple small shaders or larger shader with conditionals?

Posted by MJP on 01 March 2016 - 05:21 PM

Somebody correct me if I'm wrong, but if every path in a shader takes the same branch it is nearly (not entirely) the same cost as if the changes were compiled in with defines. If you are rendering an object with a specific set of branches that every thread will take it may not be a big deal. If threads take different branches you will eat the cost of all branches.


Branching on constants is nice, because there's no chance of divergence. Divergence is when some threads within a warp/wavefront take one side of the branch, and some take another. It's bad because you end up paying the cost of executing both sides of the branch. When branching on a constant everybody takes the same path, so you only pay some (typically small) cost for checking the constant and executing the branch instructions.

As for whether it's the same cost as compiling an entirely different shader permutation, it may depend on what's in the branch as well as the particulars of the hardware. One potential issue with branching on constants is register pressure. Many GPUs have static register allocation, which means that they compute the maximum number of registers needed by a particular shader program at compile time and then make sure that the maximum is always available when the shader is actually executed. Typically the register file has a fixed size and is shared among multiple warps/wavefronts, which means that if a shader needs lots of registers then fewer warps/wavefronts can be in flight simultaneously. GPUs like to hide latency by swapping out warps/wavefronts, so having fewer in flight limits their ability to hide latency from memory access. So let's say that you have something like this:

// EnableLights comes from a constant buffer
if(EnableLights)
{
   DoFancyLightingCodeThatUsesLotsOfRegisters();
}
By branching you can avoid the direct cost of computing the lighting, but you won't be able to avoid the indirect cost that may occur from increased register pressure. However if you were to use a preprocessor macro instead of branch and compile 2 permutations, then the permutation with lighting disabled can potentially use less registers and have greater occupancy. But again, this depends quite a bit on the specifics of the hardware as well as your shader code, so you don't want to generalize about this too much. In many cases branching on a constant might have the exact same performance as creating a second shader permutation, or might even be faster due to CPU overhead from switching shaders.


#5276220 Shadow map depth range seems off

Posted by MJP on 17 February 2016 - 04:46 PM

^^^ what mhagain said. I would suggest reading through this article for some good background information on the subject.

Also, a common trick for visualizing a perspective depth buffer is to just ignore the range close to the near clip plane and then treat the remaining part as linear. Like this:

float zw = DepthBuffer[pixelPos];  
float visualizedDepth = saturate(zw - 0.9f) * 10.0f;  
You can also compute the original view-space Z value from a depth buffer if you have the original projection matrix used for generating that depth buffer. If you take this and normalize to a [0,1] range use the near and far clip planes, then you get a nice linear value:

float zw = DepthBuffer[pixelPos];
float z = Projection._43 / (zw - Projection._33);
float visualizedDepth = saturate((z - NearClip) / (FarClip - NearClip));



#5275840 D3D12: Copy Queue and ResourceBarrier

Posted by MJP on 15 February 2016 - 08:35 PM

Honestly i use a direct queue for everything including moving data to and from the GPU. The Present API only presents at multiples of the screen refresh rate (I haven't had luck getting unlocked fps yet), and i get 120 FPS whether i use a direct queue or a copy queue. Unless you are moving a lot of data to and from the GPU, i personally feel the copy queue just makes things more complex than they really need to be for how much performance gain you might get with it.


It definitely depends on how much data you're moving around, and how long it might take the GPU to copy that data. The problem with using the direct queue for everything is that it's probably going to serialize with your "real" graphics work. So if you submit 15ms worth of graphics work for a frame and you also submit 1ms worth of resource copying on the direct queue, then your entire frame will probably take 16ms on the GPU. Using a copy queue could potentially allow the GPU to execute the copy while also concurrently executing graphics work, reducing your total frame time.


#5275061 Diffuse IBL - Importance Sampling vs Spherical Harmonics

Posted by MJP on 09 February 2016 - 05:48 PM

Spherical harmonics are pretty popular for representing environment diffuse, because they're fairly compact and they can be evaluated with just a bit of shader code. For L2 SH you need 9 coefficients (so 27 floats for RGB), which is about the size of a 2x2 cubemap. However it will always introduce some amount of approximation error compared to ground truth, and in some cases that error can be very noticeable. In particular it has the problem where very intense lighting will cause the SH representation to over-darken on the opposite side of the sphere, which can lead to totally black (or even negative!) values.

The other nice thing about SH is that it's really simple to integrate. With specular pre-integration you usually have to integrate for each texel of your output cubemap, which is why importance sampling is used as an optimization. If you're integrating to SH you don't need to do this, you're effectively integrating for a single point. This means you can just loop over all of the texels in your source cubemap, which means you won't "miss" any details. You can look at my code for an example, if you want.


#5274941 Limiting light calculations

Posted by MJP on 09 February 2016 - 02:49 AM

I was always advised to avoid dynamic branching in pixel shader.


You should follow this advice if you're working on a GPU from 2005. If you're working on one from the last 10 years...not so much. On a modern GPU I would say that there's two main things you should be aware of with dynamic branches and loops:

1. Shaders will follow the worst case within a warp or wavefront. For a pixel shader, this means groups of 32-64 pixels that are (usually) close together in screen space. What this means is that if you have an if statement where the condition evaluates false to 31 pixels but true for one pixel in the 32-thread warp, then they're all to execute what's inside the if statement. This can be especially bad if you have an else clause, since you can end up with your shader executing both the "if" as well as the "else" part of your branch! For loops it's similar: the shader will keep executing the loop until all threads have hit the termination condition. Note that if you're branching or looping on something from a constant buffer, then you don't need to worry about any of this. For that case every single pixel will take the same path, so there's no coherency issue.

2. Watch out for lots of nested flow control. Doing this can start to add overhead from the actual flow control instructions (comparisons, jumps, etc.), and can cause the compiler to use a lot of general purpose registers.

For the case you're talking about, a dynamic branch is totally appropriate and is likely to give you a performance increase. The branch should be fairly coherent in screen space, so you should get lots of warps/wavefronts that can skip what's inside of the branch. For an even more optimal approach, look into deferred or clustered techniques.




PARTNERS