Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 29 Mar 2007
Offline Last Active Yesterday, 02:29 PM

#5268650 VertexBuffers and InputAssmbler unnecessary?

Posted by MJP on 31 December 2015 - 05:45 PM

I personally shipped a project that 100% used manual vertex fetch in the shader, although this was for a particular console that has a AMD GCN variant as its GPU. AMD GPUs have no fixed-function vertex fetch, and so they implement "input assembler" functionality by generating small preamble for the VS that they call a "fetch shader". They can basically generate this fetch shader for any given input layout, and the vertex shader uses it through something analogous to a function call for their particular ISA. When the fetch shader runs, it pulls the data out of your vertex buffer(s) using SV_VertexID and SV_InstanceID, and deposits them all in registers. The vertex shader then knows which registers contain the vertex data according to convention, and it can proceed accordingly. Because of this setup, the fetch shader can sometimes have suboptimal code generation compared to a vertex shader that performs manual vertex fetch. The fetch shader must ensure that all vertex data is deposited into registers up-front, and must ensure that the loads are completed before passing control back to the VS. However if the VS is fetching vertex data, then the vertex fetches can potentially be interleaved with other VS operations, and can potentially re-use registers whose contents are no longer needed.

Unfortunately I'm not sure if it's the same when going through DX11, since there are additional API layers in the way that might prevent optimal code-gen. I'm also not sure which hardware still has fixed-function vertex fetch, and what kind of performance delta you can expect.

#5268647 GS Output not being rasterized (Billboards)

Posted by MJP on 31 December 2015 - 04:57 PM

If I understand your code correctly, it looks like you're setting the output vertex position to have z = 0.0 and w = 0.0, which is invalid. Try setting to w to 1.0 instead.

#5268406 [D3D12] Driver level check to avoid duplicate function call?

Posted by MJP on 29 December 2015 - 05:23 PM

As far as I know there's no API-level guarantee that the implementation will filter out redundant calls for you. It's possible that the drivers will do it, but there's no way of knowing without asking them or profiling. Filtering yourself should be pretty easy and cheap, you can just cache the pointer to the PSO that's currently set for that command list and compare with it before setting a new one.

#5267077 OpenGL Projection Matrix Clarifications

Posted by MJP on 19 December 2015 - 05:40 PM

This image is from the presentation Projection Matrix Tricks by Eric Lengyel, and shows how normalized device coordinates work using OpenGL conventions:


As you can see, in OpenGL the entire visible depth range between the near clip plane and the far clip plane is mapped to [-1, 1] in normalized device coordates. So if a position has a Z value of 0 then it it's not actually located at the camera position, it's actually located somewhere between the near clip plane and the far clip plane (but not exactly halfway between, since the mapping is non-linear).

#5266801 Questions on Baked GI Spherical Harmonics

Posted by MJP on 17 December 2015 - 01:27 PM

Yes, you'll either need to use multiple textures or atlas them inside of 1 large 3D texture (separate textures is easier). It would be a lot easier if GPU's supported 3D texture arrays, but unfortunately they don't.

#5266102 D3D11 texture image data from memory

Posted by MJP on 13 December 2015 - 02:22 AM

If I read the PNG images with the winapi and not with stbi_load and 
then use D3DX11CreateShaderResourceViewFromMemory it should work ?

Yes. You can use OpenFile and ReadFile to load the contents of a file into memory, and then pass that to D3DX11CreateShaderResourceViewFromMemory.

I should point out that many games do not store their textures using image file formats such as JPEG and PNG. While these formats are good for reducing the size of the image on disk, they can be somewhat expensive to decode. They also don't let you pre-generate mipmaps or compress to GPU-readable block compression formats, which many games do in order to save performance and memory. As a result games will often use their own custom file format, or will use the DDS format. DDS can store compressed data with mipmaps, and it can also store texture arrays, cubemaps, and 3D textures.

#5266065 D3D11 texture image data from memory

Posted by MJP on 12 December 2015 - 03:23 PM

D3DX11CreateShaderResourceViewFromMemory expects that the data you give it is from an image file, such as JPEG, DDS, or PNG file. stbi_load parses an image file, and gives you back the raw pixel data that was decoded from the image file. To use that raw data to initialize a texture, you should call ID3D11Device::CreateTexture2D and pass the raw image data through a D3D11_SUBRESOURCE_DATA structure that's passed as the "pInitialData" parameter. For a 2D texture, you should set pSysMem to the image data pointer that you get back from stbi_load, and you should set SysMemPitch to the size of a pixel times the width of your texture. So in your case it looks like you're loading 8-bit RGBA data which is 4 bytes per pixel, so you should set it to "object.width * 4".

#5265393 MSAA and CheckFeatureSupport

Posted by MJP on 08 December 2015 - 12:23 AM

Perhaps back buffers don't support MSAA with D3D12? I wouldn't be surprised if this were the case, since D3D12 it's much more explicit in dealing with swap chains. MSAA swap chains have to have a "hidden resolve" performed on them, where the driver resolves the subsamples of your MSAA back buffer to create a non-MSAA back buffer than can be displayed on the screen. If I were you, I would just do this yourself by creating a MSAA render target and then resolving that to your non-MSAA back buffer using ResolveSubresource.

#5265391 [D3D12] Command Queue Fence Synchronization

Posted by MJP on 08 December 2015 - 12:16 AM

Conceptually, SetEventOnCompletion works like this:
HRESULT SetEventOnCompletion(UINT64 Value, HANDLE hEvent)
   // Start a background thread to check the fence value and trigger the event
   CreateThread(FenceThread, Value, hEvent);

void FenceThread(UINT64 Value, HANDLE hEvent)
    while(fenceValue < Value);  // Wait for the fence to be signaled
    SetEvent(hEvent);           // Trigger the event
So there's no danger of "missing" the event, as you're fearing.

EDIT: changed the code to be more clear about how the checking is done in the background

#5265228 Swapchain fails to create when SampleDesc.Quality is greater than 0

Posted by MJP on 07 December 2015 - 12:09 AM

The levels start at 0, so if it returns "1" then "0" is the only valid quality level.

Regarding your artifact, are you sure that the client rectangle of your window is the same size if your backbuffer? Try calling GetClientRect and making sure that it's the size that you expect it to be.

#5265202 Swapchain fails to create when SampleDesc.Quality is greater than 0

Posted by MJP on 06 December 2015 - 06:02 PM

You have to ask the driver which quality levels it supports, which is done using ID3D11Device::CheckMultisampleQualityLevels. In your case the device probably only supports a quality level of 0.

Quality levels are typically used to expose vendor-specific MSAA extensions like CSAA and EQAA, so it's generally not something you want to use blindly.

#5265084 Questions on Baked GI Spherical Harmonics

Posted by MJP on 05 December 2015 - 06:57 PM

Thanks for the suggestion. I actually may try going this route to start off with. How do you handle blending between two 3D grids?

We didn't. Every object sampled from only one grid at a time, which was chosen by figuring out the closest grid with the highest sample density. There were never any popping issues as long as there was some overlap between the two grids, and an object could be located inside of both simultaneously.

#5264980 Can we use openCL on consoles?

Posted by MJP on 05 December 2015 - 12:39 AM

We'd like to support consoles too though and do not want to do extensive porting.

If you're using OpenCL then "extensive porting" will definitely be a requirement. As others have already mentioned the consoles have their own custom API's and toolchains that you have to use in order to execute compute shaders on the GPU.

Is it possible to precompile our code for those specific gpu's?

Consoles often require that you pre-compile shaders in advance, but it varies depending on the console and SDK. You'll need to become a registered developer and read the SDK documentation if you want any specifics, since they're covered by NDA.

#5264835 Questions on Baked GI Spherical Harmonics

Posted by MJP on 04 December 2015 - 12:06 AM

The approach that you described is definitely workable, and has been successfully used in many shipping games. For the conversion from cubemap->SH coefficients, there's some pseudocode in this paper that you might find useful. You can also take a look at the SH class from my sample framework for a working example of projecting a cubemap onto SH, as well as the code from the DirectXSH library (which you can also just use directly in your project, if you'd like).

To interpolate SH, you just interpolate all of the coefficients individually. The same goes for summing two sets of SH coefficients, or multiplying a set of SH coefficients by a scalar.

The presentation that REF_Cracker linked to should help you get started with tetrahedral interpolation. Another possible approach is just store samples in a 3D grid, or multiple 3D grids. Grids are super simple for looking up and interpolating, since they're uniform. This is what we did on The Order: the lighting artists would place OBB's throughout the level, and for each OBB we would fill the grid with samples according to their specified sample density. Each grid would then be stored in a set of 3D textures, which would be bound to each mesh based on whichever OBB the mesh intersected with. The in the pixel shader the pixel position would be transformed into the local space of the OBB, and the coordinates would be used to fetch SG coefficients using hardware trilinear interpolation (we used spherical gaussians instead of spherical harmonics, but the approach is basically the same). This let us have per-pixel interpolation without having to lookup a complex data structure inside the pixel shader. However this has a few downsides compared to a tetrahedral approach:

1. You generally want a lot of samples in your grid for good results, which means that that you have to be okay with dedicating that much memory to your samples. It also means that you want your baking to be pretty quick. Cubemap rendering isn't always that great for this, since the GPU can only render to one cubemap at a time. We use a GPU-based ray tracer to bake our GI, and that lets us bake many samples in parallel.

2. Since you don't have direct control over where each sample is placed, you're bound to end up with samples that are buried inside of geometry. This can give you bad results, since the black samples will "bleed" into their neighbors during interpolation. To counteract this, we would detect "dead" samples and then flood-fill them with a color taken from their neighbors.

#5264044 Is it OK to give each render thread their own cmdQueue?

Posted by MJP on 28 November 2015 - 05:23 PM

Typically your different command lists will have dependencies, such that you require one command list to execute completely before another can start. For instance, you need to finish rendering your G-Buffer before the GPU can start doing the deferred lighting pass. Pretty much all existing GPU's will execute your direct/graphics command lists sequentially in the order that you submit them to a queue, and so the easiest and most reliable way to ensure correct execution order is to have 1 thread wait for all other rendering threads to finish and then submit all of the command lists in the correct order. Having each render thread submit its own command list(s) in the correct order would require more complex synchronization, and would also may require multiple threads to sit around waiting for other threads to finish (as opposed to just having one "master" thread wait).