• Advertisement

MJP

Moderator
  • Content count

    8594
  • Joined

  • Last visited

  • Days Won

    1

MJP last won the day on March 17

MJP had the most liked content!

Community Reputation

19921 Excellent

1 Follower

About MJP

  • Rank
    XNA/DirectX Moderator & MVP

Personal Information

Social

  • Twitter
    @MyNameIsMJP
  • Github
    TheRealMJP
  1. GPU Ray Trace SDKs

    OptiX is built on top of CUDA, so it only works on Nvidia hardware. We've used it for years as part of our lightmap baking pipeline, and I would say that it's pretty stable and robust at this point. Early we on there were plenty of issues, but they are now on their fifth major version. Performance is good, and CUDA is really nice to work in for the most part. Unfortunately I've never used Radeon Rays, so I can't directly compare it to OptiX for you. My understanding is that Radeon Rays works by having your code give their API a list of rays, and then the API gives you back the intersections. This makes sense given that they have to abstract over their various CPU and GPU implementations, but it's very different from working with OptiX. OptiX actually has a whole high-level programming models where you write separate programs for generating rays, evaluating ray hits, and evaluating ray misses. The OptiX runtime then does all kinds of stuff behind the scenes to make it all work efficiently, and also make it appear to your ray generation program as if everything is happening synchronously. It turns out there's a lot of details in getting good performance between ray generation and hit evaluation, since GPU's need to fill large SIMD units and don't natively have support for fork/join. In other words, there's definitely value in using Nvidia's black-box implementation if you want the best possible hardware. With Radeon Rays (or your own triangle/ray intersection shaders) it will be up to you to try to figure out how to efficiently process your hits and spawn more rays. Or at least, that's my understanding from looking at their docs, samples, and API's. DXRT is also an appealing option, since it has a programming model that's very similar to OptiX. However you currently need a pre-release version of Windows, and a very expensive Nvidia Volta-based GPU if you don't want to use their software fallback path.
  2. Yup! Post-processing is usually done by drawing a triangle that covers the entire screen, and using a pixel shader to sample textures and perform image processing operations on them.
  3. In my experience RenderDoc is much better debugger than Nsight, but I haven't used the latest version of Nsight yet. I generally only use Nsight when I need to profile performance on Nvidia hardware.
  4. I have no idea how to make the VS graphics debugger work without crashing, so perhaps you should consider trying out RenderDoc. It supports debugging pixel shaders.
  5. Logically, the depth/stencil test is in the output merger (OM) stage, which runs after the pixel shader. But that pipeline is really more of a abstracted virtual pipeline: it doesn't dictate how the hardware actually works, only how results should appear to the application. This means that GPU's are free to put in optimizations that don't match the virtual pipeline as long as the results match what the specification says should be produced. So in practice, all GPU's have some sort of early depth/stencil test that they can use to keep the pixel shader from running whenever they can determine that it's safe to do so. The two big exception cases are using discard, and using depth/stencil export from the PS. In both of those cases the pixel shader has to be run to know how depth/stencil testing + writing should happen, so the hardware is limited in how it can optimize this. Some hardware will still use early Z for discard: they basically have to do early Z and then re-update the Z value again after the pixel shader if the pixel was discarded. Some will just give up in this case. Full depth output from the PS will always disable early Z, but there are "conservative" depth output modes that will still allow the hardware to do a subset of early Z operations. One trick that's used in a lot of games to speed up alpha testing is to run a depth pre-pass for alpha-test geo, and then re-render that geometry with a full shader but with depth writes off and depth test set to EQUAL. Basically you pay the discard cost for your simpler pre-pass shader, but then your full shader will only run for the non-discarded pixels (and won't need to have a 'discard' operation, so the hardware can still use early Z).
  6. DX11 DXGI swapchain call to action

    Yeah, we support both FLIP and non-FLIP paths in the DX11 version of our engine and it's only a few lines of code.
  7. Perhaps you know this already (and if so, sorry for rambling on), but upload buffers are not copied into GPU memory unless you manually do it yourself. On desktop GPU's that have have a separate physical pool of memory on the video card, UPLOAD resources will live in CPU memory that's mapped into both the CPU and GPU's address space. When the GPU access the resource, it will pull it over the PCI-e bus, which may be significantly slower compared to DEFAULT resources that live in GPU memory. So generally you only want to use it for things where the GPU will only access the data once, like a transient constant buffer or a staging buffer whose contents are then copied into DEFAULT memory. An UPLOAD resource will be safe to overwrite when the last ExecuteCommandLists batch has finished reading from it. Like Hodgman says, the way to know when a GPU has finished processing a batch of command lists is to have the queue signal a fence, which the CPU can either poll or wait on. Usually you're doing this anyway to avoid overwriting your command buffers, so the simplest approach is to tie your temporary UPLOAD memory to your command buffers.
  8. There's no way to do this without knowing what's going on in the game's shader programs. Shaders don't just add "fancy eye candy", they're completely responsible for computing vertex locations and screen space, and the final value of a pixel written to a render target. It's totally possible that the game's shaders are only doing straightforward things that could be mostly replicated in the old fixed-function pipeline, or it could be doing things in a complex, unorthodox way. Even common things like skeletal animation will screw you up if you don't know what's going on in the shader, since you would have to know that the vertex shader is doing joint skinning and set up the equivalent functionality with fixed-function states (if it's even possible to do so).
  9. 3D HLSL Minimap!

    So assuming your rectangle isn't rotated, then your minimap bounds essentially form an axis-aligned bounding box (AABB). Now if you have an item on your minimap, you can compute a vector (ray) starting at the minimap center that points towards the item. You then basically need to intersect that ray with the AABB to determine where to put the item if it goes past the bounds of the map. If you google for "ray AABB intersection", you should be able to find some articles that explain how to compute the intersection point. Here's the code that I've used in the past: float IntersectRayBox2D(float2 rayOrg, float2 dir, float2 bbmin, float2 bbmax) { float2 invDir = rcp(dir); float2 d0 = (bbmin - rayOrg) * invDir; float2 d1 = (bbmax - rayOrg) * invDir; float2 v0 = min(d0, d1); float2 v1 = max(d0, d1); float tmin = max(v0.x, v0.y); float tmax = min(v1.x, v1.y); return tmin >= 0.0f ? tmin : tmax; } It will return a "t" value such that "rayOrg + dir * t" will give you the point of intersection.
  10. There are pre-Vega AMD GPU's that support SM6.0/DXIL, they just don't have have non-experimental driver support yet (you need to enable developer mode and specify that you want to enable the experimental shader models feature in D3D12). It certainly works on the RX 460 that I have as my secondary GPU in my home PC.
  11. For VCT, I have written code very similiar to what Villem Otte just described. For general voxel visualization where I care much more about exactness than I do about performance, I've used DDA-like algorithms where I step through each voxel into its neighbor, potentially checking many empty voxels along the way. For debugging purposes this can be fast enough on a beefy desktop GPU. You can accelerate that march by generating a distance field from the voxels, which lets you skip empty space and potentially reduce your iteration counts by quite a bit. I've done this and it definitely helps, but you would probably have to go deeper and try to reduce intra-warp divergence if you really wanted to make things fast.
  12. This is often missing from older materials on Phong shading, but you want to multiply any specular term (actually any BRDF, really) by saturate(dot(N, L)). That term is part of computing the irradiance incident on the surface being shaded, and a BRDF is always defined in terms of irradiance. A lot of people associate the N dot L with diffuse lighting, but really it's a core part of computing reflectance.
  13. "DEVICE_REMOVED" means that the driver either crashed or hit some other non-recoverable error, and your D3D device is now toast. If can happen when you manage to hit a big in the driver, but it can also happen if the GPU spends too long processing a single Draw or Dispatch call. The mechanism that checks if the GPU is taking too long is called Timeout Detection and Recovery, or TDR for short. It's there to make sure that your GPU remains responsive to the many processes on your machine that might be sharing it (including OS components like the desktop compositor), since GPU's have varying support for preemption. Either way, the first thing that you should do (if you're not doing it already) is enable the debug layer for your program. Check the debugger output for error messages (or better yet, set the debug layer to break when it encounters an error), since it will give you helpful information whenever you do something that's invalid. My guess is that you're leaving your tessellation shaders bound when you draw your non-tessellated entities (or perhaps leaving the wrong primitive type set), and that's causing your issue.
  14. Hey there! For the use cases you're talking about,I don't think you'll be able to get away with the stencil-based approach that you've used for portals and mirrors. Something like a CCTV monitor is going to have its own camera projection, and then *that* projection will be mapped onto a 2D surface that's rasterized with a totally different camera projection. As far as I know this isn't something you can do in a single pass, and so you would need to render to a texture first and then map that onto the mesh in your scene. This is especially true if you want your monitor to be curved, or if you want your sub-view onto any other kind of non-planar surface. So in my opinion, I think that you should bite the bullet and implement a secondary viewport system that can render its results into a texture. It's a lot of work for sure, but it's also got a few other use cases besides the one that you mentioned (like split-screen multiplayer, or 360 degree video recording). The good news is that you don't necessarily have to create additional G-Buffer textures. That's certainly the easiest way to do it, but if you want to save memory then you can re-use your render targets across multiple sub-views. In our engine we do this by iterating overall all active sub-views to compute the maximum width and height, and creating our render targets with those dimensions. We then use partial viewports to only render each sub-view with the resolution that it requires. The downside is that some of your shaders (particularly things like blur passes) may need to be aware of the fact that you're using a partial viewport, and will need to know the viewport fraction in order to compute appropriate UV coordinates and/or clamp the sampling coordinates to keep the sub-views from "bleeding" onto one another. The other catch is that if you have any render targets that need to persist across multiple frames (for example, a "previous frame" render target that's used for temporal reprojection) then you will need to duplicate those for each sub-view.
  15. DX12 Implicit State Promotion

    I have no actual hands-on experience with Vulkan, so I'm not really the best person to answer this (hopefully someone else will chime in). But after asking around a bit and reading some of the docs/examples, my understanding is that you do need barriers to do the equivalent operation with a transfer queue in Vulkan. It looks like you will always need to make sure that you issue a barrier to give the resource VK_ACCESS_TRANSFER_WRITE_BIT access before the transfer queue can copy to it, or create the resource with that access bit specified. For textures/images it looks like you also need to put the resource into VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL layout, which can't be done when creating the resource and can only be done with a barrier issued on the transfer queue. After issuing the copy on the transfer queue, it looks like you also need to issue a barrier on the transfer queue that removes VK_ACCESS_TRANSFER_WRITE_BIT and also indicates that ownership is shifting from the transfer queue to the graphics queue (and also transition the layout to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL if you're dealing with a texture/image). Finally, you need issue one last barrier on the graphics queue that adds the appropriate READ_BIT to the access mask, and performs the "aquire" part of the cross-queue transfer. This seems to be confirmed by the synchronization examples given here. If you're trying to do a read_on_graphics_queue -> update_on_transfer_queue -> read_on_graphics_queue operation like we were originally discussing, then I believe you'll also need to do the queue ownership release/aquire steps on both queues, and also specify the appropriate access mask as well as layout bits. So to make a long story short, you can't just ignore the barriers like you can on D3D12. The only thing that seems to be nicer in Vulkan is that you don't need to track the "before" layout when transitioning a texture into a TRANSFER_DST layout, at least if you're going to update the whole subresource. You're allowed to use VK_IMAGE_LAYOUT_UNDEFINED as the source layout, as long as you don't care about losing the previous contents of the texture before the transition.
  • Advertisement