Jump to content
  • Advertisement

MJP

Moderator
  • Content count

    8620
  • Joined

  • Last visited

  • Days Won

    2

MJP last won the day on June 22

MJP had the most liked content!

Community Reputation

19951 Excellent

1 Follower

About MJP

  • Rank
    XNA/DirectX Moderator & MVP

Personal Information

Social

  • Twitter
    @MyNameIsMJP
  • Github
    TheRealMJP

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. This is not accurate. GPU's can absolutely use true flow control operations, with the caveat that the flow control is coherent across a group of threads that execute in lockstep. Modern GPU's generally use SIMD hardware that's anywhere from 8-wide to 64-wide, and require the branch condition to be uniform across the whole SIMD to be able to actually take the branch. GPU's only have to resort to lane masking and predication when the result of the branch condition is different across a group of threads on the same SIMD unit. In summary, whether or not a branch/loop actually skips instructions depends on your condition and your grouping of threads. For instance if you're branching in a pixel shader, you'll want to to make sure that the branch condition will be same across neighboring pixels in the same area of the screen. Or if you branch on a value from a constant buffer that's not dynamically indexed, you can know for sure that all of your threads will take the same path.
  2. Are you actually trying to support Windows XP? If you only need to run on Windows Vista and up, you can create an "Ex" device instead. An Ex device won't go into "device lost" state unless the driver crashes, so you won't have to deal with that. There's some more info here and here.
  3. What exactly do you mean by "faking" behaviors? SM6.0 is tied to DXIL and the new shader compiler (DXC), so the ecosystem isn't nearly as mature as it was for DBC and FXC. But it's been worked on steadily for the past few years, and I've gotten both my home and work codebase running on it. However I've definitely hit some bugs that only on the DXIL path, so it's very possible that you will as well. Fortunately with the new compiler it's very easy to open an issue on GitHub if you need help with something. As for the wave-level operations in Shader Model 6.0, they're totally optional. The driver has to report that it supports the operations and also for the wave width in D3D12_FEATURE_DATA_D3D12_OPTIONS1, so there's no reason for the driver to say that it supports them if it can't actually do a wave-level operation. Either way Intel, AMD, and Nvidia hardware all work with SIMD units at a low-level, and they certainly support cross-lane operations on those SIMD units. I have no reason to think that the vendor extensions would work any better than the wave-level intrinsics in SM6.0. Vendor extensions can expose things that are unique to that hardware, but in the case of the wave-level ops the HLSL intrinsics are pretty much identical to the what's in AMD and Nvidia's extensions. Going forward DXIL and SM6.0 are going to be the official path used by games and apps, so there's no reason to think that the IHV's won't put the bulk of their effort into making sure that path is optimal. Either way the shader "extensions" provided by Nvidia and AMD are a giant PITA to use, since they require binding a dummy UAV and performing operations on it. If you accidentally do that on hardware that doesn't support those "extensions" you could get weird results, and it's also going to mess up the debug validation layer and debugging tools like PIX and RenderDoc. We don't use them in our codebase for these reasons, and I don't plan on starting.
  4. Converting light directions to tangent space in the VS isn't practical for handling arbitrary numbers of lights, since you're very limited in the number of interpolants that you can pass between the VS and PS (this is especially true from a performance POV). There's also any number of other things that might happen in the PS that need the final surface normal, such as cubemap reflections or SH ambient lighting. I would just convert the normalmap normal to world space and work from there. It will make your code cleaner, and you'll be in a better spot if you start adding more advanced techniques. To compute per-vertex tangents you really need more global info about the triangle and its neighbors. This is why it's very common to compute tangents in an offline pass during mesh import/processing. It is possible to use pixel shader derivatives to compute a tangent frame on-the-fly, but you may run into issues if the derivatives are inaccurate.
  5. MJP

    Dynamic Vertex buffer

    Is mPlanetMesh.vertices a std::vector? If so, one thing that you need to be careful about is that "ByteWidth" expects the vertex buffer size in bytes. size() returns the number of elements in the vector, not the size in bytes. For the size in bytes, you want something like "vertices.size() * sizeof(T)", where "T" is the type that the vector is templated on. The other thing to watch out for is the data that you're using to initialize the buffer, which is passed via vertexData.pSysMem. The pointer that you pass here has to point to a block of data that's at least as large as ByteWidth. So if you create the buffer for 1024 vertices but pass a pointer to an array of 512 vertices, then CreateBuffer will access memory past the bounds of that array and potentially cause an access violation. Keep in mind that initializing a buffer with data is optional: you can skip that if you want, and partially update the buffer later on by calling Map() to update its contents. If you want to skip initializing the buffer data, just pass nullptr as the second parameter to CreateBuffer.
  6. I just wanted to drop a link to a great presentation that I read the other data, which I think might be relevent: https://research.activision.com/t5/Publications/HDR-in-Call-of-Duty/ba-p/10744846 It's about implementing support for HDR displays, but it starts out with a good intro to colorimetry/photometry and how it applies to displays. So it might help you understand the concepts behind sRGB a bit better, and also understand how the new HDR standards differ.
  7. I was referring to a combination of 2 and 3. Basically if you know you have a projection matrix and you want to invert it, then you use the appropriate scale/translation inverse instead of the generalized matrix inverse. We don't try to automatically detect when it would be more appropriate to use a different method of inverting, it has to be done manually by the programmer.
  8. In our math library we will assert if the matrix isn't invertible when the Invert() function is called. We'll generally work around it by either working with the transform in a different representation (Float3 Translation + Quat Orientation + Float3 scale), or by using special-case inverse functions that only work for certain matrix configurations. For instance we have one that only works for rotation + translation (camera/view matrix) and one that only works for scale + translation (projection matrix).
  9. That's correcty: for a microfacet specular BRDF, the fresnel term isn't computed in terms of the macro surface normal, it's computed in terms of the active microfacet normal (H). However this doesn't mean that it's completely decoupled from the surface normal, because microfacet specular BRDF's are only defined on the upper hemisphere such that N dot L > 0 and N dot V > 0 (this should make sense intuitively: if the view direction is "below" the surface then it can't actually see the surface, and vice versa for the light). With that constraint in place, it means that if L dot H or V dot H is ~0, then N dot L must also be close to 0. Here's a diagram showing what I mean: That red cone shows the possible range of values for N where L and V would still be on the upper hemisphere with regards to N, and its angle equivalent to the angle between L/V and the vector perpendicular to H. Therefore as L dot H and V dot H get smaller, N has to be close to H in order for for to get any reflections at all. In other words, if L/V are setup such that the light is grazing the microfacet normal, it will also be grazing the surface normal for valid configurations of the BRDF. So really there's no point in looking at the fresnel behavior in the "shadowed area" like you were doing, since the BRDF is always there 0 anyway.
  10. Each CU has 4 SIMD's, and each SIMD has its own vector register file (VGPR) that can support up to 10 waves running on that SIMD (each SIMD has 256 VGPR's, so the whole CU has 1024 VGPR's total). This means that the max occupancy per-CU is 40 waves, not 10. However the shared memory is shared among the 4 SIMD's on a CU ,as is the scalar unit and scalar register file. Without shared memory the occupancy calculation is pretty simple: you compute occupancy per-SIMD based on VGPR's, which you do by calculating 256 / ShaderVGPRUsage and rounding down to the closest integer. However with thread groups and shared memory things get more complex, since thread groups with shared memory introduce the requirement that all waves in the thread group have to live on the same CU, which means you have to take both VGPR usage *and* total shared memory usage into account. So if you have a thread group with the max number of threads and max D3D shared memory allocation (32KB), the max thread groups per-SIMD that you can have is 2, which means your max per-SIMD occupancy is ((1024 * 2) / 64) / 4 == 8. But that's assuming no VGPR's used, which is never the case. If you were to use something like 64 VGPR's, then each thread group would collectively use (1024 / 64) * 64 = 1024 registers, meaning that you would be capped at 1 thread group per CU. Or alternatively if you used 32KB of shared memory and only had 64 threads in your thread group, that would mean you could only run 2 wavefronts on your CU! Basically you have to calculate occupancy in terms of both shared memory and VGPR's and go with the minimum. In reality you also have to compute the occupancy in terms of scalar registers (SGPR's) as well and take that into account, but in my experience it's rare for SGPR's to be the limiting factor. There's more details and guidance here: https://gpuopen.com/optimizing-gpu-occupancy-resource-usage-large-thread-groups/
  11. As promised, here are some images. This first set is taken from BRDF Explorer, which is a very useful tool for these sorts of things: The first image shows the "head-on" angle, where the lighting and viewing direction are nearly lined up with the surface normal. I Included the polar plot in there so that you can clearly see what I'm talking about: the blue line is the light direction, the green line is the surface normal, and the pink line is the viewing direction. The red blobby shape is the BRDF lobe, which shoes the intensity of the reflected lighting in a particular direction. The second image shows a "grazing" angle with no fresnel, where the lighting and viewing directions are nearly perpendicular to the surface normal. The third image shows the grazing angle with fresnel enabled. You can see that the difference is very noticeable, both in the BRDF plot as well as in the actual rendered image. For reference, these were all rendered with F0 == 0.03, using a Cook-Torrance specular BRDF with a GGX distribution and roughnes of about 0.12. For completeness, here's images with fresnel disabled and enabled using a path tracer to render a scene with full indirect specular: Notice how you lose a lot of the specular in the scene without fresnel!
  12. Yeah, dithering is the only way I know of to improve quality in dark scenes. Some games apply a film grain effect for aesthetic purposes, and that effectively ends up giving you a dither pattern. What's your F0 value? Typical non-metal materials have an F0 in range of 0.02-0.04, and the fresnel effect can be very noticeable on them (especially for lower roughness). It will be most noticable at a "grazing" angle where the eye/camera direction is nearly parallel to the surface, and the light is on the opposite side of the surface from the eye. It also depends on the rest of your specular BRDF, since the geometry/visibility terms in proper microfacet BRDF's will also give much stronger reflections at grazing angles. I'm not on my main PC right now, but I'll get some screenshots for you later when I'm on my desktop.
  13. With graphics programming it's always important to break down your performance into CPU performance and GPU performance. The CPU and GPU run concurrently with each other, so it's typical that one will take longer than the other to complete a frame. If the GPU is taking longer than the CPU we call it being "GPU-bound", and if its the other way around we call it being "CPU-bound". It's good practice to build your own in-engine tools for measuring CPU performance with a high-resolution timer, and for measuring GPU performance with timestamp queries. External CPU and GPU profiling tools like PIX, Nvidia Nsight, and VTune can also help you to gather the necessary information. Alternatively, you can often quickly determine if you're CPU or GPU-bound through simple experimentation. So lets now look at your situation specifically. The GS technique that you've used can drastically reduce draw calls (by up to 6x for the cubemap case). Draw calls tend increase your CPU performance, but often won't have much effect in your GPU performance. In other words, it will probably improve your overall frame time if you're CPU-bound, but isn't likely to help if you're GPU-bound. The bad part about the GS technique (and main reason why it's infrequently used) is that some aspects of the GS can be difficult to implement on a GPU in an efficient way. In particular, having any kind of geometry amplification (which is what you're doing in your GS) can be really slow, since it doesn't play nicely with processing lots of triangles in parallel (it's especially tricky for GPU's since D3D requires that triangles output by the GS get rasterized in-order). In particular AMD has historically had problems with GS performance, which is mostly due to their implementation having to spill all of the triangles to memory before rasterizing. You may have more luck with GS instancing, but I've never used that myself and I don't know if it's actually more optimal for existing GPU's. AMD and Nvidia have some "extensions" for their recent GPU's that let you do some neat tricks. Nvidia has their "Fast GS" available through NVAPI, which supports using a viewport mask to "broadcast" a triangle to multiple viewports or RT slices. They actually have a sample that uses this for cascaded shadow maps, but unfortunately the code uses OpenGL and not D3D. Meanwhile, AMD has API's for letting you specify a viewport/RT broadcast mask from the CPU, and lets you get the index in the shader to transform the vertices differently (which suggests that its more of an instancing API).
  14. I don't think that the Fresnel calculation is the source of your banding. I looked at your screen capture in photoshop, and the different bands have an intensity difference of exactly 1/255. This implies that you're at the limit of what can be represented in 8-bit sRGB. You may want to look into applying some dithering to effectively hide the banding. Playdead had some presentations about this that you should check out. Regarding fresnel, the "correct" version depends on the context. The basic fresnel equations deal with reflection and refraction of a ray of light. In plain english, I would explain it like this: "When the light ray is pointed directly into the surface, less light is reflected off the surface and more light is refracted into the surface. When the light ray is grazing the surface, more light is reflected off and less light is refracted into the surface." Based on that explanation you can see why Schlick's approximation makes use of a dot product, since it's a simple way of determining whether a vector is grazing a surface or pointing into a surface given a normal vector for that surface. So your basic Schlick's approximation for determining the amount of reflected light would look something like this: return F0 + (1.0f - F0) * pow(1.0f - saturate(dot(N, L)), 5.0f); If you're dealing with perfectly mirror surfaces (roughness of 0), this equation applies. However, typically we're using a specular BRDF that models surface that are "rough" at a microscopic level. These BRDF's assume that the surfaces are made up of tiny "microfacets" that each behave like a perfect Fresnel behavior, but might be oriented in all kinds of directions (the roughness parameter controls the degree to which those microfacets are all aligned, or un-aligned). With a microfacet BRDF instead of directly computing the reflected light off of a surface, you typically determine a portion of the microfacets that aligned with the half vector. The half vector is exactly between the light direction and view direction, so if a microfacet is aligned with the half vector then it will produce a perfect reflection towards the eye. Because of this, the half vector is sometimes referred to as the "active microfacet direction". With the BRDF being formulated this way, you instead want to compute your fresnel reflectance using that active microfacet direction instead of the surface normal: return F) + (1.0f - FO) * pow(1.0f - saturate(dot(H, L)), 5.0f); This is also equivalent to using dot(H, V), since the half-vector is exactly in-between L and V. So to make a long story short: the Fresnel equation that you're using is correct. The one you're looking at in that Google image appears to be using dot(V, N), which is something else. This is essentially giving you the reflectance amount assuming you started at the eye and shot a ray towards the surface, which due to reciprocity is the same as doing dot(reflect(V, N), N). This is probably what you would use if you were computing Fresnel for a mirror BRDF and sampling a lighting environment, for instance an IBL cubemap. This is *not* what you would want to use for a local light source, since Fresnel always depends on the incoming lighting direction!
  15. What do you mean by "lines"? I assume you're talking about the total number of SIMD lanes, which are the execution units that process a single thread? Most GPU's require that all threads in a thread group using shared memory "live" on the same GPU core (AMD calls them Compute Units or CU's for short, Nvidia calls them Shader Multiprocessor or SM's for short). Most GPU's can over-commit threads to functional units on those cores. For instance, an AMD CU has 4 SIMD units, and each one of those SIMD units can have up to 10 wavefronts active at once. Those 10 wavefronts don't actually execute at the same time, instead the hardware will cycle through them (usually it will try to do so when a shader program encounters a long stall due to memory access). However the max number of waves that it can keep in flight simultaneously (called the "occupancy") is limited by both the number of registers that the shader uses, as well as the shared memory allocation. The hardware will try to fill up the cores with as many wavefronts as possible until it either runs out of registers or it runs out of on-ship memory used as the backing store for shared memory, which is why you usually want to minimize those two things.
  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!