Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Yesterday, 05:21 PM

#5297646 glsl represent 1 big texture as 4 smaller ones (tearing)

Posted by Matias Goldberg on 22 June 2016 - 05:14 PM

You're gonna have trouble with bilinear (gets worse with trilinear) filtering at the edges because the GPU should be interpolating between the two textures, but obviously this won't happen, so you need to do it yourself.


Potentially you may have to sample all four textures and interpolate it yourself:

// Assuming layout of textures:
// |0|1|
// |2|3|
result = mix(
mix( c0, c1, fract( uv.x * 1024.0 - 0.5/1024.0 ),
mix( c2, c3, fract( uv.x * 1024.0 - 0.5/1024.0 ),
fract( uv.y * 1024.0 - 0.5/1024.0 ) );

If you're at the left/right edge, you only need c0 & c1 or c2 & c3; if you're at the top/bottom edge you only need c0 & c2 or c1 & c3. But if you're close to the cross intersection, you're going to need to sample and mix all 4 textures.


Also the mipmaps need to be generated offline based on the original 1024x1024 rather than generating them on the GPU since it will generate them based on the 512x512 blocks individually.


I can't think quickly of a way to fix the trilinear filtering problem though.

#5297226 How to get patch id in domain shader.

Posted by Matias Goldberg on 19 June 2016 - 11:43 AM


Also, drawing each path in its own DrawCall sounds incredibly inefficient. You need to provide at least 256 vertices per draw call to fully utilize the vertex shader.

I thought it was 64 vertices to fully utilize the vertex shader and 256 to not become command processor limited.
edit - for amd.


AMD's wavefront size is of 64, that's true, but there are some inefficiencies and overhead details, such as needing 3 vertices to make a triangle (e.g. 64 triangles x 3 = 192 vertices assuming no tri shares any vertex). Real world testing shows on average you get near optimum throughput at >= 256 vertices per draw.
Edit. See http://www.g-truc.net/post-0666.html

@Matias is it still true if I have a pass-through vertex shader?


#5297150 How to get patch id in domain shader.

Posted by Matias Goldberg on 18 June 2016 - 05:01 PM

Also, drawing each path in its own DrawCall sounds incredibly inefficient. You need to provide at least 256 vertices per draw call to fully utilize the vertex shader.

#5294988 SampleLevel not honouring integer texel offset

Posted by Matias Goldberg on 04 June 2016 - 12:22 PM

Based on personal experience do not rely on the offset parameters. Broken drivers, broken hardware; missmatching results across vendors. It's better to just apply the offset yourself to the UVs.

#5293812 [Solved]NV Optimus notebook spend too much time in copy hardware queue?

Posted by Matias Goldberg on 27 May 2016 - 10:01 AM

I just realized: are you clearing the colour, depth and stencil buffers every frame? (at least the ones linked to the swap chain)
If you're not, you're creating inter-frame dependencies that could also explain this behaviour.

#5293684 [Solved]NV Optimus notebook spend too much time in copy hardware queue?

Posted by Matias Goldberg on 26 May 2016 - 04:35 PM

By the way if you're reading from the framebuffer, it would totally explain it (i.e. postprocessing, or worse... reading from CPU).
Treat the backbuffer as write-only.

#5292902 Hybrid Frustum Traced Shadows

Posted by Matias Goldberg on 22 May 2016 - 12:40 PM

Also the how does the irregular z-buffer fit into this?

They don't use an irregular z-buffer. They don't even need a Z-buffer. Pay attention again: instead of storing depth at each pixel, they store the triangle's plane equation coefficients. A Z-buffer is used to store depth. If they don't store depth, they are not using a Z-buffer.

So where does this https://developer.nvidia.com/sites/default/files/akamai/gameworks/Frustum_Trace.jpg fit into what you just described.

The picture is a visual description of "depthAtReceiver >= calculateDepthAt( planeEquationCoefficients, x, y );"

#5292790 Hybrid Frustum Traced Shadows

Posted by Matias Goldberg on 21 May 2016 - 04:49 PM

During the caster pass, instead of storing depth at each pixel, they store the triangle's plane equation coefficients.


During the receiver pass, instead of doing depthAtReceiver >= depthAtShadowmap test like in regular shadow mapping, they perform a depthAtReceiver >= calculateDepthAt( planeEquationCoefficients, x, y );

Becoming effectively a form of raytracing since it's a ray vs triangle intersection test.

#5292789 Terrain Rendering

Posted by Matias Goldberg on 21 May 2016 - 04:40 PM

Now, why one single VBO?

Well, i see no reason to use multiple VBO since i can scale down my patch.

For instance, a level 0 patch of 33x33 vertices splits into 4 33x33 patches having 0.25 the size of the parent patch.

(33 x 33 vertices means a width and height of 32, i love numbers that are a power of 2, probably an OCD or something.)

The question is why do you need a VBO at all?

With modern GPUs, you can compute the XZ position via gl_VertexID (gl_VertexID / verticesPerRow; gl_VertexID % verticesPerRow); and grab the Y component from the heightmap texture.

#5292373 Pixel Shader 3 weirdness

Posted by Matias Goldberg on 18 May 2016 - 05:16 PM

The others are right. It is a requirement to match VS 3.0 shaders with PS 3.0

The only exception is VS_SW 3.0 which can be matched with PS 2.0 (very old Intel cards).


If you turn on the Debug Layer you would have spotted this issue. The Debug Runtimes are your friend.

#5291984 Material, Shaders, Shader variants and Parameters

Posted by Matias Goldberg on 16 May 2016 - 08:40 PM

Don't follow Unity's and UE4's exact approach because they're overengineered techs born out of DX9-style rendering which had to evolve and adapt over time.


If you design your material system that way, you're going to inherit the same slowness that plagues those engines.


There's no need for so many classes.

All you have is:

  1. Shaders. Make a representation that simply encapsulates the file and compiles it according to input parameters.
  2. Materials. A collection of shaders with per-material parameters that affect how the shader will be compiled, what parameters will be passed during draw instead of compile time, and what textures will be bound.
  3. MaterialManager. Aside from creating materials, it's responsible for keeping shared per-pass parameters (such as view matrices, fog parameters) in a different place (i.e. different const buffer). It also is aware of Materials and Renderable objects so that it can match inputs that are per-object during rendering (such as the world matrix, bone matrices in the case of skinning)

That's all you need. Also stop thinking in terms of parameters, that's a DX9-style thing that nowadays only works well for postprocessing effects and some compute shaders. Start thinking in terms of memory layouts (buffers) and frequency of updates (there's generally going to be 3 buffers: 1 is updated per pass; 1 is per material, updated when a material stored in that buffer changes; 1 is updated per object)

#5291982 [D3D12] Ping Pong Rendering

Posted by Matias Goldberg on 16 May 2016 - 08:17 PM

Adam Miles answer is correct. I'll just expand on it:

Huh.  So then what is the difference between a GroupMemoryBarrier and a DeviceMemoryBarrier?  The latter talks about blocking for "device memory accesses", which I took to mean things like RWStructuredBuffers, RWTexture2Ds, etc.

A 8x8 ThreadGroup works on a group of 8x8 pixels. To process a 1024x1024 texture you'll need 16384 thread groups.


A DeviceMemoryBarrier will sync all transfers to global memory (such as RWStructuredBuffers, RWTexture2Ds) within the threadgroup (within that 8x8 block).

A GroupMemoryBarrier will sync all transfers to shared memory (everything declared as groupshared; which is usually stored inside an on-chip cache. In GCN this is called LDS Local Data Storage) also within the threadgroup.


The difference within these two barriers are which kind of memory they sync. But neither of them can sync with the whole dispatch. There is no intrinsic function to do such thing.

#5291003 Clarification about shaders and Directx11

Posted by Matias Goldberg on 10 May 2016 - 12:50 PM

Excluding the possibility that you did set a shader earlier and didn't unset it, or some 3rd party dll did (such as Direct2D), 3D APIs are like web browsers: when you do something the docs specifically tell you not to do but it still works on your machine, it doesn't mean it will work on other machines.

#5290579 RenderDoc (0.28) not properly capturing output

Posted by Matias Goldberg on 07 May 2016 - 03:19 PM

I don't see you're issuing a clear (which is a huge red flag unless you're doing it on purpose and know what you're doing).

Perhaps you need to enable RenderDoc's save initials setting.


RenderDoc also allows you to check the entire pipeline, see the outputs of the VS, and even debug the VS and PS shaders. Have you tried that?

There's also a pixel history log that will tell you why a pixel is of that colour (e.g. it was cleared, then set to red by pixel shader, then rejected a pixel shader due to depth buffer, etc)

#5289634 Why does GLSL use integers for texture fetches?

Posted by Matias Goldberg on 01 May 2016 - 05:00 PM

TBH I thought it was a bad call. And I still think it is.
However I found one instance where the fetch being an int was useful instead of being an uint: Clamp to edge emulation.
I needed my fetches to clamp to edge; so typical code would look like this:

ivec2 xy = some_value - another_value;
xy = clamp( xy, 0, textureResolution.xy );
float val = texelFetch( myTex, xy ).x;

This code would not work as intended if "xy" were to be uvec2, because values below 0 would wrap, and hence clamped to textureResolution (the other edge!) instead of clamping to 0. It would be the same as doing xy = min( xy, textureResolution.xy );

However, I'm like Hodgman: I prefer unsigned integers because we're addressing memory here, and negative memory makes no sense, and I prefer assert( x < elemSize ) over assert( x >= 0 && x < elemSize );
This case I talk about (clamp to edge) can simply be solved through explicit casts. IMO ints here have more trouble than benefits.