Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Yesterday, 09:48 PM

#5293812 NV Optimus notebook spend too much time in copy hardware queue?

Posted by Matias Goldberg on 27 May 2016 - 10:01 AM

I just realized: are you clearing the colour, depth and stencil buffers every frame? (at least the ones linked to the swap chain)
If you're not, you're creating inter-frame dependencies that could also explain this behaviour.


#5293684 NV Optimus notebook spend too much time in copy hardware queue?

Posted by Matias Goldberg on 26 May 2016 - 04:35 PM

By the way if you're reading from the framebuffer, it would totally explain it (i.e. postprocessing, or worse... reading from CPU).
Treat the backbuffer as write-only.


#5292902 Hybrid Frustum Traced Shadows

Posted by Matias Goldberg on 22 May 2016 - 12:40 PM

Also the how does the irregular z-buffer fit into this?

They don't use an irregular z-buffer. They don't even need a Z-buffer. Pay attention again: instead of storing depth at each pixel, they store the triangle's plane equation coefficients. A Z-buffer is used to store depth. If they don't store depth, they are not using a Z-buffer.

So where does this https://developer.nvidia.com/sites/default/files/akamai/gameworks/Frustum_Trace.jpg fit into what you just described.

The picture is a visual description of "depthAtReceiver >= calculateDepthAt( planeEquationCoefficients, x, y );"


#5292790 Hybrid Frustum Traced Shadows

Posted by Matias Goldberg on 21 May 2016 - 04:49 PM

During the caster pass, instead of storing depth at each pixel, they store the triangle's plane equation coefficients.

 

During the receiver pass, instead of doing depthAtReceiver >= depthAtShadowmap test like in regular shadow mapping, they perform a depthAtReceiver >= calculateDepthAt( planeEquationCoefficients, x, y );

Becoming effectively a form of raytracing since it's a ray vs triangle intersection test.




#5292789 Terrain Rendering

Posted by Matias Goldberg on 21 May 2016 - 04:40 PM

Now, why one single VBO?

Well, i see no reason to use multiple VBO since i can scale down my patch.

For instance, a level 0 patch of 33x33 vertices splits into 4 33x33 patches having 0.25 the size of the parent patch.

(33 x 33 vertices means a width and height of 32, i love numbers that are a power of 2, probably an OCD or something.)

The question is why do you need a VBO at all?

With modern GPUs, you can compute the XZ position via gl_VertexID (gl_VertexID / verticesPerRow; gl_VertexID % verticesPerRow); and grab the Y component from the heightmap texture.




#5292373 Pixel Shader 3 weirdness

Posted by Matias Goldberg on 18 May 2016 - 05:16 PM

The others are right. It is a requirement to match VS 3.0 shaders with PS 3.0

The only exception is VS_SW 3.0 which can be matched with PS 2.0 (very old Intel cards).

 

If you turn on the Debug Layer you would have spotted this issue. The Debug Runtimes are your friend.




#5291984 Material, Shaders, Shader variants and Parameters

Posted by Matias Goldberg on 16 May 2016 - 08:40 PM

Don't follow Unity's and UE4's exact approach because they're overengineered techs born out of DX9-style rendering which had to evolve and adapt over time.

 

If you design your material system that way, you're going to inherit the same slowness that plagues those engines.

 

There's no need for so many classes.

All you have is:

  1. Shaders. Make a representation that simply encapsulates the file and compiles it according to input parameters.
  2. Materials. A collection of shaders with per-material parameters that affect how the shader will be compiled, what parameters will be passed during draw instead of compile time, and what textures will be bound.
  3. MaterialManager. Aside from creating materials, it's responsible for keeping shared per-pass parameters (such as view matrices, fog parameters) in a different place (i.e. different const buffer). It also is aware of Materials and Renderable objects so that it can match inputs that are per-object during rendering (such as the world matrix, bone matrices in the case of skinning)

That's all you need. Also stop thinking in terms of parameters, that's a DX9-style thing that nowadays only works well for postprocessing effects and some compute shaders. Start thinking in terms of memory layouts (buffers) and frequency of updates (there's generally going to be 3 buffers: 1 is updated per pass; 1 is per material, updated when a material stored in that buffer changes; 1 is updated per object)




#5291982 [D3D12] Ping Pong Rendering

Posted by Matias Goldberg on 16 May 2016 - 08:17 PM

Adam Miles answer is correct. I'll just expand on it:

Huh.  So then what is the difference between a GroupMemoryBarrier and a DeviceMemoryBarrier?  The latter talks about blocking for "device memory accesses", which I took to mean things like RWStructuredBuffers, RWTexture2Ds, etc.

A 8x8 ThreadGroup works on a group of 8x8 pixels. To process a 1024x1024 texture you'll need 16384 thread groups.

 

A DeviceMemoryBarrier will sync all transfers to global memory (such as RWStructuredBuffers, RWTexture2Ds) within the threadgroup (within that 8x8 block).

A GroupMemoryBarrier will sync all transfers to shared memory (everything declared as groupshared; which is usually stored inside an on-chip cache. In GCN this is called LDS Local Data Storage) also within the threadgroup.

 

The difference within these two barriers are which kind of memory they sync. But neither of them can sync with the whole dispatch. There is no intrinsic function to do such thing.




#5291003 Clarification about shaders and Directx11

Posted by Matias Goldberg on 10 May 2016 - 12:50 PM

Excluding the possibility that you did set a shader earlier and didn't unset it, or some 3rd party dll did (such as Direct2D), 3D APIs are like web browsers: when you do something the docs specifically tell you not to do but it still works on your machine, it doesn't mean it will work on other machines.


#5290579 RenderDoc (0.28) not properly capturing output

Posted by Matias Goldberg on 07 May 2016 - 03:19 PM

I don't see you're issuing a clear (which is a huge red flag unless you're doing it on purpose and know what you're doing).

Perhaps you need to enable RenderDoc's save initials setting.

 

RenderDoc also allows you to check the entire pipeline, see the outputs of the VS, and even debug the VS and PS shaders. Have you tried that?

There's also a pixel history log that will tell you why a pixel is of that colour (e.g. it was cleared, then set to red by pixel shader, then rejected a pixel shader due to depth buffer, etc)




#5289634 Why does GLSL use integers for texture fetches?

Posted by Matias Goldberg on 01 May 2016 - 05:00 PM

TBH I thought it was a bad call. And I still think it is.
 
However I found one instance where the fetch being an int was useful instead of being an uint: Clamp to edge emulation.
I needed my fetches to clamp to edge; so typical code would look like this:

ivec2 xy = some_value - another_value;
xy = clamp( xy, 0, textureResolution.xy );
float val = texelFetch( myTex, xy ).x;

This code would not work as intended if "xy" were to be uvec2, because values below 0 would wrap, and hence clamped to textureResolution (the other edge!) instead of clamping to 0. It would be the same as doing xy = min( xy, textureResolution.xy );

However, I'm like Hodgman: I prefer unsigned integers because we're addressing memory here, and negative memory makes no sense, and I prefer assert( x < elemSize ) over assert( x >= 0 && x < elemSize );
This case I talk about (clamp to edge) can simply be solved through explicit casts. IMO ints here have more trouble than benefits.




#5289488 Use Buffer or Texture, PS or CS for GPU Image Processing?

Posted by Matias Goldberg on 30 April 2016 - 05:44 PM

I have researched into this very recently, so it's fresh in my memory:

 

For large kernels (kernel_radius > 4; which means > 9 taps per pixel), Compute Shaders outperform Pixel Shaders, until the kernel becomes large enough the difference is up to 100% on my AMD Radeon HD 7770.

 

However, you need to be careful about the CS method because maximizing throughput isn't easy.

"Efficient Compute Shader Programming" from Bill Bilodeau describes several ways on maximizing throughput, and GPUOpen has a free SeparableFilter11 implementation of the techniques described there with full source code and a demo with lots of knots to tweak and play with.

 

As for Buffer vs Compute, like you said linear layout is great for the horizontal pass, but terrible for the vertical pass; thus Textures performs better; also usually if you end up sampling this texture later on in non linear patterns (or need something other than point filtering), a Texture is a win.

 

You may want to look into addressing the texture from the Compute Shader in morton order to undo the morton pattern of the texture and hence improve speed when possible, but I haven't looked into that.

 

And of course, on D3D12/Vulkan, a Compute Shader based solution means an opportunity for Async Shaders which can increase speed on AMD's GCN, or decrease it on NVIDIA.




#5289098 C++ CLI or native for game engine

Posted by Matias Goldberg on 28 April 2016 - 09:50 AM

Clarifying so what Josh Petrie said makes sense:

You're confusing the CRT (C Run-Time) with CLI (Common Language Infrastructure).

 

CLI is some weird hybrid between C++ & C# for interoperating between those two languages.

CRT is native and provides common basic C functionality such as malloc, free, strcpy, etc. When you see projects linked to the VC Runtime, they're linking to the CRT.




#5289027 D3D12 / Vulkan Synchronization Primitives

Posted by Matias Goldberg on 27 April 2016 - 08:20 PM

I fail to see how:

waitOnFence( fence[i] );

is any different from:

waitOnFence( fence, i );

Yes, the first one might require more "malloc" (I'm not speaking in the C malloc sense, but rather in "we'll need more memory somewhere") assuming the second version doesn't have hidden overhead.

 

However since you shouldn't have much more than ~10 fences (3 for triple buffer + 6 for overall synchronization across those 3 frames + 1 for streaming) memory usage becomes irrelevant. If you are calling "waitOnFence(...)" (which has a high overhead) more than 1-3 times per frame you're probably doing something wrong and it will likely begin to show up in GPUView (unless you have carefully calculated why you are fencing more than the norm and makes sense on what you're doing).

 

Btw you can emulate DX12's style in vulkan with (assuming you have a max limit of what the waiting value will be):

class MyFence
{
#if VULKAN
       vkFence m_fence[N];
#else
       D3D12Fence m_fence;
#endif
       MyFence( uint maxN );

       void wait( uint value );
}
due to creating 1 fence per ExecuteCommandLists

 

Ewww. Why would you do that?

Fence once per frame like Hodgman said. Only exceptions are sync'ing with compute & copy queues (but keep the waits() to a minimum).




#5288637 GPL wtf?

Posted by Matias Goldberg on 25 April 2016 - 01:10 PM

A quick read of the SFC post shows quite a different view.

 

From their perspective, it's not the GPL, but rather that the CDDL license forbids distributing their software linked with software that can't be covered by the CDDL (such as the GPL).

 

I guess "GPL Violations Related to Combining ZFS and Linux" or "Canonical accused of violating the GPL"" calls more the attention than "CDDL Violations Related to Combining ZFS and Linux" or "Canonical accused of violating the CDDL".

 

So to your question "So, since it is against the GPL to combine non-GPL stuff with the Linux kernel, is Valve in violation with the GPL?". No, because Valve isn't saying the Linux kernel shouldn't be GPL when distributing their SteamOS with their own software, while the ZFS' license says the Linux kernel can't be GPL if ZFS is included in binary form.

At least, from SFC's rationale being discussed here.






PARTNERS