Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Online Last Active Today, 03:52 PM

#5298933 Porting OpenGL to Direct3D 11 : How to handle Input Layouts?

Posted by on 03 July 2016 - 04:52 PM

I suggest you follow a Vao + PSO (PipelineStateObject) approach (PSOs are a D3D12, Vulkan and Metal concept).


Eventually you find yourself across all APIs that you need input layout, shader bytecode data, rasterizer state, depth states, etc. A PSO is a single condensed block with all this information combined. The only catch is that PSOs don't normally require vertex & index buffers, whereas Vaos do.


Therefore Vao + PSO approach: In both your GL and D3D11 pipelines create an emulated PSO (should contains input layout, blend state, rasterizer state, msaa count, shaders to use, depth state, etc) and Vao together and assign them to your renderables.

For D3D11 assign a dummy Vao that only contains vertex & index buffers and a valid PSO, while for GL assign a valid Vao and a valid PSO. Then make your abstracted code set the PSO and then the Vao while iterating through them to render.

In GL, both setVao() and setPso() functions will perform relevant stuff, in D3D11 the setVao() will only set the vertex & index buffers, and setPso() will do all the work.


So in D3D11:

  • Vao: Contains Index & Vertex Buffers
  • PSO: Contains everything else

In GL:

  • Vao: Contains Index & Vertex Buffers + Vertex Layout definition
  • PSO: Contains everything else


This is very easy to write, simplifies everything (you have all the information you need!), and just works™. That's what we do in Ogre 2.1.

Plus, you make your engine friendly with D3D12, Vulkan & Metal.

#5298807 Tangent Space computation for dummies?...

Posted by on 02 July 2016 - 08:43 AM

If you're looking to find a working implementation, you can have a look at mine's. It's very basic, nothing fancy. It's based on Langyel's method.
Several more modern, superior one's have appeared since then.
Should be enough to get you started.



I do not get how a vertex can have a normal. Triangles have a normal. At least in Blender. Then Blender also assigns a mean normal to subdivision patches ..

See Polycount vs Vertex count

#5298649 Water and Fresnel

Posted by on 30 June 2016 - 11:04 AM

Everything has fresnel.



Given enough grazing angle, every surface will look like a mirror. Problem is some surfaces are really non-smooth or the grazing angle must be so steep we can barely notice a discernible reflection because it becomes very thin.

#5298418 [MSVC] Why does SDL initialize member variables?

Posted by on 28 June 2016 - 11:39 AM

For hunting bugs in production, yeah that sucks. You often want a non-null invalid address.

But for deployment you want to avoid crashes and potential memory corruption if the random address happens to be valid.

Note that while 0x00000000 is always considered a bad address in the x86 ABIs, 0xcdcdcdcd could be a valid address if e.g. running with Large Address Aware.

#5297994 [D3D12] Multiple command queues

Posted by on 25 June 2016 - 09:12 AM


This, but remember that copy-queues should have lower bandwidth compared to graphics queue (at least on actual hardware). They are great for concurrency and background works, but for the shortest job to be down it is better to use the graphics queue. I am not sure how they compare against compute queues, but I cannot imagine a scenario where is better to use compute queues instead of graphics queues for immediate copy operations only.

Do you have a reference for that? Maybe for CPU-side to CPU-side, or GPU-side to GPU-side transfers that's true... but I wouldn't think so for transfers between CPU-side and a dedicated GPU (across PCI-e) it would be.
The whole point of the copy queue is that it's designed to fully saturate the PCI-e bus while consuming zero shading/grahpics/compute resources (it's just a DMA controller being fed an "async memcpy" job). Intel say that their DMA controller has fairly low throughput, but, their "GPU-side RAM" is actually also "CPU-side RAM" so in some cases you'd just be able to use a regular background thread and have it perform the memcpy :lol:


For references:

  • DX12PerfTweet 25: Copy queue consumes no shader resources but has less bandwidth than graphics and compute queues.
  • DX12PerfTweet 34: Use the copy queue for background tasks. Spinning for copy to finish is likely inefficient.
  • DX12PerfTweet 56: Use the COPY queue to move memory over PCI-Express: this is more efficient than using COMPUTE or DIRECT queue.
  • GPUOpen blog - Performance Tweets Series: Streaming & Memory Management: (...) The copy queue exposes the copy engine, which is a dedicated DMA engine designed around efficient transfers across the PCIe bus. (...) Before you run off and move all copies to the copy queue, keep in mind the copy queue is not designed for all copies. In fact, the copy engine is only optimized for transferring data over PCIe. It’s the only way to saturate PCIe bandwidth (...).

- if you're copying CPU->CPU, don't use the GPU, call memcpy :lol:
- if you're copying CPU->GPU or GPU->CPU, use the copy queue, except maybe if you're optimizing for Intel or a mobile platform.
- If you're copying GPU->GPU, probably use a compute queue, except maybe for SLI/crossfire (multi-adaptor) cases.

That is pretty much it. Integrated GPUs will perform better if you write directly to the GPU memory from the CPU. It's a mystery to me whether this applies to AMD APUs as well.

#5297646 glsl represent 1 big texture as 4 smaller ones (tearing)

Posted by on 22 June 2016 - 05:14 PM

You're gonna have trouble with bilinear (gets worse with trilinear) filtering at the edges because the GPU should be interpolating between the two textures, but obviously this won't happen, so you need to do it yourself.


Potentially you may have to sample all four textures and interpolate it yourself:

// Assuming layout of textures:
// |0|1|
// |2|3|
result = mix(
mix( c0, c1, fract( uv.x * 1024.0 - 0.5/1024.0 ),
mix( c2, c3, fract( uv.x * 1024.0 - 0.5/1024.0 ),
fract( uv.y * 1024.0 - 0.5/1024.0 ) );

If you're at the left/right edge, you only need c0 & c1 or c2 & c3; if you're at the top/bottom edge you only need c0 & c2 or c1 & c3. But if you're close to the cross intersection, you're going to need to sample and mix all 4 textures.


Also the mipmaps need to be generated offline based on the original 1024x1024 rather than generating them on the GPU since it will generate them based on the 512x512 blocks individually.


I can't think quickly of a way to fix the trilinear filtering problem though.

#5297226 How to get patch id in domain shader.

Posted by on 19 June 2016 - 11:43 AM


Also, drawing each path in its own DrawCall sounds incredibly inefficient. You need to provide at least 256 vertices per draw call to fully utilize the vertex shader.

I thought it was 64 vertices to fully utilize the vertex shader and 256 to not become command processor limited.
edit - for amd.


AMD's wavefront size is of 64, that's true, but there are some inefficiencies and overhead details, such as needing 3 vertices to make a triangle (e.g. 64 triangles x 3 = 192 vertices assuming no tri shares any vertex). Real world testing shows on average you get near optimum throughput at >= 256 vertices per draw.
Edit. See http://www.g-truc.net/post-0666.html

@Matias is it still true if I have a pass-through vertex shader?


#5297150 How to get patch id in domain shader.

Posted by on 18 June 2016 - 05:01 PM

Also, drawing each path in its own DrawCall sounds incredibly inefficient. You need to provide at least 256 vertices per draw call to fully utilize the vertex shader.

#5294988 SampleLevel not honouring integer texel offset

Posted by on 04 June 2016 - 12:22 PM

Based on personal experience do not rely on the offset parameters. Broken drivers, broken hardware; missmatching results across vendors. It's better to just apply the offset yourself to the UVs.

#5293812 [Solved]NV Optimus notebook spend too much time in copy hardware queue?

Posted by on 27 May 2016 - 10:01 AM

I just realized: are you clearing the colour, depth and stencil buffers every frame? (at least the ones linked to the swap chain)
If you're not, you're creating inter-frame dependencies that could also explain this behaviour.

#5293684 [Solved]NV Optimus notebook spend too much time in copy hardware queue?

Posted by on 26 May 2016 - 04:35 PM

By the way if you're reading from the framebuffer, it would totally explain it (i.e. postprocessing, or worse... reading from CPU).
Treat the backbuffer as write-only.

#5292902 Hybrid Frustum Traced Shadows

Posted by on 22 May 2016 - 12:40 PM

Also the how does the irregular z-buffer fit into this?

They don't use an irregular z-buffer. They don't even need a Z-buffer. Pay attention again: instead of storing depth at each pixel, they store the triangle's plane equation coefficients. A Z-buffer is used to store depth. If they don't store depth, they are not using a Z-buffer.

So where does this https://developer.nvidia.com/sites/default/files/akamai/gameworks/Frustum_Trace.jpg fit into what you just described.

The picture is a visual description of "depthAtReceiver >= calculateDepthAt( planeEquationCoefficients, x, y );"

#5292790 Hybrid Frustum Traced Shadows

Posted by on 21 May 2016 - 04:49 PM

During the caster pass, instead of storing depth at each pixel, they store the triangle's plane equation coefficients.


During the receiver pass, instead of doing depthAtReceiver >= depthAtShadowmap test like in regular shadow mapping, they perform a depthAtReceiver >= calculateDepthAt( planeEquationCoefficients, x, y );

Becoming effectively a form of raytracing since it's a ray vs triangle intersection test.

#5292789 Terrain Rendering

Posted by on 21 May 2016 - 04:40 PM

Now, why one single VBO?

Well, i see no reason to use multiple VBO since i can scale down my patch.

For instance, a level 0 patch of 33x33 vertices splits into 4 33x33 patches having 0.25 the size of the parent patch.

(33 x 33 vertices means a width and height of 32, i love numbers that are a power of 2, probably an OCD or something.)

The question is why do you need a VBO at all?

With modern GPUs, you can compute the XZ position via gl_VertexID (gl_VertexID / verticesPerRow; gl_VertexID % verticesPerRow); and grab the Y component from the heightmap texture.

#5292373 Pixel Shader 3 weirdness

Posted by on 18 May 2016 - 05:16 PM

The others are right. It is a requirement to match VS 3.0 shaders with PS 3.0

The only exception is VS_SW 3.0 which can be matched with PS 2.0 (very old Intel cards).


If you turn on the Debug Layer you would have spotted this issue. The Debug Runtimes are your friend.