Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Yesterday, 06:06 PM

#5300046 Multiple animations in one .dae file(blender doesn't support this)

Posted by on 10 July 2016 - 04:56 PM

Have you tried the OpenGEX exporter? The exporter may or may not support what you need and OGEX is infinitely superior to Collada.

#5299608 Frame buffer speed, when does it matter?

Posted by on 07 July 2016 - 09:29 AM

I presumed that interface width of the memory enables us to transfer more data in a shorter time

It enables to transfer more data in the same time, not in shorter time. It's a very important distinction.
Think of the problem as a truck travelling 500km and it takes them 5 hours to complete. The truck can only hold 1tn of cargo. If you use two trucks, you can send twice the amount of cargo. But it still will take 5 hours to complete.

Why am I asking this is (higher level view) because I'm interested in why HBM is beneficial and when does it stop being such.

It depends on something we call "bottleneck". A game that performs a lot of reads and writes may be bandwidth limited, thus memory that has higher bandwidth will run faster.
But if another game executes a lot of math (which uses the ALU units Hodgman describes) and that's most of what it does, then higher bandwidth won't do jack squad because that's not the bottleneck.
Going back to the truck example:

You have to transfer 2tn of cargo. You have one truck. This is your bottleneck. You need 5hs to travel 500km and send 1tn, then another 5hs to get back and load the rest. Then 5hs more to travel 500km again. In total all the travelling took 15hs by using one truck.
If you use two trucks, you'll be done in 5hs. Memory bandwidth and bus bandwidth behave more or less the same. Because you can send more data in the same amount of time, but you needed a lot of data to send; doubling the amount of data you can transfer allows you to finish sooner only if it's the bottleneck. But you can never go less than 5hs in one trip. (Why? you ask? because GPUs can't send data faster than the speed of light)
Now let's add the "ALU" to the example: Let's suppose all you have to send in the truck a machine that weights only 70kg (that's 0.07tn). However disassembling the machine for transportation and load it into the truck takes you 8 hours. The truck then begins its journey and takes 5hs. Total time = 13hs.
You could use two trucks... but it will still take you 13hs because having an extra truck doesn't help you at all in disassembling the machine. What you need is an extra hand, not another truck. The bottleneck here is in disassembling the machine, not in transportation.
In this example people = ALU; trucks = bandwidth.
More people = you can disassemble and load the machine into the truck faster.
More trucks = you can send more cargo per trip.
More ALU = you can do more math operation in the same amount of time.
More bandwidth = you can do more loads and store from/to memory in the same amount of time.
So, to answer your question: does an increase of bandwidth make a game run faster? It depends.

#5298933 Porting OpenGL to Direct3D 11 : How to handle Input Layouts?

Posted by on 03 July 2016 - 04:52 PM

I suggest you follow a Vao + PSO (PipelineStateObject) approach (PSOs are a D3D12, Vulkan and Metal concept).


Eventually you find yourself across all APIs that you need input layout, shader bytecode data, rasterizer state, depth states, etc. A PSO is a single condensed block with all this information combined. The only catch is that PSOs don't normally require vertex & index buffers, whereas Vaos do.


Therefore Vao + PSO approach: In both your GL and D3D11 pipelines create an emulated PSO (should contains input layout, blend state, rasterizer state, msaa count, shaders to use, depth state, etc) and Vao together and assign them to your renderables.

For D3D11 assign a dummy Vao that only contains vertex & index buffers and a valid PSO, while for GL assign a valid Vao and a valid PSO. Then make your abstracted code set the PSO and then the Vao while iterating through them to render.

In GL, both setVao() and setPso() functions will perform relevant stuff, in D3D11 the setVao() will only set the vertex & index buffers, and setPso() will do all the work.


So in D3D11:

  • Vao: Contains Index & Vertex Buffers
  • PSO: Contains everything else

In GL:

  • Vao: Contains Index & Vertex Buffers + Vertex Layout definition
  • PSO: Contains everything else


This is very easy to write, simplifies everything (you have all the information you need!), and just works™. That's what we do in Ogre 2.1.

Plus, you make your engine friendly with D3D12, Vulkan & Metal.

#5298807 Tangent Space computation for dummies?...

Posted by on 02 July 2016 - 08:43 AM

If you're looking to find a working implementation, you can have a look at mine's. It's very basic, nothing fancy. It's based on Langyel's method.
Several more modern, superior one's have appeared since then.
Should be enough to get you started.



I do not get how a vertex can have a normal. Triangles have a normal. At least in Blender. Then Blender also assigns a mean normal to subdivision patches ..

See Polycount vs Vertex count

#5298649 Water and Fresnel

Posted by on 30 June 2016 - 11:04 AM

Everything has fresnel.



Given enough grazing angle, every surface will look like a mirror. Problem is some surfaces are really non-smooth or the grazing angle must be so steep we can barely notice a discernible reflection because it becomes very thin.

#5298418 [MSVC] Why does SDL initialize member variables?

Posted by on 28 June 2016 - 11:39 AM

For hunting bugs in production, yeah that sucks. You often want a non-null invalid address.

But for deployment you want to avoid crashes and potential memory corruption if the random address happens to be valid.

Note that while 0x00000000 is always considered a bad address in the x86 ABIs, 0xcdcdcdcd could be a valid address if e.g. running with Large Address Aware.

#5297994 [D3D12] Multiple command queues

Posted by on 25 June 2016 - 09:12 AM


This, but remember that copy-queues should have lower bandwidth compared to graphics queue (at least on actual hardware). They are great for concurrency and background works, but for the shortest job to be down it is better to use the graphics queue. I am not sure how they compare against compute queues, but I cannot imagine a scenario where is better to use compute queues instead of graphics queues for immediate copy operations only.

Do you have a reference for that? Maybe for CPU-side to CPU-side, or GPU-side to GPU-side transfers that's true... but I wouldn't think so for transfers between CPU-side and a dedicated GPU (across PCI-e) it would be.
The whole point of the copy queue is that it's designed to fully saturate the PCI-e bus while consuming zero shading/grahpics/compute resources (it's just a DMA controller being fed an "async memcpy" job). Intel say that their DMA controller has fairly low throughput, but, their "GPU-side RAM" is actually also "CPU-side RAM" so in some cases you'd just be able to use a regular background thread and have it perform the memcpy :lol:


For references:

  • DX12PerfTweet 25: Copy queue consumes no shader resources but has less bandwidth than graphics and compute queues.
  • DX12PerfTweet 34: Use the copy queue for background tasks. Spinning for copy to finish is likely inefficient.
  • DX12PerfTweet 56: Use the COPY queue to move memory over PCI-Express: this is more efficient than using COMPUTE or DIRECT queue.
  • GPUOpen blog - Performance Tweets Series: Streaming & Memory Management: (...) The copy queue exposes the copy engine, which is a dedicated DMA engine designed around efficient transfers across the PCIe bus. (...) Before you run off and move all copies to the copy queue, keep in mind the copy queue is not designed for all copies. In fact, the copy engine is only optimized for transferring data over PCIe. It’s the only way to saturate PCIe bandwidth (...).

- if you're copying CPU->CPU, don't use the GPU, call memcpy :lol:
- if you're copying CPU->GPU or GPU->CPU, use the copy queue, except maybe if you're optimizing for Intel or a mobile platform.
- If you're copying GPU->GPU, probably use a compute queue, except maybe for SLI/crossfire (multi-adaptor) cases.

That is pretty much it. Integrated GPUs will perform better if you write directly to the GPU memory from the CPU. It's a mystery to me whether this applies to AMD APUs as well.

#5297646 glsl represent 1 big texture as 4 smaller ones (tearing)

Posted by on 22 June 2016 - 05:14 PM

You're gonna have trouble with bilinear (gets worse with trilinear) filtering at the edges because the GPU should be interpolating between the two textures, but obviously this won't happen, so you need to do it yourself.


Potentially you may have to sample all four textures and interpolate it yourself:

// Assuming layout of textures:
// |0|1|
// |2|3|
result = mix(
mix( c0, c1, fract( uv.x * 1024.0 - 0.5/1024.0 ),
mix( c2, c3, fract( uv.x * 1024.0 - 0.5/1024.0 ),
fract( uv.y * 1024.0 - 0.5/1024.0 ) );

If you're at the left/right edge, you only need c0 & c1 or c2 & c3; if you're at the top/bottom edge you only need c0 & c2 or c1 & c3. But if you're close to the cross intersection, you're going to need to sample and mix all 4 textures.


Also the mipmaps need to be generated offline based on the original 1024x1024 rather than generating them on the GPU since it will generate them based on the 512x512 blocks individually.


I can't think quickly of a way to fix the trilinear filtering problem though.

#5297226 How to get patch id in domain shader.

Posted by on 19 June 2016 - 11:43 AM


Also, drawing each path in its own DrawCall sounds incredibly inefficient. You need to provide at least 256 vertices per draw call to fully utilize the vertex shader.

I thought it was 64 vertices to fully utilize the vertex shader and 256 to not become command processor limited.
edit - for amd.


AMD's wavefront size is of 64, that's true, but there are some inefficiencies and overhead details, such as needing 3 vertices to make a triangle (e.g. 64 triangles x 3 = 192 vertices assuming no tri shares any vertex). Real world testing shows on average you get near optimum throughput at >= 256 vertices per draw.
Edit. See http://www.g-truc.net/post-0666.html

@Matias is it still true if I have a pass-through vertex shader?


#5297150 How to get patch id in domain shader.

Posted by on 18 June 2016 - 05:01 PM

Also, drawing each path in its own DrawCall sounds incredibly inefficient. You need to provide at least 256 vertices per draw call to fully utilize the vertex shader.

#5294988 SampleLevel not honouring integer texel offset

Posted by on 04 June 2016 - 12:22 PM

Based on personal experience do not rely on the offset parameters. Broken drivers, broken hardware; missmatching results across vendors. It's better to just apply the offset yourself to the UVs.

#5293812 [Solved]NV Optimus notebook spend too much time in copy hardware queue?

Posted by on 27 May 2016 - 10:01 AM

I just realized: are you clearing the colour, depth and stencil buffers every frame? (at least the ones linked to the swap chain)
If you're not, you're creating inter-frame dependencies that could also explain this behaviour.

#5293684 [Solved]NV Optimus notebook spend too much time in copy hardware queue?

Posted by on 26 May 2016 - 04:35 PM

By the way if you're reading from the framebuffer, it would totally explain it (i.e. postprocessing, or worse... reading from CPU).
Treat the backbuffer as write-only.

#5292902 Hybrid Frustum Traced Shadows

Posted by on 22 May 2016 - 12:40 PM

Also the how does the irregular z-buffer fit into this?

They don't use an irregular z-buffer. They don't even need a Z-buffer. Pay attention again: instead of storing depth at each pixel, they store the triangle's plane equation coefficients. A Z-buffer is used to store depth. If they don't store depth, they are not using a Z-buffer.

So where does this https://developer.nvidia.com/sites/default/files/akamai/gameworks/Frustum_Trace.jpg fit into what you just described.

The picture is a visual description of "depthAtReceiver >= calculateDepthAt( planeEquationCoefficients, x, y );"

#5292790 Hybrid Frustum Traced Shadows

Posted by on 21 May 2016 - 04:49 PM

During the caster pass, instead of storing depth at each pixel, they store the triangle's plane equation coefficients.


During the receiver pass, instead of doing depthAtReceiver >= depthAtShadowmap test like in regular shadow mapping, they perform a depthAtReceiver >= calculateDepthAt( planeEquationCoefficients, x, y );

Becoming effectively a form of raytracing since it's a ray vs triangle intersection test.