Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Today, 12:21 AM

#5271097 Vertex to cube using geometry shader

Posted by Matias Goldberg on 14 January 2016 - 12:43 PM

PS: I'm still curious how you pulled that off with 14 vertices only wink.png

14 tristrip cube in the vertex shader using only vertex ID (no UVs or normals):
b = 1 << i;
x = (0x287a & b) != 0;
y = (0x02af & b) != 0;
z = (0x31e3 & b) != 0;

#5270487 Injection LPV Geometry Shader

Posted by Matias Goldberg on 10 January 2016 - 11:20 PM

Use RenderDoc. It will provide a lot of help in debugging your problem.


Normally your errors means the structure definition between the VS & GS and the GS & PS do not match. It doesn't seem to be the case here, although I never worked with GS that involved points instead of tris.

This error can also happen if you later render something else, and you changed the VS & PS but forgot to take out the GS (very common mistake).


You're also getting errors that your pixel shader is using MRT and outputting to 3 RTs; but you never call OMSetRenderTargets with three different RTs. It looks like you're confusing PSSetShaderResources with OMSetRenderTargets.


And last but not least. Use RenderDoc. It's an invaluable tool which will help you understand what's going wrong.

#5270209 none issue when near plane = 0.01, is it just matter of time when I see artif...

Posted by Matias Goldberg on 08 January 2016 - 09:31 PM

The value of near plane is irrelevant on its own - what matters is the relative magnitude of the near and far plane to each other.

While true, a slight change on the near plane has a much higher precision improvement than a big change on the far plane. I wish I had the numbers at hand, but it was something like you get much more precision from raising the near plane from 0.1 to 1 than from lowering the far plane from 10.000 to 1.000

As for depth precision improvements, see

#5270102 What happens with 32bit overflow?

Posted by Matias Goldberg on 08 January 2016 - 11:38 AM

It's not a simple question actually.

Virtual address is not the same as virtual memory.


32-bit processes cannot address more than 2^32 bytes of memory. However, they can access more than that considering they only work on sections of it at a time and remap different chunks of virtual memory into virtual addresses as they need.

With this trick, on 32-bit OSes, up to 64GB of memory can be accessed (assuming PAE is enabled). On 64-bit OSes I don't know if the 64GB limit still applies, or if they can use more.

Here's an example on how to do that: https://blogs.msdn.microsoft.com/oldnewthing/20040810-00/?p=38203


You will notice that you need OS specific calls to do this. Regular malloc won't do.


Going back to the original question, what will happen? Well it depends of how the program is coded:

  1. Apps using malloc that run out of virtual addressing ranges, will get a null pointer. If the app checks for this condition, it can successfully try to recover (i.e. start to free memory and try again, show an error and quit, silently quit, etc). If the app doesn't check for this condition, it will continue to execute normally until the null pointer is used which will lead to a crash. Usually malloc'ed memory is used immediately afterwards, so the crash will happen close to the malloc call; but who knows how much the crash could be delayed if the null pointer isn't accessed soon. It's also likely many subsequent malloc calls will fail, which means a crash is imminent unless a lot of memory suddenly became available again.
  2. Apps using C++'s new operator will throw an exception when it runs out of memory; so it will crash immediately. However the app can catch this exception and try to recover, like in the malloc case. Poorly written C++ code may catch all exceptions and not notice it's catching an out of memory exception; in which case they'll continue to execute except now they have an uninitialized pointer (which worse than a null pointer). Uninitialized pointers normally crash immediately when they're used, or may corrupt the process' memory and continue to run until it's so broken it can't work anymore or crashes (think Gameboy's Pokemon glitches)
  3. Apps using other languages (i.e. Python, Ruby, Java) depend on how their interpreter manages out of memory conditions, but most of them handle it by just crashing or raising an exception (also depends on whether the interpreter couldn't honour the amount of memory the executed code is requesting; or if the interpreter itself ran out of memory while inside of its internal routines).
  4. Apps manually managing memory (the CreateFileMapping method in the blog) are advanced enough to assume they're handling this condition. When out of memory, CreateFileMapping and/or MapViewOfFile will fail. The rest is as the malloc case.

Edit: What I wrote is Windows/Linux/OS X specific. As described above by swiftcoder, iOS and Android are much more strict about memory (they just immediately terminate a process that starts consuming too much memory)

#5269844 DirectX evolution

Posted by Matias Goldberg on 07 January 2016 - 09:26 AM

I am trying to support on my framework DX9, DX11 and DX12 for that I really need to know many things of the way things are done for each API

It's worth pointing out that before DX10 came (i.e. pre 2005), the advice of "how things should be done" in DX9 was different.
This is because:

  1. New techniques were discovered that ended up faster than recommended approaches
  2. Hardware was actually based around constant registers (current hw is based around constant buffers)
  3. Drivers change and they optimize for different patterns in their heuristics.
  4. Rendering the old "D3D9 way" and then porting straight to D3D11 resulted in horrible performance (9 performing considerably better than 11).
  5. Rendering the "D3D11 way" and then mimicking the same in D3D9 resulted in good performance (11 beating 9; and 9 having similar or even better performance than the 'other way' using 9).
  6. Having two separate paths for rendering "the D3D9 way" and "the D3D11 way" is a ton of work, and sometimes too contradictory.

Same could be said about 2008's era DX10 recommendations and 2015's era DX11 recomendations (which still apply to 10). We do things differently now (not completely different, but rather incrementally different), basically because we found new ways that turned out to be better.


Because it's an incremental evolution, the general advises from 2003 still hold true (i.e. batch batch batch presentation from NVIDIA for the GeForce 6).

However I personally consider batch submission a problem solved since D3D11 added the BaseVertexLocation and StartInstanceLocation which lets for multi-mesh rendering without having to change the vertex buffer; and the research on display lists which allowed to fill constant buffers and prepare draw commands from within multiple threads (which btw got enhanced in D3D12); which is basically impossible to do in D3D9 (yet though you can create a layer that emulates this and surprisingly sometimes works even faster due to better cache coherency during playback and being able to distribute work across cores during recording).


But if for some odd reason you're batch limited even in D3D11/12, the general thinking from the batch, batch, batch presentation still holds (i.e. batch limited = increasing vertex count becomes free). For example to combat batching issues, Just Cause 2 introduced the term merge-instancing.


If you need an historical archive of the API evolution. Go to the GDC Vault, and download the API slides in chronological order; then google for SIGGRAPH 2003-2015 slides in chronological order, Google the "Game Fest" slides from Microsoft (I think they were from 2006 through 2009) and visit the NV (2007, 2008, 2009, 2010, change the address bar for the rest of the years) and AMD sites (archive), go to the SDK/Resources section, and look for the older entries.

#5269429 HDR Rendering (Average Luminance)

Posted by Matias Goldberg on 05 January 2016 - 09:54 AM

do a bilinear sample in the middle of a 2x2 quad of pixels and the hardware will average them for you as long as you get the texture coordinate right

Emphasis is mine. I shall note it gets really tricky to get perfect right. Took me days of RenderDoc debugging. It was always off by some small amount.

#5269232 Criticism of C++

Posted by Matias Goldberg on 04 January 2016 - 11:38 AM

If that is something that makes the language unviable for you, use a different one.

Can we not have a civil discourse on the pros/cons of a language without resorting to this?

He actually explained in that same post why. As he explained, even in 2015 he's writing code where char is 32 bits.
And suggesting to use a different language is not uncivilized.
It's simply a brutal truth. Some people want C++ to do stuff it's not intended to do, solve issues that is not supposed to do; and changing C++ to please this people will anger those who need C++ the way it is now.
You can't satisfy everyone at the same time. Those unsatisfied can move to another language; because what they want from C++ is not what they need.

The evil's of premature optimization are always taught in software design, get it working correctly then worry about shaving off a few bytes or cycles here or there.

Yet as L. Spiro points out, a lot of people get it wrong.

What we have is the opposite, where the default is fast and then we have to over-ride it with the 'correct' typedefs.

That is simply not true.

#5269090 Reducing byte transfer between C++ and HLSL.

Posted by Matias Goldberg on 03 January 2016 - 06:40 PM

I'm not even sure GCN has an instruction for what he wants to do. The best I can figure out it would be 4 v_cvt_f32_ubyte[0|1|2|3] and then 4 v_mul_f32 by 1/255.0f.

Maybe yes, maybe not, but what I mean is that it's still very far from doing 4 loads, 4 bitshifts, 4 'and' masks, 4 conversions to float, then the 1/255 mul.

Edit: Checked, you're right about the instructions. "fragCol = unpackUnorm4x8(val);" outputs: (irrelevant ISA code stripped):

  v_cvt_f32_ubyte0  v0, s4                                  // 00000000: 7E002204
  v_cvt_f32_ubyte1  v1, s4                                  // 00000004: 7E022404
  v_cvt_f32_ubyte2  v2, s4                                  // 00000008: 7E042604
  v_cvt_f32_ubyte3  v3, s4                                  // 0000000C: 7E062804
  v_mov_b32     v4, 0x3b808081                              // 00000010: 7E0802FF 3B808081
  v_mul_f32     v0, v4, v0                                  // 00000018: 10000104
  v_mul_f32     v1, v1, v4                                  // 0000001C: 10020901
  v_mul_f32     v2, v2, v4                                  // 00000020: 10040902
  v_mul_f32     v3, v3, v4                                  // 00000024: 10060903

Edit 2: Well, that was disappointing. I checked the manual and GCN does have a single instruction for this conversion, if I'm not mistaken it should be:

tbuffer_load_format_xyzw v[0:3], v0, s[4:7], 0 idxen format:[BUF_DATA_FORMAT_8_8_8_8,BUF_NUM_FORMAT_UNORM]

#5268945 Reducing byte transfer between C++ and HLSL.

Posted by Matias Goldberg on 02 January 2016 - 11:55 PM

This is one of the places where OpenGL is ahead of D3D.

OpenGL has unpackUnorm for this. It's cumbersome but gets the job done. On most modern hardware, this function maps directly to a native instruction. Unfortunately, as far as I know HLSL has no equivalent.

However you do have f16tof32 which is the next best thing.


Edit: Someone already wrote some util functions. With extreme luck the compiler recognizes the pattern and issues the native instruction instead of lots of bitshifting, masking and multiplication / division. You can at least check the results on GCN hardware using GPUPerfStudio's ShaderAnalyzer to see if the driver does indeed recognize what you're doing (I don't think it will though...).

#5268843 Vector4 W Component

Posted by Matias Goldberg on 02 January 2016 - 11:25 AM

So, what vector operations does W take part in exactly?
I assume not length.... It would be odd if W took part in the Length operation as the vector (2, 2, 2, 0) and the point (2, 2, 2, 1) would have different results.

If I'd wanted to take the length of the XYZ components, I would use a Vector3. If I use a Vector4, I expect the length to account all 4 components. Because a Vector4 represents 4 dimensions, not 3.

On that same note, it does not make sense (to me) to include W in the dot product calculation either.

Same here again. Dot including W is useful for example when dealing with plane equations and quaternions.

So, should i just ignore W for these operations: Addition, Subtraction, Scalar Multiplication, Dot Product, Cross Product, Length and Projection?

Nope, you shouldn't ignore it.

#5268841 Vector4 W Component

Posted by Matias Goldberg on 02 January 2016 - 11:19 AM

As imoogiBG said, you're overthinking it.


Personally, I only use Vector4s when it makes sense (4x4 matrices involving projection; dealing with clip space / projection space).

Using 4x4 * Vector4 involves a lot of operations, and contributes to numerical instability.


Otherwise I use Vector3. If I have a matrix with rotation, scale, skew and translation; I use a 4x3 matrix (or an affine 4x4 with an affineTransform function that asserts the matrix is affine and then ignores the last row).

If I have a matrix and only want to apply rotation scale and skew (no translation) I extract the 3x3 matrix and apply it to the Vector3.

And honestly, I try to avoid matrices and use Quaternions instead (Position / Quaternion / Scale) since I don't need skewing and is numerically the most stable method (and memory compact).


Since I only use Vector4 on special cases (ie. projection stuff) that means W almost always starts as 1 for me.

#5268792 Visual studio cannot compile 32bit or Release Mode

Posted by Matias Goldberg on 01 January 2016 - 11:56 PM

Looks like you've got a 64-bit DLL in the same folder as your EXE; causing a cascade of x64 dlls to also be included.

I would start by looking there is no msvcp140d.dll in your EXE folder.

#5268719 GLSL iOS values beyond 2048 modules operation fails

Posted by Matias Goldberg on 01 January 2016 - 11:39 AM

Ok, from what I can see, this is clearly a precision problem. gl_FragCoord must be a 16-bit floating point; which would make perfect sense because 16-bit floats can represent up to 2048 perfectly, but can only represent multiples of 2 between the range [2048; 4096].

By spec gl_FragCoord is defined to be mediump; but obviously that's going to break multiple apps on the iPad Pro and should be considered an iOS bug.

I suggest you raise a bug ticket to Apple. They like having everything handed in silver plate (can't blame them), so make a simple XCode project that can repro the problem so they can quickly open in XCode, build and run.

#5268640 VertexBuffers and InputAssmbler unnecessary?

Posted by Matias Goldberg on 31 December 2015 - 03:37 PM

That's mainly what I was wondering about.
Do you have any references for these performance claims?

Personal experience, I'm afraid.

I'd be very interested to know what hardware has a penalty and how large it is.

GCN definitely is in theory "the same"; PowerVR (mobile) definitely prefers IA as there are gigantic gains (Metal, not D3D12), I can't recall which Intel and NVIDIA cards used FF but at least some NVIDIAs did (if not all of them?).

As for the performance difference, it's not big, but "it depends". First it has the vertex bottleneck has to be big enough (which usually isn't). Second, it depends what you're doing on the shader and how complex it is.
For example even when testing GCN (which in theory, it should be the same) in complex shaders sometimes the driver generates relatively long ISAs to decode the formats (e.g. you stored them as 4 normalized shorts -> float4) when it should generate just one or two instructions. Granted, future driver versions would fix this.
If, for example, you use an UAV inside the vertex shader, the penalization becomes much bigger as there is no restrict equivalent and the loads become delayed and the shader suddenly blows with waitcnt instructions.

You will always have to have the knowledge to match appropriate vertex buffer data with vertex shaders.
The main difference is your PSO creation code doesn't need to know or care about the vertex format if you don't use the IA.
This brings a significant reduction in complexity and coupling IMO.
I work on AAA projects as well as smaller projects.
I don't see why it wouldn't scale to bigger projects.

Reduces coupling? Agreed. But coupling became irrelevant with PSOs, because PSOs coupled almost everything together. In a D3D11 view; input layouts made my life hell because they mixed the shader, the vertex buffer, and the vertex layout; but this relation wasn't obvious so I tried to abstract the three separately and end up with an entangled mess. If you weren't careful enough, each vertex buffer would need one IA layout for each shader it was associated with (should this IA live with the vertex buffer? or with the shader?)
A PSO approach made my life much easier (even outside D3D12) since now vertex buffers just need to be accompanied by a vertex description, and to generate a PSO you need absolutely everything. And the result lives inside the PSO.

I don't see why it wouldn't scale to bigger projects.

Because as a project becomes bigger, shader A works well on mesh M, N & O, but it should not be used with mesh P. To make it work on mesh P, you need shader A'

To detect this situation you need some form of vertex description to log a warning or automatically modify the shader (if emulating) so that Shader A becomes shader A'; or you let it glitch and lose a lot of time wondering what's going on (if someone notices it's glitching).

Maybe the artist exported P incorrectly. But without a vertex description, you can't tell why.

And if you're manually fetching vertex data via SV_VertexID, you need to grab the correct shader for P; or autogenerate it correctly (if it's tool assisted).

FWIW I believe this is the method Nitrous Engine uses.

Yes, Mantle had no vertex desc. because GCN didn't need them at all; so it just relied on StructuredBuffers. Though I always wondered if quirks like these were the reason D3D11 beats Mantle on GPU-bottleneck benchs. After all, they were relying on HLSL compiler to generate the shader; rather than using a shader language that better matches GCN.
D3D12 & Vulkan added them back because of the other vendors.

#5268618 VertexBuffers and InputAssmbler unnecessary?

Posted by Matias Goldberg on 31 December 2015 - 12:55 PM

These feel like legacy fixed-function constructs.
If you're building something new is there a good reason to use VertexBuffers and the InputAssembler at all?
Why not just use a StructuedBuffer<Vertex> SRV or unbounded ConstantBuffer<Vertex> array CBV (T3 hardware) and index them with the SV_VertexID?

Because these fixed functions constructs aren't legacy. In some hardware it is very current. Other GPUs though, there is practically no difference (aside from compiler optimizations that cannot be performed due to guarantees StructuredBuffer/ConstantBuffer give regarding to caching, alignment, ordering and aliasing).

You can ignore the IA and emulate it yourself with SV_VertexID, but doing so may result in sub-optimal performance on certain hardware.

Specifying D3D12_INPUT_ELEMENT_DESC requires detailed knowledge of both the vertex shader and the vertex data.

Yes. But emulating it with SV_VertexID requires detailed knowledge of both the vertex shader and the vertex data too; as you have to make sure the right vertex shader is used with the right vertex data. Perhaps what you mean is that by emulating it you can avoid caring about this and you just force it.
It works for small projects where you can mentally track what shader goes with which mesh, and it feels faster for developing (no time wasted specifying the vertex format). But it doesn't scale for bigger projects.