Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Online Last Active Today, 03:03 PM

#5273015 D3d12 : d24_x8 format to rgba8?

Posted by Matias Goldberg on 28 January 2016 - 10:57 AM

Yes they mentionned it on some twitter account, but then does GCN store 24 bits depth value as 32 bits if a 24 bits depth texture is requested ?
Since there is no performance bandwidth advantage since 24 bits needs to be stored in a 32 bits location and 8 bits are wasted the driver might as well promote d24x8 to d32 + r8 ?

No, they store it as 24-bit fixed point with 8 bits unused. It only uses 32 bits if you request a floating point depth buffer, and they can't promote from fixed point -> floating point since the distribution of precision is different.

Pretty much this. They cannot promote it for you since the behavior is very different. They must honour 24-bit integer precision.

As for the bandwidth, this is why AMD recommends that if you never use the stencil, don't ask for a depth buffer with stencil capabilities.

#5272800 D3d12 : d24_x8 format to rgba8?

Posted by Matias Goldberg on 26 January 2016 - 11:53 PM

IIRC AMD GCN always stores the stencil and depth separately, so this hack won't work there.

#5272515 transition barrier strictness

Posted by Matias Goldberg on 24 January 2016 - 04:25 PM

D3D12 is explicitly targetted at "expert" graphic programmers with already a background with GPU hardware and modern APIs.
This is why D3D11 is not going away and still being updated (i.e. D3D11.3).
Note I'm not calling you a rookie. I'm just saying this what to expect from D3D12.

If the debug runtime doesn't complain, does it just not matter?

No, it just means the debug layer didn't catch it. Hopefully it will improve with time.

#5271406 Understanding cross product without delving too much on Linear algebra

Posted by Matias Goldberg on 16 January 2016 - 09:13 AM

I you have Firefox (won't work in Chrome because no NPAPI) and the Java plugin (and most security stuff disabled), you can run the interactive demo:



IMHO that's the best tutorial I ever found on cross products.

#5271338 VSSM

Posted by Matias Goldberg on 15 January 2016 - 03:05 PM

I have no idea either. VSM is the popular technique, VSSM is not popular. You can try guessing from the rest of the steps.

The link you gave is just an abstract preview. It appears the actual paper is here which may provide better insight.

#5271097 Vertex to cube using geometry shader

Posted by Matias Goldberg on 14 January 2016 - 12:43 PM

PS: I'm still curious how you pulled that off with 14 vertices only wink.png

14 tristrip cube in the vertex shader using only vertex ID (no UVs or normals):
b = 1 << i;
x = (0x287a & b) != 0;
y = (0x02af & b) != 0;
z = (0x31e3 & b) != 0;

#5270487 Injection LPV Geometry Shader

Posted by Matias Goldberg on 10 January 2016 - 11:20 PM

Use RenderDoc. It will provide a lot of help in debugging your problem.


Normally your errors means the structure definition between the VS & GS and the GS & PS do not match. It doesn't seem to be the case here, although I never worked with GS that involved points instead of tris.

This error can also happen if you later render something else, and you changed the VS & PS but forgot to take out the GS (very common mistake).


You're also getting errors that your pixel shader is using MRT and outputting to 3 RTs; but you never call OMSetRenderTargets with three different RTs. It looks like you're confusing PSSetShaderResources with OMSetRenderTargets.


And last but not least. Use RenderDoc. It's an invaluable tool which will help you understand what's going wrong.

#5270209 none issue when near plane = 0.01, is it just matter of time when I see artif...

Posted by Matias Goldberg on 08 January 2016 - 09:31 PM

The value of near plane is irrelevant on its own - what matters is the relative magnitude of the near and far plane to each other.

While true, a slight change on the near plane has a much higher precision improvement than a big change on the far plane. I wish I had the numbers at hand, but it was something like you get much more precision from raising the near plane from 0.1 to 1 than from lowering the far plane from 10.000 to 1.000

As for depth precision improvements, see

#5270102 What happens with 32bit overflow?

Posted by Matias Goldberg on 08 January 2016 - 11:38 AM

It's not a simple question actually.

Virtual address is not the same as virtual memory.


32-bit processes cannot address more than 2^32 bytes of memory. However, they can access more than that considering they only work on sections of it at a time and remap different chunks of virtual memory into virtual addresses as they need.

With this trick, on 32-bit OSes, up to 64GB of memory can be accessed (assuming PAE is enabled). On 64-bit OSes I don't know if the 64GB limit still applies, or if they can use more.

Here's an example on how to do that: https://blogs.msdn.microsoft.com/oldnewthing/20040810-00/?p=38203


You will notice that you need OS specific calls to do this. Regular malloc won't do.


Going back to the original question, what will happen? Well it depends of how the program is coded:

  1. Apps using malloc that run out of virtual addressing ranges, will get a null pointer. If the app checks for this condition, it can successfully try to recover (i.e. start to free memory and try again, show an error and quit, silently quit, etc). If the app doesn't check for this condition, it will continue to execute normally until the null pointer is used which will lead to a crash. Usually malloc'ed memory is used immediately afterwards, so the crash will happen close to the malloc call; but who knows how much the crash could be delayed if the null pointer isn't accessed soon. It's also likely many subsequent malloc calls will fail, which means a crash is imminent unless a lot of memory suddenly became available again.
  2. Apps using C++'s new operator will throw an exception when it runs out of memory; so it will crash immediately. However the app can catch this exception and try to recover, like in the malloc case. Poorly written C++ code may catch all exceptions and not notice it's catching an out of memory exception; in which case they'll continue to execute except now they have an uninitialized pointer (which worse than a null pointer). Uninitialized pointers normally crash immediately when they're used, or may corrupt the process' memory and continue to run until it's so broken it can't work anymore or crashes (think Gameboy's Pokemon glitches)
  3. Apps using other languages (i.e. Python, Ruby, Java) depend on how their interpreter manages out of memory conditions, but most of them handle it by just crashing or raising an exception (also depends on whether the interpreter couldn't honour the amount of memory the executed code is requesting; or if the interpreter itself ran out of memory while inside of its internal routines).
  4. Apps manually managing memory (the CreateFileMapping method in the blog) are advanced enough to assume they're handling this condition. When out of memory, CreateFileMapping and/or MapViewOfFile will fail. The rest is as the malloc case.

Edit: What I wrote is Windows/Linux/OS X specific. As described above by swiftcoder, iOS and Android are much more strict about memory (they just immediately terminate a process that starts consuming too much memory)

#5269844 DirectX evolution

Posted by Matias Goldberg on 07 January 2016 - 09:26 AM

I am trying to support on my framework DX9, DX11 and DX12 for that I really need to know many things of the way things are done for each API

It's worth pointing out that before DX10 came (i.e. pre 2005), the advice of "how things should be done" in DX9 was different.
This is because:

  1. New techniques were discovered that ended up faster than recommended approaches
  2. Hardware was actually based around constant registers (current hw is based around constant buffers)
  3. Drivers change and they optimize for different patterns in their heuristics.
  4. Rendering the old "D3D9 way" and then porting straight to D3D11 resulted in horrible performance (9 performing considerably better than 11).
  5. Rendering the "D3D11 way" and then mimicking the same in D3D9 resulted in good performance (11 beating 9; and 9 having similar or even better performance than the 'other way' using 9).
  6. Having two separate paths for rendering "the D3D9 way" and "the D3D11 way" is a ton of work, and sometimes too contradictory.

Same could be said about 2008's era DX10 recommendations and 2015's era DX11 recomendations (which still apply to 10). We do things differently now (not completely different, but rather incrementally different), basically because we found new ways that turned out to be better.


Because it's an incremental evolution, the general advises from 2003 still hold true (i.e. batch batch batch presentation from NVIDIA for the GeForce 6).

However I personally consider batch submission a problem solved since D3D11 added the BaseVertexLocation and StartInstanceLocation which lets for multi-mesh rendering without having to change the vertex buffer; and the research on display lists which allowed to fill constant buffers and prepare draw commands from within multiple threads (which btw got enhanced in D3D12); which is basically impossible to do in D3D9 (yet though you can create a layer that emulates this and surprisingly sometimes works even faster due to better cache coherency during playback and being able to distribute work across cores during recording).


But if for some odd reason you're batch limited even in D3D11/12, the general thinking from the batch, batch, batch presentation still holds (i.e. batch limited = increasing vertex count becomes free). For example to combat batching issues, Just Cause 2 introduced the term merge-instancing.


If you need an historical archive of the API evolution. Go to the GDC Vault, and download the API slides in chronological order; then google for SIGGRAPH 2003-2015 slides in chronological order, Google the "Game Fest" slides from Microsoft (I think they were from 2006 through 2009) and visit the NV (2007, 2008, 2009, 2010, change the address bar for the rest of the years) and AMD sites (archive), go to the SDK/Resources section, and look for the older entries.

#5269429 HDR Rendering (Average Luminance)

Posted by Matias Goldberg on 05 January 2016 - 09:54 AM

do a bilinear sample in the middle of a 2x2 quad of pixels and the hardware will average them for you as long as you get the texture coordinate right

Emphasis is mine. I shall note it gets really tricky to get perfect right. Took me days of RenderDoc debugging. It was always off by some small amount.

#5269232 Criticism of C++

Posted by Matias Goldberg on 04 January 2016 - 11:38 AM

If that is something that makes the language unviable for you, use a different one.

Can we not have a civil discourse on the pros/cons of a language without resorting to this?

He actually explained in that same post why. As he explained, even in 2015 he's writing code where char is 32 bits.
And suggesting to use a different language is not uncivilized.
It's simply a brutal truth. Some people want C++ to do stuff it's not intended to do, solve issues that is not supposed to do; and changing C++ to please this people will anger those who need C++ the way it is now.
You can't satisfy everyone at the same time. Those unsatisfied can move to another language; because what they want from C++ is not what they need.

The evil's of premature optimization are always taught in software design, get it working correctly then worry about shaving off a few bytes or cycles here or there.

Yet as L. Spiro points out, a lot of people get it wrong.

What we have is the opposite, where the default is fast and then we have to over-ride it with the 'correct' typedefs.

That is simply not true.

#5269090 Reducing byte transfer between C++ and HLSL.

Posted by Matias Goldberg on 03 January 2016 - 06:40 PM

I'm not even sure GCN has an instruction for what he wants to do. The best I can figure out it would be 4 v_cvt_f32_ubyte[0|1|2|3] and then 4 v_mul_f32 by 1/255.0f.

Maybe yes, maybe not, but what I mean is that it's still very far from doing 4 loads, 4 bitshifts, 4 'and' masks, 4 conversions to float, then the 1/255 mul.

Edit: Checked, you're right about the instructions. "fragCol = unpackUnorm4x8(val);" outputs: (irrelevant ISA code stripped):

  v_cvt_f32_ubyte0  v0, s4                                  // 00000000: 7E002204
  v_cvt_f32_ubyte1  v1, s4                                  // 00000004: 7E022404
  v_cvt_f32_ubyte2  v2, s4                                  // 00000008: 7E042604
  v_cvt_f32_ubyte3  v3, s4                                  // 0000000C: 7E062804
  v_mov_b32     v4, 0x3b808081                              // 00000010: 7E0802FF 3B808081
  v_mul_f32     v0, v4, v0                                  // 00000018: 10000104
  v_mul_f32     v1, v1, v4                                  // 0000001C: 10020901
  v_mul_f32     v2, v2, v4                                  // 00000020: 10040902
  v_mul_f32     v3, v3, v4                                  // 00000024: 10060903

Edit 2: Well, that was disappointing. I checked the manual and GCN does have a single instruction for this conversion, if I'm not mistaken it should be:

tbuffer_load_format_xyzw v[0:3], v0, s[4:7], 0 idxen format:[BUF_DATA_FORMAT_8_8_8_8,BUF_NUM_FORMAT_UNORM]

#5268945 Reducing byte transfer between C++ and HLSL.

Posted by Matias Goldberg on 02 January 2016 - 11:55 PM

This is one of the places where OpenGL is ahead of D3D.

OpenGL has unpackUnorm for this. It's cumbersome but gets the job done. On most modern hardware, this function maps directly to a native instruction. Unfortunately, as far as I know HLSL has no equivalent.

However you do have f16tof32 which is the next best thing.


Edit: Someone already wrote some util functions. With extreme luck the compiler recognizes the pattern and issues the native instruction instead of lots of bitshifting, masking and multiplication / division. You can at least check the results on GCN hardware using GPUPerfStudio's ShaderAnalyzer to see if the driver does indeed recognize what you're doing (I don't think it will though...).

#5268843 Vector4 W Component

Posted by Matias Goldberg on 02 January 2016 - 11:25 AM

So, what vector operations does W take part in exactly?
I assume not length.... It would be odd if W took part in the Length operation as the vector (2, 2, 2, 0) and the point (2, 2, 2, 1) would have different results.

If I'd wanted to take the length of the XYZ components, I would use a Vector3. If I use a Vector4, I expect the length to account all 4 components. Because a Vector4 represents 4 dimensions, not 3.

On that same note, it does not make sense (to me) to include W in the dot product calculation either.

Same here again. Dot including W is useful for example when dealing with plane equations and quaternions.

So, should i just ignore W for these operations: Addition, Subtraction, Scalar Multiplication, Dot Product, Cross Product, Length and Projection?

Nope, you shouldn't ignore it.