Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Yesterday, 09:45 PM

#5269429 HDR Rendering (Average Luminance)

Posted by Matias Goldberg on 05 January 2016 - 09:54 AM

do a bilinear sample in the middle of a 2x2 quad of pixels and the hardware will average them for you as long as you get the texture coordinate right

Emphasis is mine. I shall note it gets really tricky to get perfect right. Took me days of RenderDoc debugging. It was always off by some small amount.

#5269232 Criticism of C++

Posted by Matias Goldberg on 04 January 2016 - 11:38 AM

If that is something that makes the language unviable for you, use a different one.

Can we not have a civil discourse on the pros/cons of a language without resorting to this?

He actually explained in that same post why. As he explained, even in 2015 he's writing code where char is 32 bits.
And suggesting to use a different language is not uncivilized.
It's simply a brutal truth. Some people want C++ to do stuff it's not intended to do, solve issues that is not supposed to do; and changing C++ to please this people will anger those who need C++ the way it is now.
You can't satisfy everyone at the same time. Those unsatisfied can move to another language; because what they want from C++ is not what they need.

The evil's of premature optimization are always taught in software design, get it working correctly then worry about shaving off a few bytes or cycles here or there.

Yet as L. Spiro points out, a lot of people get it wrong.

What we have is the opposite, where the default is fast and then we have to over-ride it with the 'correct' typedefs.

That is simply not true.

#5269090 Reducing byte transfer between C++ and HLSL.

Posted by Matias Goldberg on 03 January 2016 - 06:40 PM

I'm not even sure GCN has an instruction for what he wants to do. The best I can figure out it would be 4 v_cvt_f32_ubyte[0|1|2|3] and then 4 v_mul_f32 by 1/255.0f.

Maybe yes, maybe not, but what I mean is that it's still very far from doing 4 loads, 4 bitshifts, 4 'and' masks, 4 conversions to float, then the 1/255 mul.

Edit: Checked, you're right about the instructions. "fragCol = unpackUnorm4x8(val);" outputs: (irrelevant ISA code stripped):

  v_cvt_f32_ubyte0  v0, s4                                  // 00000000: 7E002204
  v_cvt_f32_ubyte1  v1, s4                                  // 00000004: 7E022404
  v_cvt_f32_ubyte2  v2, s4                                  // 00000008: 7E042604
  v_cvt_f32_ubyte3  v3, s4                                  // 0000000C: 7E062804
  v_mov_b32     v4, 0x3b808081                              // 00000010: 7E0802FF 3B808081
  v_mul_f32     v0, v4, v0                                  // 00000018: 10000104
  v_mul_f32     v1, v1, v4                                  // 0000001C: 10020901
  v_mul_f32     v2, v2, v4                                  // 00000020: 10040902
  v_mul_f32     v3, v3, v4                                  // 00000024: 10060903

Edit 2: Well, that was disappointing. I checked the manual and GCN does have a single instruction for this conversion, if I'm not mistaken it should be:

tbuffer_load_format_xyzw v[0:3], v0, s[4:7], 0 idxen format:[BUF_DATA_FORMAT_8_8_8_8,BUF_NUM_FORMAT_UNORM]

#5268945 Reducing byte transfer between C++ and HLSL.

Posted by Matias Goldberg on 02 January 2016 - 11:55 PM

This is one of the places where OpenGL is ahead of D3D.

OpenGL has unpackUnorm for this. It's cumbersome but gets the job done. On most modern hardware, this function maps directly to a native instruction. Unfortunately, as far as I know HLSL has no equivalent.

However you do have f16tof32 which is the next best thing.


Edit: Someone already wrote some util functions. With extreme luck the compiler recognizes the pattern and issues the native instruction instead of lots of bitshifting, masking and multiplication / division. You can at least check the results on GCN hardware using GPUPerfStudio's ShaderAnalyzer to see if the driver does indeed recognize what you're doing (I don't think it will though...).

#5268843 Vector4 W Component

Posted by Matias Goldberg on 02 January 2016 - 11:25 AM

So, what vector operations does W take part in exactly?
I assume not length.... It would be odd if W took part in the Length operation as the vector (2, 2, 2, 0) and the point (2, 2, 2, 1) would have different results.

If I'd wanted to take the length of the XYZ components, I would use a Vector3. If I use a Vector4, I expect the length to account all 4 components. Because a Vector4 represents 4 dimensions, not 3.

On that same note, it does not make sense (to me) to include W in the dot product calculation either.

Same here again. Dot including W is useful for example when dealing with plane equations and quaternions.

So, should i just ignore W for these operations: Addition, Subtraction, Scalar Multiplication, Dot Product, Cross Product, Length and Projection?

Nope, you shouldn't ignore it.

#5268841 Vector4 W Component

Posted by Matias Goldberg on 02 January 2016 - 11:19 AM

As imoogiBG said, you're overthinking it.


Personally, I only use Vector4s when it makes sense (4x4 matrices involving projection; dealing with clip space / projection space).

Using 4x4 * Vector4 involves a lot of operations, and contributes to numerical instability.


Otherwise I use Vector3. If I have a matrix with rotation, scale, skew and translation; I use a 4x3 matrix (or an affine 4x4 with an affineTransform function that asserts the matrix is affine and then ignores the last row).

If I have a matrix and only want to apply rotation scale and skew (no translation) I extract the 3x3 matrix and apply it to the Vector3.

And honestly, I try to avoid matrices and use Quaternions instead (Position / Quaternion / Scale) since I don't need skewing and is numerically the most stable method (and memory compact).


Since I only use Vector4 on special cases (ie. projection stuff) that means W almost always starts as 1 for me.

#5268792 Visual studio cannot compile 32bit or Release Mode

Posted by Matias Goldberg on 01 January 2016 - 11:56 PM

Looks like you've got a 64-bit DLL in the same folder as your EXE; causing a cascade of x64 dlls to also be included.

I would start by looking there is no msvcp140d.dll in your EXE folder.

#5268719 GLSL iOS values beyond 2048 modules operation fails

Posted by Matias Goldberg on 01 January 2016 - 11:39 AM

Ok, from what I can see, this is clearly a precision problem. gl_FragCoord must be a 16-bit floating point; which would make perfect sense because 16-bit floats can represent up to 2048 perfectly, but can only represent multiples of 2 between the range [2048; 4096].

By spec gl_FragCoord is defined to be mediump; but obviously that's going to break multiple apps on the iPad Pro and should be considered an iOS bug.

I suggest you raise a bug ticket to Apple. They like having everything handed in silver plate (can't blame them), so make a simple XCode project that can repro the problem so they can quickly open in XCode, build and run.

#5268640 VertexBuffers and InputAssmbler unnecessary?

Posted by Matias Goldberg on 31 December 2015 - 03:37 PM

That's mainly what I was wondering about.
Do you have any references for these performance claims?

Personal experience, I'm afraid.

I'd be very interested to know what hardware has a penalty and how large it is.

GCN definitely is in theory "the same"; PowerVR (mobile) definitely prefers IA as there are gigantic gains (Metal, not D3D12), I can't recall which Intel and NVIDIA cards used FF but at least some NVIDIAs did (if not all of them?).

As for the performance difference, it's not big, but "it depends". First it has the vertex bottleneck has to be big enough (which usually isn't). Second, it depends what you're doing on the shader and how complex it is.
For example even when testing GCN (which in theory, it should be the same) in complex shaders sometimes the driver generates relatively long ISAs to decode the formats (e.g. you stored them as 4 normalized shorts -> float4) when it should generate just one or two instructions. Granted, future driver versions would fix this.
If, for example, you use an UAV inside the vertex shader, the penalization becomes much bigger as there is no restrict equivalent and the loads become delayed and the shader suddenly blows with waitcnt instructions.

You will always have to have the knowledge to match appropriate vertex buffer data with vertex shaders.
The main difference is your PSO creation code doesn't need to know or care about the vertex format if you don't use the IA.
This brings a significant reduction in complexity and coupling IMO.
I work on AAA projects as well as smaller projects.
I don't see why it wouldn't scale to bigger projects.

Reduces coupling? Agreed. But coupling became irrelevant with PSOs, because PSOs coupled almost everything together. In a D3D11 view; input layouts made my life hell because they mixed the shader, the vertex buffer, and the vertex layout; but this relation wasn't obvious so I tried to abstract the three separately and end up with an entangled mess. If you weren't careful enough, each vertex buffer would need one IA layout for each shader it was associated with (should this IA live with the vertex buffer? or with the shader?)
A PSO approach made my life much easier (even outside D3D12) since now vertex buffers just need to be accompanied by a vertex description, and to generate a PSO you need absolutely everything. And the result lives inside the PSO.

I don't see why it wouldn't scale to bigger projects.

Because as a project becomes bigger, shader A works well on mesh M, N & O, but it should not be used with mesh P. To make it work on mesh P, you need shader A'

To detect this situation you need some form of vertex description to log a warning or automatically modify the shader (if emulating) so that Shader A becomes shader A'; or you let it glitch and lose a lot of time wondering what's going on (if someone notices it's glitching).

Maybe the artist exported P incorrectly. But without a vertex description, you can't tell why.

And if you're manually fetching vertex data via SV_VertexID, you need to grab the correct shader for P; or autogenerate it correctly (if it's tool assisted).

FWIW I believe this is the method Nitrous Engine uses.

Yes, Mantle had no vertex desc. because GCN didn't need them at all; so it just relied on StructuredBuffers. Though I always wondered if quirks like these were the reason D3D11 beats Mantle on GPU-bottleneck benchs. After all, they were relying on HLSL compiler to generate the shader; rather than using a shader language that better matches GCN.
D3D12 & Vulkan added them back because of the other vendors.

#5268618 VertexBuffers and InputAssmbler unnecessary?

Posted by Matias Goldberg on 31 December 2015 - 12:55 PM

These feel like legacy fixed-function constructs.
If you're building something new is there a good reason to use VertexBuffers and the InputAssembler at all?
Why not just use a StructuedBuffer<Vertex> SRV or unbounded ConstantBuffer<Vertex> array CBV (T3 hardware) and index them with the SV_VertexID?

Because these fixed functions constructs aren't legacy. In some hardware it is very current. Other GPUs though, there is practically no difference (aside from compiler optimizations that cannot be performed due to guarantees StructuredBuffer/ConstantBuffer give regarding to caching, alignment, ordering and aliasing).

You can ignore the IA and emulate it yourself with SV_VertexID, but doing so may result in sub-optimal performance on certain hardware.

Specifying D3D12_INPUT_ELEMENT_DESC requires detailed knowledge of both the vertex shader and the vertex data.

Yes. But emulating it with SV_VertexID requires detailed knowledge of both the vertex shader and the vertex data too; as you have to make sure the right vertex shader is used with the right vertex data. Perhaps what you mean is that by emulating it you can avoid caring about this and you just force it.
It works for small projects where you can mentally track what shader goes with which mesh, and it feels faster for developing (no time wasted specifying the vertex format). But it doesn't scale for bigger projects.

#5268536 G-Buffer and Render Target format for Normals

Posted by Matias Goldberg on 30 December 2015 - 06:51 PM

Thank you, Matias!
DXGI_FORMAT_R10G10B10A2_UNORM removed all the artifacts.

I'm glad it worked for you. Just remember that UNORM stores values in the [0; 1] range, so you need to convert by hand your [-1; 1] range to [0; 1] by doing rtt = normal * 0.5f + 0.5f (and then the opposite when reading)

#5268468 GLSL iOS values beyond 2048 modules operation fails

Posted by Matias Goldberg on 30 December 2015 - 08:37 AM

First, you may want to start printing gl_FragCoord.x / 2732.0f to see if you actually get a black to white gradient; the resolution may be different from what you expect. And be sure you've declared gl_FragCoord as highp.


Second, when floating point precision begins to have problems, it will start by eliminating odd numbers and preserving even numbers. This is extremely suspicious.

I wouldn't be surprised if by oversight gl_FragCoord.x doesn't have enough precision to represent the entire iPad Pro's resolution. See Precision limitations on integer values.

#5268410 [D3D12] Driver level check to avoid duplicate function call?

Posted by Matias Goldberg on 29 December 2015 - 06:09 PM

Does the driver do some basic check to prevent duplicate commands?

No. That was D3D11's motto. D3D12 is exactly the opposite. You get what you ask.

However, because PSOs are a huge block of state to fit all hardware efficiently, but not all hardware requires all that data as a fixed block; a particular driver may go through the PSO, check if anything's different, and skip if nothing changed.
But this isn't a guarantee and you shouldn't rely on this. It's vendor, model and driver specific.

Or we as developer have to do this kind of check ourself?(like using hash value to identify duplicated func call with same params and avoid it?) or the perf delta is negligible?


pso = getPsoFromCache( draw_parameters );
if( pso != lastPso )
    SetPipelineState( pso );
    lastPso = pso;

See Valve's slides on fast multithreaded PSO caching (slides 13-23 PPT version may be animated).

#5268370 G-Buffer and Render Target format for Normals

Posted by Matias Goldberg on 29 December 2015 - 01:40 PM


Three partial-precision floating-point numbers encoded into a single 32-bit value (a variant of s10e5, which is sign bit, 10-bit mantissa, and 5-bit biased (15) exponent). There are no sign bits, and there is a 5-bit biased (15) exponent for each channel, 6-bit mantissa for R and G, and a 5-bit mantissa for B, as shown in the following illustration.

First, there is no sign bit. So I suppose negative values become positive or get clamped to 0. You definitely don't want that.
Second, normals are in the [-1; 1] range. You will get much better precision by using DXGI_FORMAT_R10G10B10A2_UNORM which gets you 9 bits for the value and 1 bit for the sign; vs this float format which uses 5 bits for mantissa and 5 for the exponent.

Looks like you made a poor choice of format.

3) Use some math to calculate Z-value on first X and Y. But I want to avoid this approach.

Why? GPUs have plenty of ALU to spare but bandwidth is precious.

Btw there's Crytek's best fit normals that get impressive quality results on just RGB888 RTs

#5268245 IMGUI

Posted by Matias Goldberg on 28 December 2015 - 07:25 PM

I've been looking into Dear ImGui for a pet project; since it's quite popular, in active development, very stable, fast, lightweight and easy to use. I didn't like its default look at first but cmftStudio is proof that it can look good. I couldn't have asked for anything better.
But there were two issues that were blocking for me:


1. Mouse/cursor centric. I'm trying to make an UI that can also be traversed with a gamepad (or keyboard) and no mouse. Like console games.

Dear ImGui doesn't seem to offer this, and it suggests to use Synergy instead. I don't know how hard it would be to add support for it, but ImGui's codebase is large (and a single file!); which doesn't make it easy to evaluate if adding this kind of support by hand would be easy.


2. "Threading". Not in the usual way. TBH this is more of a problem with IMGUI in general. Due to how my system works, the user interface is setup from scripts which run in the logic thread; while the actual UI runs in the Graphics thread and then passes messages to the logic thread for the script to process important events (like "user clicked button X") but not the trivial ones unless specifically requested during setup (like "user pressed down arrow to go to the next widget").

Obviously this is an RMGUI way of doing things and doesn't translate well to IMGUIs. I could try to refactor my engine to allow the Graphics thread to run simple scripts and workaround the issue. But this is kind of a big time investment which isn't a deal breaker, but when you add the previous point to consider, then I get grumpy.


So in the end I'll probably end up writing my own RMGUI system to suite my needs. It's not that hard for me anyways (for games). I may or may not reuse code from Dear ImGui (after all.. it's damn good) or borrow ideas from it.