Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Today, 02:13 PM

#5268792 Visual studio cannot compile 32bit or Release Mode

Posted by Matias Goldberg on 01 January 2016 - 11:56 PM

Looks like you've got a 64-bit DLL in the same folder as your EXE; causing a cascade of x64 dlls to also be included.

I would start by looking there is no msvcp140d.dll in your EXE folder.

#5268719 GLSL iOS values beyond 2048 modules operation fails

Posted by Matias Goldberg on 01 January 2016 - 11:39 AM

Ok, from what I can see, this is clearly a precision problem. gl_FragCoord must be a 16-bit floating point; which would make perfect sense because 16-bit floats can represent up to 2048 perfectly, but can only represent multiples of 2 between the range [2048; 4096].

By spec gl_FragCoord is defined to be mediump; but obviously that's going to break multiple apps on the iPad Pro and should be considered an iOS bug.

I suggest you raise a bug ticket to Apple. They like having everything handed in silver plate (can't blame them), so make a simple XCode project that can repro the problem so they can quickly open in XCode, build and run.

#5268640 VertexBuffers and InputAssmbler unnecessary?

Posted by Matias Goldberg on 31 December 2015 - 03:37 PM

That's mainly what I was wondering about.
Do you have any references for these performance claims?

Personal experience, I'm afraid.

I'd be very interested to know what hardware has a penalty and how large it is.

GCN definitely is in theory "the same"; PowerVR (mobile) definitely prefers IA as there are gigantic gains (Metal, not D3D12), I can't recall which Intel and NVIDIA cards used FF but at least some NVIDIAs did (if not all of them?).

As for the performance difference, it's not big, but "it depends". First it has the vertex bottleneck has to be big enough (which usually isn't). Second, it depends what you're doing on the shader and how complex it is.
For example even when testing GCN (which in theory, it should be the same) in complex shaders sometimes the driver generates relatively long ISAs to decode the formats (e.g. you stored them as 4 normalized shorts -> float4) when it should generate just one or two instructions. Granted, future driver versions would fix this.
If, for example, you use an UAV inside the vertex shader, the penalization becomes much bigger as there is no restrict equivalent and the loads become delayed and the shader suddenly blows with waitcnt instructions.

You will always have to have the knowledge to match appropriate vertex buffer data with vertex shaders.
The main difference is your PSO creation code doesn't need to know or care about the vertex format if you don't use the IA.
This brings a significant reduction in complexity and coupling IMO.
I work on AAA projects as well as smaller projects.
I don't see why it wouldn't scale to bigger projects.

Reduces coupling? Agreed. But coupling became irrelevant with PSOs, because PSOs coupled almost everything together. In a D3D11 view; input layouts made my life hell because they mixed the shader, the vertex buffer, and the vertex layout; but this relation wasn't obvious so I tried to abstract the three separately and end up with an entangled mess. If you weren't careful enough, each vertex buffer would need one IA layout for each shader it was associated with (should this IA live with the vertex buffer? or with the shader?)
A PSO approach made my life much easier (even outside D3D12) since now vertex buffers just need to be accompanied by a vertex description, and to generate a PSO you need absolutely everything. And the result lives inside the PSO.

I don't see why it wouldn't scale to bigger projects.

Because as a project becomes bigger, shader A works well on mesh M, N & O, but it should not be used with mesh P. To make it work on mesh P, you need shader A'

To detect this situation you need some form of vertex description to log a warning or automatically modify the shader (if emulating) so that Shader A becomes shader A'; or you let it glitch and lose a lot of time wondering what's going on (if someone notices it's glitching).

Maybe the artist exported P incorrectly. But without a vertex description, you can't tell why.

And if you're manually fetching vertex data via SV_VertexID, you need to grab the correct shader for P; or autogenerate it correctly (if it's tool assisted).

FWIW I believe this is the method Nitrous Engine uses.

Yes, Mantle had no vertex desc. because GCN didn't need them at all; so it just relied on StructuredBuffers. Though I always wondered if quirks like these were the reason D3D11 beats Mantle on GPU-bottleneck benchs. After all, they were relying on HLSL compiler to generate the shader; rather than using a shader language that better matches GCN.
D3D12 & Vulkan added them back because of the other vendors.

#5268618 VertexBuffers and InputAssmbler unnecessary?

Posted by Matias Goldberg on 31 December 2015 - 12:55 PM

These feel like legacy fixed-function constructs.
If you're building something new is there a good reason to use VertexBuffers and the InputAssembler at all?
Why not just use a StructuedBuffer<Vertex> SRV or unbounded ConstantBuffer<Vertex> array CBV (T3 hardware) and index them with the SV_VertexID?

Because these fixed functions constructs aren't legacy. In some hardware it is very current. Other GPUs though, there is practically no difference (aside from compiler optimizations that cannot be performed due to guarantees StructuredBuffer/ConstantBuffer give regarding to caching, alignment, ordering and aliasing).

You can ignore the IA and emulate it yourself with SV_VertexID, but doing so may result in sub-optimal performance on certain hardware.

Specifying D3D12_INPUT_ELEMENT_DESC requires detailed knowledge of both the vertex shader and the vertex data.

Yes. But emulating it with SV_VertexID requires detailed knowledge of both the vertex shader and the vertex data too; as you have to make sure the right vertex shader is used with the right vertex data. Perhaps what you mean is that by emulating it you can avoid caring about this and you just force it.
It works for small projects where you can mentally track what shader goes with which mesh, and it feels faster for developing (no time wasted specifying the vertex format). But it doesn't scale for bigger projects.

#5268536 G-Buffer and Render Target format for Normals

Posted by Matias Goldberg on 30 December 2015 - 06:51 PM

Thank you, Matias!
DXGI_FORMAT_R10G10B10A2_UNORM removed all the artifacts.

I'm glad it worked for you. Just remember that UNORM stores values in the [0; 1] range, so you need to convert by hand your [-1; 1] range to [0; 1] by doing rtt = normal * 0.5f + 0.5f (and then the opposite when reading)

#5268468 GLSL iOS values beyond 2048 modules operation fails

Posted by Matias Goldberg on 30 December 2015 - 08:37 AM

First, you may want to start printing gl_FragCoord.x / 2732.0f to see if you actually get a black to white gradient; the resolution may be different from what you expect. And be sure you've declared gl_FragCoord as highp.


Second, when floating point precision begins to have problems, it will start by eliminating odd numbers and preserving even numbers. This is extremely suspicious.

I wouldn't be surprised if by oversight gl_FragCoord.x doesn't have enough precision to represent the entire iPad Pro's resolution. See Precision limitations on integer values.

#5268410 [D3D12] Driver level check to avoid duplicate function call?

Posted by Matias Goldberg on 29 December 2015 - 06:09 PM

Does the driver do some basic check to prevent duplicate commands?

No. That was D3D11's motto. D3D12 is exactly the opposite. You get what you ask.

However, because PSOs are a huge block of state to fit all hardware efficiently, but not all hardware requires all that data as a fixed block; a particular driver may go through the PSO, check if anything's different, and skip if nothing changed.
But this isn't a guarantee and you shouldn't rely on this. It's vendor, model and driver specific.

Or we as developer have to do this kind of check ourself?(like using hash value to identify duplicated func call with same params and avoid it?) or the perf delta is negligible?


pso = getPsoFromCache( draw_parameters );
if( pso != lastPso )
    SetPipelineState( pso );
    lastPso = pso;

See Valve's slides on fast multithreaded PSO caching (slides 13-23 PPT version may be animated).

#5268370 G-Buffer and Render Target format for Normals

Posted by Matias Goldberg on 29 December 2015 - 01:40 PM


Three partial-precision floating-point numbers encoded into a single 32-bit value (a variant of s10e5, which is sign bit, 10-bit mantissa, and 5-bit biased (15) exponent). There are no sign bits, and there is a 5-bit biased (15) exponent for each channel, 6-bit mantissa for R and G, and a 5-bit mantissa for B, as shown in the following illustration.

First, there is no sign bit. So I suppose negative values become positive or get clamped to 0. You definitely don't want that.
Second, normals are in the [-1; 1] range. You will get much better precision by using DXGI_FORMAT_R10G10B10A2_UNORM which gets you 9 bits for the value and 1 bit for the sign; vs this float format which uses 5 bits for mantissa and 5 for the exponent.

Looks like you made a poor choice of format.

3) Use some math to calculate Z-value on first X and Y. But I want to avoid this approach.

Why? GPUs have plenty of ALU to spare but bandwidth is precious.

Btw there's Crytek's best fit normals that get impressive quality results on just RGB888 RTs

#5268245 IMGUI

Posted by Matias Goldberg on 28 December 2015 - 07:25 PM

I've been looking into Dear ImGui for a pet project; since it's quite popular, in active development, very stable, fast, lightweight and easy to use. I didn't like its default look at first but cmftStudio is proof that it can look good. I couldn't have asked for anything better.
But there were two issues that were blocking for me:


1. Mouse/cursor centric. I'm trying to make an UI that can also be traversed with a gamepad (or keyboard) and no mouse. Like console games.

Dear ImGui doesn't seem to offer this, and it suggests to use Synergy instead. I don't know how hard it would be to add support for it, but ImGui's codebase is large (and a single file!); which doesn't make it easy to evaluate if adding this kind of support by hand would be easy.


2. "Threading". Not in the usual way. TBH this is more of a problem with IMGUI in general. Due to how my system works, the user interface is setup from scripts which run in the logic thread; while the actual UI runs in the Graphics thread and then passes messages to the logic thread for the script to process important events (like "user clicked button X") but not the trivial ones unless specifically requested during setup (like "user pressed down arrow to go to the next widget").

Obviously this is an RMGUI way of doing things and doesn't translate well to IMGUIs. I could try to refactor my engine to allow the Graphics thread to run simple scripts and workaround the issue. But this is kind of a big time investment which isn't a deal breaker, but when you add the previous point to consider, then I get grumpy.


So in the end I'll probably end up writing my own RMGUI system to suite my needs. It's not that hard for me anyways (for games). I may or may not reuse code from Dear ImGui (after all.. it's damn good) or borrow ideas from it.

#5267952 Where is the cosine factor in extended LTE?

Posted by Matias Goldberg on 25 December 2015 - 08:54 PM

As JoeJ said, there's missing a lot of context.

Posting an equation without mentioning what does phi, theta, We, Li, dA and Pfilm stand for, or what happens inside those function; we're pretty much clueless.

I might remember something from my PBR readings; but that requires a lot of effort, which we're expecting from your side, not ours. We need some refreshers.

What I'm wondering is that where is the cosine factor at the camera side of this equation? I never recall any renderer taking it into consideration, at least for real time rendering engine.

Honestly I do not understand whether you're asking why is the LTE considering the camera, or if you're asking why the LTE is not considering the camera.

Anyway, rendering does take into account the camera (the "eye") simply because of the specular component of the light and the fresnel term.
We basically need to check whether the eye is being directly hit by polarized light.

Since a movie projection is basically mostly diffuse light though, the eye position is pretty much irrelevant in your example.

#5267815 Screen-Space Reflection, enough or mix needed ?

Posted by Matias Goldberg on 24 December 2015 - 01:08 PM

Imagine your main character in front of a mirror and the camera is behind the character. That's it. There's a massive amount of information unavailable to SS reflections.


They need to be complemented by a more powerful technique unless you're planing on using reflections just to spice up background stuff

#5267100 What will change with HDR monitor ?

Posted by Matias Goldberg on 19 December 2015 - 09:49 PM

Since the original question was "What will change with HDR monitor?"; if the monitors truly are HDR (like Hodgman said, HDR is quite overused by marketing... what it really means by HDR monitors "depends") then what will change for sure is your electricity bill. A monitor that can show a picture in a dynamic range around 4-10 times higher is bound to consume much higher electricity when the "shiny stuff" lits the pixels.


Funny how "staying green" slogan is also important.

#5266913 Why don't modern GPUs support palettized textures

Posted by Matias Goldberg on 18 December 2015 - 11:08 AM

I suppose texture size could be a factor too in the death of texture palettes. As texture sizes get larger and larger, the quality of block compression remains unchanged, but the ability of a 256 colour palette to do a reasonable job diminishes. Especially in the fact of texture atlases where unconnected textures with very different colours are munged together.


A 64x64 texture would probably gain more by using palettes over BC1 compression. But as the resolution goes higher; the compressed version will almost always win. There's no way a 2048x2048 palette texture would be better than a compressed version.
Let's remember that paletted textures were popular for their small size. A full 256x256x32bpp texture is 0.25MB. Considering today's GPUs with +1GB VRAM and >100GB/s in bandwidth; the extra transistor space dedicated for decoding paletted textures is totally not worth it.

Not to mention mipmapping is an issue (the only downfilter to produce good results is a point filter; as any other filter will generate new colours)

#5266581 _aligned_malloc with big alignment waste of space is madness

Posted by Matias Goldberg on 15 December 2015 - 07:39 PM

Why do you need 32MB alignment?


Aligned alloc works by allocating alignment + requested size; then offsetting the actual pointer. So a 32MB alignment requires a minimum size of 32MB per allocation.

For example for 16-byte alignment, if the pointer returned by malloc is 4-byte alligned, you need to offset by 12 bytes. If the memory was 8-byte alligned, you need to offset by 8 bytes. If the memory was 1-byte aligned, you need to offset by 15 bytes. Hence the minimum allocation must be 16 bytes + requested size.


In the 32MB alignment case, if malloc returns the pointer 0x04000004, you need to offset it by 33554428 bytes to make it multiple of 33.554.432

Now you know where the waste space comes from. It's unavoidable


For stuff like meeting D3D12 requirements (like 4MB alignment) you seriously should use VirtualAlloc. But for meeting the alignment, you should do the alignment yourself. If you think the wasted storage is too big, manage the memory manually so that you can reuse the memory space that comes after the allocation start and before the aligned offset.

#5266166 Question about creating a replay system

Posted by Matias Goldberg on 13 December 2015 - 04:51 PM


First, the size. A single screen capture image may be 1920x1080x24 = 58 MB uncompressed.

Wait, is my math off with this?
1920*1080 = 2.073.600 pixels
2.073.600 * 3 (per channel, one channel is one byte right?) = 6.220.800 byte
6.220.800 / 1024 = 6075 KB
6075 / 1024 = 5.93 MB
Thats still something but 58MB seemed a little much for a single screen picture, even uncompressed.


You're correct. The numbers were wrong. However, the actual problem is that at 60 fps that amounts to 60 x 5.93MB = 355MB/s.
You would need an SSD to record it, since no current user-grade HDD can keep up with that write bandwidth.
You would have to compress it. H264 live encoding is possible, but you would have to use the zerolatency quality profile for fastest speed, and deal with H264 licencing and patents. (and also high CPU usage)

A more reasonable approach is using YCoCg at 30fps and later reconvert to your favourite compression scheme to lower CPU usage, and workaround the H264 licensing/patent issues.