Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Online Last Active Today, 01:50 PM

#5279772 [D3D12] What is the benefit of descriptor table ranges?

Posted by Matias Goldberg on 05 March 2016 - 10:04 PM

In addition to everything Hodgman said, remember DX12 had to support three architectures from NVIDIA, one architecture from AMD, and like three from Intel (and they might have considered some mobile GPUs in their Nokia phones as well).

These cards differed vastly in their texture support. What doesn't make sense in one hw architecture, makes sense in another (Intel & Fermi, I'm looking at youuuuu)

#5279547 Design question: Anti-aliasing and deferred rendering.

Posted by Matias Goldberg on 04 March 2016 - 03:47 PM

You can't go from a non-MSAA GBuffers & Light passes go to MSAA tonemapping. It just doesn't work like that and makes no sense.


Overview of MSAA is a good introduction on how MSAA works. MSAA resolve filters is also a good read.


You have to start with MSAA GBuffers, avoid resolving them, and resolve the tonemapped result. This is what makes Deferred Renderers + MSAA so damn difficult. SV_Coverage + stencil tricks can help saving bandwidth.

#5279366 Fastest way to draw Quads?

Posted by Matias Goldberg on 03 March 2016 - 05:53 PM


A wavefront can't work on several instance? InstanceID is stored in a scalar register?

The instanceID is in a vector register on AMD GPUs, so you can have multiple instances within the same wavefront.


InstanceID is in a VGPR, yes. Multiple instances could be in the same wavefront? Probably. Multiple instances are in the same wavefront? Not in practice.

I don't know if there are limitations within the rasterizer (e.g. vertices pulled from a wavefront are assumed to be from the same instance), but simple things like:

Texture2D tex[8];
tex[instanceId].Load( ... );

would cause colossal divergency issues. Analyzing a shader to check for these hazardous details to see if instances should share wavefronts or not is more expensive than just sending each instance to its own group of wavefronts. Being vertex bound is pretty rare nowadays unless you hit a pathological case (such as having 4 vertices per draw)

#5279127 Fastest way to draw Quads?

Posted by Matias Goldberg on 02 March 2016 - 07:36 PM

Is there a rationale for the performance penalty encountered for small mesh vs drawIndexed method by the way?

GCN's wavefront size is 64.
That means GCN works on 64 vertices at a time.
If you make two DrawPrimitive calls of 32 vertices each, GCN will process them using 2 wavefronts, wasting half of its processing power.

It's actually a bit more complex, as GCN has compute units, and each CU has 4 SIMD units. Each SIMD unit can execute between 1 and 10 wavefronts. There's also some fixed function parts like the rasterizer which may have some overhead when involving small meshes.

Long story short, it's all about load balancing, and small meshes leave a lot of idle space; hence the sweetspot is around 128-256 vertices for NVIDIA, and around 500-600 vertices for AMD (based on benchmarks).

#5278753 Fastest way to draw Quads?

Posted by Matias Goldberg on 29 February 2016 - 03:38 PM

See Vertex Shader Tricks by Bill Bilodeau regarding point sprites. Hint: it's none of the ways you mentioned.

#5278460 How to find the cause of framedrops

Posted by Matias Goldberg on 27 February 2016 - 12:50 PM

Hi Oogst!


Learn to use GPUView (also check this out). It's scary at first, but it's an invaluable tool at discovering stutters like the one you're facing.

#5278336 "Modern C++" auto and lambda

Posted by Matias Goldberg on 26 February 2016 - 12:06 PM

auto x = 7.0; //This is an insane strawman example
auto x = 7.0f; //This is an insane strawman example
auto x = 7.; //This is an insane strawman example
auto x = 7; //This is an insane strawman example
There's absolutely no reason to use auto where


Just to be clear, I had to pull this example because Scott Meyers is literally recommending to use auto everywhere, including literals. And his books are widely read across freshman trying to learn C++.


Besides, you misinterpreted the example. It doesn't have to be literals. I could do the same with:

auto x = time; //This is double
auto x = acceleration; //This is a float
auto x = time * acceleration; //This is...? (will be a double, probably a float is much better fit)
auto x = sizeBytes; //This is an unsigned integer. 64-bits
auto x = lengthBytes; //This is a signed integer

Except now it's not obvious at all. The last one (lengthBytes) can end up inducing undefined behavior. While the first three could cause precision or performance issues because I have no idea if I'm working with doubles or floats unless I spend my effort checking out each variable's type; which obviously the one who wrote it didn't care because he decided to use auto.



for( Foo x : m_foos )

...the exact same flaw would occur.
Further, I would suggest by default one should use const-reference, and only go to non-const reference or value (or r-value) is the code needs it (for example, if it's just integers).
If you use good coding practices, then "auto x : m_foos" stands out just as badly as "Foo x : m_foos".

Yes, but it is far more obvious that it is a hard copy. It's self evident. People who write "for( auto x : m_foos )" most of the time actually expect "auto &x" or didn't care. I mean, isn't the compiler smart enough to understand I meant auto& x and not auto x? Isn't it supposed automatically deduce that? This is the kind of BS. I had to fix in an actual, real C++ project for a client who was having performance problems. Their programmers didn't realize auto x was making deep copies.


Like frob said, that's the fault of people who try to pretend C++ is not a strongly typed language; since in (almost?) every weakly type language out there, "auto x : m_foos" means a reference and not a hard copy.

Obviously making auto a reference by default instead of a hard copy is not the solution. That would create another storm of problems. But perhaps people should stop recommending to use auto everywhere, or stop pretending C++ is not a strongly typed language.

#5278202 AMD - Anisotropic filtering with nearest point filtering not working properly?

Posted by Matias Goldberg on 25 February 2016 - 05:45 PM

As far as I know there is no such thing as "point anistropic" filtering. Anisotropic, by definition, requires performing many taps; which is orthogonal with the notion of point filtering. Have you compared the difference between regular point filtering and what you see as "point anistropic"?. As Hodgman said it is an error not to set both mag and min to anisotropic.


Edit: I missread your post. You set Min to Aniso, and Mag to Point filtering. I thought you had set Mip to aniso (which is undefined behavior). I don't know if the specs allow what you're doing or not.

Edit 2: Check the driver isn't configured to override filtering, which is a common option (i.e. high quality vs high performance vs balanced, etc).

#5278175 "Modern C++" auto and lambda

Posted by Matias Goldberg on 25 February 2016 - 03:47 PM



I totally agree with you.

The keyword auto introduced by C++11 is one of the new most overused features.

Auto looks awesome to get rid of typing 8 extra letters, but it's very horrible when you have to analyze code written by someone else and you can't quickly see what the variable's type is, or you don't have an IDE to help you.

Being easy to write != being easy to read & debug 6 months later.

Auto is useful inside templates, since the variable type may not be known until compilation or specialization, or when the syntax inside the template gets too complex. However using auto outside templates (or everywhere!) is an abuse. It can even introduce subtle bugs: e.g.

auto x = 7.0; //This is a double
auto x = 7.0f; //This is a float
auto x = 7.; //This is a double!!!
auto x = 7; //Is this an int or a uint?

The first one may be intended to be a float, but the 'f' was missing.

The third one may be supposed to be an integer, but the extra '.' made it a double. Easy to miss.

The fourth one is the worst kind. Signed integer overflow is undefined behavior. Unsigned integer overflow is well defined (wraps around).

This can have unforeseen consequences:

auto x = myInteger; //int myInteger = 262144; causes undefined behavior as result should be 68719476736, but doesn't fit in 32-bit
if( x * x < x )
//The compiler may optimize this away. Which wouldn't happen if 'x' were unsigned.

Of course you wouldn't use auto on a literal (note: Scott Meyers actually recommends using auto on literals!!!), but even then, using it on more complex code is asking for subtle trouble.

If the code is too complex, then it is even harder to read what's going on.

Another subtle problem is one like the following:

foreach( auto x : m_foos )

Looks ok, right? Well turns out m_foos is std::vector<Foo> m_foos; The loop will perform a hard copy Foo in every iteration. If the declaration of m_foos is std::vector<std::shared_ptr<Foo>> m_foos; then every loop iteration will take the lock, increasing and almost immediately after decreasing its internal reference count. The proper fix would be:

foreach( auto &x : m_foos )

And now x is a reference, rather than a hardcopy.

Auto has far too many issues and problems, even if you're aware of all of them. It's best to just forbid it.

On things like templates it makes sense, because auto myVariable is far better to work with than typename<T>::iterator::something<B, C>::type_value myVariable where the type is not actually explicit, and it's an incomprehensible string. In those cases, auto is justified as the alternative is worse on every single aspect.

So, in my opinion, Microsoft recommending to use auto everywhere like a mantra only perpetuates bad practices.

#5277990 How does everyone test their hlsl code?

Posted by Matias Goldberg on 24 February 2016 - 08:36 PM

Depends on what you mean by "testing hlsl":

If by testing you mean prototyping an idea or some math; RenderMonkey works since most of the relevant syntax hasn't changed. Unity can also work. Even ShaderToy can work even though it does GLSL, math is still math and porting to HLSL is a bit of find & replace (e.g. vec4 with float4).


If by testing you mean the shader works well in the actual environment it's going to run (i.e. the game), has proper SM 4.0/5.0 syntax, then nothing's best like your own program as testing, considering you implemented a hot reloading feature that detects changes on disk and reloads them automatically. And RenderDoc, Visual Studio Graphics Debugger and GPUPerfStudio for debugging. In that order.

#5277824 Black screen when trying to implement ibo

Posted by Matias Goldberg on 23 February 2016 - 11:30 PM

Your index buffer is declared as GLfloat IndexBufferData when it should be uint32_t IndexBufferData. For best efficiency it should be uint16_t IndexBufferData and calling glDrawElements with GL_UNSIGNED_SHORT instead of GL_UNSIGNED_INT.

#5277306 D3D12: Texture2D/3D/Cube accessed inside a single SRV descriptor heap ?

Posted by Matias Goldberg on 21 February 2016 - 11:55 AM

How can I check for stalling when writing to descriptor heap ?

Well, if your code contains an explicit wait then I would first check there.
Otherwise, use a CPU profiler to see significant portions spent inside the driver with the callstack leading to your code. Or learn to use GPUView (also check this out).

Is there a way to map and unmap descriptor heap ? I always use ID3D12Device::CreateShaderResourceView

Could you provide some code? We're shooting in the dark.
Also name your GPU just in case.

#5277219 D3D12: Texture2D/3D/Cube accessed inside a single SRV descriptor heap ?

Posted by Matias Goldberg on 20 February 2016 - 10:06 PM

After some measurement it looks like writing to descriptor heap is not cheap (it takes up to 2ms here) so I'm trying to modify my app to only write to the descriptor heap if necessary and let the shader access the descriptor heap using ids provided with constant roots.

  1. Make sure you're not stalling before writing to the desc. heap
  2. Make sure the assembly does not read from the desc heap memory by any random chance (write combined memory will hit you hard!)
  3. I prefer baking a pack of descriptors (i.e. 256 is the limit), and then swap them out depending on which pack I need.
  4. I assume you're not mapping and unmapping memory to write to the heap?

#5277217 D3D12: Texture2D/3D/Cube accessed inside a single SRV descriptor heap ?

Posted by Matias Goldberg on 20 February 2016 - 09:56 PM

I kept thinking about this and there's probably hardware limitations.
There are several types of types of divergences:

  • All pixels in a wavefront A access texture[0]; All pixels in a wavefront B access texture[1]; This is very likely because the wavefronts belong to different draws.
  • Pixel A accesses texture[0]; Pixel B accesses texture[1]. Both are Texture2D.

The first case is fine.
But the second one is not. You must specify NonUniformResourceIndex if you expect to be in scenario 2.
On the top of that, to cover what you want; we would have to add several more scenarios, where the type of texture could be divergent, not just the texture index.
Some hardware out there definitely cannot do that.
Note however, nothing's preventing you from doing this:

Texture2d texture_array2d[5000] : register(t0); //We assume the first 5000 textures are 2D
Texture3d texture_array3d[5000] : register(t5000); //We assume textures [5000; 10000) are 3D

Also note that Tier 1 supports up to 256 textures, so an array(s) of 10.000 will lock you out of a lot of cards (Haswell & Broadwell Intel cards, Fermi NVIDIA)


Edit: Just to clarify why it's not possible (or very difficult, or would add a lot of pointless overhead). There's three parts:

  1. Information about textures like format (i.e. RGBX8888 vs Float_R16, etc) and resolution. In some hardware it lives in a structure in GPU memory (GCN), in other hardware it lives in a physical register (Intel).
  2. Information about how to sample the texture (bilinear vs point vs trilinear, mip lod bias, anisotropy, border/clamp/wrap, etc). In GCN most of this information lives in a SGPR register that points to a cached region of memory. The border colour (for the border colour mode) lives in a register table. In Haswell this information lives in physical register IIRC.
  3. Information about the type of the texture, which affects how it is sampled (1D vs 2D vs 2D Array vs 3D vs Cube vs Cube Array). In GCN, sampling a cubemap requires issuing more instructions (V_CUBE*_F32 family if I recall); sampling 3D textures requires providing more VGPRs (since more data is needed) than for sampling 2D textures.

Your assumption is that the type of texture lives in GPU memory alongside the format and resolution (point 1). But this is not the case. It lives on the ISA instructions (point 3).

In fact D3D12 provides some level of abstraction: You think the format and resolution lives in GPU memory, when in fact on Intel GPUs it lives in physical registers (that's where the 256 limit of Tier 1 comes from btw. D3D11 by spec allowed up to 128 textures, and it happens to be both Fermi & Intel supported up to 256)

Therefore, it becomes too cumbersome to support this sort of generic-type texture you want.

#5276213 Vulkan is Next-Gen OpenGL

Posted by Matias Goldberg on 17 February 2016 - 04:30 PM

Anybody know of any open source graphics engines going to be built around these new apis? The most popular open source engine I know is Ogre, and that is build around DX9. I would love to see a graphics engine that is build specifically around these new apis.

Ogre 2.1 is built around AZDO OpenGL, D3D11, and mostly prepared for D3D12 & Vulkan.

Our biggest issue is some old code for Textures that needs a big refactor, which is currently the biggest issue when implementing the D3D12 & Vulkan RenderSystems.