Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Yesterday, 10:51 PM

#5280171 The GPU comes too late for the party...

Posted by on 08 March 2016 - 08:54 AM

> On Optimus systems, the Intel card is always the one hooked to the monitor.
> If you call get() query family for getting timestamp information, you're forcing the NV driver to ask Intel drivers to get information (i.e. have we presented yet?).

Even if it askes the Intel driver, it doesnt seem to slow it down, considering that I submit hundreds of queries per frame. And why should it delay the rendering ?

Because the NV driver needs to wait on the Intel driver. In a perfect world the NV driver would be asynchronous if you're asynchronous as well. But in the real world, where Optimus devices can't get VSync straight, I wouldn't count on that. At all. There's several layers of interface communication (NV stack, DirectX, WDDM, and Intel stack) and a roundtrip is going to be a deep hell where likely a synchronous path is the only one being implemented.


> Edit: Also if it's Windows 7, try disabling Aero. Also check with GPUView (also check this out).

It is not a matter of performance, but a matter of delay and synchronisation. All rendring on the GPU falls more or less exactly in the SwapBuffers call, as if the driver waits too long before submitting the command queue and get eventually forced by calling SwapBuffers.

Aero adds an extra layer of latency. Anyway, not the problem since you're in Windows 10.

I have Windows 10 btw. I looked into GPUView, but it seems to track only DirectX events.

GPUView tracks GPU events such as DMA transfers, page commits, memory eviction, screen presentation. All of which is API agnostic and thus works with DirectX and OpenGL (I've successfully used GPUView with OpenGL apps). IIRC it also supports some DX-only events, but that's not really relevant for an OGL app.

#5280090 The GPU comes too late for the party...

Posted by on 07 March 2016 - 07:33 PM

On Optimus systems, the Intel card is always the one hooked to the monitor.

If you call get() query family for getting timestamp information, you're forcing the NV driver to ask Intel drivers to get information (i.e. have we presented yet?).


Does this problem go away if you never call glQueryCounter, glGetQueryObjectiv & glGetInteger64v(GL_TIMESTAMP)?


Calling glFlush could help you force the NV drivers to start processing sooner, but it's likely the driver will just ignore that call (since it's often abused by dumb programmers).


I also recall NV driver config in their control panel having a Triple Buffer option that only affected OpenGL. Try disabling it.


Edit: Also if it's Windows 7, try disabling Aero. Also check with GPUView (also check this out).

#5279772 [D3D12] What is the benefit of descriptor table ranges?

Posted by on 05 March 2016 - 10:04 PM

In addition to everything Hodgman said, remember DX12 had to support three architectures from NVIDIA, one architecture from AMD, and like three from Intel (and they might have considered some mobile GPUs in their Nokia phones as well).

These cards differed vastly in their texture support. What doesn't make sense in one hw architecture, makes sense in another (Intel & Fermi, I'm looking at youuuuu)

#5279547 Design question: Anti-aliasing and deferred rendering.

Posted by on 04 March 2016 - 03:47 PM

You can't go from a non-MSAA GBuffers & Light passes go to MSAA tonemapping. It just doesn't work like that and makes no sense.


Overview of MSAA is a good introduction on how MSAA works. MSAA resolve filters is also a good read.


You have to start with MSAA GBuffers, avoid resolving them, and resolve the tonemapped result. This is what makes Deferred Renderers + MSAA so damn difficult. SV_Coverage + stencil tricks can help saving bandwidth.

#5279366 Fastest way to draw Quads?

Posted by on 03 March 2016 - 05:53 PM


A wavefront can't work on several instance? InstanceID is stored in a scalar register?

The instanceID is in a vector register on AMD GPUs, so you can have multiple instances within the same wavefront.


InstanceID is in a VGPR, yes. Multiple instances could be in the same wavefront? Probably. Multiple instances are in the same wavefront? Not in practice.

I don't know if there are limitations within the rasterizer (e.g. vertices pulled from a wavefront are assumed to be from the same instance), but simple things like:

Texture2D tex[8];
tex[instanceId].Load( ... );

would cause colossal divergency issues. Analyzing a shader to check for these hazardous details to see if instances should share wavefronts or not is more expensive than just sending each instance to its own group of wavefronts. Being vertex bound is pretty rare nowadays unless you hit a pathological case (such as having 4 vertices per draw)

#5279127 Fastest way to draw Quads?

Posted by on 02 March 2016 - 07:36 PM

Is there a rationale for the performance penalty encountered for small mesh vs drawIndexed method by the way?

GCN's wavefront size is 64.
That means GCN works on 64 vertices at a time.
If you make two DrawPrimitive calls of 32 vertices each, GCN will process them using 2 wavefronts, wasting half of its processing power.

It's actually a bit more complex, as GCN has compute units, and each CU has 4 SIMD units. Each SIMD unit can execute between 1 and 10 wavefronts. There's also some fixed function parts like the rasterizer which may have some overhead when involving small meshes.

Long story short, it's all about load balancing, and small meshes leave a lot of idle space; hence the sweetspot is around 128-256 vertices for NVIDIA, and around 500-600 vertices for AMD (based on benchmarks).

#5278753 Fastest way to draw Quads?

Posted by on 29 February 2016 - 03:38 PM

See Vertex Shader Tricks by Bill Bilodeau regarding point sprites. Hint: it's none of the ways you mentioned.

#5278460 How to find the cause of framedrops

Posted by on 27 February 2016 - 12:50 PM

Hi Oogst!


Learn to use GPUView (also check this out). It's scary at first, but it's an invaluable tool at discovering stutters like the one you're facing.

#5278336 "Modern C++" auto and lambda

Posted by on 26 February 2016 - 12:06 PM

auto x = 7.0; //This is an insane strawman example
auto x = 7.0f; //This is an insane strawman example
auto x = 7.; //This is an insane strawman example
auto x = 7; //This is an insane strawman example
There's absolutely no reason to use auto where


Just to be clear, I had to pull this example because Scott Meyers is literally recommending to use auto everywhere, including literals. And his books are widely read across freshman trying to learn C++.


Besides, you misinterpreted the example. It doesn't have to be literals. I could do the same with:

auto x = time; //This is double
auto x = acceleration; //This is a float
auto x = time * acceleration; //This is...? (will be a double, probably a float is much better fit)
auto x = sizeBytes; //This is an unsigned integer. 64-bits
auto x = lengthBytes; //This is a signed integer

Except now it's not obvious at all. The last one (lengthBytes) can end up inducing undefined behavior. While the first three could cause precision or performance issues because I have no idea if I'm working with doubles or floats unless I spend my effort checking out each variable's type; which obviously the one who wrote it didn't care because he decided to use auto.



for( Foo x : m_foos )

...the exact same flaw would occur.
Further, I would suggest by default one should use const-reference, and only go to non-const reference or value (or r-value) is the code needs it (for example, if it's just integers).
If you use good coding practices, then "auto x : m_foos" stands out just as badly as "Foo x : m_foos".

Yes, but it is far more obvious that it is a hard copy. It's self evident. People who write "for( auto x : m_foos )" most of the time actually expect "auto &x" or didn't care. I mean, isn't the compiler smart enough to understand I meant auto& x and not auto x? Isn't it supposed automatically deduce that? This is the kind of BS. I had to fix in an actual, real C++ project for a client who was having performance problems. Their programmers didn't realize auto x was making deep copies.


Like frob said, that's the fault of people who try to pretend C++ is not a strongly typed language; since in (almost?) every weakly type language out there, "auto x : m_foos" means a reference and not a hard copy.

Obviously making auto a reference by default instead of a hard copy is not the solution. That would create another storm of problems. But perhaps people should stop recommending to use auto everywhere, or stop pretending C++ is not a strongly typed language.

#5278202 AMD - Anisotropic filtering with nearest point filtering not working properly?

Posted by on 25 February 2016 - 05:45 PM

As far as I know there is no such thing as "point anistropic" filtering. Anisotropic, by definition, requires performing many taps; which is orthogonal with the notion of point filtering. Have you compared the difference between regular point filtering and what you see as "point anistropic"?. As Hodgman said it is an error not to set both mag and min to anisotropic.


Edit: I missread your post. You set Min to Aniso, and Mag to Point filtering. I thought you had set Mip to aniso (which is undefined behavior). I don't know if the specs allow what you're doing or not.

Edit 2: Check the driver isn't configured to override filtering, which is a common option (i.e. high quality vs high performance vs balanced, etc).

#5278175 "Modern C++" auto and lambda

Posted by on 25 February 2016 - 03:47 PM



I totally agree with you.

The keyword auto introduced by C++11 is one of the new most overused features.

Auto looks awesome to get rid of typing 8 extra letters, but it's very horrible when you have to analyze code written by someone else and you can't quickly see what the variable's type is, or you don't have an IDE to help you.

Being easy to write != being easy to read & debug 6 months later.

Auto is useful inside templates, since the variable type may not be known until compilation or specialization, or when the syntax inside the template gets too complex. However using auto outside templates (or everywhere!) is an abuse. It can even introduce subtle bugs: e.g.

auto x = 7.0; //This is a double
auto x = 7.0f; //This is a float
auto x = 7.; //This is a double!!!
auto x = 7; //Is this an int or a uint?

The first one may be intended to be a float, but the 'f' was missing.

The third one may be supposed to be an integer, but the extra '.' made it a double. Easy to miss.

The fourth one is the worst kind. Signed integer overflow is undefined behavior. Unsigned integer overflow is well defined (wraps around).

This can have unforeseen consequences:

auto x = myInteger; //int myInteger = 262144; causes undefined behavior as result should be 68719476736, but doesn't fit in 32-bit
if( x * x < x )
//The compiler may optimize this away. Which wouldn't happen if 'x' were unsigned.

Of course you wouldn't use auto on a literal (note: Scott Meyers actually recommends using auto on literals!!!), but even then, using it on more complex code is asking for subtle trouble.

If the code is too complex, then it is even harder to read what's going on.

Another subtle problem is one like the following:

foreach( auto x : m_foos )

Looks ok, right? Well turns out m_foos is std::vector<Foo> m_foos; The loop will perform a hard copy Foo in every iteration. If the declaration of m_foos is std::vector<std::shared_ptr<Foo>> m_foos; then every loop iteration will take the lock, increasing and almost immediately after decreasing its internal reference count. The proper fix would be:

foreach( auto &x : m_foos )

And now x is a reference, rather than a hardcopy.

Auto has far too many issues and problems, even if you're aware of all of them. It's best to just forbid it.

On things like templates it makes sense, because auto myVariable is far better to work with than typename<T>::iterator::something<B, C>::type_value myVariable where the type is not actually explicit, and it's an incomprehensible string. In those cases, auto is justified as the alternative is worse on every single aspect.

So, in my opinion, Microsoft recommending to use auto everywhere like a mantra only perpetuates bad practices.

#5277990 How does everyone test their hlsl code?

Posted by on 24 February 2016 - 08:36 PM

Depends on what you mean by "testing hlsl":

If by testing you mean prototyping an idea or some math; RenderMonkey works since most of the relevant syntax hasn't changed. Unity can also work. Even ShaderToy can work even though it does GLSL, math is still math and porting to HLSL is a bit of find & replace (e.g. vec4 with float4).


If by testing you mean the shader works well in the actual environment it's going to run (i.e. the game), has proper SM 4.0/5.0 syntax, then nothing's best like your own program as testing, considering you implemented a hot reloading feature that detects changes on disk and reloads them automatically. And RenderDoc, Visual Studio Graphics Debugger and GPUPerfStudio for debugging. In that order.

#5277824 Black screen when trying to implement ibo

Posted by on 23 February 2016 - 11:30 PM

Your index buffer is declared as GLfloat IndexBufferData when it should be uint32_t IndexBufferData. For best efficiency it should be uint16_t IndexBufferData and calling glDrawElements with GL_UNSIGNED_SHORT instead of GL_UNSIGNED_INT.

#5277306 D3D12: Texture2D/3D/Cube accessed inside a single SRV descriptor heap ?

Posted by on 21 February 2016 - 11:55 AM

How can I check for stalling when writing to descriptor heap ?

Well, if your code contains an explicit wait then I would first check there.
Otherwise, use a CPU profiler to see significant portions spent inside the driver with the callstack leading to your code. Or learn to use GPUView (also check this out).

Is there a way to map and unmap descriptor heap ? I always use ID3D12Device::CreateShaderResourceView

Could you provide some code? We're shooting in the dark.
Also name your GPU just in case.

#5277219 D3D12: Texture2D/3D/Cube accessed inside a single SRV descriptor heap ?

Posted by on 20 February 2016 - 10:06 PM

After some measurement it looks like writing to descriptor heap is not cheap (it takes up to 2ms here) so I'm trying to modify my app to only write to the descriptor heap if necessary and let the shader access the descriptor heap using ids provided with constant roots.

  1. Make sure you're not stalling before writing to the desc. heap
  2. Make sure the assembly does not read from the desc heap memory by any random chance (write combined memory will hit you hard!)
  3. I prefer baking a pack of descriptors (i.e. 256 is the limit), and then swap them out depending on which pack I need.
  4. I assume you're not mapping and unmapping memory to write to the heap?