Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Yesterday, 06:57 PM

#5278753 Fastest way to draw Quads?

Posted by Matias Goldberg on 29 February 2016 - 03:38 PM

See Vertex Shader Tricks by Bill Bilodeau regarding point sprites. Hint: it's none of the ways you mentioned.

#5278460 How to find the cause of framedrops

Posted by Matias Goldberg on 27 February 2016 - 12:50 PM

Hi Oogst!


Learn to use GPUView (also check this out). It's scary at first, but it's an invaluable tool at discovering stutters like the one you're facing.

#5278336 "Modern C++" auto and lambda

Posted by Matias Goldberg on 26 February 2016 - 12:06 PM

auto x = 7.0; //This is an insane strawman example
auto x = 7.0f; //This is an insane strawman example
auto x = 7.; //This is an insane strawman example
auto x = 7; //This is an insane strawman example
There's absolutely no reason to use auto where


Just to be clear, I had to pull this example because Scott Meyers is literally recommending to use auto everywhere, including literals. And his books are widely read across freshman trying to learn C++.


Besides, you misinterpreted the example. It doesn't have to be literals. I could do the same with:

auto x = time; //This is double
auto x = acceleration; //This is a float
auto x = time * acceleration; //This is...? (will be a double, probably a float is much better fit)
auto x = sizeBytes; //This is an unsigned integer. 64-bits
auto x = lengthBytes; //This is a signed integer

Except now it's not obvious at all. The last one (lengthBytes) can end up inducing undefined behavior. While the first three could cause precision or performance issues because I have no idea if I'm working with doubles or floats unless I spend my effort checking out each variable's type; which obviously the one who wrote it didn't care because he decided to use auto.



for( Foo x : m_foos )

...the exact same flaw would occur.
Further, I would suggest by default one should use const-reference, and only go to non-const reference or value (or r-value) is the code needs it (for example, if it's just integers).
If you use good coding practices, then "auto x : m_foos" stands out just as badly as "Foo x : m_foos".

Yes, but it is far more obvious that it is a hard copy. It's self evident. People who write "for( auto x : m_foos )" most of the time actually expect "auto &x" or didn't care. I mean, isn't the compiler smart enough to understand I meant auto& x and not auto x? Isn't it supposed automatically deduce that? This is the kind of BS. I had to fix in an actual, real C++ project for a client who was having performance problems. Their programmers didn't realize auto x was making deep copies.


Like frob said, that's the fault of people who try to pretend C++ is not a strongly typed language; since in (almost?) every weakly type language out there, "auto x : m_foos" means a reference and not a hard copy.

Obviously making auto a reference by default instead of a hard copy is not the solution. That would create another storm of problems. But perhaps people should stop recommending to use auto everywhere, or stop pretending C++ is not a strongly typed language.

#5278202 AMD - Anisotropic filtering with nearest point filtering not working properly?

Posted by Matias Goldberg on 25 February 2016 - 05:45 PM

As far as I know there is no such thing as "point anistropic" filtering. Anisotropic, by definition, requires performing many taps; which is orthogonal with the notion of point filtering. Have you compared the difference between regular point filtering and what you see as "point anistropic"?. As Hodgman said it is an error not to set both mag and min to anisotropic.


Edit: I missread your post. You set Min to Aniso, and Mag to Point filtering. I thought you had set Mip to aniso (which is undefined behavior). I don't know if the specs allow what you're doing or not.

Edit 2: Check the driver isn't configured to override filtering, which is a common option (i.e. high quality vs high performance vs balanced, etc).

#5278175 "Modern C++" auto and lambda

Posted by Matias Goldberg on 25 February 2016 - 03:47 PM



I totally agree with you.

The keyword auto introduced by C++11 is one of the new most overused features.

Auto looks awesome to get rid of typing 8 extra letters, but it's very horrible when you have to analyze code written by someone else and you can't quickly see what the variable's type is, or you don't have an IDE to help you.

Being easy to write != being easy to read & debug 6 months later.

Auto is useful inside templates, since the variable type may not be known until compilation or specialization, or when the syntax inside the template gets too complex. However using auto outside templates (or everywhere!) is an abuse. It can even introduce subtle bugs: e.g.

auto x = 7.0; //This is a double
auto x = 7.0f; //This is a float
auto x = 7.; //This is a double!!!
auto x = 7; //Is this an int or a uint?

The first one may be intended to be a float, but the 'f' was missing.

The third one may be supposed to be an integer, but the extra '.' made it a double. Easy to miss.

The fourth one is the worst kind. Signed integer overflow is undefined behavior. Unsigned integer overflow is well defined (wraps around).

This can have unforeseen consequences:

auto x = myInteger; //int myInteger = 262144; causes undefined behavior as result should be 68719476736, but doesn't fit in 32-bit
if( x * x < x )
//The compiler may optimize this away. Which wouldn't happen if 'x' were unsigned.

Of course you wouldn't use auto on a literal (note: Scott Meyers actually recommends using auto on literals!!!), but even then, using it on more complex code is asking for subtle trouble.

If the code is too complex, then it is even harder to read what's going on.

Another subtle problem is one like the following:

foreach( auto x : m_foos )

Looks ok, right? Well turns out m_foos is std::vector<Foo> m_foos; The loop will perform a hard copy Foo in every iteration. If the declaration of m_foos is std::vector<std::shared_ptr<Foo>> m_foos; then every loop iteration will take the lock, increasing and almost immediately after decreasing its internal reference count. The proper fix would be:

foreach( auto &x : m_foos )

And now x is a reference, rather than a hardcopy.

Auto has far too many issues and problems, even if you're aware of all of them. It's best to just forbid it.

On things like templates it makes sense, because auto myVariable is far better to work with than typename<T>::iterator::something<B, C>::type_value myVariable where the type is not actually explicit, and it's an incomprehensible string. In those cases, auto is justified as the alternative is worse on every single aspect.

So, in my opinion, Microsoft recommending to use auto everywhere like a mantra only perpetuates bad practices.

#5277990 How does everyone test their hlsl code?

Posted by Matias Goldberg on 24 February 2016 - 08:36 PM

Depends on what you mean by "testing hlsl":

If by testing you mean prototyping an idea or some math; RenderMonkey works since most of the relevant syntax hasn't changed. Unity can also work. Even ShaderToy can work even though it does GLSL, math is still math and porting to HLSL is a bit of find & replace (e.g. vec4 with float4).


If by testing you mean the shader works well in the actual environment it's going to run (i.e. the game), has proper SM 4.0/5.0 syntax, then nothing's best like your own program as testing, considering you implemented a hot reloading feature that detects changes on disk and reloads them automatically. And RenderDoc, Visual Studio Graphics Debugger and GPUPerfStudio for debugging. In that order.

#5277824 Black screen when trying to implement ibo

Posted by Matias Goldberg on 23 February 2016 - 11:30 PM

Your index buffer is declared as GLfloat IndexBufferData when it should be uint32_t IndexBufferData. For best efficiency it should be uint16_t IndexBufferData and calling glDrawElements with GL_UNSIGNED_SHORT instead of GL_UNSIGNED_INT.

#5277306 D3D12: Texture2D/3D/Cube accessed inside a single SRV descriptor heap ?

Posted by Matias Goldberg on 21 February 2016 - 11:55 AM

How can I check for stalling when writing to descriptor heap ?

Well, if your code contains an explicit wait then I would first check there.
Otherwise, use a CPU profiler to see significant portions spent inside the driver with the callstack leading to your code. Or learn to use GPUView (also check this out).

Is there a way to map and unmap descriptor heap ? I always use ID3D12Device::CreateShaderResourceView

Could you provide some code? We're shooting in the dark.
Also name your GPU just in case.

#5277219 D3D12: Texture2D/3D/Cube accessed inside a single SRV descriptor heap ?

Posted by Matias Goldberg on 20 February 2016 - 10:06 PM

After some measurement it looks like writing to descriptor heap is not cheap (it takes up to 2ms here) so I'm trying to modify my app to only write to the descriptor heap if necessary and let the shader access the descriptor heap using ids provided with constant roots.

  1. Make sure you're not stalling before writing to the desc. heap
  2. Make sure the assembly does not read from the desc heap memory by any random chance (write combined memory will hit you hard!)
  3. I prefer baking a pack of descriptors (i.e. 256 is the limit), and then swap them out depending on which pack I need.
  4. I assume you're not mapping and unmapping memory to write to the heap?

#5277217 D3D12: Texture2D/3D/Cube accessed inside a single SRV descriptor heap ?

Posted by Matias Goldberg on 20 February 2016 - 09:56 PM

I kept thinking about this and there's probably hardware limitations.
There are several types of types of divergences:

  • All pixels in a wavefront A access texture[0]; All pixels in a wavefront B access texture[1]; This is very likely because the wavefronts belong to different draws.
  • Pixel A accesses texture[0]; Pixel B accesses texture[1]. Both are Texture2D.

The first case is fine.
But the second one is not. You must specify NonUniformResourceIndex if you expect to be in scenario 2.
On the top of that, to cover what you want; we would have to add several more scenarios, where the type of texture could be divergent, not just the texture index.
Some hardware out there definitely cannot do that.
Note however, nothing's preventing you from doing this:

Texture2d texture_array2d[5000] : register(t0); //We assume the first 5000 textures are 2D
Texture3d texture_array3d[5000] : register(t5000); //We assume textures [5000; 10000) are 3D

Also note that Tier 1 supports up to 256 textures, so an array(s) of 10.000 will lock you out of a lot of cards (Haswell & Broadwell Intel cards, Fermi NVIDIA)


Edit: Just to clarify why it's not possible (or very difficult, or would add a lot of pointless overhead). There's three parts:

  1. Information about textures like format (i.e. RGBX8888 vs Float_R16, etc) and resolution. In some hardware it lives in a structure in GPU memory (GCN), in other hardware it lives in a physical register (Intel).
  2. Information about how to sample the texture (bilinear vs point vs trilinear, mip lod bias, anisotropy, border/clamp/wrap, etc). In GCN most of this information lives in a SGPR register that points to a cached region of memory. The border colour (for the border colour mode) lives in a register table. In Haswell this information lives in physical register IIRC.
  3. Information about the type of the texture, which affects how it is sampled (1D vs 2D vs 2D Array vs 3D vs Cube vs Cube Array). In GCN, sampling a cubemap requires issuing more instructions (V_CUBE*_F32 family if I recall); sampling 3D textures requires providing more VGPRs (since more data is needed) than for sampling 2D textures.

Your assumption is that the type of texture lives in GPU memory alongside the format and resolution (point 1). But this is not the case. It lives on the ISA instructions (point 3).

In fact D3D12 provides some level of abstraction: You think the format and resolution lives in GPU memory, when in fact on Intel GPUs it lives in physical registers (that's where the 256 limit of Tier 1 comes from btw. D3D11 by spec allowed up to 128 textures, and it happens to be both Fermi & Intel supported up to 256)

Therefore, it becomes too cumbersome to support this sort of generic-type texture you want.

#5276213 Vulkan is Next-Gen OpenGL

Posted by Matias Goldberg on 17 February 2016 - 04:30 PM

Anybody know of any open source graphics engines going to be built around these new apis? The most popular open source engine I know is Ogre, and that is build around DX9. I would love to see a graphics engine that is build specifically around these new apis.

Ogre 2.1 is built around AZDO OpenGL, D3D11, and mostly prepared for D3D12 & Vulkan.

Our biggest issue is some old code for Textures that needs a big refactor, which is currently the biggest issue when implementing the D3D12 & Vulkan RenderSystems.

#5275755 Do game developers still have any reason to support Direct3D 10 cards?

Posted by Matias Goldberg on 15 February 2016 - 09:47 AM


The D3D9 to D3D10 shift was a very peculiar one. It wasn't just performance, D3D10 introduced a few improvements that were very handy and easy to support for adding "extras" (note: these extras could've been easily backported to D3D9, but MS just wasn't interested).
For example:
1) Z Buffer access, 2) Z Out semantic, 3) sending to multiple RTTs via the geometry shader, 4) access to individual MSAA samples, 5) separate alpha blending, 6) dynamic indexing in the pixel shader, 7) real dynamic branching in the pixel shader.

You're misremembering how groundbreaking D3D10 was smile.png
Uptake of D3D10 was excruciatingly slow, as it didn't really introduce any killer new feature (geometry shaders did not live up to the hype), and came with the lack of XP-compatability, which was a big deal at the time. In my experience, a lot of people seem to have stuck with D3D9 until D3D11 came out. Most of the stuff you mention is accessible from the D3D9 API:
1) D3D9-era GPUs had multiple different vendor-specific extensions for this, which was painful. D3D10-era GPUs all support the "INTZ" extension (side note: you see a LOT of games that use the D3D9 API, but list the GeForce 8800 as their min-spec, which is the first NV D3D10 GPU -- my guess is because being able to reliably read depth values is kinda important)
2/5/7) Are in the D3D9 core API.
3) is a D3D10 API feature, but performance wasn't that great...
4) is a D3D10.1 API feature (and compatible GPU), but wasn't put to great use until proper compute shaders appeared in D3D11 smile.png
6) is emulatable in D3D9 but requires you to use a tbuffer instead of a cbuffer (as cbuffers weren't buffers in D3D9).
I've actually been toying with the idea of doing a D3D9/WinXP build of my current game, as a 10-years-too-late argument against all the D3D10 hype of the time laugh.png
You can actually do a damn lot with that API, albeit a lot less efficiently than the D3D11 API does it! I'd be able to implement a lot of the D3D11-era techniques... but with quite significant RAM and shader efficiency overheads. Still, would be fun to see all those modern techniques running badly on Windows XP (or an XP app, emulated on Wine!!).


Actually I did. I was using those screenshots about Saint Rows & Bioshock DX10-only features he posted:

  • Reflections. I guess they used GS to write to multiple RTTs at once. Otherwise it doesn't make sense to be DX10-only (from a technical point of view). While GS didn't live nowhere to their hype, that doesn't mean people didn't try. Probably they didn't gain performance. But porting it to DX9 would mean creating two codepaths (one for single pass using a GS, another for multi pass w/out GS). Note however, hybrids did actually improve performance. A hybrid would use instancing to multiply the geometry, a Geometry Shader to output to multiple RTTs, and still being multipass. Instead of writing to all 6 faces in one pass, write to 3 faces in 2 passes, or 2 faces in 3 passes. This kind of leverage allowed to find a sweet spot in performance improvement.
  • Ambient occlusion: Clearly they're doing a dynamic loop in the pixel shader which would explain why they'd need DX10. Or maybe they wanted Z Buffer access and didn't bother with the INTZ hack.
  • DirectX10 detail surfaces: I'm suspecting they mean multiple diffuse/normal textures overlayed on top of each other, taking advantage of array textures. Or maybe they enabled some Geometry Shaders somewhere for extra effects, like in a wall or something.

All of these features can definitely be done on DX9. But on the lazy side, you have to admit they're much easier to implement on DX10 (or like you said, doing it in DX9 would require more RAM or some other kind of overhead).

Like you said, DX10 wasn't that groundbreaking; but the features (that could've easily been backported to DX9, but weren't; save for vendor hacks like the INTZ one) that were added allowed games to include "turn on / turn off" kind of effects when running in DX10 mode.

#5275703 Do game developers still have any reason to support Direct3D 10 cards?

Posted by Matias Goldberg on 15 February 2016 - 12:12 AM

The D3D9 vs D3D10/11 and D3D10 vs D3D11 is not exactly the same.
Supporting multiple DX versions means we need to aim for lowest common denominator. This cripples performance optimizations that are not possible because of the oldest path (unless we'd spent an disproportionate amount of resources to maintain two completely different code paths).
This means a game well-designed to run D3D11 will be significantly more efficient than a game that aims to run on D3D11, 10 & 9.
The D3D9 to D3D10 shift was a very peculiar one. It wasn't just performance, D3D10 introduced a few improvements that were very handy and easy to support for adding "extras" (note: these extras could've been easily backported to D3D9, but MS just wasn't interested).
For example: Z Buffer access, Z Out semantic, sending to multiple RTTs via the geometry shader, access to individual MSAA samples, separate alpha blending, dynamic indexing in the pixel shader, real dynamic branching in the pixel shader.
All stuff that made certain postprocessing FXs much easier. Therefore it was possible to offer DX10-only effects like you see in Bioshock that can be turned on and off, just to "spice up" the experience when you had a recent GPU running on Vista.
But moving from D3D10->D3D11... there weren't many features introduces, but those few features... oh boy they were critical. Let's take Assassin's Creed Unity for example: its frustum culling and dispatch of draws lives in a compute shader! We're not talking about an effect you can turn on and off. We're talking about the bare bones of its rendering infrastructure depending on a feature unavailable to D3D10 cards. Supporting D3D10 cards may mean as well to rewrite 70% or more of its entire rendering engine; which also likely will affect the asset pipeline and the map layout.


There are only a few D3D11-only things that can be used to spice up the graphics while still turning them off for D3D10, tessellation comes to mind.

#5275583 Do game developers still have any reason to support Direct3D 10 cards?

Posted by Matias Goldberg on 13 February 2016 - 04:44 PM

Since in Ogre 2.1 we aim to support both DX10 & DX11 hardware, I've gotta say DX10 hardware's DirectCompute limitations are currently giving me a big PITA.

AMD's Radeon HD 2000-4000 hardware didn't even get a driver upgrade to support DirectCompute. So even if you limit yourself to the structured buffers available to DX10 hardware, these cards won't even run these compute shaders (despite the hardware being completely capable of doing so). I don't know what about Intel DX10 GPUs, but I suspect it's the same deal.


AFAIK only NVIDIA DX10 GPUs got the upgrade.

#5275464 MSVC generating much slower code compared to GCC

Posted by Matias Goldberg on 12 February 2016 - 04:23 PM


The first version calls std::vector<>::size() every iteration. The second does so only once and stores the value in a local variable.

I would have thought that something as trivial as size() would have gotten inlined out? Though at least the implementation I'm looking at computes the size by creating the beginning and end iterators and subtracting them, so maybe that isn't much of a savings anyway.


It's not trivial at all. Consider the following:

m_sum = 0;
for( size_t i=0; i != m_vec.size(); ++i )
  m_sum += m_vec[i];

Unless the full body of someFunc is available at the compilation stage (and even if it does, someFunc must not do something that is not visible to the compiler), the compiler literally can't know if someFunc() will alter m_sum or if it will push or remove elements from m_vec; hence m_vect.size() must be fetched from memory every single loop, so does m_sum.

However if it were changed to:

size_t tmpSum = 0;
const size_t vecSize = m_vec.size();
for( size_t i=0; i != vecSize; ++i )
  tmpSum += m_vec[i];

m_sum = tmpSum;

Now the compiler knows for sure neither tmpSum nor vecSize will change regardless of what happens inside someFunc (unless someFunc corrupts memory, of course) and can keep their values in a register instead of refetching on every single loop iteration.


It's far from trivial, in fact most of the time the word is "impossible" to optimize. This is where the arguments of "let the compiler do its job. Don't worry about optimization, they're really good" falls short. Yes, compilers have improved tremendously in the last 15 years, but they aren't fortune tellers. They can't optimize away stuff that might change the expressed behavior (I say expressed because what's intended may have more relaxed requirements than what's being expressed in code)