Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Online Last Active Today, 03:03 PM

#5275351 request HLSL support for sqrt() for integers

Posted by Matias Goldberg on 11 February 2016 - 05:05 PM


IEEE compliance doesn't guarantee determinism across different GPUs

No, IEEE compliance does in fact guarantee "determinism" across different GPUs.


No. It does not.

You need a guarantee the generated ISA performs all calculations in the exact same order, which you don't get with IEEE conformance, not even if you use the same HLSL compiler. IEEE also doesn't guarantee the intermediate precision in which calculations are made.

As you can see, I was talking about graphics, not simulation, where e.g. one ulp of color difference "most likely" will not matter.

One ulp of different in Dolphin "Most likely" broke emulation of Pokemon Snap. "Most likely" was also causing bizarre glitches.

#5275224 Clinical studies on overlooking stupid bugs

Posted by Matias Goldberg on 10 February 2016 - 11:11 PM

QtCreator will show everything in purple that is covered by half-open bracket or parenthesis like that when the cursor touches the offending bracket.
It's very useful. See that "mMaxTexUnitReached( 0 ))" has no matching '(':

Edit: I saw your other post. With practice you'll quickly learn to recognize that when your IDE's autoformatting is doing something you don't expect (like over- or under-indenting your lines), it probably means you've just introduced a syntax error.

#5274910 Cache misses and VTune

Posted by Matias Goldberg on 08 February 2016 - 02:55 PM

I'm sorry, but the data you've shown is exactly what's supposed to happen.


You're completely random-accessing and looping through 253MB of data, which obviously does not fit in the cache, and VTune is telling you that you're DRAM bound. This is exactly what will happen if the first iteration indexes the float[5] and float[26600000]; and the next iteration indexes the float[99990] and the float[7898]. The cache is effectively useless, and all the bottlenecks will be in the DRAM.


What do you expect it to tell you?

#5274800 Query Timestamp inconsistency

Posted by Matias Goldberg on 07 February 2016 - 06:04 PM

If you are on Windows 7, make sure to disable Aero.

Also for best results perform these queries in exclusive fullscreen.


The compositor's presentation can seriously skew your measurements.

#5274589 Any alternatives to automatic class instantiation via macro?

Posted by Matias Goldberg on 05 February 2016 - 09:04 PM

I agree with everyone... on desktop.


Unfortunately Android and iOS came to crash the party where there is no main, and the former enters into Native Code via a Java loader that loads an so library with a set of arbitrary-named JNI function calls, and the latter enters the system by overloading AppDelegate.


Considering these two bastards if they need to be supported, the macro idea looks suddenly more appealing; although I personally still prefer letting the user write these JNI loaders or iOS AppDelegates himself, instead of trying to do it for him (specially when the user needs to release resources or be notified of low memory conditions).

If a macro tries to do it for the user, when something goes wrong there's always that weird feeling that it's the macro's overloaded method fault (i.e. "I bet main system isn't informing me of low memory conditions even though the app is receiving them")

#5274581 Why are there no AAA games targeted towards the young adult audience?

Posted by Matias Goldberg on 05 February 2016 - 08:01 PM

This theory also seems to apply towards why there aren't many games that cover political or sociological themes.

The first Assassin's Creed games were strongly loaded with political and sociological themes.
I still remember fondly the long discussions about politics, religion, morality and ethics between Altair and Al Mualim (even though I met a lot of people who disliked those moments... "boring" they said).

The second game is about a teenager seeking revenge for the unjust sentence to death of half of his family (quite common in that era), involving real world events like Lorenzo Di Medici's attempt of murder, the Pazzi conspiracy, the speculation of poisoning of the Doge of Venice Giovanni Mocenigo, the Borgia's family drama, and well... someone summarized it for me. It also covers topics like thievery, extreme poverty, and prostitution.

Some people may have played AC II as just a dude that kills people with cutscenes inbetween; but it's actually strongly charged with a lot of content if you pay attention to the story.

#5274389 Why are there no AAA games targeted towards the young adult audience?

Posted by Matias Goldberg on 04 February 2016 - 09:36 PM

According to Wikipedia, young adult is between 14-20 years old.


I was under the impression most games target that audience already.

Also according to Wikipedia, YA literature often treats topics such as depression, drug & alcohol abuse, identity, sexuality, familial struggles and bullying.


Perhaps you meant to ask why aren't there more games covering these topics. Which is a very different type of question. If that's the case, beware the target market is mostly the same as current games, so they would be against a lot of strong, established competition.

#5274064 [Debate] Using namespace should be avoided ?

Posted by Matias Goldberg on 03 February 2016 - 10:19 AM

using namespace has little to no way of being disabled once it is declared. Which is why at header level can be a PITA.

At .cpp file level it sounds more sane. But if you try "Unity builds" to speed up compilation, using namespace at .cpp file level comes back to bite you. Which makes "using namespace" more friendly at an enclosed scope, e.g.:

void myFunc( int a )
     using namespace std; //Only available at myFunc level.


Typing std is not a big deal, so I try to avoid it as much as possible. Furthermore it "using" pollutes autocomplete.


There are legitimate cases where it's appropriate, but use with discretion, with care.

#5273997 Multi-threaded deferred setup

Posted by Matias Goldberg on 02 February 2016 - 09:54 PM

For "read once" (ie, not read again on the next frame) dynamic data such as constants it's not worth copying it over to the GPU. Just leave the data in the UPLOAD heap and read it from there.

Actually on GCN performing a Copy via a Copy queue allows GCN to start copying the data from bus to the GPU using its DMA engines while it does other work (like rendering the current frame); which might result in higher overall performance (particularly if bound by the bus or latency is an issue).


However it hurts all other GPUs which don't have a DMA Engine (particularly Intel integrated GPUs and AMD APUs which don't need this transfer at all and takes away precious bandwidth)

#5273783 Shader Permutations

Posted by Matias Goldberg on 01 February 2016 - 09:37 PM

You may be interested in how we tackled it in Ogre 2.1 with the Hlms (see section 8 HLMS).

Basically, 64 bits will soon look like not enough flags to handle all the permutations. But like Hodgman said, many of these options are mutually exclusive; or most of the combinations aren't used.


The solution we went for was, at creation time, to create a 32-bit hash to the shader based on all the options (which are stored in an array), and store this hash in the Renderable.

Then at render time we pull the right shader from the cache using the "final hash". The final hash is produced by merging the Renderable's one with the Pass hash. A pass hash contains all settings that are common to all Renderables and may change per pass (i.e. during the shadow map pass vs another receiver pass vs extra pass that doesn't use shadow mapping for performance reasons).

You only need to access the cache when the hash between the previous and next Renderable changes, which why it is a good idea to sort your Renderables first.


Source 2 slides suggest a similar thing to map their PSOs (see slide 13-23; PPT for animated version).


While a 64-bit permutation mask works well for simple to medium complex scenes, it will eventually fall short; specially if you need to adapt to very dynamic scenarios or have lots of content. However implementing a 64-bit permutation mask is a good exercise to get a good idea of the pros and cons of managing shaders.

#5273776 Cost of Switching Shaders

Posted by Matias Goldberg on 01 February 2016 - 09:19 PM

On the CPU side, the "root signature" is changed, which means that (on pre-D3D12 APIs), all the resource bindings must be re-sent to the GPU. The driver/runtime also might have to resubmit a bunch of pipeline state, and even validate that the PS / VS are compatible, etc (and possibly patch them if they mis-match, or patch the VS if it mis-matches with the IA config).... The driver might also have to do things like patch the PS if it doesn't match the current render-target format... sad.png

Since you're describing pre-DX12 problems; I shall add that most state changes (particularly shader changes) meant the driver would delay all validation and updates (basically any actual work) until the next DrawPrimitive call. Since it's only then where the driver has already all the information it needs i.e. it needs the IA layout & vertex buffer bindings to patch vertex shaders, it needs the RTT format and multisample settings to patch the pixel shader, etc.

Then it would have to internally create a cache of all the IA Layouts / RTT / shader combinations and pull the ISA assembly code from the cache the next time it is needed.

Mantle said screw it, and came up Pipeline State Objects to condense all that huge information any GPU could possibly need to run generate the ISA from shaders into one huge blob, moving the overhead from DrawPrimitive time (which happens every frame) to PSO creation time (which happens once).

#5273711 request HLSL support for sqrt() for integers

Posted by Matias Goldberg on 01 February 2016 - 12:50 PM

Since you want the same results for sqrt on different GPUs, I assume that what you need this for is not directly related to graphics, since if it was, a small error would most likely not matter.


If that is the case, you should consider OpenCL or CUDA where you can enforce IEEE 754 floating point compliance on recent GPUs.

IEEE compliance doesn't guarantee determinism across different GPUs, and there is no "small error that doesn't matter" in simulations where determinism is required. One ulp of difference is enough for two simulations to diverge greatly over time

#5273561 request HLSL support for sqrt() for integers

Posted by Matias Goldberg on 31 January 2016 - 07:21 PM

If we're using the usual definition of "determinism" to mean a system that doesn't produce random results (ie, same input + same set of operations = same output. Every time.) then I fail to see how any of the normal operations on a GPU can be classified as non-deterministic.
Now, if you're talking about things that are sensitive to timing (like Atomic operations, UAV writes) then you can get some non-determinism, but only by virtue of having started operating on a shared resource with many threads. This is the same non-determinism you'd get on any architecture, CPUs included.

For two different machines to produce the same output (GPU speaking), they must follow these rules:

  1. Exact same GPU chip (not even different revisions).
  2. Same drivers (to generate the same ISA).
  3. Same version of HLSL compiler (if compiling from source).

Otherwise the result will not be deterministic across machines. This is very different from x86/x64 and ARM CPUs where the same assembly with the same input will result in the same output even across different Intel & AMD chips, as long as you stay away from some transcendental FPU functions (like acos), some non-determinstic instructions (RCPPS & RSQRTPS) and ignoring certain models with HW bugs (e.g. FDIV bug)

#5273541 request HLSL support for sqrt() for integers

Posted by Matias Goldberg on 31 January 2016 - 05:48 PM

Floating point operations surely are not deterministic on GPU, but I'm pretty sure that casting an int to a float, then sqrt(), then cast back to int (truncate, floor, ceil) will result in deterministic results.

#5273533 Omnidirectional shadow mapping

Posted by Matias Goldberg on 31 January 2016 - 04:32 PM

It's good because it doesn't use much memory, the disadvantage is it eats fillrate for breakfast even if you do some kind of clever stencil + light bounding volume based optimization.

Another disadvantage is that involves a lot of SetRenderTarget calls which are relatively expensive CPU side. (Normally for N lights you would need N+1 SetRenderTarget calls; but with this method you need N*2 calls. Though you can amortize if you work on 2 cubemaps at once)