Jump to content

  • Log In with Google      Sign In   
  • Create Account

Banner advertising on our site currently available from just $5!

1. Learn about the promo. 2. Sign up for GDNet+. 3. Set up your advert!

Matias Goldberg

Member Since 02 Jul 2006
Online Last Active Today, 06:02 PM

#5238829 Avoid problem when roughness is 0

Posted by Matias Goldberg on Today, 12:49 PM

Clamp the roughness before sending it to the GPU. There's no need to do it in the shader (other than educational purposes).


A roughness of 0 makes as much sense as product being sold at price of $ -10. Just because there's a numeric value doesn't mean that all the range is valid

#5238387 Why is controller support for pc game so poor?

Posted by Matias Goldberg on 04 July 2015 - 03:47 PM

There's a lot of reasons, some of them which I bump into whenever I try to implement it myself.


First, historically GamePad support was pretty bad. Ignoring DOS games, back in Windows 9x the default setup wouldn't let you support more than 4 buttons.

You had to install special drivers to support 6 button pads or higher. I never understood why, though at some point I began suspecting that it could have been a limitation of the Game Port and the drivers would workaround the limitation via special encoding (hacks) or using signals that were reserved for other uses (extra axes, etc)


Because of this, interfaces coming from the OS and DirectInput were rather very basic (let's not even joke about rumble support, or stuff like that)


This problem more or less went away when USB gamepads arrived. However we are now left with HW problems: lots of non-standard input layouts. Even if all gamepads now look like either a PS3 or XBox controller; you still have the problem that one assigns button 0 to square, and the other assigns button 0 to circle. There's no standardization at all.

Therefore, you have to dedicate time to include an easy to use interface to map you controllers to actions. Also, the user now can't just hook in any controller and start playing.


Then there's the analog sticks. Even though the basics are the same and fortunately all vendors seem to agree in which axes map to the left stick and which axes map to the right stick; they do not agree in the scale/space they use.

In one controller you move your thumb halfway through it and the gamepad will report 50%, while another gamepad will report 80%, another 99%, and another will report 10% and if you move the thumb a millimeter more then suddenly reports 100%.

This problem can be alleviated via deadzones and a user defined exponent factor as an attempt to linearize the signal coming from the stick. But it has to be manually configured by the user and is not as easy to setup as the button mapping. And he may not be able to set it up "right" the way the designer meant it to be played.


Carefully designed stick-gesture experiences such as in Zelda Ocarina of Time in the original Nintendo 64 (walk, run, quick spin attack; all depending on how you moved the stick) are completely ruined unless everyone uses the same gamepad. How can you implement walk/run gestures when a gamepad goes from 10 to 100 in a heartbeat?


The Xbox360 controller simplified some things. Microsoft made an API specifically designed to only work with their xbox controller (XInput) and all their gamepads are standardized. Meaning you get the same experience in all of them and good support. Therefore some games chose to only support this interface and screw everything else.

Ironically, I find games running on XInput with x360ce (an XInput to DirectInput wrapper) much better than games that directly support all sort of gamepads, thanks to this standardization layer and the fact that I can configure/calibrate my gamepad to closely resemble a real Xbox360 controller after a long time of tweaking around the same tool that works on all XInput games.


Then there's gamepads joystick that don't resemble the PS3/Xbox controller at all. Their button layout doesn't map too well to in-game icons or layouts. Also reading "tap button 5" isn't too helpful either unless the manufacturer pasted a '5' sticker on the button #5 (which doesn't happen very often because they prefer to resemble a PS3 controller and use square, triangle circle, etc as if somehow that would fool me into thinking it's not a cheap imitation).

Sadly the USB drivers don't expect the controller to send a few bitmaps for each button #, that we could display and show a circle, a smiley or "L1" instead of saying "button 5".


Finally, a game designer aiming to support a PC has to make peace with the fact that there's a chance the user doesn't own a gamepad, and design their game for a mouse/keyboard, which has no analog interfaces, and different buttons combinations that are easier/harder to execute (try a 360° barrel roll with the left, down, right, up keys! Easy to do with a stick)

#5237953 Vulkan is Next-Gen OpenGL

Posted by Matias Goldberg on 01 July 2015 - 10:50 PM


GDC, why can't you release it all for free sad.png Hopefully I can get my uni to pay

Often they do with these kind of slides. Keep an eye for the GDC Vault (free section) and the speaker's websites (usually their employer's)


This talk hasn't been even made yet. It's scheduled for August. That link is for buying a pass for personally attending the session.

#5237583 Why no Constant Buffer offsets like for Vertex Buffers?

Posted by Matias Goldberg on 29 June 2015 - 06:00 PM

So on the draw call I could have extra params for the cbuffers.
Well, I see you can have lots of different cbuffers on a shader, but its nothing a description structure couldnt solve. (actually, I think an array with offsets for each cbuffer would be enough)

There are two issues with that:

1. cbuffers often live in the L1 cache or in special "register file" (and in older hardware, in actual physical read-only registers). Putting an offset means they can change per draw. Changing a cbuffer per draw can result in a flush. And flushes result in minor GPU pipeline stalls, that is tiny if done once, but can accumulate quickly with many draws.

2. There can be many cbuffers. Many of them completely static for the entire frame (e.g. view and projection matrices, fog parameters). So, the API would either make all of these cbuffers dynamic (which goes against the very one purpose they're optimized for) or specify which cbuffers will be offsetted, which implies parsing a structure and validating it (e.g. what if you offset cbuffer4 and there is no cbuffer4?) PER DRAW. This would be expensive in CPU terms.

What you want to achieve can be easily be done via baseInstance parameter and indexing to a huge const or tex buffer array in the shader. Note that baseParamter = 10 means SV_InstanceID is still 0 based, but attaching an extra instanced vertex buffer filled from 0 to 4096 (assuming the limit is 4096 per draw) will workaround the issue.

#5237413 What Happens At Render Time When This Occurs ?

Posted by Matias Goldberg on 28 June 2015 - 11:06 PM

Javascript interpreters are highly dependent on its implementation (what really happens depends on the browser). Some browsers may parse the whole page in one pass, others may parse it in multiple passes, some could be recursive.


However, without having looked at Chrome's or Firefox' code, it should be a good guess that this separation should be of little importance because Javascript is, by nature, JIT (the actual code needed cannot be compiled or interpreted until execution time due to lack of information in the code. e.g. 'var myVariable' could be holding a string, an integer, an array, or some weird object; and what they hold could change the next time the script is executed).

#5237394 Best technique for lighting

Posted by Matias Goldberg on 28 June 2015 - 08:37 PM

Modern tests on Geometry Shaders. Yup. They still suck (I'm hearing reports that GS on GeForce 9xx series is significantly faster compared to the 7xx series).


To render sprites, particles, etc, efficiently; use the techniques described in Vertex Shader tricks by Bill Bilodeau (hint: it's neither a Geometry Shader nor Instancing)


To render to multiple slices without a Geometry Shader (which makes rendering automatically slower on a lot of HW) on DX12 there will be a new feature VPAndRTArrayIndexFromAnyShaderFeedingRasterizerSupportedWithoutGSEmulation and in OpenGL there's already an AMD extension.

#5237004 Driver bug? shader fail

Posted by Matias Goldberg on 26 June 2015 - 05:41 PM

I think your sign*pow(eep, 7.0); should be pow(sign*eep, 7.0);

Though using abs() function should be better.


You're likely getting a NaN somewhere, but could be negative values as well.


Things to watch for (in general):

  • Divisions. 0 / 0 will generate a NaN. X / 0 will generate an infinite, and infinites can quickly turn into NaNs (i.e. Inf * 0; Inf / Inf, Inf - Inf, etc). Likewise, X / very_small_number can also generate an Inf.
  • Logarithms. Log, ln, log2, log10, etc. Log of 0 or a negative number returns a NaN.
  • Sqrt. Squared roots of negative numbers return NaN.
  • Pow. Pow( 0, 0 ) will return NaN.
  • Pow and Exp. These can quickly turn into infinity with a not-so-large exponent.
  • Loops of large additions, multiplications or substractions. If the numbers are really big, they can get into infinity. And Infs quickly turn into Nans.
  • Tan trigonometric function. Tan(90°) returns NaN.
  • Acos, Asin. If you're using these, you need to go back to books. 99.99% you won't need those.

#5236798 SIMD and Compilation

Posted by Matias Goldberg on 25 June 2015 - 02:01 PM

"Limited auto-vectorization" often means pretty much none and useless.
On C/C++ compilers you can use intrinsics as already pointed out.

Although you can use ISPC which is a C-like syntax (but it's not C) that compiles into SIMD code for critical uses.

#5236638 Best technique for lighting

Posted by Matias Goldberg on 24 June 2015 - 05:34 PM

MJP pretty much covered.

Just to make a few remarks:

My old DX9 approach:
- Shadowmapping was done in old-school way by rendering scene from view of light with some frustum culling.

This isn't old school. This is "current" school.

This all was done per light, so quite CPU heavy.

Shadow mapping per light is quite CPU and GPU heavy no matter how you look at it.
DX11 lets you reuse a lot of stuff and thus save CPU (e.g. upload the world matrices in a const or tex buffer, then reuse this buffer during the shadow casting passes instead of uploading the data every time).
With Indirect rendering in theory you can also perform frustum culling on the GPU but you will risk unbalancing the bottleneck too much to the GPU (what will the CPU do then?).


But other than that, things haven't changed much. Tiled rendering is just a way to efficiently exploit how GPUs rasterize in order to improve performance when shading. But the fundamental theory behind shadow mapping hasn't changed.

#5236580 massive allocation perf difference between Win8, Win7 and linux.

Posted by Matias Goldberg on 24 June 2015 - 11:26 AM

Without seeing the test's code is hard to draw any conclussion.


Most benchmarks out there are completely useless.


Why? Because implementations are very different (it's not just the compilers). For example MS' vector implementation grows the vector by 1.5x every time it reaches the capacity. Default GCC's implementation grows the vector by 2.0x. So, if you didn't reserve all the space beforehand, a test where lots of pushes are made to a vector, GCC will always be the clear winner because it overestimates compared to MS' implementation, and therefore will perform less reallocations. That doesn't mean it will perform as well on a real world case.


Likewise, map implementations are optimized for different things (lookup, traversal, erasures, insertions, runtime error correction / safety checks). But even if we compare all that, implementations can be optimized for large structures or short structures.

An implementation optimized for short structures means std::map<int, int> will perform faster; while an implementation optimized for large structures means std::map<MyStruct, AnotherStruct> where sizeof(MyStruct) and/or sizeof(AnotherStruct) is large (i.e. bigger than 64 bytes) will perform faster.


Last, but not least; yes, it's not hard to make an implementation that beats the one provided by your compiler tool suite that solve your particular need. These implementations are first coded for correctness and safety. Then, if there's time or if a lot of devs are demanding it, it gets optimized.

Until recently GCC's std::shared_ptr was a using a big fat mutex. The latest version uses atomic/interlocked instructions. It's really hard to get the latter correct, so it needs a lot of planning, and a lot of testing. Likewise, MS std::chrono implementation leaves much to be desired. All of this because, like everything else, standard library writers are hit by time constraints and meeting deadlines; and their libraries are generic that have to work for everybody without crashing or exhibiting incorrect behavior.

This goes against the general knowledge that the STL is unbeatable and you can always trust it. Trust it? Yes. Fastest? Well most game developers sooner or later learn it the hard way this is not always true. There's a reason EA created the EASTL library many years ago.

#5236084 Hardware skinning and running out of uniforms

Posted by Matias Goldberg on 21 June 2015 - 07:02 PM

Ugh, apparently you're right. It's 256 vec4s, aka 1024 floats.


Anyway, if you bind yourself to the minimum guarantees, then you will be extremely limited at everything because these minimums are for +10 year-old hardware (last desktop HW that was shipped from NVIDIA with 256 constant register limit was the GeForce 7000 series; last HW from AMD was ATI Radeon X1000 series, these are very old; except for Intel but Intel's GL drivers haven't been great for their >3 year old hardware anyway).


So, often what I do is go to http://opengl.delphigl.de/ and check the most common value from the minimum HW I want to target instead of relying on the minimum. For example you can check at GL_MAX_VERTEX_UNIFORM_VECTORS minimum value (at the top there's the option to list all entries at once).


Beware that it returns what the driver reported (drivers aren't bug-free, they might report wrong values from time to time). A value of probably 0 means the driver doesn't support GLSL shaders, and you can also see that very modern hardware "AMD Radeon HD 7700 Series 4.2.12172 Compatibility Profile Context" is returning a very low value (256) but if you look closer it's just using a very old driver. The same card with newer drivers is getting reported with bigger values. Also beware to check the OS this value was reported from (e.g. OS X is usually significantly behind the other OSes).


Looking at the reports, assuming a minimum of 1024 and checking if the value is below 1024 and showing a legend "update your video card drivers" should be enough, unless OS X is critical to you.

#5236073 Hardware skinning and running out of uniforms

Posted by Matias Goldberg on 21 June 2015 - 04:44 PM

1. GLSL guarantees a minimum of 1024 vec4s, not 1024 floats. That gives you plenty of space.


2. World matrices for skinning can be represented via 4x3 matrices (three vec4); chopping off a 1/4th of the memory requirements.


3. You can interpolate the matrices in the CPU. Doing it in the vertex shader means you interpolate them every time for every vertex, potentially wasting more processing power than doing it once in the CPU. Ideally use a compute shader to interpolate the matrices, then use the already interpolated matrices to the vertex shader.


4. UBOs have 16kb of guaranteed minimum, but the most common size is of 64kb (that's the minimum D3D10 mandates) and higher (except for Intel cards; where the support is between 16-32kb). Note that 1024 vec4 = 16kb; it's not a coincidence.


5. If you need much more, TBOs have much larger storage capacity.

#5235726 Is OpenGL enough or should I also support DirectX?

Posted by Matias Goldberg on 19 June 2015 - 11:46 AM

I can tell you my experience:


I thought that supporting OpenGL would be enough. It wasn't. But it depends on what you want to aim. Our goal was to keep compatibility with DX10 hardware while still taking advantage of DX11/12 hardware features.


There are several reasons GL only approach didn't work for us:

  1. Both NV & AMD stopped shipping drivers for anything below GeForce 400 / Radeon HD 5000. Their OpenGL driver versions are outdated (specially AMD's which cut off driver support 6 months sooner than NV); and if they've got a bug or don't support a specific extension you need (even extensions that would normally work on DX10 hardware); don't hope they will get fixed or added in a later release. In contrast, D3D drivers have been historically been more stable, more thoroughly tested and feature complete. As a result, supporting these cards via D3D is easy.
  2. Intel GL drivers sucked for a very long time, and their definition of "old hardware" is anything they shipped older than 1 year. Technically that's the same problem as NV & AMD just aggravated. But GL drivers have only become decent for Intel HD 4400 and newer (in a driver release that was literally... just a few months ago), some of these cards are DX11 level hardware (not just DX10) and Intel's market share is so large and their deprecation is so strict and narrow that it deserves its own point.
  3. There are so many variables that come into play, that sometimes D3D will get you higher performance, and sometimes OpenGL will get you higher performance. It depends on the driver, GPU involved, API calls you've made, and shader syntax you've used. Supporting both lets your users decide which one runs better for them.


Of course this is all about your particular case. If you don't have the resources (time) or don't feel you're experienced enough to support both APIs at the same time; or you only care about bleeding edge hardware and software; then you can focus your efforts into just one API.

#5235387 Animation takes up too much memory

Posted by Matias Goldberg on 17 June 2015 - 09:55 PM

In addition to L. Spiro's advise, exporters often have a setting for sampling rates. Higher sampling rates often results in higher fidelity during playback vs how it looks in the modelling application (specifically important when doing IK or other procedural animation); while lower sampling rates use far less memory.

#5235139 Vertex Buffer overrun

Posted by Matias Goldberg on 16 June 2015 - 10:10 AM

There's a lot of information missing.

sizeof(CUSTOMVERTEX) could be wrong.

The vertex buffer could be smaller than sizeof(CUSTOMVERTEX) * numVerts;

Also whatever happens inside addVertex could be causing stack corruption. A lot of unknown variables to say.




I don't know if that's the only issue, but locking a buffer with D3DLOCK_DISCARD and then reading from it is not a valid operation.

Fun fact: It actually was valid in Windows XP and some games relied on the flag doing nothing (and then they broke on Vista/7/8 and I had to make a proxy DLL to play them).


It was invalid then, still invalid now. It "just worked" doesn't mean it was valid. Many games violate the API rules.