Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Today, 01:51 PM

#5290579 RenderDoc (0.28) not properly capturing output

Posted by on 07 May 2016 - 03:19 PM

I don't see you're issuing a clear (which is a huge red flag unless you're doing it on purpose and know what you're doing).

Perhaps you need to enable RenderDoc's save initials setting.


RenderDoc also allows you to check the entire pipeline, see the outputs of the VS, and even debug the VS and PS shaders. Have you tried that?

There's also a pixel history log that will tell you why a pixel is of that colour (e.g. it was cleared, then set to red by pixel shader, then rejected a pixel shader due to depth buffer, etc)

#5289634 Why does GLSL use integers for texture fetches?

Posted by on 01 May 2016 - 05:00 PM

TBH I thought it was a bad call. And I still think it is.
However I found one instance where the fetch being an int was useful instead of being an uint: Clamp to edge emulation.
I needed my fetches to clamp to edge; so typical code would look like this:

ivec2 xy = some_value - another_value;
xy = clamp( xy, 0, textureResolution.xy );
float val = texelFetch( myTex, xy ).x;

This code would not work as intended if "xy" were to be uvec2, because values below 0 would wrap, and hence clamped to textureResolution (the other edge!) instead of clamping to 0. It would be the same as doing xy = min( xy, textureResolution.xy );

However, I'm like Hodgman: I prefer unsigned integers because we're addressing memory here, and negative memory makes no sense, and I prefer assert( x < elemSize ) over assert( x >= 0 && x < elemSize );
This case I talk about (clamp to edge) can simply be solved through explicit casts. IMO ints here have more trouble than benefits.

#5289488 Use Buffer or Texture, PS or CS for GPU Image Processing?

Posted by on 30 April 2016 - 05:44 PM

I have researched into this very recently, so it's fresh in my memory:


For large kernels (kernel_radius > 4; which means > 9 taps per pixel), Compute Shaders outperform Pixel Shaders, until the kernel becomes large enough the difference is up to 100% on my AMD Radeon HD 7770.


However, you need to be careful about the CS method because maximizing throughput isn't easy.

"Efficient Compute Shader Programming" from Bill Bilodeau describes several ways on maximizing throughput, and GPUOpen has a free SeparableFilter11 implementation of the techniques described there with full source code and a demo with lots of knots to tweak and play with.


As for Buffer vs Compute, like you said linear layout is great for the horizontal pass, but terrible for the vertical pass; thus Textures performs better; also usually if you end up sampling this texture later on in non linear patterns (or need something other than point filtering), a Texture is a win.


You may want to look into addressing the texture from the Compute Shader in morton order to undo the morton pattern of the texture and hence improve speed when possible, but I haven't looked into that.


And of course, on D3D12/Vulkan, a Compute Shader based solution means an opportunity for Async Shaders which can increase speed on AMD's GCN, or decrease it on NVIDIA.

#5289098 C++ CLI or native for game engine

Posted by on 28 April 2016 - 09:50 AM

Clarifying so what Josh Petrie said makes sense:

You're confusing the CRT (C Run-Time) with CLI (Common Language Infrastructure).


CLI is some weird hybrid between C++ & C# for interoperating between those two languages.

CRT is native and provides common basic C functionality such as malloc, free, strcpy, etc. When you see projects linked to the VC Runtime, they're linking to the CRT.

#5289027 D3D12 / Vulkan Synchronization Primitives

Posted by on 27 April 2016 - 08:20 PM

I fail to see how:

waitOnFence( fence[i] );

is any different from:

waitOnFence( fence, i );

Yes, the first one might require more "malloc" (I'm not speaking in the C malloc sense, but rather in "we'll need more memory somewhere") assuming the second version doesn't have hidden overhead.


However since you shouldn't have much more than ~10 fences (3 for triple buffer + 6 for overall synchronization across those 3 frames + 1 for streaming) memory usage becomes irrelevant. If you are calling "waitOnFence(...)" (which has a high overhead) more than 1-3 times per frame you're probably doing something wrong and it will likely begin to show up in GPUView (unless you have carefully calculated why you are fencing more than the norm and makes sense on what you're doing).


Btw you can emulate DX12's style in vulkan with (assuming you have a max limit of what the waiting value will be):

class MyFence
       vkFence m_fence[N];
       D3D12Fence m_fence;
       MyFence( uint maxN );

       void wait( uint value );
due to creating 1 fence per ExecuteCommandLists


Ewww. Why would you do that?

Fence once per frame like Hodgman said. Only exceptions are sync'ing with compute & copy queues (but keep the waits() to a minimum).

#5288637 GPL wtf?

Posted by on 25 April 2016 - 01:10 PM

A quick read of the SFC post shows quite a different view.


From their perspective, it's not the GPL, but rather that the CDDL license forbids distributing their software linked with software that can't be covered by the CDDL (such as the GPL).


I guess "GPL Violations Related to Combining ZFS and Linux" or "Canonical accused of violating the GPL"" calls more the attention than "CDDL Violations Related to Combining ZFS and Linux" or "Canonical accused of violating the CDDL".


So to your question "So, since it is against the GPL to combine non-GPL stuff with the Linux kernel, is Valve in violation with the GPL?". No, because Valve isn't saying the Linux kernel shouldn't be GPL when distributing their SteamOS with their own software, while the ZFS' license says the Linux kernel can't be GPL if ZFS is included in binary form.

At least, from SFC's rationale being discussed here.

#5287708 Compiling HLSL to Vulkan

Posted by on 19 April 2016 - 07:37 PM

There is an hlsl-frontend to compile HLSL to SPIR-V, but I don't know in which state it is.

#5287257 Trying to understand normalize() and tex2D() with normal maps

Posted by on 16 April 2016 - 10:45 PM

JTippetts explained the tex2D part and as he JTippetts said, normalize() converts a vector into a unit length vector. Note that if a vector is already unit length, then the result after normalizing should be exactly the same vector (in practice, give or take a few bits due to floating point precision issues)


In a perfect world, the normalize wouldn't be needed. However it is needed because:

  • There's no guarantee the normal map contains unit-length data. For example If the texture is white, the vector after decoding from tex2D will be (1, 1, 1). The length of such vector is 1.7320508; hence it's invalid for our needs. After normalization it will result in (0.57735, 0.57735, 0.57735) which points to the same direction, but has a length of 1.
  • If the fetch uses bilinear, trilinear or anisotropic filtering, the result will likely not be unit length. For example fetching right in the middle between ( -0.7071, 0.7071, 0 ) and ( 0.7071, 0.7071, 0 ) which are both unit length vectors will result in the interpolated vector ( 0, 0.7071, 0 ); which is not unit length. After normalization it will result in (0, 1, 0) which is the correct vector.
  • 8-bit precision issues. The vector ( 0.7071, 0.7071, 0 ) will translate to colours: (218, 218, 128) since 218 is the closest match to 217.8. When converted back to floating point, it's (0.70866, 0.70866, 0 ) which is slightly off. May not sound much, but it can create very annoying artifacts. Normalization helps in this case.

#5286207 Nothing renders in windowed mode on Windows 10 with dedicated Nvidia card

Posted by on 10 April 2016 - 05:00 PM

The "screeching noise" sounds like coil whine / squeaking. Does it sound like this or like this?

If so, this usually (but not always) means your card is drawing near maximum power.


Coil whine is considered harmless to your hardware, though some people believe if you hear coil noise, there's strong vibrations, if there's strong vibrations, it means gradual wear and tear over time (i.e. shorten lifespan); thus it's often advised to reduce the amount of time your GPU spends whining, just in case this turns to be more than a myth.

#5286035 D3D alternative for OpenGL gl_BaseInstanceARB

Posted by on 09 April 2016 - 10:59 AM

In the article mentioned above (slides 35, 36), they explain how MutiDrawInderect could be use to render multiple meshes with the single CPU call.
In particular, they utilize baseInstance from DrawElementsIndirect struct to encode transform and material indices for the mesh.
Then, this data is exposed in the vertex shader as gl_BaseInstanceARB variable and used to fetch transform for the specific mesh.

Actually... MDI (MultiDrawIndirect) didn't expose gl_BaseInstanceARB. It was added later. So you may find old drivers having MDI but without gl_BaseInstanceARB.

The best solution/workaround which works everywhere very well is to create a vertex buffer with 4096 entries, and bind it as instanced data with a frequency of 1.
Thus for each instance you get 0, 1, 2, 3, 4, 5, ..., 4095 (same as SV_InstanceID/gl_InstanceID) but with the added bonus that it starts from baseInstance instead of starting from 0.


Obviously you create this buffer once, since it can be reused anywhere.

The only caveat is that you can't use more than 4096 instances. Why 4096? It's quite random, but it's small enough to fit in a cache (whole buffer is just 16kb) and big enough to not matter whether you have two DrawPrimitive calls of 4096 instances each instead of one DP call of 8192 instances.


This is what we do in Ogre 2.1 and it works wonders for us. We do this for OpenGL as well, to avoid having to deal with drivers not having gl_BaseInstanceARB (and also have D3D11 & GL more consistent)

#5286032 C++ best way to break out of nested for loops

Posted by on 09 April 2016 - 10:50 AM

Put it in the loop statement rather than use break.

for( a = a_init; a != a_end && !condition; a = a_next )
  for( b = b_init; b != b_end && !condition; b = b_next )

Or break the algorithm out into it's own function and use return.

A billion times this. Putting the extra condition in the loop is clean and easy to understand.

If you've got a deeply nested loop, you may have to rethink what you're doing instead of blaming the language. If impossible to refactor, breaking the algo into several functions is a perfectly valid strategy. Gets easier to grok, may even reveal more info thanks to isolation and dependencies thanks to the function arguments being passed.

#5285465 In terms of engine technology, what ground is left to break?

Posted by on 06 April 2016 - 12:30 PM


Even my small-time engine completely hands-down destroys UE4 and Unity when it comes to rendering efficiency, but they win on tools and ease of use. We had a situation recently where an artist accidentally exported a model as 2000 sub-meshes, in a way where the tools weren't able to re-merge them into a single object. That would cause a serious performance issue in unity, but in our engine we didn't notice because it was chewing through 2000 sub-meshes per millisecond...

Why do you think UE does that so much slower?


Unity is really slow too. Not just UE4.

Looping through a bunch of meshes, binding buffers and shaders, shouldn't leave room for extreme overhead?


And within your phrase lies the answer. A modern D3D11 / GL4 engine only needs to map buffers once (unsychronized access FTW) and bind buffers once. Updating the buffers can be done in parallel. Texture arrays also allow binding textures only a couple times per frame.


A GLES2 engine (necessary for mainstream Android support) requires changing shader parameters per draw and per material, and bind buffers per sub-mesh. Also bind textures per material. Oh! all of this must be done from the main thread.


Unless they put huge amount of resources maintaining two completely different back ends, the least-common-denominator dictates GLES2-like performance will limit all other platforms. And even if you put completely different backends, they're so different it will affect the design of your front end one way or another, still limiting the potential.


Like I said earlier, supporting so many platforms comes at a cost.


And that's without examining that it's not the same to have an engine that is cache friendly, SIMD-ready, using Data Oriented Design principles will beat the heck of an engine that didn't put extra care in data contiguity or SIMD-friendliness. Branches can kill your performance.

#5284868 In terms of engine technology, what ground is left to break?

Posted by on 03 April 2016 - 09:37 AM

Can I say performance?

Unity and UE4 are very flexible, very generic, friendly, very powerful & very portable. But this has come at price of performance where they pale against a custom tailored engine (by several orders of magnitude, I'm talking between 4x & 10x difference)

#5284297 Fragmentation in my own memory allocator for GPU memory

Posted by on 30 March 2016 - 11:09 AM

1. You can make your own defragmenter. An allocation is just an offset and a size. The defragmenter only needs to update those two from all allocations and memmove from the old region to the new region. Just mark the offset and size private to avoid saving this data somewhere else that could go out of sync. It must be polled from the allocated pointer every time you need it.

2. Millions of allocations doesn't sound realistic to me.

3. At the higher level, batch allocations of similar (eg don't alternate a 2048x2048 with a 128x128 texture followed by another 2048x2048 texture)

4. Make sure when you've freed two contiguous regions, you properly merge them as one. This is very common to get wrong

5. Allocate smaller pools, and when you've ran out of space in the pool, create another pool. You don't must have just ONE, but a few

#5284145 N64, 3DO, Atari Jaguar, and PS1 Game Engines

Posted by on 29 March 2016 - 06:15 PM

The first "commercial game engine" that comes to my mind that supported multiple platforms and used by several AAA titles that remotely gets close to the modern concept of game engines was RenderWare. It wasn't even a game engine, it was a rendering engine. And it wasn't for PS1 / N64 gen.


Licensing it costed several tens of thousands of dollars AFAIK. Wikipedia has a list of its competitors (Unreal Engine & Frostbite). Doesn't matter who came first, none of them were for the era you're looking for because, like others already explained, it was all handcrafted and kept in house; occasionally having their stuff licensed to other studios.