Jump to content

  • Log In with Google      Sign In   
  • Create Account

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Today, 05:17 PM

#5289027 D3D12 / Vulkan Synchronization Primitives

Posted by Matias Goldberg on 27 April 2016 - 08:20 PM

I fail to see how:

waitOnFence( fence[i] );

is any different from:

waitOnFence( fence, i );

Yes, the first one might require more "malloc" (I'm not speaking in the C malloc sense, but rather in "we'll need more memory somewhere") assuming the second version doesn't have hidden overhead.


However since you shouldn't have much more than ~10 fences (3 for triple buffer + 6 for overall synchronization across those 3 frames + 1 for streaming) memory usage becomes irrelevant. If you are calling "waitOnFence(...)" (which has a high overhead) more than 1-3 times per frame you're probably doing something wrong and it will likely begin to show up in GPUView (unless you have carefully calculated why you are fencing more than the norm and makes sense on what you're doing).


Btw you can emulate DX12's style in vulkan with (assuming you have a max limit of what the waiting value will be):

class MyFence
       vkFence m_fence[N];
       D3D12Fence m_fence;
       MyFence( uint maxN );

       void wait( uint value );
due to creating 1 fence per ExecuteCommandLists


Ewww. Why would you do that?

Fence once per frame like Hodgman said. Only exceptions are sync'ing with compute & copy queues (but keep the waits() to a minimum).

#5288637 GPL wtf?

Posted by Matias Goldberg on 25 April 2016 - 01:10 PM

A quick read of the SFC post shows quite a different view.


From their perspective, it's not the GPL, but rather that the CDDL license forbids distributing their software linked with software that can't be covered by the CDDL (such as the GPL).


I guess "GPL Violations Related to Combining ZFS and Linux" or "Canonical accused of violating the GPL"" calls more the attention than "CDDL Violations Related to Combining ZFS and Linux" or "Canonical accused of violating the CDDL".


So to your question "So, since it is against the GPL to combine non-GPL stuff with the Linux kernel, is Valve in violation with the GPL?". No, because Valve isn't saying the Linux kernel shouldn't be GPL when distributing their SteamOS with their own software, while the ZFS' license says the Linux kernel can't be GPL if ZFS is included in binary form.

At least, from SFC's rationale being discussed here.

#5287708 Compiling HLSL to Vulkan

Posted by Matias Goldberg on 19 April 2016 - 07:37 PM

There is an hlsl-frontend to compile HLSL to SPIR-V, but I don't know in which state it is.

#5287257 Trying to understand normalize() and tex2D() with normal maps

Posted by Matias Goldberg on 16 April 2016 - 10:45 PM

JTippetts explained the tex2D part and as he JTippetts said, normalize() converts a vector into a unit length vector. Note that if a vector is already unit length, then the result after normalizing should be exactly the same vector (in practice, give or take a few bits due to floating point precision issues)


In a perfect world, the normalize wouldn't be needed. However it is needed because:

  • There's no guarantee the normal map contains unit-length data. For example If the texture is white, the vector after decoding from tex2D will be (1, 1, 1). The length of such vector is 1.7320508; hence it's invalid for our needs. After normalization it will result in (0.57735, 0.57735, 0.57735) which points to the same direction, but has a length of 1.
  • If the fetch uses bilinear, trilinear or anisotropic filtering, the result will likely not be unit length. For example fetching right in the middle between ( -0.7071, 0.7071, 0 ) and ( 0.7071, 0.7071, 0 ) which are both unit length vectors will result in the interpolated vector ( 0, 0.7071, 0 ); which is not unit length. After normalization it will result in (0, 1, 0) which is the correct vector.
  • 8-bit precision issues. The vector ( 0.7071, 0.7071, 0 ) will translate to colours: (218, 218, 128) since 218 is the closest match to 217.8. When converted back to floating point, it's (0.70866, 0.70866, 0 ) which is slightly off. May not sound much, but it can create very annoying artifacts. Normalization helps in this case.

#5286207 Nothing renders in windowed mode on Windows 10 with dedicated Nvidia card

Posted by Matias Goldberg on 10 April 2016 - 05:00 PM

The "screeching noise" sounds like coil whine / squeaking. Does it sound like this or like this?

If so, this usually (but not always) means your card is drawing near maximum power.


Coil whine is considered harmless to your hardware, though some people believe if you hear coil noise, there's strong vibrations, if there's strong vibrations, it means gradual wear and tear over time (i.e. shorten lifespan); thus it's often advised to reduce the amount of time your GPU spends whining, just in case this turns to be more than a myth.

#5286035 D3D alternative for OpenGL gl_BaseInstanceARB

Posted by Matias Goldberg on 09 April 2016 - 10:59 AM

In the article mentioned above (slides 35, 36), they explain how MutiDrawInderect could be use to render multiple meshes with the single CPU call.
In particular, they utilize baseInstance from DrawElementsIndirect struct to encode transform and material indices for the mesh.
Then, this data is exposed in the vertex shader as gl_BaseInstanceARB variable and used to fetch transform for the specific mesh.

Actually... MDI (MultiDrawIndirect) didn't expose gl_BaseInstanceARB. It was added later. So you may find old drivers having MDI but without gl_BaseInstanceARB.

The best solution/workaround which works everywhere very well is to create a vertex buffer with 4096 entries, and bind it as instanced data with a frequency of 1.
Thus for each instance you get 0, 1, 2, 3, 4, 5, ..., 4095 (same as SV_InstanceID/gl_InstanceID) but with the added bonus that it starts from baseInstance instead of starting from 0.


Obviously you create this buffer once, since it can be reused anywhere.

The only caveat is that you can't use more than 4096 instances. Why 4096? It's quite random, but it's small enough to fit in a cache (whole buffer is just 16kb) and big enough to not matter whether you have two DrawPrimitive calls of 4096 instances each instead of one DP call of 8192 instances.


This is what we do in Ogre 2.1 and it works wonders for us. We do this for OpenGL as well, to avoid having to deal with drivers not having gl_BaseInstanceARB (and also have D3D11 & GL more consistent)

#5286032 C++ best way to break out of nested for loops

Posted by Matias Goldberg on 09 April 2016 - 10:50 AM

Put it in the loop statement rather than use break.

for( a = a_init; a != a_end && !condition; a = a_next )
  for( b = b_init; b != b_end && !condition; b = b_next )

Or break the algorithm out into it's own function and use return.

A billion times this. Putting the extra condition in the loop is clean and easy to understand.

If you've got a deeply nested loop, you may have to rethink what you're doing instead of blaming the language. If impossible to refactor, breaking the algo into several functions is a perfectly valid strategy. Gets easier to grok, may even reveal more info thanks to isolation and dependencies thanks to the function arguments being passed.

#5285465 In terms of engine technology, what ground is left to break?

Posted by Matias Goldberg on 06 April 2016 - 12:30 PM


Even my small-time engine completely hands-down destroys UE4 and Unity when it comes to rendering efficiency, but they win on tools and ease of use. We had a situation recently where an artist accidentally exported a model as 2000 sub-meshes, in a way where the tools weren't able to re-merge them into a single object. That would cause a serious performance issue in unity, but in our engine we didn't notice because it was chewing through 2000 sub-meshes per millisecond...

Why do you think UE does that so much slower?


Unity is really slow too. Not just UE4.

Looping through a bunch of meshes, binding buffers and shaders, shouldn't leave room for extreme overhead?


And within your phrase lies the answer. A modern D3D11 / GL4 engine only needs to map buffers once (unsychronized access FTW) and bind buffers once. Updating the buffers can be done in parallel. Texture arrays also allow binding textures only a couple times per frame.


A GLES2 engine (necessary for mainstream Android support) requires changing shader parameters per draw and per material, and bind buffers per sub-mesh. Also bind textures per material. Oh! all of this must be done from the main thread.


Unless they put huge amount of resources maintaining two completely different back ends, the least-common-denominator dictates GLES2-like performance will limit all other platforms. And even if you put completely different backends, they're so different it will affect the design of your front end one way or another, still limiting the potential.


Like I said earlier, supporting so many platforms comes at a cost.


And that's without examining that it's not the same to have an engine that is cache friendly, SIMD-ready, using Data Oriented Design principles will beat the heck of an engine that didn't put extra care in data contiguity or SIMD-friendliness. Branches can kill your performance.

#5284868 In terms of engine technology, what ground is left to break?

Posted by Matias Goldberg on 03 April 2016 - 09:37 AM

Can I say performance?

Unity and UE4 are very flexible, very generic, friendly, very powerful & very portable. But this has come at price of performance where they pale against a custom tailored engine (by several orders of magnitude, I'm talking between 4x & 10x difference)

#5284297 Fragmentation in my own memory allocator for GPU memory

Posted by Matias Goldberg on 30 March 2016 - 11:09 AM

1. You can make your own defragmenter. An allocation is just an offset and a size. The defragmenter only needs to update those two from all allocations and memmove from the old region to the new region. Just mark the offset and size private to avoid saving this data somewhere else that could go out of sync. It must be polled from the allocated pointer every time you need it.

2. Millions of allocations doesn't sound realistic to me.

3. At the higher level, batch allocations of similar (eg don't alternate a 2048x2048 with a 128x128 texture followed by another 2048x2048 texture)

4. Make sure when you've freed two contiguous regions, you properly merge them as one. This is very common to get wrong

5. Allocate smaller pools, and when you've ran out of space in the pool, create another pool. You don't must have just ONE, but a few

#5284145 N64, 3DO, Atari Jaguar, and PS1 Game Engines

Posted by Matias Goldberg on 29 March 2016 - 06:15 PM

The first "commercial game engine" that comes to my mind that supported multiple platforms and used by several AAA titles that remotely gets close to the modern concept of game engines was RenderWare. It wasn't even a game engine, it was a rendering engine. And it wasn't for PS1 / N64 gen.


Licensing it costed several tens of thousands of dollars AFAIK. Wikipedia has a list of its competitors (Unreal Engine & Frostbite). Doesn't matter who came first, none of them were for the era you're looking for because, like others already explained, it was all handcrafted and kept in house; occasionally having their stuff licensed to other studios.

#5283756 Per Triangle Culling (GDC Frostbite)

Posted by Matias Goldberg on 27 March 2016 - 03:02 PM


I don't know why you insist that much on bandwidth.

Alright I ran your numbers, you've convinced me it isn't as big an issue as I thought it to be... but I'm hazy on one figure of yours.  
edit - also didn't you forget to take into account the Hi-Z buffer bandwidth for per triangle depth culling?


Yes I did. I don't know the exact memory footprint, but 33.33% overhead (like in mipmapping) sounds like a reasonable estimate.

How did you get the 309MB per frame figure?  When I did it I'm getting completely different numbers.
edit - specifically the 305MB number.
Thanks for pointing it out.

1.000.000 * 32 bytes = 30.51MB... dammit I added a 0 and considered 10 million vertices.
The 305MB came from 10 million vertices, not 1 million.
Well... crap.

For 10 million vertices it's 35MB of index data, not 3.5MB. But for 1 million vertices, it's 30.51 MB, not 305.5MB

It only makes it easier to prove. Like I said, at 1920x1080 there shouldn't be much more than 2 million vertices (since there would be one vertex per pixel). Maybe 3 million? Profiling would be needed
So if you provide a massive amount of input vertices (such as 10 million vertices), the culler will end up discarding a lot of vertices.

#5283754 OpenGL Check if VBO Upload is Complete

Posted by Matias Goldberg on 27 March 2016 - 02:39 PM

So can you use a fence to test for the actual status of a BufferSubData call uploading to server? And that works (...) without issuing a draw call against that buffer?

Yes, but no:
1. In one hand, you can reliable test that the copy has ended by calling glClientWaitSync with GL_SYNC_FLUSH_COMMANDS_BIT set. No need to issue a draw call. An implementation that doesn't behave this way could cause hazards or deadlocks and would be therefore considered broken. However...

2. On the other hand, flushing is something you should avoid unless you're prepared to stall. So we normally query these kind of things without the flush bit set. Some platforms may have already began and eventually finish the copy by the time we query for 2nd or 3rd time (since the 1st time we decided to do something else. Like audio processing). While other platforms/drivers may wait forever to even start the upload because it's waiting for you to issue that glDraw call and decide uploading the buffer will be worth it. Thus the query will always return 'not done yet' until something relevant happens.

So the answer is yes, you can make it work without having to call draw. But no, you should avoid this and hope drivers don't try to get too smart (or profile overly smart drivers).

...and that works consistently across platforms...

If you're using fences and unsynchronized access you're targeting pretty much modern desktop drivers (likely GL 4.x; but works on GL 3.3 drivers too), whether Linux or Windows. It works fine there (unless you're using 3-year-old drivers which had a couple fence bugs)
Few android devices support GL_ARB_Sync. It's not available on iOS afaik either. It's available on OSX but OSX lives in a different world of instability.

Does it work reliably across platforms? Yes (except on OSX where I don't know). Is it available widespread in many platforms? No.

If you're using fences and thus targeting modern GL, this brings me my next point: Just don't use BufferSubData. BufferSubData can stall if the driver ran out of its internal memory to perform the copy.

Instead, map an unsynchronized/persistent mapped region to use as a stash between CPU<->GPU (i.e. what D3D11 knows as Staging Buffers); and then perform a glCopyBufferSubdata to copy from GPU Stash to final GPU data. Just as fast, less stall surprises (you **know** when you've run out of stash space; and fences tell you when older stash regions can be reused again), and gives you tighter control. You can even perform the copy from CPU -> GPU stash in a worker thread, and perform the glCopyBufferSubdata call in the main thread to do the GPU Stash->GPU copy.
This is essentially what you would do in D3D11 and D3D12 (except the GPU->GPU copy doesn't have to be routed to the main thread).

#5283672 OpenGL Check if VBO Upload is Complete

Posted by Matias Goldberg on 27 March 2016 - 12:15 AM

Hold on guys. There is a way to check if the copy has been performed the way OP asked.


apitest shows how to issue a fence and wait for it. The first time it checks if the fence has been signaled. The second time it tries again but flushing the queue since the driver may not have processed the copy yet (thus the GPU hasn't even started the copy, or whatever you're waiting for. If we don't flush, we'll be waiting forever. aka deadlock)


Of course if you want to just check if the copy has finished, and if not finished then do something else: you just need to do the 'wait' like the first time (i.e. without flushing), but using waiting period of zero (so that you don't wait, and get a boolean-like response like OP wants). We do this in Ogre to check for async transfer's status.


As with all APIs that offer fences (D3D12, Vulkan, OpenGL), the more fences you add, the worse it becomes for performance (due to the added hardware and driver overhead of communicating results, synchronizing, and keeping atomicity). Use them wisely. Don't add fences "just in case" you think you'll want to query for the transfer status. If you have multiple copies to perform, batch them together and then issue the fence.



I'd like to do this in order not to try to draw it before the upload is complete, as this halts the program (and a lag spike is felt).

Calling glDraw* family of functions won't stall because it's also asynchronous. I can't think of a scenario where the API will stall because an upload isn't complete yet. You usually need to check if a download (i.e. GPU->CPU) is completed before you map the buffer to avoid stalling (unless you use unsynchronized or persistent mapping flags; in such case it won't stall but you still need to check if the copy is complete to avoid a hazard)

#5283614 Per Triangle Culling (GDC Frostbite)

Posted by Matias Goldberg on 26 March 2016 - 03:34 PM

Thats true, but again I wonder at what cost in terms of bandwidth mainly... although the combination of cluster culling and per triangle culling might reduce the cost of the per triangle culling to my liking.

I don't know why you insist that much on bandwidth.
Assuming 1.000.000 vertices per frame with 32 bytes per vertex & 600.000 triangles, that would require 309MB per frame of Read bandwidth (305MB in vertices, 3.5MB in index data). Actual cost can be reduced because only position is needed for the first pass (in which case only 8-16 bytes per vertex are needed). But let's ignore that.
309MB to read.
Now to write, worst case scenario no triangle gets culled and we would need to write 1.800.000 indices (3 indices per triangle). That's 3.5MB of write bandwidth.

Now to read again, in the second pass, we'd need between 309MB and 553MB depending on caches (i.e. accessing an array 1.000.000 of vertices 1.800.000 times).
So let's assume utter worst case scenario (nothing gets culled, cache miss ratio is high):

  • 309MB Read (1st pass)
  • 3.5MB Write (1st pass)
  • 553MB Read (2nd pass)

Total = 865.5MB per frame.
A typical AMD Radeon HD 7770 has 72GB/s of peak memory bw. At 60fps, that's 1228.8MB per frame available. At 30fps that's 2457.6MB per frame.
Let's keep it at 60 fps. You still have left 363.3MB of BW per frame for texture and RenderTargets. RenderTarget at 1080p needs 7.9MB for the colour buffer (RGBA8888) and another 7.9MB for the depth buffer (assuming you hit the worst cases where Hi-Z and Z-compression get disabled or end up useless; which btw you shouldn't run into those because this culling step removes the troublesome triangles that are hostile to Hi-Z and Z-compression. But let's assume you enabled alpha testing on everything).
You still have left 347.47MB per frame for texture sampling and compositing. Textures are the elephant in the room, but note that since non-visible triangles were culled away, texture sampling shouldn't be that inefficient since each pixel should only end up running once (or twice top).
And this is assuming:

  • Nothing could be culled (essentially rendering this technique useless in this case).
  • You draw 1.000.000 vertices per frame. At 1920x1080 that's one vertex every two pixels and looks extremely smooth (if you managed to concentrate many vertices into the same pixel, leaving other pixels with less tesselation, then it contradicts the assertion that nothing could've been culled).
  • Those 1.000.000 are unique and thus can't be cached during reads (e.g. a scene that renders 1.000.000 per frame typically are the same 60.000 vertices or so repeated over and over again i.e. instancing. We're assuming here no caching could've been done)
  • You don't provide a position-only buffer for the first pass (which would greatly reduce BW costs at the expense of more VRAM)
  • You hit horrible cache miss ratios in the 2nd pass
  • Early Z and Hi-Z get disabled
  • You're in the mid-end HD 7770 which has only 72GB/s of BW (vs PS4's 176GB/s)

Vertex bandwidth could be an issue for a game that heavily relies on vertices but not your average game.