Jump to content

  • Log In with Google      Sign In   
  • Create Account

We're offering banner ads on our site from just $5!

1. Details HERE. 2. GDNet+ Subscriptions HERE. 3. Ad upload HERE.

Matias Goldberg

Member Since 02 Jul 2006
Offline Last Active Yesterday, 03:47 PM

#5187471 dynamic branching in GPU

Posted by Matias Goldberg on 16 October 2014 - 01:59 PM

It is possible that newer hardware has some smart workarounds for these issues.

Nope. That has hardly changed.

Branches are commonly not analyzed in cycles but rather something called "divergency" or "input coherence".

GPUs work in parallel. Lockstep as it has been said. Threads are grouped, launched together, and must execute the same instructions.

When one thread inside that group needs to take a different branch than the rest of threads in its group, it is said that we have a "divergency".
The more divergencies you have, the bigger the negative impact of branches; as all the branches must be executed by all threads only later for the wrong results to be discarded (masked out).

When all threads in one group follows one branch; while another group follows a different branch; all is well. No divergency happens, and we say that the input is coherent, homogeneous, or that it follows a nice pattern.

Of course even if the data is coherent, some divergency may still happen for some groups. But the key here is whether the data is coherent enough so that the performance improvement of skipping work for most of the groups outweights the performance drop caused by the groups that ended up diverging.

#5186837 Texture coordinates and shared vertices

Posted by Matias Goldberg on 13 October 2014 - 11:47 PM

A cube may be drawn using 8 vertices, 24 vertices, or 36.


I hope the post explains your doubts.

#5186771 Do you use UML or other diagrams when working without a team?

Posted by Matias Goldberg on 13 October 2014 - 03:50 PM

+1 to all that has been said against UML.


I personally prefer when documenting a stable API, to make basic diagram (not formal like UML) explaining the flow of the data between the interfaces, a few key relationships and key processes. Just something that gets the general picture for my users. Much friendlier.

#5186767 Vertex buffer efficiency

Posted by Matias Goldberg on 13 October 2014 - 03:34 PM

Thanks for the explanation.  While I wasn't focusing on that part, the transient buffers make more sense to me now from a synchronization standpoint.  When you talk about creating a default buffer, do you mean I should try to have as much as possible of my non-dynamic streaming data stored within a single buffer, and the pool refers to the staging buffers?

Yes. Immutable if possible.

I would think with level streaming, it would be risky to implement a single fixed capacity on the live buffer(s), so pool-type management for the live buffers would be useful too.  For example, when streaming in a new package, the loader knows exactly how much capacity it will require, and can grab how ever many buffers it needs from the pool of unused buffers. Likewise as a level is streamed out, the buffers are no longer needed and are added back into the pool of unused buffers.  Maybe I'm over complicating things.  I definitely see the benefit of staging buffers for updating dynamic data, but I guess it's not as clear for the case of loading in a large amount of streaming data (not behind a loading screen).

The problem is that you're trying to build a car that runs over roads, can submerge into the ocean, fly across the sky, is also capable of travelling into outer space; and even intergalactic travel (and God only knows what you'll find!).

You will want to keep everything together into one single buffer (or a couple of them) to reduce switching buffers at runtime while rendering.
From a streaming perspective, it depends how you organize your data. i.e. some games divide a level into "sections" and force the gameplay to go through corridors, and while you run through these corridors, start streaming the data to the GPU (Gameplay like Tomb Raider, Castlevania Lord Of Shadows, fits this use case). In this scenario, each "section" could be granted it's own buffer. You already know the size required for each buffer. And if you page the buffer out, you know if it can be permanent (i.e. can the player go back?) or use some heuristic (i.e. after certain distance from that section, schedule the buffer for deletion, but don't do it if you don't need it, i.e. you still got lot of spare GPU RAM). You may even get away with immutable buffers in this case.

Second, you can keep an adjustable pool of immutable/default buffers based on size and regions. Remember you're not going into the unknown depths of the ocean or into the unknowns of a distant galaxy. You know the level that is going to be streamed. You know its size in megabytes, in kilometers, its number of vertices, how it's going to be used, how many materials it needs etc. You know how each section gets connected with each section (e.g. if F can only be reached from A, put it in its own buffer, and the player is likely to not return to F very often once it has been visited).

You have a lot of data at your disposal.

Open World games are trickier, but it's the same concept (divide the region into chunks that has some logic behind it, i.e. spatial subdivision, and start from there). Open World usually have a very low poly model of the whole scene to use until the higher quality data has been streamed.


My advice, algorithms are supposed to solve a problem. An engine solves problems. The answer on how to design your engine will be clearer if you approach the problem instead of trying to solve a problem you know nothing about. Try to make a simple game. Even a walking cube moving across cardboard city (open world) or pipe-land (corridor-based loading) should be enough.

Stop thinking on how to write the solution and start thinking on how to solve the problem. After that, how to write the solution will appear obvious.

#5186739 Vertex buffer efficiency

Posted by Matias Goldberg on 13 October 2014 - 12:35 PM

That presentation is basically l33t speech for "how to fool the driver and hit no stalls until DX12 arrives".
What they do in "Transient buffers" is an effective hack that allows you to get immediate unsynchronized write access to a buffer and use D3D11 queries as a replacement for real fences.

Specifically, I'm working on implementing his "long-lived buffers" that are reused to hold streaming (static) geometry data.  I've been unable to find much information on how best to implement it, however.

Create a default buffer. Whenever you need to update it, upload the data to a staging buffer (you should have a preallocated pool to avoid stalling if you create the staging buffer), then copy the subresource from staging to default. You're done.

You won't find much because there's no much more to it. Long-lived buffers assume you will rarely modify them, and as such shouldn't be a performance bottleneck nor a concern.

Usually you also have a lot of knowledge about the size you will need for the buffer. Even if you need to calculate it, the frequency of doing this is so little that often you should be capable of calculating it, or at least cache it.


The problem is when it comes to buffers that you need to update very often (i.e. every frame)

#5186419 Neutral planets

Posted by Matias Goldberg on 11 October 2014 - 04:48 PM

Similar to wodinoneeye said.

In real life neutral countries exist because either:

  1. The conflict hasn't expanded yet enough to affect them.
  2. They're strong enough to repel any invasion if they get involved (they could even seriously imbalance the war if they take side).
  3. Most of the involved parties don't want anything of that country (i.e. why would Israel or Palestine want to take umm... Mexico?) or are emotionally attached to them (emotion != logic).
  4. It's more beneficial to have them as an independent country than to have them take your orders. May be because their know how is too high and can't be used appropiately if you invade them, or their citizens could start small acts of terrors during the occupation, or guerrilla style fighting.

For a game, points 2 and 4 are the most interesting. Point 4 can actually be very fun and make the player go through a living hell.


Point 2 is easy. If you attack, you will be obliterated.


Point 4 is fun. You can attack, you may win. But pay the consequences until you release that land back. Random sabotages, slowdown of your resources gathering or slower building of units, critical unit-making buildings randomly exploding, inability to develop certain technologies. Allow the development of technologies or gathering the goods they offered when they were neutral, but at a higher price (or getting developed at a slower rate), etc.


Point 3 is possible if the game has a story. Get the player to actually love a civilization good enough so that most players will feel bad about invading it and prefer working alongside them. But this is really hard to execute well.

#5185903 Will OpenGL 4.3 work on a 4.0 machine?

Posted by Matias Goldberg on 08 October 2014 - 09:30 PM

I see many misleading answers.
Since OpenGL 3.x; Khronos adopted a version numbering system of MAJOR.MINOR
A change in major number means the hardware needs to be significantly upgraded (i.e. like going from a GeForce 280 to a GeForce 480, or from a Radeon HD 4850 to a Radeon HD 7850; which is going from DX10.1 hardware to DX11 or GL3 to GL4).
A change in minor number means that 99% of the time a driver upgrade is all that you will need.
If your hardware supports OpenGL 4.0; then it's almost certain that by just updating the drivers it will be enough to get 4.3 (though there's always the risk that the vendor never releases a driver that supports 4.3 version, and goes straight to 5.x whenever it comes out), or even 4.5 for that matter.
As for the Intel HD 4000; Intel is usually behind when it comes to OpenGL drivers. Their current version is at 4.0; however they expose the most important 4.3 functionality through extensions (GL_ARB_multi_draw_indirect, GL_ARB_sync, GL_ARB_shading_language_420pack, GL_ARB_conservative_depth).
They're missing compute shaders (GL_ARB_compute_shader) and Shader Storage Buffer Objects (GL_ARB_shader_storage_buffer_object); only the latter is where I have my doubts whether the HW can truly support it; however it's not a reason to not buy the book.
My recommendation is go buy the book. The differences will be slim (if SSBOs are even in the book) because most of what applies to 4.3 is provided by the Intel's 4.0 drivers (+ extensions)

Will the code examples from the book (OpenGL superbible) work on my machine?

Most of them, yes. You may have to edit the initialization routine so that it asks for a 4.0 context instead of a 4.3 one (which will obviously fail as soon as you launch the program and initialize OGL). For samples that use features that are not provided through extensions (like SSBOs and Compute shaders) it will obviously fail, but the rest of the samples will work.

#5184805 Directional light... position or direction?

Posted by Matias Goldberg on 03 October 2014 - 12:11 PM

To answer the OP's question, it's a direction.

What I think is confusing you is that we typically refer to the diffuse N * L formula (also known as N dot L, dot( N, L )) where N is the surface normal and L is the light's direction; when it is actually N * -L (notice the negative sign).


It's not that the direction becomes the position or something like that. Strictly speaking the formula is N * -L; but we often refer to it as just N * L (because we tend to look at it from the perspective "the ray that goes from surface towards the light"; in other words, the opposite direction of the light's real path it travels)


This is a very common source of confusion among people just starting with lighting equations.

#5184803 Very advanced particle effects

Posted by Matias Goldberg on 03 October 2014 - 11:57 AM

Most tutorials dont go father than how to emit a basic billboard particle. sad.png

Because that's all there is to it. Just smoke and mirrors.


The key is in a good system that can emit lots of controlled billboards. And by "controlled" it means how many particles are emitted per second, of which type (i.e. size, material), rate of growth per second, colour randomization, where do they get emitted, if they follow a predetermined path or are attracted by some force (like gravity) etc.

The rest is just really good artists knowing how to take advantage of it.

Google "particle system emitter affector" for ideas on how to implement your own (i.e. a quick google returns these interesting links)


There are a few exceptions though i.e. for thunder/lightning effects you're better off writing a code that will create a chain/path of connected billboards (each billboard slightly reoriented) that randomly split into 2 or 3 paths at certain points. Then repeat until desired length is reached. (like this, but in 3D)


When we mean "advanced particle effects" we actually mean about voxels, fluids, and other very compute intensive stuff which isn't what you're asking for.

#5184070 Generate mipmaps for DDS cubemaps in DirectX 11

Posted by Matias Goldberg on 30 September 2014 - 09:28 AM

For what I can see it is failing because what DX9 did was to decompress, generate mips, compress again. Which is a lossy operation (theoretically it can be done as a lossless conversion by replacing binary data of mip 0 of the recompressed stream with the one from the original bc1, however the generated mips from bc1 sources should be of lower quality than generating mips from original sources).

Most likely D3D11 forces you to greater quality by first generating the mips from the source material, then compress. If the dds is already compressed and you want to pay the price, decompress it first.


Posted by Matias Goldberg on 29 September 2014 - 09:18 AM

That implies that the bit pattern for integer formats created from the floating clear color may not produce the same color.
Just FYI: clearing the backbuffer with clear color = (0.0f, 0.125f, 0.3f, 1.0f), doing a screen print, opening Paint, ctrl-V and sampling the color yields:
with clear color 0.0f, 0.125f, 0.3f, 1.0f
Hue 133
Sat 240
Lum 70

Red 0
Green 99
Blue 149

same clear color

Hue 143
Sat 240
Lum 36

Red 0
Green 32
Blue 76 (blue 77 with DXGI_FORMAT_R8G8B8A8_UNORM)

Green => 255 * 0.125 ^ (1/2.2) = 99.09
Blue => 255 * 0.3 ^ (1/2.2) = 147.52

Which is really close to the 99 and 149 values you got. I'm simplifying as gamma = 2.2; though sRGB is actually a piecewise linear function.
This is a linear vs sRGB problem.

#5182689 Handling Uniform Locations?

Posted by Matias Goldberg on 24 September 2014 - 11:30 AM

From what you're describing you're using anything between the Mobility Radeon HD 2000 and 4000 series.

I've checked and seems the driver team forgot about this extension.


May not be exactly the same, but these cards do expose the GL_ARB_shading_language_420pack which allows you to use explicit binding points for uniform buffers.

#5182496 Temporal coherence and render queue sorting

Posted by Matias Goldberg on 23 September 2014 - 03:11 PM

I loved whole L. Spiro's post, but I have something to correct

Sorting 2 smaller queues is faster than 1 big one.

This is a half truth.
Sorting can take:

  • Best Case: O(N)
  • Avg. Case: O( N log( N ) )
  • Worst Case: O( N^2 )

1. In best case, N/2 + N/2 = N; so in theory it doesn't matter whether it's split or not. But there is the advantage that two containers can be sort in separate threads. So it's a win.

2. In the average case, 2 * (N/2 log(N/2)) > N log(N); having one large container should be faster than sorting two smaller ones (though there remains to be seen whether threading can negate the effect up to certain N)

3. In the worst case, 2 * (N/2)^2 < N^2; which means it's much better to sort two smaller containers than a large one.


In the end you'll have to profile as it is not a golden rule.

Spiro's suggestion of using temporal coherence assumes that most of the time you can get O(N) sorting using insertion sort; thus most likely having two smaller containers should be better (if you perform threading).


Update: Stupid algebra mistake. See lunkhound's post. Avg case is better when dividing and conquering.

#5182132 Problem with Constant Buffer Size.

Posted by Matias Goldberg on 22 September 2014 - 10:07 AM

The 48 bytes comes from a mistake in your C++ code, not in your HLSL code.

#5182015 Cache coherence and containers

Posted by Matias Goldberg on 21 September 2014 - 10:14 PM

std::map< int* > mymap;

That is not valid code. std::map requires a key and a value.