Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 14 Feb 2007
Offline Last Active Today, 08:43 AM

#5302255 Is This A 'thread Deadlock'?

Posted by Hodgman on 23 July 2016 - 08:40 PM

It's simply just not valid code  :wink:  :P
If you're sharing a mutable variable between threads, then you need to use some form of synchronization, such as wrapping it in a mutex.
Assuming C/C++: The old-school advice would be: this will work fine if 'done' is volatile, but don't do that (it's not what volatile is for, and will still be buggy). You can, however, make done a std::atomic and it will work in this particular scenario. An atomic variable is basically one that's wrapped in a super-lightweight, hardware-accelerated mutex, and by default is set to provide "sequentially consistent ordering" of memory operations.
Without some for of synchronization being present, there's no guarantee that changes to memory made by one thread will be visible by another thread at all, or in the correct order.
Ordering matters a lot in most cases, e.g. let's say we have:
result = 0
done = false

  result = 42; // do some work
  done = true; // publish our results to main thread

  Launch Worker
  while(!done) {;} // busy wait
  print( result );
If implemented correctly, this code should print '42'.
But, if memory ordering isn't enforced by the programmer (by using a synchronization primitive), then it's possible that Main sees a version of memory where done==true, but result==0 -- i.e. the memory writes from Worker have arrived out of order.
Synchronization primitives solve this. e.g. the common solution would be:
result = 0
done = false

    result = 42;
    done = true;

  Launch Worker
      if done then break;
  print( result );
Or the simplest atomic version... which honestly you should only try to use after doing a lot of study on the C++11 memory model and the x86 memory model (and the memory models of other processors...) because it's easy to misunderstand and have incorrect code :(
result = 0
done = false

  result.AtomicWrite(42, sequential_consistency)
  done.AtomicWrite(true, sequential_consistency)

  Launch Worker
  while(!done.AtomicRead(sequential_consistency)) {;} // busy wait
  print( result.AtomicRead(sequential_consistency) )

#5302158 Deferred context rendering

Posted by Hodgman on 23 July 2016 - 08:23 AM

Hodgeman: would I be able to run a parallel thread on the GPU? If I did the procedural generation on the GPU, I would have to write a very complex shader to do that.

No... well, this is what "async compute" does in Dx12/Vulkan, but you still shouldn't use it for extremely long-running shaders.

Any task that can be completed by preemptive multi-threading can always be completed by co-operative multi-threading, it's often just harder to write certain problems with one model or the other... In other words, you can break up a very expensive task into a large number of very cheap tasks and run one per frame (it just might make your code uglier).

e.g. I did a dynamic GPU lightmap baker on a PS3 game, which took about 10 seconds of GPU time to run... so instead, I broke it up into 10000 chunks of work that were about 1ms each, and executed one per frame, producing a new lightmap every ~2 minutes :wink:

#5302147 My First Videogame Failed Conquering The Market

Posted by Hodgman on 23 July 2016 - 05:56 AM

Sorry this isn't more constructive, because I don't have the art education to be specific, but -- the visual aspect of your game and website is unappealing.

Most people will make an instant judgement based on the first visual that's presented to them, and the colours, the composition, the style here just don't come together to make a beautiful piece of art.


It would be a very good investment to hire an experienced concept artist to visualize what the game could/should look like early on in production, and use that to produce a style guide / art bible for the rest of the production phase. You can also get a concept artist to "paint over" screenshots at any point in time to show you what they should look like, and then use that to focus on improving the art.

Same goes for the website -- an experienced graphic designer and UI/UX person would be invaluable.

#5301959 Matrix Calculation Efficiency

Posted by Hodgman on 22 July 2016 - 08:22 AM

Right now I can measure time in NSight's "Events" window with nonosec-precision and can’t see performance gain between the shaders.
Is there a way to measure the difference in a finer way?

Well there's two explanations -
1) NSight can't measure the difference.
2) There is no performance difference...

It could be that when the driver tranlsates from D3D bytecode to native asm, it's unrolling the loops, meaning you get the same shader in both cases.
It could be that branching in a GPU these days is free as long as (a) the branch isn't divergent and (b) is surrounded by enough other operations that it can be scheduled into free space.

e.g. on that latter point, this branch won't be divergant because the path taken is a compile time constant. I'm not up to date with NV's HW specifics (and they're secretive...) but on AMD HW, branch set-up is done using scalar (aka per-wavefront) instructions, which are dual-issued with vector (aka per-thread/pixel/vertex/etc) instructions, which means they're often free as the scalar instruction stream is usually not saturated.

#5301878 Matrix Calculation Efficiency

Posted by Hodgman on 21 July 2016 - 11:48 PM

Simple answer: yes - doing multiplication once ahead of time, in order to avoid doing it hundreds of thousands of times (once per vertex) is obviously a good idea.


However, there may be cases where uploading a single WVP matrix introduces its own problems too!

For example, lets say we have a scene with 1000 static objects in it and a moving camera.

Each frame, we have to calculate VP = V*P, and then perform 1000 WVP = W * VP calculations, and upload the 1000 resulting WVP matrices to the GPU.

If instead, we sent W and VP to the GPU separetely, then we could pre-upload 1000 W matrices one time in advance, and then upload a single VP matrix per frame.... which means that the CPU will be doing 1000x less matrix/upload work in the second situation... but the GPU will be doing Nx more matrix multiplications, where N is the number of vertices drawn.


The right choice there would depend on the exact size of the CPU/GPU costs incurred/saved, and how close to your GPU/CPU processing budgets you are.

#5301817 Vulkan is Next-Gen OpenGL

Posted by Hodgman on 21 July 2016 - 02:11 PM

Yeah it's both.

AMD's drivers have traditionally been (CPU) slower than NV's. Especially GL, which is an inconceivable amount for driver code (for comparison, NV's GL driver likely dwarfs Unreal engine's code base).

A lot of driver work is now engine work,
letting AMD catch up on CPU performance by handing half their responsibilities to smart engine devs who can use design instead of heuristics now :)

Resource barriers also give engine devs some opportunity to micro-optimize a bit of GPU time, which was the job of magic driver heuristics previously -- and NV's heuristic magic was likely smarter than AMD's.

AMD were the original Vulkan architects (and probably had disproportionate input into D12 as well - the benefits of winning the console war), so both APIs fit their HW architecture perfectly (closer API fit than NV).

AMD actually can do async compute right (again: perfect API/HW fit) allowing modest gains in certain situations (5-30%). Which could mean as much as 5ms of extra GPU time per frame :o

#5301771 D3Dlock_Nooverwrite Erratic Flashing

Posted by Hodgman on 21 July 2016 - 09:16 AM

Yep, that's another common way to use no-overwrite without the hassle of managing the queries like in my example!

#5301739 [Glsl] Spherical Area Light For Ggx Specular

Posted by Hodgman on 21 July 2016 - 06:52 AM

Well, what is the math for ray intersection test? I have never ever made raytracers or anything like that.

It's not that complex in this case.


Use this to compare the line (x1 = surface, x2 = surface + reflection) with the center of the sphere:



The resulting point will be inside the sphere in the case of an intersection.

e.g. bool is_inside = distance(sphere_center, closest_point_on_line) < radius

If that's true, use the reflection direction as the lighting direction.


Otherwise, find the closest point on the surface of the sphere to the above point:


#5301716 D3Dlock_Nooverwrite Erratic Flashing

Posted by Hodgman on 21 July 2016 - 04:46 AM

Yeah as above, you shouldn't used discard and no-overwrite, but one or the other.


When you're writing graphics code, you're actually writing multi-threaded code, where one "thread" is your CPU, and one "thread" is your GPU.


For the lock flags:

  • Passing 0 or D3DLOCK_READONLY is equivalent to acquiring a mutex. Both threads will synchronize (one will stall) if they're both using the resource.
  • Passing D3DLOCK_NOOVERWRITE is equivalent to doing nothing... If both threads are using the resource, you've got yourself a race condition! Using this flag requries that you implement your own synchronization, e.g. by using IDirect3DQuery9::Issue to keep track of the GPU's progress.
  • Passing D3DLOCK_DISCARD is similar to no-overwrite, but relying on the driver to implement a clever scheme for you. This is actually equivalent to Releasing your buffer and Creating a new one every time you lock it (except much more optimal than that!). Internally, D3D releases the memory allocation for the buffer (which doesn't cause it to get deleted/free'd immediately -- if it's in use by the GPU, then it will have incremented the reference counter, so the delete/free will only occur after the GPU has finished with that data), and allocates a new memory allocation to return to you as the result of the Lock call.


If you want something similar to Discard, but want to do it yourself, the typical solution is to allocate a buffer that is N*(M+1)*Size, where Size = the number of bytes you need the buffer to store, N = the number of times that you update the data per frame, and M = the number of frames that you wish to have in flight (typically 1 or 2).

Every time the user wishes to "lock" your buffer, increment the offset by Size (or wrap back around to zero when you reach the end). Lastly, you need to ensure that the GPU is only ever frames behind the CPU. To do this, you can issue a query at the end of every frame as a kind of 'fence', and use IDirect3DQuery9::GetData to periodically see which fences the GPU has passed. If the GPU is too far behind (M frames), then you need to busy wait on GetData (using the D3DGETDATA_FLUSH flag while busy waiting to ensure the GPU is making progress) until it has caught up.

This is a typical "ring buffer" used to stream data to the GPU :)

If done right, it should be slightly faster than using the Discard flag while actually being safe and not giving you flickering bugs from a race condition :lol:

#5301684 [Glsl] Spherical Area Light For Ggx Specular

Posted by Hodgman on 20 July 2016 - 11:30 PM

The easiest solution is the "representative direction" method, though it's not accurate when it comes to energy conservation (overall, it will be too bright, especially for large sphere), etc... but is still a good place to get started :)


* First calculate the reflection direction like you would for Phong specular (R = reflect(V,N)).

* Form a ray originating from the surface and traveling along this reflection direction.

* Find the closest point on the sphere to this ray (or: find closest point on the ray from the center of the sphere, then the closest point on the sphere to that first point).

*^ If the ray intersects the sphere, the answer is either of the intersection points between the ray and the sphere!

* Create a new "representative reflection direction", which is the direction from the surface to this point on the sphere.

* Run your specular code as usual (for a point light), but use this "representative reflection direction" as the lighting direction.


With a zero-sized sphere, the representative direction will always be equal to the original lighting direction, so you get the same results as a point light.

With larger spheres, the lighting direction gets shifted a little bit so that every surface is getting lit by a slightly different point light -- one that's placed somewhere on the surface of your sphere, to give the impression that there's actually a sphere-light in your scene :D

#5301682 Visual Effects, Shaders, And Uber-Shaders

Posted by Hodgman on 20 July 2016 - 11:09 PM

Uber shaders have many different permutations... but you're not limited to one uber shader per game. You can use many different shaders (with each of them being 'uber', or not).

#5301633 Why didn't somebody tell me?

Posted by Hodgman on 20 July 2016 - 04:11 PM

The bbq in omgwtfbbq stands for "be back quick", not for "barbacue". Apparently I ruined the day for several people already by revealing this.

Citation needed.

#5301631 When you realize how dumb a bug is...

Posted by Hodgman on 20 July 2016 - 04:07 PM

^ isn't that of questionable style but prefectly legal? C++ allows overriding your parent's privates.

[edit] Wait... How does Callee not generate an error when it calls a protected function??

[edit2] how are you passing a Derived& to a function that takes a Base*? :P

#5301628 Copying A D3D11 Buffer To 2D Texture Of Floats

Posted by Hodgman on 20 July 2016 - 03:57 PM

Sorry, you should be passing vinitDataA to CreateTexture, not CrrateBuffer.

#5301491 Overall Strategy For Move-Semantics? [C++11]

Posted by Hodgman on 20 July 2016 - 03:50 AM

It used to be that many game platforms either didn't support exceptions at all, or not without a massive global performance impact. Now, there's one major game platform that doesn't allow you to disable them anymore!

IMHO c++ exceptions are still a bad idea from a code maintenance point of view though, which is why most projects still ban them.

C++ RTTI is almost useless though - it's basically a code smell...


But to pretend to be on topic, move semantics are actually a useful feature so will likely see introduction into game codebases over time :D