Jump to content

  • Log In with Google      Sign In   
  • Create Account

Hodgman

Member Since 14 Feb 2007
Offline Last Active Today, 05:10 PM

#5200254 Compute shader: varying the number of threads

Posted by Hodgman on Today, 09:02 AM

The hardware is pretty important -
If you dispatch groups of 1 thread, then on AMD you'll be running at 1/64th speed, and 1/32nd speed on NVidia, which is a huge penalty.
These chips are SIMD processors, where one register holds 32/64 floats, so one instruction operates on 32/64 floats. These specific architectures are designed to run lots of threads within each group - specifically, multiples of 32(NVidia) or 64(AMD) threads.

If you know you'll always be working on 100 items, there's no harm in declaring that in your shader... But while declaring one thread (and then dispatching the true number) is easy to develop/maintain, it will kill performance. You have to balance maintainability with hardware specific choices... ;(


#5200237 Compute shader: varying the number of threads

Posted by Hodgman on Today, 05:45 AM

AFAIK, modern AMD GPUs run 64 threads at once and nVidia GPUs run 32 threads at once (per wave, per processor, per shader unit, per GPU).

 

If there's 64 threads in a wave, and only 10 of them take the true branch of an if statement, then you still pay for the full wave. You're wasting (64-10=)54 thread's time.

If every thread takes the same branch, then there's only the cost of the if instruction itself to worry about.

On modern GPU's, the cost of the if instruction itself is basically free. On older GPU's, it would cost about the same as a dozen basic arithmetic instructions.

 

If you set your shader's thread group size to 10, then on nVidia hardware you're always wasting (32-10=)22 thread's time, per thread group. If you dispatch 100 thread groups, you're wasting 2200 thread's time.

The same example on AMD hardware is (64-10=)54 threads / 5400 threads.

 

Say you need to process 1000 items (Continuing with modern AMD hardware in this example)--

You can set the thread group size to 10, then dispatch 100 thread groups as above, but this is immensely wasteful (5400 HW threads wasted).

You can set the thread group size to 64, then dispatch 16 thread groups, which equals 1024 total threads. You can then add an if statement to your code, as suggested by the posters above. This means that for 15 of the thread groups, this branch does nothing (and is pretty much free, performance wise), and for the last thread group you waste (1024-1000=)24 thread's time.

24 wasted threads is much nicer than 5400 biggrin.png




#5199839 Problem with tangents (fbx)

Posted by Hodgman on 24 December 2014 - 06:21 AM

If you're using a normal map that was generated by Maya, then you have to use Maya's tangent/binormal data (or tangent/handedness data) -- you can't generate your own tangents and still correctly decode that normal map (unless you use the exact same algorithm that Maya does internally).

 

To deal with the fact that handedness may differ (i.e. either binormal = cross(normal,tangent) or binormal = -cross(normal,tangent)) then you either need to store the Maya's binormal per vertex, or store an extra bit per vertex indicating which binormal to use.

Often I've seen this implemented by changing the tex-coord attribute from a float2 to a float3, and storing either +1 or -1 in the z value. You then generate the binormal with cross(normal,tangent) * texcoord.z.

When you're building your model file, for each vertex, you can compare Maya's binormal against those two possible values to determine if you should be storing +1 or -1.




#5199828 Is this correct sRGB conversion?

Posted by Hodgman on 24 December 2014 - 04:02 AM

Yep, 0.25882srgb is one quarter as bright as 0.5srgb, so, you're doing everything correctly.
 
Have a look (squint) at these images and see approximately what the gamma response of your monitor is.
http://www.lagom.nl/lcd-test/gamma_calibration.php
If it's sRGB calibrated, you should see the equal intensity band at about 2.2 gamma.
 
If you perceive the dashed and solid bars to be of equal intensity at some other value, then your monitor isn't correctly calibrated, so 'correct' sRGB outputs will look wrong to you.
To support such crummy monitors, then give the user an option to replace the "final image; to sRGB again" step with a simple pow function, where the exponent is the value you got from squinting at that chart.
 

I tried recreating the scene in Unity and applied linear lighting calculations and I got [0.125, 0, 0, 0]

Sounds like Unity isn't implementing gamma-correct lighting then... Shame on them.

 

 

Personally, I use a ColorMunki to ensure all our development monitors are close to being calibrated correctly (along with the squint-test image above), which means that all of our input colour textures will contain sRGB data (as they were authored on an sRGB monitor).

Then, most end-users unfortunately will have perverse settings applied to their own monitors (completely destroying the point of colour standards...) so I default to sRGB output, but also give the user a slider if they want to increase/decrease a gamma value (which results in a simple pow-based final gamma encoding, instead of hardware sRGB encoding).

 

 

p.s. trying to tweak lighting in an 8-bit pipeline (whether gamma correct or not) is incredibly hard. IMHO everyone should be doing their lighting in 16bit linear these days, and then tone-mapping to 8-bit sRGB/gamma. Without a tonemapper, your ambient light is just 0.25 units(... of light..? 0.25 lights?), but with a tone-mapper you can declare what those units are, and how those units are exposed / converted to colours, as in photography.




#5199697 Problem with tangents (fbx)

Posted by Hodgman on 23 December 2014 - 06:59 AM

What kind of problem are you actually having?

There is no one right way to generate tangents - any method that produces a vector that's perpendicular to the normal is valid.
What is vitally important is that both the art tool that's generating your normal map, and your game, are both using the exact same normals/tangents/bitangents. Otherwise, if the normal-map tool and the game have different vectors, there's absolutely no way to correctly decode the normal map..


#5199616 Thoughts on Rust?

Posted by Hodgman on 22 December 2014 - 06:13 PM

I think it's 5-10 years out before Rust has even a shot of becoming mainstream. A large part of this is inertia. In games, for instance, we need not only ports/wrappers of open source libraryes but also for things like... . This is the same problem faced by Go, D, and all the other C++ killers...

For mainstream games use, we also need compilers to exist for esoteric CPU's under closed platforms, where only other game-devs are allowed to tread - locking out typical open source contributors and requiring a decent amount of demand to exist if a commercial provider is to step in. Chicken and egg ensues, where we can't use it because there's no compiler, and there's no compiler because no one is using it.
You need a PC devs to invest heavily and then transition to mainstream games, forcing them to develop/port the compilers themselves ;D


#5199361 What is faster cbuffer or data textures for MVP matrices?

Posted by Hodgman on 20 December 2014 - 10:09 PM

In D3D11, you don't have to use VTF; you can use a regular Buffer object (like for vertices) and bind it to a texture slot (tbuffer type in the HLSL code).

If using instancing, you can't use multiple cbuffers (as all resources have to be the same for every instance), so tbuffers are the obvious choice.
If not using instancing, the overhead of updating a cbuffers probably dwarfs the cost of copying 2 to 48 bytes, so I don't imagine that updating cbuffers with indices instead of matrices will be a useful optimization.

With instancing, you could also put the matrix array in a cbuffer rather than a tbuffer, but I would guess this will be non-optimal.
Cbuffers are optimised assuming that every pixel/vertex will require every bit of data in the cbuffer, and older cards may not support array indexing (= may implement it as a huge nested if/elseif chain...).
Tbuffers and textures are optimised assuming that different vertices/pixels will need different subsets of the data, but that there may be some spatial locality that a cache would help with. They're implemented using the fetch hardware, so you know that array indexing will work, but also that it will be performing an actual (cached) memory request (whereas perhaps cbuffer data may have been pre-fetched into registers - wasting a huge number of them).

Lastly, you can put the matrix data into a vertex buffer and bind it to an Input Assembler slot, where the vertex layout / vertex declaration is responsible for defining how it is fetched for the vertex shader.
In D3D9, this is probably the best approach, as VTF was either slow and limited, or entirely unsupported back then. In D3D11, it's probably faster to define your own tbuffer and do the fetch yourself using SV_InsanceID.


#5198870 Occlusion Culling - Combine OctTree with CHC algorithm?

Posted by Hodgman on 17 December 2014 - 07:11 PM

What would be some good alternatives, preferably with being able to keep the OctTree, for dynamic scenes?

That's a good question!  I just wanted to point out the inevitable GPU sync points involved in CPU-read-back techniques.

 

Obviously the best performance will be with precomputed techniques and static scenes...

 

For dynamic scenes, many engines are using depth occlusion queries, but entirely on the CPU-side to avoid the GPU-sync issues.

This generally requires your artists to generate very low-poly LOD's of your environment / occluders, and ensure that these LODs are entirely contained within the high-poly visual versions. You then rasterize them at low resolution on the CPU to a floating point depth buffer. To test objects for visibility, you find their 2D bounding box (or bounding ellipse) and test all the floats in that region to see if the bounding volume is covered or not.

 

At the moment, I'm using a 1bpp software coverage rasterizer, as mentioned here.

 

Another technique that I've seen used in AAA games is simple "occlusion planes". The level designers manually place large rectangles around the levels (usually inside walls of buildings / etc), which represent occluders. The game then frustum culls these rectangles and then sorts them by screen-space size and selects the biggest N number of them. Those N rectangles are then extruded into occluder frustums, and every visible object is tested to see if it's entirely inside any of those occluder frustums.

e.g.

visibleObjects = allObjects.Where( x => EntirelyOutsideFrustum(camera, x) == false );
visibleOccluders = allOccluders.Where( x => EntirelyOutsideFrustum(camera, x) == false );
bestOccluders = visibleOccluders.Sort( x => x.ScreenArea() ).Trim(0,10);
occluderFrustra = bestOccluders.Select( x => x.ExtrudeQuadToFrustum(cameraPos, x) );
reallyVisibleObjects = visibleObjects.Where( x => EntirelyInsideFrustum(occluderFrustra.ForEach(), x) == false );

You'd be suprised at how fast modern CPUs can burn through these kind of repeated, simple, brute-force checks... Sometimes simple flat lists will even out-perform complex structures like trees, due to how ridiculously slow random memory access patterns are compared to predictable linear patterns.

 

Other games use a combination of portals and occlusion planes.




#5198695 How much do I pay someone for coming up with some ideas for my game?

Posted by Hodgman on 16 December 2014 - 09:48 PM

Add up how many hours both you and him have put into it and show those numbers to him.

Either agree to a royalty share based on that, or a fixed price, e.g. $20 x his hours.
In either case you need a lawyer to draft the agreement.

If negotiations fail, you're not legally obliged to pay him anything.... Game ideas aren't really subject to copyright - implementations of ideas are.


#5198691 declare custom types of ints?

Posted by Hodgman on 16 December 2014 - 09:06 PM

yes, but if both parameters are of type eiTYPEDEF_INT, the compiler won't catch it if they are accidentally reversed will it? IE if i accidentally passed ani # as model #, and model # as ani #.

The whole point of those macros is to solve that problem for you - otherwise you'd just use the regular typedef keyword.
If you use:
eiTYPEDEF_INT(AnimID);
eiTYPEDEF_INT(ModelID);
Then passing an AnimID as a ModelID will result in a compiler error, saying can't convert AnimID to ModelID.
 
[edit]
The trick is that you end up with two completely different instatiations of the PrimitiveWrap template -- one using 'tag_ModelID' as an argument and one using 'tag_AnimID' as an argument. Those 'tag' types are just dummy structures with no use at all, except to trick C++ into cloning the PrimitiveWrap template into a new unique type. 
struct tag_ModelID;//useless structs, just to create a unique type
struct tag_AnimID;

typedef PrimitiveWrap<int, tag_ModelID> ModelID;//the useless structs are used as a template argument
typedef PrimitiveWrap<int, tag_AnimID>  AnimID; // so that the two resulting types are different

void PlayAnimation(ModelID some_model, AnimID some_ani);
...
AnimID a = AnimID(2);
ModelID m = ModelID(1);
PlayAnimation(m, a);//OK!
PlayAnimation(a, m);//error - can't convert arg#1 from PrimitiveWrap<int,tag_AnimID> to PrimitiveWrap<int,tag_ModelID>
PlayAnimation(1, 2);//error - can't implicitly convert arg#1 from int to PrimitiveWrap<int,tag_ModelID>
PlayAnimation(AnimID(1), ModelID(2));//error - can't convert arg#1 from PrimitiveWrap<int,tag_AnimID> to PrimitiveWrap<int,tag_ModelID>
PlayAnimation(ModelID(1), AnimID(2));//OK!

int id = m;//OK -- n.b. you CAN convert from ID's back into integers implicitly
ModelID m2 = id;//ERROR -- but you can't implicitly convert integers into IDs!
ModelID m3 = ModelId(id);//OK -- you have to explicitly convert them like this



#5198653 declare custom types of ints?

Posted by Hodgman on 16 December 2014 - 05:54 PM

I use this code (and helper macros) to declare custom type-safe integer and pointer types.
 
template<class T,class Name>
struct PrimitiveWrap
{
	PrimitiveWrap() {}
	explicit PrimitiveWrap( T v ) : value(v) {}
	operator const T&() const { return value; }
	operator       T&()       { return value; }
private:
	T value;
};

# define eiTYPEDEF_INT( name )					\
	struct tag_##name;					\
	typedef PrimitiveWrap<int,tag_##name> name;		//

# define eiTYPEDEF_PTR( name )					\
	struct tag_##name;					\
	typedef tag_##name* name;				//
e.g.
eiTYPEDEF_INT( ModelId ); // ModelId is a typedef for int
eiTYPEDEF_PTR( AnimationId ); // AnimationId is a typedef for void*

void Play( ModelId m, AnimationId a )
{
  int modelId = (int)m;
  void* animPtr = (void*)a;
  ...
}

Animation anim;
AnimationId a = AnimationId(&anim);
ModelId m = ModelId(42);
Play(m, a);
The final asm should be the same as if you were using ints/void*'s, but you get to use C++'s compile-time type-safety system.


#5198508 Current-Gen Lighting

Posted by Hodgman on 16 December 2014 - 05:46 AM

but I wasn't sure if even current consoles were capable of that yet.

Even PS3/360/mobile games do PBR  these days... just with more approximations.

 

Use a nice BRDF (start with cook torrence / normalized blinn-phong), use IBL for ambient (pre-convolved with an approximation of your BRDF, so that you end up with ambient-diffuse and ambient-specular), use gamma-decoding on inputs (sRGB->linear when reading colour textures), render to a high-precision target (Float16, etc) and tone-map it to gamma-encoded 8bit (do linear->sRGB / linear->Gamma as the last step in tone-mapping).

 

Ideally you'll do Bloom/DOF/motion-blur before tone-mapping, but on older hardware you might do it after (to get better performance, but with worse quality).




#5198292 GPU skinned object picking

Posted by Hodgman on 15 December 2014 - 05:57 AM

You almost never see triangle based picking on skinned meshes (for pixel perfect accuracy), except maybe in editor apps.
Generally you'll make a bounding box for each bone, then transform the boxes on the CPU and trace against them.


#5197762 Occlusion Culling - Combine OctTree with CHC algorithm?

Posted by Hodgman on 12 December 2014 - 07:28 AM

The CHC++ algorithm (Which builds on CHC) takes care of both of these problems.
There’s no synchronization between the CPU and GPU, unfinished queries are simply queued and used in the next frame.
There’s no “popping” either, can’t quite recall how that’s solved in the algorithm, but they have that covered.

From the paper, notice they say it reduces sync points, not eliminates sync points. In the one chart where they compare with the 'OPT' dataset, their average case is very close as they claim, but their worst case is 2x the frametime / half the framerate, probably due to a sync.

If an object has been invisible but is now visible this frame, you either have to conservatively draw it anyway, or sync (or pop artefacts). So in worst-case situations, if you don't pop and don't sync, you can't cull either...


#5197742 Occlusion Culling - Combine OctTree with CHC algorithm?

Posted by Hodgman on 12 December 2014 - 02:59 AM

Before you go too far down that path, you might want to evaluate alternatives to CHC+...
Using GPU occlusion queries for CPU occlusion culling means that either - you suffer a one-frame delay in culling results, leading to horrible popping artefacts, or you have your framerate by introducing CPU/GPU sync points. IMHO, it's simply not a feasible approach.




PARTNERS