Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 14 Feb 2007
Offline Last Active Today, 04:11 AM

#5199361 What is faster cbuffer or data textures for MVP matrices?

Posted by Hodgman on Yesterday, 10:09 PM

In D3D11, you don't have to use VTF; you can use a regular Buffer object (like for vertices) and bind it to a texture slot (tbuffer type in the HLSL code).

If using instancing, you can't use multiple cbuffers (as all resources have to be the same for every instance), so tbuffers are the obvious choice.
If not using instancing, the overhead of updating a cbuffers probably dwarfs the cost of copying 2 to 48 bytes, so I don't imagine that updating cbuffers with indices instead of matrices will be a useful optimization.

With instancing, you could also put the matrix array in a cbuffer rather than a tbuffer, but I would guess this will be non-optimal.
Cbuffers are optimised assuming that every pixel/vertex will require every bit of data in the cbuffer, and older cards may not support array indexing (= may implement it as a huge nested if/elseif chain...).
Tbuffers and textures are optimised assuming that different vertices/pixels will need different subsets of the data, but that there may be some spatial locality that a cache would help with. They're implemented using the fetch hardware, so you know that array indexing will work, but also that it will be performing an actual (cached) memory request (whereas perhaps cbuffer data may have been pre-fetched into registers - wasting a huge number of them).

Lastly, you can put the matrix data into a vertex buffer and bind it to an Input Assembler slot, where the vertex layout / vertex declaration is responsible for defining how it is fetched for the vertex shader.
In D3D9, this is probably the best approach, as VTF was either slow and limited, or entirely unsupported back then. In D3D11, it's probably faster to define your own tbuffer and do the fetch yourself using SV_InsanceID.

#5198870 Occlusion Culling - Combine OctTree with CHC algorithm?

Posted by Hodgman on 17 December 2014 - 07:11 PM

What would be some good alternatives, preferably with being able to keep the OctTree, for dynamic scenes?

That's a good question!  I just wanted to point out the inevitable GPU sync points involved in CPU-read-back techniques.


Obviously the best performance will be with precomputed techniques and static scenes...


For dynamic scenes, many engines are using depth occlusion queries, but entirely on the CPU-side to avoid the GPU-sync issues.

This generally requires your artists to generate very low-poly LOD's of your environment / occluders, and ensure that these LODs are entirely contained within the high-poly visual versions. You then rasterize them at low resolution on the CPU to a floating point depth buffer. To test objects for visibility, you find their 2D bounding box (or bounding ellipse) and test all the floats in that region to see if the bounding volume is covered or not.


At the moment, I'm using a 1bpp software coverage rasterizer, as mentioned here.


Another technique that I've seen used in AAA games is simple "occlusion planes". The level designers manually place large rectangles around the levels (usually inside walls of buildings / etc), which represent occluders. The game then frustum culls these rectangles and then sorts them by screen-space size and selects the biggest N number of them. Those N rectangles are then extruded into occluder frustums, and every visible object is tested to see if it's entirely inside any of those occluder frustums.


visibleObjects = allObjects.Where( x => EntirelyOutsideFrustum(camera, x) == false );
visibleOccluders = allOccluders.Where( x => EntirelyOutsideFrustum(camera, x) == false );
bestOccluders = visibleOccluders.Sort( x => x.ScreenArea() ).Trim(0,10);
occluderFrustra = bestOccluders.Select( x => x.ExtrudeQuadToFrustum(cameraPos, x) );
reallyVisibleObjects = visibleObjects.Where( x => EntirelyInsideFrustum(occluderFrustra.ForEach(), x) == false );

You'd be suprised at how fast modern CPUs can burn through these kind of repeated, simple, brute-force checks... Sometimes simple flat lists will even out-perform complex structures like trees, due to how ridiculously slow random memory access patterns are compared to predictable linear patterns.


Other games use a combination of portals and occlusion planes.

#5198695 How much do I pay someone for coming up with some ideas for my game?

Posted by Hodgman on 16 December 2014 - 09:48 PM

Add up how many hours both you and him have put into it and show those numbers to him.

Either agree to a royalty share based on that, or a fixed price, e.g. $20 x his hours.
In either case you need a lawyer to draft the agreement.

If negotiations fail, you're not legally obliged to pay him anything.... Game ideas aren't really subject to copyright - implementations of ideas are.

#5198691 declare custom types of ints?

Posted by Hodgman on 16 December 2014 - 09:06 PM

yes, but if both parameters are of type eiTYPEDEF_INT, the compiler won't catch it if they are accidentally reversed will it? IE if i accidentally passed ani # as model #, and model # as ani #.

The whole point of those macros is to solve that problem for you - otherwise you'd just use the regular typedef keyword.
If you use:
Then passing an AnimID as a ModelID will result in a compiler error, saying can't convert AnimID to ModelID.
The trick is that you end up with two completely different instatiations of the PrimitiveWrap template -- one using 'tag_ModelID' as an argument and one using 'tag_AnimID' as an argument. Those 'tag' types are just dummy structures with no use at all, except to trick C++ into cloning the PrimitiveWrap template into a new unique type. 
struct tag_ModelID;//useless structs, just to create a unique type
struct tag_AnimID;

typedef PrimitiveWrap<int, tag_ModelID> ModelID;//the useless structs are used as a template argument
typedef PrimitiveWrap<int, tag_AnimID>  AnimID; // so that the two resulting types are different

void PlayAnimation(ModelID some_model, AnimID some_ani);
AnimID a = AnimID(2);
ModelID m = ModelID(1);
PlayAnimation(m, a);//OK!
PlayAnimation(a, m);//error - can't convert arg#1 from PrimitiveWrap<int,tag_AnimID> to PrimitiveWrap<int,tag_ModelID>
PlayAnimation(1, 2);//error - can't implicitly convert arg#1 from int to PrimitiveWrap<int,tag_ModelID>
PlayAnimation(AnimID(1), ModelID(2));//error - can't convert arg#1 from PrimitiveWrap<int,tag_AnimID> to PrimitiveWrap<int,tag_ModelID>
PlayAnimation(ModelID(1), AnimID(2));//OK!

int id = m;//OK -- n.b. you CAN convert from ID's back into integers implicitly
ModelID m2 = id;//ERROR -- but you can't implicitly convert integers into IDs!
ModelID m3 = ModelId(id);//OK -- you have to explicitly convert them like this

#5198653 declare custom types of ints?

Posted by Hodgman on 16 December 2014 - 05:54 PM

I use this code (and helper macros) to declare custom type-safe integer and pointer types.
template<class T,class Name>
struct PrimitiveWrap
	PrimitiveWrap() {}
	explicit PrimitiveWrap( T v ) : value(v) {}
	operator const T&() const { return value; }
	operator       T&()       { return value; }
	T value;

# define eiTYPEDEF_INT( name )					\
	struct tag_##name;					\
	typedef PrimitiveWrap<int,tag_##name> name;		//

# define eiTYPEDEF_PTR( name )					\
	struct tag_##name;					\
	typedef tag_##name* name;				//
eiTYPEDEF_INT( ModelId ); // ModelId is a typedef for int
eiTYPEDEF_PTR( AnimationId ); // AnimationId is a typedef for void*

void Play( ModelId m, AnimationId a )
  int modelId = (int)m;
  void* animPtr = (void*)a;

Animation anim;
AnimationId a = AnimationId(&anim);
ModelId m = ModelId(42);
Play(m, a);
The final asm should be the same as if you were using ints/void*'s, but you get to use C++'s compile-time type-safety system.

#5198508 Current-Gen Lighting

Posted by Hodgman on 16 December 2014 - 05:46 AM

but I wasn't sure if even current consoles were capable of that yet.

Even PS3/360/mobile games do PBR  these days... just with more approximations.


Use a nice BRDF (start with cook torrence / normalized blinn-phong), use IBL for ambient (pre-convolved with an approximation of your BRDF, so that you end up with ambient-diffuse and ambient-specular), use gamma-decoding on inputs (sRGB->linear when reading colour textures), render to a high-precision target (Float16, etc) and tone-map it to gamma-encoded 8bit (do linear->sRGB / linear->Gamma as the last step in tone-mapping).


Ideally you'll do Bloom/DOF/motion-blur before tone-mapping, but on older hardware you might do it after (to get better performance, but with worse quality).

#5198292 GPU skinned object picking

Posted by Hodgman on 15 December 2014 - 05:57 AM

You almost never see triangle based picking on skinned meshes (for pixel perfect accuracy), except maybe in editor apps.
Generally you'll make a bounding box for each bone, then transform the boxes on the CPU and trace against them.

#5197762 Occlusion Culling - Combine OctTree with CHC algorithm?

Posted by Hodgman on 12 December 2014 - 07:28 AM

The CHC++ algorithm (Which builds on CHC) takes care of both of these problems.
There’s no synchronization between the CPU and GPU, unfinished queries are simply queued and used in the next frame.
There’s no “popping” either, can’t quite recall how that’s solved in the algorithm, but they have that covered.

From the paper, notice they say it reduces sync points, not eliminates sync points. In the one chart where they compare with the 'OPT' dataset, their average case is very close as they claim, but their worst case is 2x the frametime / half the framerate, probably due to a sync.

If an object has been invisible but is now visible this frame, you either have to conservatively draw it anyway, or sync (or pop artefacts). So in worst-case situations, if you don't pop and don't sync, you can't cull either...

#5197742 Occlusion Culling - Combine OctTree with CHC algorithm?

Posted by Hodgman on 12 December 2014 - 02:59 AM

Before you go too far down that path, you might want to evaluate alternatives to CHC+...
Using GPU occlusion queries for CPU occlusion culling means that either - you suffer a one-frame delay in culling results, leading to horrible popping artefacts, or you have your framerate by introducing CPU/GPU sync points. IMHO, it's simply not a feasible approach.

#5197481 Physical Based Models

Posted by Hodgman on 10 December 2014 - 06:24 PM

Yeah I'm with promit - even if two different engines/art tools both use "PBR", they probably store their specular/roughnews/etc maps in completely different ways...

The best solution would be if the art files used only mask textures and material names - e.g. a greyacale map specifying which bits are clean copper, another map for "worn blue paint", etc, etc... You could then write/use a tool to convert those masks into textures that are correct for your engine.

It's always been an issue that games made up of "asset packs" will have no consistency unless retouched by an art team... PBR had just exaggerated this existing fact.

#5197142 is there a name for this type of data structure?

Posted by Hodgman on 09 December 2014 - 06:40 AM

The "interface with Update/Render methods" pattern is often called a "game object" or a "game entity".

These days I consider it an anti-pattern...

Sure you could implement any bit of software using it - though non "realtime"/interactive software would be a bad fit..
However, by choosing to put everything into a single "entity" list you're making some serious trade-offs! Sure, your main loops becomes incredibly simple - for each entity, Update; for each entity Render... But in exchange you're giving up the ability to have any high level knowledge of your application's data (and data-flows), and you're giving up all knowledge of your program's flow-of-control. The order that things happen is left up to the chance of your entity list ordering...

You end up with silly bugs, like sometimes your homing missiles are targeting a position that lags by one frame, because you're unable to ensure that all movement logic has completed before executing targeting logic... You become cursed to watch, helplessly, as your co-workers expand the interface to include PreUpdate, PostUpdate... PostPostUpdate... to work around such bugs :lol:

When considering the potential 4x boosts of a quad-core CPU, you're left scratching your head, because any of those Update calls could be touching any bit of the game-state, leading you to consider abhorrent ideas like per-object locks and even more non-deterministic update orders...

This pattern is single-handedly responsible for the oft-repeated myth that "games are not a good fit for multithreading" or "games are hard to mutithread"... And on that note - Last-gen consoles made tripple-core CPUs and NUMA co-processing standard requirements for all games. Current-gen consoles are making hex-core CPUs standard. Games are now leading the way in making consumer software take advantage of parallel processors.

IMHO, having a larger number of smaller, more homogeneous lists of self contained objects (or 'systems' of objects), combined with a more complex main loop in which the flow of data and control is explicit and easy to follow, is a far, far superior approach.
It's easier to keep everything sensibly deterministic, easier to understand and maintain, easier to optimize, easier to spread over multiple CPU cores...
Hence all the rage about 'component systems', 'data oriented design', and multithreaded game engines in recent times.

#5197055 is there a better way top refer to assets in a game?

Posted by Hodgman on 08 December 2014 - 05:38 PM

and the next step after that would be stepping up to skinned meshes vs rigid body animation. but i don't think even dx11 and the latest card could do it: 125 characters onscreen at once without slowdowns at 15fps. that would be 62 characters at once at 30 fps, or 31 skinned mesh characters onscreen at once at 60fps. games can't really do this yet can they? total war draws a lots of characters, but they're not high resolution, like a character in a typical shooter.

In the sports games I've worked on we've usually got ~30 players and referees on the field, plus ~32 low-detail spectators (which are then instanced/impostered to fill up to 100000 stadium seats). That's at 30hz on DX9/2006-era consoles (with about half the frame time being spent on post-processing), and 60Hz on the new DX11/2014-era ones.
Bigger rival companies were doing it at 60Hz in the DX9 era too...

Play any newish Assassin's creed or Hitman game and you'll see crowds of easily 100 animated NPCs, which the player can interact with (interrupt/push/etc).

Going back a ways, any Quake 3 derived shooter (e.g. every Call of Duty game) supports 32 player multiplayer on DX9.

Quake 3 was on the cusp between the CPU doing the skinning, and the GPU's vertex shader taking over that role. These days almost everyone uses GPU-skinning. GPU's can crunch *millions* of pixels per frame with highly complex pixel shaders, so 30 characters * 10k verts is a breeze.

#5196896 Does glMapBuffer() Allocate Client-Side Memory?

Posted by Hodgman on 07 December 2014 - 10:20 PM

I thought operating systems typically provide ALL memory, regardless of where in the system it's located, its own unique range of memory addresses. For example, memory address 0x00000000 to 0x2000000 point to main system memory while 0x20000001 to 0x2800000 all point to the GPU's memory.

Regarding PHYSICAL RAM, maybe... But we work with VIRTUAL RAM at all times these days.
If your process needs to access some physical RAM, the OS has to give you a range of virtual addresses, and then 'map' those virtual addresses to the physical resources you've allocated.
By default, there will be no virtual addresses corresponding to any VRAM. Also, a quirk of modern desktop OS's means only a small bit of VRAM can be mapped to CPU-side virtual addresses at one time (hence all the unmapping).

In practice, if you're using the no-overwrite/unsynchronized map flags/hints, you've got the best chance at being given an actual pointer to VRAM! If so, this means that when writing to those addresses, you'll skip the CPU's caches and go via a write-combining buffer for maximum throughput (another reason for the mandatory unmap - in this case, the driver needs to flush the CPU's write-combine cache), but if you read from that pointer, well, it's going to be dog slow (no cache, non-local resource = bad).

With any other map flags (except perhaps in write-discard/orphaning situations), the driver will almost certainly internally allocate some extra CPU-side RAM, and copy through to the GPU itself.

#5196387 Blending without changing the alpha source

Posted by Hodgman on 04 December 2014 - 10:59 PM

Yeah, the D3D11 equivalent is:


#5196243 Accurately estimating programming cost?

Posted by Hodgman on 04 December 2014 - 07:16 AM

You need an experienced lead programmer -who is familiar with their team of programmers- to make the estimates ;)

You can't just make estimates in a vacuum.
Said lead will break down the design into a list of technical features, identify the dependencies on other features, make a rough list of tasks, then refine it into more precise tasks. They'll estimate all those tasks with regards to the capabilities of their team.
If they've been given 5 veterans who they've worked with in the past, you're going to get a much lower total estimate than if they've been given 15 university graduates.
Ideally the actual staff will have input on generating these estimates (and then the lead might multiply the staff's numbers by Pi just to be safe).

Sometimes you might have to commit some experienced programmers to the project first, in a "pre-production" phase, so they can experiment on different approaches to solving the design requirements before estimates can even be guessed at...
e.g. If you haven't yet chosen an engine, you'll probably want your core team to evaluate your options and make that decision before going on to create the detailed task list.