Jump to content

  • Log In with Google      Sign In   
  • Create Account


Member Since 02 Feb 2004
Offline Last Active Sep 04 2013 10:12 PM

#5089889 questions about singletons

Posted by nonoptimalrobot on 28 August 2013 - 02:21 PM


 Sometimes you truly only want one instancing of something running around (assert tracker, GPU wrapper etc)


This is misleading -- You may "want" only one, but it's very, very seldom that one cannot imagine a scenario where, in fact, you actually need more than one, and even more seldom when having more than one would actually be incorrect. Usually when a programmer says "I'll use a singleton because I only want one of these." what he's really saying is "I'll use a singleton because I don't want to bother making my code robust enough to handle more than one of these."


If you can justify choosing convenience over correctness you're free to do so, but you've made your bed once you've chosen a Singleton and you alone will have to lay in it.



...yeah, 'correctness' in software design; something that's largely considered undefinable.  We are not scientist doing good science or bad science; we are craftsman and what one paradigm holds over the other is simply the kinds of coding decisions that are discouraged vs encouraged.  The idea is to use designs that encourage use cases that pay productivity dividends down the road and discourage use cases that become time sinks.  In this sense Singletons are ALL BAD but making your life easier down the road shouldn't be a directive that is pathologically perused such that the productivity of the moment is drawn to a crawl.  I guess the point I'm trying to make is that a tipping point exists between writing infinitely sustainable code and getting the job done.  Here are some examples:


Singletons make sense for something like a GPU; it is a piece of hardware, there is only one of them and it has an internal state.  Of course you can write code such that the GPU appears virtualized and your application can instantiate any number of GPU objects and use them at will.  Indeed I would consider this good practice as it will force the code that relies on the GPU to be written better and should extend nicely to multi GPU platforms.  The flip side is that implementing all this properly comes at a cost in productivity at the moment and that cost needs to be considered over the probability of seeing the benefit.


Another example is collecting telemetry or profiling data.  It's nice to have something in place that tells everyone on the team: "Hey, telemetry data and performance measurements should be coalesced through these systems for the purposes of generating comprehensive reports."  A Singleton does this while a class declaration called Profiler and Telemetry does not.  Again, you can put the burden of managing and using instances of Profiler and Telemetry onto the various subsystem of your application and once again this may lead to better code but if the project never lives long enough to see that 'better code' pay productivity gains then what was the point?


I don't implement Singletons either personally or professionally (for the reasons outlined by Ravyne and SiCrane unless explicitly directed to do so) but I have worked on projects that did use them and overall I was glad they existed as they made me more productive on the whole.  In these instances the dangers of using other people's singletons in already singleton dependent systems never came to fruition and the time sink I made writing beautiful, self contained, stateless and singleton free code never paid off.  Academic excellence vs. pragmatism:  it's a tradeoff worth considering.  Mostly I'm playing devils advocate here as I find blanket statements about a design paradigm being all good or all bad misleading.  Anyway, this is likely to get flamey...it already is and people aren't even disagreeing yet. :)  I'm out.

#5089723 a map<T,T> as a default parameter...

Posted by nonoptimalrobot on 28 August 2013 - 12:18 AM

You would need to do this:

bool createWindowImp(DemString		winName,	// window name
		     DemString		winTitle,	// window title
		     DemUInt		winWidth,	// window width.
		     DemUInt		winHeight,
		     ExtraParameters	params = ExtraParameters());

Which will push an instance of ExtraParameters (which will be empty) onto the stack and pass it on to the function.  The instance will get pushed off the stack after the function's scope terminates.  A more efficient approach would look like this:

bool createWindowImp(DemString		winName,	// window name
		     DemString		winTitle,	// window title
		     DemUInt		winWidth,	// window width.
		     DemUInt		winHeight,
		     ExtraParameters*	params = 0);

Inside createWindowImp you will need to check to see if param is non-null and proceed to dereference it and extract values if so.




I see ExtraParameters is an std::map, in that case IF you want to go with the fist option you will definitely want to tweak it a bit:

bool createWindowImp(DemString			winName,	// window name
		     DemString			winTitle,	// window title
		     DemUInt			winWidth,	// window width.
		     DemUInt			winHeight,
		     const ExtraParameters&	params = ExtraParameters());

If you don't pass by reference then each time the function is called a temporary std::map will be created for the params varaible and a deep copy will be preformed between it and whatever you happen to be passing in; that equates to a lot of memory that gets newed only to be prompty deleted.  It's good practice to toss the 'const' in there too unless you want to pass info out of createWindowImp via params (usually forwned upon).

#5089654 D3DXMatrixLookAtLH internally

Posted by nonoptimalrobot on 27 August 2013 - 06:32 PM

This:  http://msdn.microsoft.com/en-us/library/windows/desktop/bb205342(v=vs.85).aspx contains a description of the algorithm.  Note that the matrix is identical to a object-to-world space transform that has been inverted.


The gist of the algorithm is to compute the local x, y and z axis of the camera in world space and then drop them into a matrix such the camera is placed at the origin.   Transforms containing rotations and translations only are constructed as follows:

| ux  uy  uz  0 |
| vx  vy  vz  0 |
| nx  ny  nz  0 |
| tx  ty  tz  1 |

// x-axis of basis (ux, uy, uz) 
// y-axix of basis (vx, vy, vz)
// z-axis of basis (nx, ny, nz)
// origin of basis (tx, ty, tz)

That's a simple world transform; you are building a view matrix so you actually want the inverse which is computed as follows:

|        ux          vx         nx  0 |
|        uy          vy         ny  0 |
|        uz          vz         nz  0 |
| -(t dot u)  -(t dot v) -(t dot n) 1 |

// u = (ux, uy, uz)
// v = (vx, vy, vz)
// n = (nx, ny, nz)
// t = (tx, ty, tz)

#5089038 Storing grass patch data (terrain system)

Posted by nonoptimalrobot on 25 August 2013 - 07:33 PM

This is how I've done it in the past:


Grass "clumps" are placed in the world individually or via a spray brush in the world editor.  The brush and placement tool have various options to make this easy including an 'align to terrain' behavior and various controls over how sizes, orientations and texture variations are handled.  This process generates a massive list of positions, scales and orientations (8 floats per clump).  There are millions of grass clumps so storing all this in the raw won't do...


At export time the global list of grass clumps is partitioned into a regular 3d grid.  Each partition of the grid has a list of the clumps it owns and quantizes the positions, scales and orientations into 2 32 bit values.  The fist value contains four 8bit integers: the first 3 ints represent a normalized position w.r.t. the partition's extents and the 4th is just a uniform scale factor.  The second 32 bit value is a quaternion with its components quantized to 8 bit integers (0 = -1 and 255 = 1).


At runtime the contents of nearby partitions are decompressed on the fly into data that's amiable to hardware instancing.  This was a long time ago so it was important the vertex shader didn't spent to much time unpacking data.  Theses days with bit operations available to the GPU you might be able to use the actual compressed data and not need to manage compressing and uncompressing chunks in real-time, if you do use a background thread.


It worked pretty well and was extended to support arbitrary meshes so the initial concept of a grass clump evolved to include pebbles, flowers, sticks, debris etc.  Any mesh that was static and replicated many times throughout the world was a good candidate for this system as long as its vertex count was low enough to justify the loss of the post transform cache caused by the type of HW instancing being used.

#5088762 Distortion/Heat Haze FX

Posted by nonoptimalrobot on 24 August 2013 - 05:25 PM

I'm still not really clear on what you do with water.  Water would be some plane, is the cascade you're talking about aligned to this plane?


The 'cascades' are generated by slicing up the view frustum with cutting planes that are perpendicular to the view direction.  Objects that use distortion are rendered in the furthest distortion cascade first and the nearest distortion cascade last.  The frame buffer is resolved into a texture between rendering each cascade.  Folding water into this is definitely add hoc.  The cascades get split into portions that are above the water plane and below the water plane.  Render the cascades below normally then resolve your frame buffer and move on to render the cascades that are above the water plane.  If the camera is below the water reverse the order.  Ugly huh?  Obviously none of this solves the problem it just mitigates artifacts.


This may be slightly off topic but is this basically saying that I don't need to worry about sorting when i'm doing premultiplied alpha?!?  How are more people not just using this then?


Yeah, those post (especially the second one) are misinformation or at least poorly presented information.  Alpha blending, whether using pre-multiplied color data or not is still multiplying the contents of your frame buffer by 1-alpha so the result is order dependent.  Consider rendering to the same pixel 3 different times using an alpha pre-multiplied texture in back to front order.  Each pass uses color and alpha values of (c0, a0), (c1, a1) and (c2, a2) respectively and 'color' is the initial value of the frame buffer.  Note I'm writing aX instead of 1-aX here because it requires less parenthesizes and is therefore easier to visually analyze, this doesn't invalidate the assertion.

Pass 1:              c0 + color * a0
Pass 2:        c1 + (c0 + color * a0) * a1
Pass 3:  c2 + (c1 + (c0 + color * a0) * a1) * a2 = result of in order rendering

Now lets reverse the order:

Pass 1:              c2 + color * a2
Pass 2:        c1 + (c2 + color * a2) * a1
Pass 3:  c0 + (c1 + (c2 + color * a2) * a1) * a0 = result of out of order rendering


c2 + (c1 + (c0 + color * a0) * a1) * a2 != c0 + (c1 + (c2 + color * a2) * a1) * a0 

 You still need to depth sort transparent objects regardless of how you choose to blend them...unless of course are just doing additive blending, then it doesn't matter.

#5088731 Managing Game Time ( Ticks ) and Speed Issues

Posted by nonoptimalrobot on 24 August 2013 - 01:38 PM

Your code is a little strange but it seems to work, nothing jumps out as wrong.


Off topic note:  It is a little strange to render as frequently as possible but lock updating at 60Hz.  If nothing moved why render again, the generated image will be the same.  Don't get me wrong games do this all the time but usually in a slightly more complicated way.  For example:  if updating happens at 30Hz and rendering happens at 60Hz then the renderer can linearly interpolate between (properly buffered) frame data generated by the updater to give the illusion of the entire game running at 60 fps. 


Back to the problem at hand...


My second issues is that once I bump up the instances to 10k( 40k total vertices and a shared texture ) my gamespeed and render speed drop significantly. Gamespeed drops from 57-58 ticks per second to 42-43 ticks per second and my render speed drops even worse from 6k-7k FPS to 125-130FPS. Here is my render code:


This is a valid observation (something is amiss) but using fps to measure relative changes in performance can be misleading.  A change from 57-58 fps to 43-42 fps means each frame is taking approximately 6.1 milliseconds longer to finish.  A change from 6k-7k fps to 125-130 fps means each frame is taking approximately 7.6 milliseconds longer to finish.  Not as drastic as it first seems.  Think of fps measurements as velocities, they are inversely proportional to time.  If you have a race between two cars it doesn't make a lot of sense to say car A finished the race at 60 mph and car B finished the race at 75 mph.  Want you really want to know is the time it took each car to finish the race.



I'd imagine that on my computer which is considerably powerful, I should be able to get 200k+ vertices minimum at 60fps.... especially considering I am only handling the rendering of instances and nothing else at all.


Agreed but I'm not sure why your performance is tanking.  Hopefully someone who knows a bit more about OpenGL can point to something that's blatant performance problem.

#5088726 Bullet Impact on Mesh Edge

Posted by nonoptimalrobot on 24 August 2013 - 01:03 PM

I know that he uses forward renderer, but i am interseted in other techniques for deferred one, and i see you have something interesting to write about that, so please share. biggrin.png


I'm curious as well.


At one point I was dealing with a game what had highly detailed meshes used for rendering and exceptionally simplified meshes used for collision.  This meant the normal solution of projecting and clipping your decal against the collision geometry resulted a decal that was rarely consistent with the topology of the rendered object.  Obviously you can solve this by giving your decal system access to the render geometry for the purposes of projecting and clipping but this wasn't practical; the computational geometry code wanted triangles in a specific format that didn't mix well with rendering so there was a huge memory overhead.  Our render was deferred which meant easy access to the depth buffer and a normal buffer, for this reason the decision was made to render decals as boxes and use various tricks in the pixel shader to create the illusion of decals being projected onto the scene.  Essentially UV coordinates would be tweaked based on the result of a ray cast.  It worked fairly well but had some view dependent artifacts (sliding) that were distracting if you zoomed in on and rotated around a decal that had been projected onto a particularly complicated depth buffer topology.  There was also the issue of two dynamic objects overlapping, if one of the objects had a decal stuck to it in the region of the overlap the decal would appear on the other object; this only happened when the collision response was faulty so it wasn't that big a deal.


Anyway, waiting for Hodgman's insight...

#5088690 Better method than smart-pointers for dealing with circular dependency?

Posted by nonoptimalrobot on 24 August 2013 - 10:03 AM

frob, on 24 Aug 2013 - 01:34 AM, said:snapback.png

In our current engine at work there is actually a 4-step game object initialization process. The game object sees a constructor which is nearly always empty. Then an OnCreate() method is called where you know all the base member variables are created and initialized; here you set up your own values but do not touch other components, the function is skipped if the object is being loaded through serialization. Next an OnStartup() method where you register with other components and recreate any non-persisted values, this is run both after deserialization and on new object creation. Finally, when the object is placed in the world, OnPlacedInWorld(), only a few objects need this additional step, but sometimes they need to hook up with nearby objects and this provides that opportunity.

Holy crap, man!

Couldn't you just create separate ctors for normal construction and deserialization construction? I mean, I get the OnPlacedInWorld() part, but it sounds like OnCreate() and OnStartup() could just be normal ctors that call a common member function after the serializable values are set. Why do you guys avoid using the ctor? Or, rather, if it needs to be plugged in this way, why not a pair of factory functions?


This is pretty standard.  It's not that you can't achieve the same with multiple constructors but breaking it up into multiple function calls or passes allows a few different things:


1) External systems can run code in-between the initialization passes.  This is almost a necessity, especially for games. 


2) Code reuse.  Often times the default constructor does stuff that all the other initialization functions want done.  If everything was broken out into multiple, overloaded constructors then a lot of code would get copied across them.


3) Clear intent.  OnPlacedInWorld() OnCreate() and OnStartup() tells would be modifiers of the class where to put various bits of initialization code while the systems using the class know when to call what.


4) Performance.  As mentioned, the minimally filled out default constructor allows faster serialization as you can avoid doing things that will simply be undone by the serializer.

#5088596 Distortion/Heat Haze FX

Posted by nonoptimalrobot on 23 August 2013 - 10:42 PM

For a differed rendered things usually get arranged like this:


1) Render solid objects into the depth and g-buffers

2) Generate shadow maps

3) Render lights

4) Resolve the frame buffer to a texture that can be sampled called S

5) Render distortion effects by sampling S

6) Render transparent objects (usually via forward lighting)


As you mentioned this isn't really a generic solution, it just works okay most of the time.  A lot of games only take it this far and artifacts that result from having distortion effects in-front of transparent objects that at are in front of other distortion effects are either accepted as inevitable or dealt with by a graphics programmer conning a designer into tweaking the layout of the problematic scene.  Add depth-of-field effects to the mix and things can get ugly fast.


A generic solution is to lump distortion and transparency into the same object class and resolve the current frame buffer into a texture just before drawing each object.  This is extremely slow as you are resolving the scene to a texture many times per frame.


More sophisticated solutions group distorting and transparent objects into slabs or cascades and then only resolve the scene to a texture between rendering cascades.  When it comes to water...well, water is just given it's own cascade.  Admittedly this doesn't make much sense in the geometric sense (water is usually a giant plane that crosses all the cascades) it works well most of the time.


Various adaptations of depth peeling along with the compute shader analog that relies on MRTs and sorting data by depth can be leveraged as well to solve this problem.

#5088530 Calculating area left uncovered by overlapping projected circles on a plane

Posted by nonoptimalrobot on 23 August 2013 - 04:44 PM

Ah, yes but the terms "colouring algorithm" and "simulated annealing" in the first post are not jiving well with the mental model I have of the problem you are trying to solve.  In any case, occlusion queries are pretty simple:


1) Render some things to populate the depth buffer

2) Issue a begin occlusion query command

3) Render some more stuff

4) Issue an end occlusion query command

5) Get the result 'N' from the query which tells you the number of pixels that have failed the depth buffer test.


To start out you will have to render your mesh between a begin and end query command in such a way that you know every pixel will fail so you know its initial pixel count; you can subtract 'N' from this value down the road to get the area left uncovered by your disks.  You will want to use an orthographic projection so each pixel, regardless of its depth, has the same projected area.


In my experience using compute shaders to solve a generic problem is never a performance win unless you aggressively restructure the algorithm to accommodate the various quirks of the GPU architecture.   The amount of work it takes to optimization a compute shader solution is often similar to the amount of work in takes to recast your algorithm into a rasterization task.  For reasons that are not entirely clear to me GPUs nudge closer to their peak theoretical performance when you are blasting triangles instead of running compute shaders making a rasterization method a clear winner in a lot of cases.  This may not be true with workstation cards and OpenCL does a good job at parallelizing your entire machine (even the integrated graphics) so there might be performance wins in that department.

#5088487 Calculating area left uncovered by overlapping projected circles on a plane

Posted by nonoptimalrobot on 23 August 2013 - 02:38 PM

Assuming I'm following you (and I might not be), it sounds like you can turn this into a rendering problem use occlusion queries.

#5088003 Help with visibility test for AI in 2D

Posted by nonoptimalrobot on 21 August 2013 - 11:19 PM

Can't sleep and kept coming back to this.  Here's some code to get the "hacky" version that I described working.  This is something of a neat problem, it's very similar to extruding a shadow volume but you get to make a bunch of convenient assumptions.  The crux of the algorithm is to construct 4 rays each starting at the location of your AI and traveling to one of the 4 corners of the platform that the AI is standing on.  Each ray is then intersected with the plane that the player is standing on, these four intersection points will form an AABB which describe the region of space that is occluded by the platform from the AI's point of view.  If the player is inside this AABB they are not visible, if they are outside of it then they are potentially visible if the AI is looking in that direction.

struct AABB
	Vector2 vMin;
	Vector2 vMax;

	void IncludePoint(const Vector2& p)
		vMin.x = min(vMax.x, p.x);
		vMin.y = min(vMax.y, p.y);
		vMax.x = max(vMax.x, p.x);
		vMax.y = max(vMax.y, p.y);

	Vector2 GetCorner(int32 a_nIndex) const
		case 0: return Vector2(vMin.x, vMin.y);
		case 1: return Vector2(vMax.x, vMin.y);
		case 2: return Vector2(vMin.x, vMax.y);
		case 3: return Vector2(vMax.x, vMax.y);

		default: assert(false);

		return Vector2(0, 0);

	bool ContainsPoint(const Vector2& p) const
		if(p.x < vMin.x) return false;
		if(p.x > vMax.x) return false;
		if(p.y < vMin.y) return false;
		if(p.y > vMax.y) return false;
		return true;

void GetOcclusionBounds(const AABB&     a_kPlatform,        // Rectangular bounds of elevated platform
                        float           a_fPlatformHeight,  // Vertical position of the elevated platform
                        const Vector2&  a_vViewPoint,       // Horizontal position of the AI, must be on the platform!
                        float32         a_fViewHeight,      // Height of the AI (tall AIs can see more, short ones can see less
                        float32         a_fPlayerHeight,    // Vertical position of the player's feet + their height
                        AABB&           a_kOcclusion)       // Resulting bounds of visibility occlusion.
	assert(a_fViewHeight > 0);
	assert(a_fPlayerHeight < a_fPlatformHeight);

	a_kOcclusion = a_kPlatform;

	Vector3 vView = Vector3(a_vViewPoint, a_fPlatformHeight + a_fViewHeight);

	for(int32 i = 0; i < 4; i++)
		Vector3 vDelta  = vView - Vector3(a_kPlatform.GetCorner(i), a_fPlatformHeight);
		float32 fDist   = (a_fPlayerHeight-vView.z) / vDelta.z;
		Vector3 vCorner = vView + vDelta * fDist;

		a_kOcclusion.IncludePoint(Vector2(vCorner.x, vCorner.y));

The math can be made leaner but I didn't want to obscure the geometric intuitiveness.  Add a field-of-view check and you are good to go.

#5087912 "Each vs. any" collision shape implementation

Posted by nonoptimalrobot on 21 August 2013 - 03:42 PM

I have misgivings about the double dispatch method.  I feel like it's a form of spaghetti code that forms over and over again in response to a particular type of problem so we gave it a name and sanctioned it as a design paradigm.  My biggest problems with it is that it obliterates the open/closed principle and is hard to maintain.  Admittedly this is an opinion many will take issue with.


HappyCoder is onto something about giving your IVolume interface the necessary methods so that they can be used by a separating axis detector for collision.  I've done this before and it's workable but still awkward.  Also: performance!  An AABB to AABB detector can be made very fast but if when you generalize to a separating axis test you'll pay a price.  It will be slightly faster than if you had stored your AABBs as convex hulls and used a vanilla separating axis detector but only slightly.  Bounding volumes with curved boundaries (capsules, cylinders etc) are problematic as well.  In any case it is a fun academic experiment and may suit your needs.

#5087891 "Each vs. any" collision shape implementation

Posted by nonoptimalrobot on 21 August 2013 - 02:42 PM

This is a good question.  I've never seen it solved in a particularly satisfactory way.  This is what I do:

enum ShapeType
    Shape_Null   = 0,
    Shape_AABB   = 1,
    Shape_OOBB   = 2,
    Shape_Hull   = 4,
    Shape_Sphere = 8,

#define ShapeTypeMash(a, b) (((uint32)a << 16) | (uint32)b)

class IShape

    virtual ShapeType GetType(void) const = 0;

bool CollideShapes(const IShape* a_pShapeA, const IShape* a_pShapeB)
    ShapeType eTypeA = a_pShapeA->GetType();
    ShapeType eTypeB = a_pShapeB->GetType();

    if(eTypeA > eTypeB)
        swap(eTypeA, eTypeB);
        swap(a_pShapeA, a_pShapeB);

    bool bResult = false;

    switch(ShapeTypeMash(eTypeA, eTypeB))

    case ShapeTypeMash(Shape_AABB, Shape_AABB):
        bResult = CollideShapes_AABB_AABB((const AABB*)a_pShapeA, (const AABB*)a_pShapeB);

    case ShapeTypeMash(Shape_AABB, Shape_OOBB):
        bResult = CollideShapes_AABB_OOBB((const AABB*)a_pShapeA, (const OOBB*)a_pShapeB);

    case ShapeTypeMash(Shape_AABB, Shape_Hull):
        bResult = CollideShapes_AABB_Hull((const AABB*)a_pShapeA, (const Hull*)a_pShapeB);

    case ShapeTypeMash(Shape_AABB, Shape_Sphere):
        bResult = CollideShapes_AABB_Sphere((const AABB*)a_pShapeA, (const Sphere*)a_pShapeB);

    case ShapeTypeMash(Shape_OOBB, Shape_OOBB):
        bResult = CollideShapes_OOBB_OOBB((const OOBB*)a_pShapeA, (const OOBB*)a_pShapeB);

    case ShapeTypeMash(Shape_OOBB, Shape_Hull):
        bResult = CollideShapes_OOBB_Hull((const OOBB*)a_pShapeA, (const Hull*)a_pShapeB);

    case ShapeTypeMash(Shape_OOBB, Shape_Sphere):
        bResult = CollideShapes_OOBB_Sphere((const OOBB*)a_pShapeA, (const Sphere*)a_pShapeB);

    case ShapeTypeMash(Shape_Hull, Shape_Hull):
        bResult = CollideShapes_Hull_Hull((const Hull*)a_pShapeA, (const Hull*)a_pShapeB);

    case ShapeTypeMash(Shape_Hull, Shape_Sphere):
        bResult = CollideShapes_Hull_Sphere((const Hull*)a_pShapeA, (const Sphere*)a_pShapeB);

    case ShapeTypeMash(Shape_Sphere, Shape_Sphere):
        bResult = CollideShapes_Sphere_Sphere((const Sphere*)a_pShapeA, (const Sphere*)a_pShapeB);

        bResult = false;

    return bResult;

It works okay and keeps the shape vs shape collision code out of the actual shape class which I have found to be beneficial.  It keeps you from modifying all of your shape classes each time you add a new type.  You can leverage template specialization to clean up the syntax a bit.  I'm hoping someone replies with a better solution though.

#5087680 Help me understand how DirectX treats the W component

Posted by nonoptimalrobot on 20 August 2013 - 04:07 PM

When transforming vectors by matrices the w component has the effect of scaling the translational component of your matrix.  Consider the vanilla transformation matrix encountered while doing graphics work:

| ux vx nx tx|  |x|   |(x y z) dot (ux uy uz) + tx*w|
| uy vy ny ty|* |y| = |(x y z) dot (vx vy vz) + ty*w|
| uz vz nz tz|  |z|   |(x y z) dot (nx ny nz) + tz*w|
|  0  0  0  1|  |w|   |                            w|

The rotation and scale of the matrix is stored in the u, v, and n vectors and the translational component is stored in the t vector.  As you can in the right-hand-side the (x y z) components of the original vector where translated by w*t.  For directions w is set to 0 (or implicitly interpreted as 0) since translations are irrelevant to directions.  For points w is set to 1 (or implicitly interpreted as 1) treating the translation as expected.


Now for the good stuff.  Consider multiplying a point by a projection matrix:

| sx  0  0  0| |x|   |x*sx     |
|  0 sy  0  0|*|y| = |y*sy     |
|  0  0 sz tz| |z|   |z*sz + tz|
|  0  0  1  1| |1|   |z        |

The right-hand-side is the data spit out by your vertex shader.  For various mathematical reasons that I won't get into the GPU will clip data against the view frustum in this space and it is thus called clip space.  To apply perspective some magic happens, the right-hand-side gets homogenized or divide by its w component:

H( |x*sx   | ) = | (sx*x)    / z |
   |y*sy   |     | (sy*y)    / z |
   |z*sz+tz|     | (sz*z+tz) / z |
   |z      |     | 1             |

From here the right-hand-side is said to be in NDC or normalized device coordinates.  The z component gets directly written to the depth buffer, the x and y component are scaled and biased to generate pixel coordinates and the w component is simply discarded (it will always be 1).


If you are hand setting the w value of your vertex in the vertex shader you must take into account that the GPU will divide (x, y, z) by w to generate NDC coordinates.


[edit] I don't know what all that whitespace is...I can't get rid of it sad.png