View matrix - explanation of its elements

Started by
8 comments, last by Paradigm Shifter 10 years, 3 months ago

So a common view matrix looks like this:

MAT_12.GIF

where n is a vector acting as "z" axis of camera, u as "x" axis and v as "y" axis of camera. u_x, v_x, n_x etc. are coordinates of each vector. c is a vector representing distance of camera from (0,0, 0). u*c, v*c and n*c are dot products

Does somebody know any detailed article which would explain precisely how to create such a matrice from the scratch? Or maybe could someone explain it to me here?

First my assumptions:

When applying this matrix to every object on the scene, this matrix is the first factor of multiplication and a vertex is a second one?

So for example: MAT_12.GIF *


| x |
| y |
| z |
| 1 |

If my assumptions are correct then I don't understand a few things tongue.png. Why 4th row of the matrix must contain a vector representing how much should I move every vertex on the screen? The 4th coordinate in any vertex is "w" right? So what meaning does it have here? I thought it's actually useless but defined only to enable adding matrices in form of multiplication. Now after such mutiplication as above I would get the following vertex transformation:


|x * u_x + y * v_x + z * n_x + 0      |
|x * u_y + y * v_y + z * n_y + 0      |
|x * u_z + y * v_z + z * n_z + 0      |
|-x * (u*c) -y * (v*c) -z * (n*c) + 1 |

And it seems as the "w" component of the vertex was moved but it doesn't make any sense to me : (.

My second issue is rotation in the view matrix so 3 first rows of the view matrix. I completely don't understand why we can put coordinates of the camera axis vectors as rotation factors.

So if anyone could lend me a hand here I would be really grateful!

Advertisement

Problem is that you need to understand half a dozen concepts to have all of your questions answered. That is too much to be explained in detail here. I'll try to give the basics, so you have some hints to search the forum and the internet in general.

1.) A camera can be seen like any other object in the world, and as such it can be placed in the world. Placement is done by using a transformation that explains how to convert a point in the camera local space into the global space. Notice that a point in local space and its converted pendant are actually the same points; the difference is the reference to which the co-ordinates of the point are given. Notice that the transformation is used to convert from local to global space. Clearly, there should be the other way, too, namely converting from global space into local space. Those two transformations are the mathematical inverse of each other, because going from the local space into the global spec and back into the same local space should yield in the original point co-ordiates.

With the above in mind, a camera matrix is normally the transformation from camera local into global space; the same as a model matrix is for a model. But the view matrix is normally called its inverse, i.e. the matrix to transform from global space into camera local (a.k.a. view) space. So you must not confuse camera and view matrix (as you did in the OP).

2.) A vector may denote a position, a difference (between 2 positions), a direction, but not a distance. A distance is a scalar value. A distance can be computed as the length of a difference vector.

3.) You must be aware that vectors can be written in a row or in a column. This makes a difference if you think of the matrix product. E.g. if you have a matrix and a vector, you are able to compute

matrix * column_vector or row_vector * matrix

but you re not able to compute

column_vector * matrix or matrix * row_vector

The operator to convert between column and row vectors is named "transpose".

What you written down as "view matrix" looks like a row-vector matrix. But the product you've done later has the structure suitable for column-vectors. Using them together is wrong!

4.) The w component, better named the homogenous component, is a trick (well, not really, but it seems to be). Notice that both scaling and rotation are multiplicative operations that can be applied to positions as well as direction vectors, but translation is an additive operation that can be applied to position vectors only (yes: a direction vector has no concept of positions!). Concatenating several operations is not possible without already considering the vector.

Now with using a homogenous co-ordinate it is possible to unify translation and rotation/scaling because the homogeneous co-ordinate makes an explicit distinction between positions and direction.

As far as I know, view matrix is the "camera" matrix inverted. I hope someone can confirm this.

The camera matrix looks a bit more understandable. Ie. the position elements contains the actual camera position in world space.

Since the camera matrix has just rotation and translation, the general matrix inverse is not really necessary. You can transpose the rotation part and then use the dot products for the position components.

Cheers!

A camera can be seen like any other object in the world, and as such it can be placed in the world.

That is a common misconception, which makes it hard to understand. Actually, the camera is stationary! You move the whole world (using the view matrix), and you turn the whole world, to get what you want in front of the camera. Looking at it that way, the transforms are obvious.

As far as I know, view matrix is the "camera" matrix inverted. I hope someone can confirm this.

Kind of, if you see the camera as something you place in the world.

[size=2]Current project: Ephenation.
[size=2]Sharing OpenGL experiences: http://ephenationopengl.blogspot.com/

A camera can be seen like any other object in the world, and as such it can be placed in the world.

That is a common misconception, which makes it hard to understand. Actually, the camera is stationary! You move the whole world (using the view matrix), and you turn the whole world, to get what you want in front of the camera. Looking at it that way, the transforms are obvious.

Well, no, it isn't a "misconception". It is a legitimate view onto the things. It may seem that a view space is something special but it isn't.

* Can a camera be placed in the world like any other object? Yes, it can.

* To what is a camera stationary? It is stationary to ... itself!! But that is true for any object in the scene. If you chose a reference system where you sit down, then it becomes stationary for you.

* What if you have 2 cameras in the world: How could saying that both are stationary is more intuitive?

* Is a light also special / stationary because one transforms into its local space when computing a shadow volume?

* Is an ellipsoidal shape special because, when raytracing it, one transforms into its local space where it becomes a sphere?

* Is the world special because one does collision detection within?

Mathematically there is nothing like a natural break point in the chain of transformations from any local space into the screen space. One chooses the space where a given task is best to be done. For sure, the best space for screen rendering is not the world ;)

A camera can be seen like any other object in the world, and as such it can be placed in the world.

That is a common misconception, which makes it hard to understand. Actually, the camera is stationary! You move the whole world (using the view matrix), and you turn the whole world, to get what you want in front of the camera. Looking at it that way, the transforms are obvious.

Well, no, it isn't a "misconception". It is a legitimate view onto the things. It may seem that a view space is something special but it isn't.

* Can a camera be placed in the world like any other object? Yes, it can.

* To what is a camera stationary? It is stationary to ... itself!! But that is true for any object in the scene. If you chose a reference system where you sit down, then it becomes stationary for you.

* What if you have 2 cameras in the world: How could saying that both are stationary is more intuitive?

* Is a light also special / stationary because one transforms into its local space when computing a shadow volume?

* Is an ellipsoidal shape special because, when raytracing it, one transforms into its local space where it becomes a sphere?

* Is the world special because one does collision detection within?

Mathematically there is nothing like a natural break point in the chain of transformations from any local space into the screen space. One chooses the space where a given task is best to be done. For sure, the best space for screen rendering is not the world ;)

The problem with placing the camera in the world is that you don't use the same transformation matrix as when you place any other type of object in the world. So on that regard, it is not just like any kind of object. If you want to move an object to the right in your view, you add a translation to that specific object. If you want to move the camera to the right, you subtract a corresponding translation to every object in the world.

I agree that it is valid to use any reference system (view space, word space, screen space, etc). But you can't use more than one camera. That is, if you have more than one camera, you render them one at a time. E.g. creating stereoscopic views, shadow maps or cube maps.

[size=2]Current project: Ephenation.
[size=2]Sharing OpenGL experiences: http://ephenationopengl.blogspot.com/

First of: I agree that both sights are valid (although IMHO my arguments show that "its a camera in the world" is even more suitable). But I disagree with your argumentation for declaring the one sight being a misconception.


The problem with placing the camera in the world is that you don't use the same transformation matrix as when you place any other type of object in the world. So on that regard, it is not just like any kind of object.

When one places a camera in the world one uses translations and rotations to describe where positioned and how oriented the camera is w.r.t. the global co-ordinate system. This already sounds from the phrase "placing a camera in the world". How you build those transform is convention: Whether your camera controller pitches the camera down or up when moving the mouse forward is a convention. Whether you compose the view matrix directly instead of the camera matrix plus afterward inversion is an implementation detail.

An example: I create a path onto which a Frenet frame is animated. I may attach an arrow to the frame to let an RPG character shoot with bow and arrow. I may alternatively attach the camera to it because I want to render a cut scene. Now, is the camera distinct from the arrow from the path's point of view? No! The animation system will write exactly the same transform to the target. I'd not came to the conclusion to let the animation system make a distinction between a camera and something else. There is not only no necessity for that, it were even more complex. Later on, the rendering system as the one which knows what "camera" means uses one of them, computes the view matrix, and does its thing.

Another example is CES. A camera can be implemented as component which brings field of view and the far and near clipping planes into an ordinary entity. The entity has, as every entity that is placed in the world, a Placement component as well.

Another example is a third person camera. It is attached in a forward kinematic way to the PC (a.k.a. "parenting").

All these examples work fine with "camera in the world", and IMHO they trouble-freely unify the sight onto things.


I agree that you can use any reference system (view space, word space, screen space, etc). But you can't use more than one camera. That is, if you have more than one camera, you render them one at a time. E.g. creating stereoscopic views, shadow maps or cube maps.

Yes, of course. But it means that you create a situation where one camera is stationary and all others are not, and declare that as special, although it is as special as any of the situations using another camera, too.

Every engine and tool I've ever worked with handles their camera by treating it like any other object in the scene. It's completely valid to this as far as I'm concerned, and it's an intuitive way of dealing with camera transforms. The concept of "view space" is something that's specific to rendering, while the concept of a camera in general is something that usually applies to other subsystems like gameplay, scripting, and animation. When you do need to work with view space, in can be intuitively reasoned about by thinking of it as the coordinate space achieved by transforming everything in the world by the inverse of the camera's transformation matrix. The inverse is used because it transforms everything to the camera's frame of reference, where everything is relative to the camera's current position and orientation.

As for the original view matrix "look at" routines posted by the OP, kauna is exactly right: since the parameters for a lookAt function are an orientation and position (since you typically only deal with rotations and translations for camera and not scaling), you can perform an optimized 4x4 matrix inverse by transposing the 3x3 portion and doing the dot products to get the last row.

you can perform an optimized 4x4 matrix inverse by transposing the 3x3 portion and doing the dot products to get the last row.

To expand,

If you have a transformation that is a rotation and a translation, e.g. the 3d camera's position is t and it's rotation is R then it's world transform is,

x' = Rx + t, where R is a 3x3 matrix, x is a 3x1 matrix(column vector) and t is a 3x1 matrix(column vector). The output is x', your transformed vector.

To get a view matrix you want to bring all vertices in the frame of reference of the camera so the camera sits at the origin looking down one of the Cartesian axis. You simply need to invert the above equation and solve for x.

x' = Rx + t

x' - t = Rx

R-1(x' - t) = (R-1)Rx

RT(x' - t) = x // because a rotation matrix is orthogonal then its transpose equals its inverse

x = RTx' - RTt

So to invert a 4x4 camera transformation matrix, efficiently, you will want to transpose the upper 3x3 block of the matrix and for the translation part of the matrix you will want to negate and transform your translation vector by the transposed rotation matrix.

-= Dave

Graphics Programmer - Ready At Dawn Studios

Surely

x = RT(x' - t)

is a better representation since it involves only 1 matrix multiply and a vector subtraction.

"Most people think, great God will come from the sky, take away everything, and make everybody feel high" - Bob Marley

This topic is closed to new replies.

Advertisement