After some problems understanding the underlying code in a book I am currently working my way through (OpenGL SuperBible, 5th edition), I decided to take a few steps back and see if I could actually explain the process to myself. Partly due to the "Writing Articles is Good for YOU!" announcement on top here, partly for getting feedback if I'm wrong.
I also tried checking briefly in another book (Real-Time Rendering, Third Edition) for some additional detail.
So, does all this seem correct, or do I have any incorrect assumptions/misunderstandings in the following?
A vertex position is a point in space, with coordinates that uniquely describe where the vertex is.
A vertex position is defined with 1 coordinate per dimension (so a vertex for a 2D position requires a set of 2 numbers, while a 3D position requires a set of 3 numbers). In addition, it has an additional element, which is set to 1. This is required for various matrix math operations, and allows vertex positions and vectors to use the same math.
A vector is a directional length, i.e. it contains both a direction as well as a length (its magnitude). A vector is defined in the same way as a vertex position (with 1 element per dimension), but the extra element for vectors is 0, instead of 1.
The direction of a vector is found by looking from the origin (0, 0, 0) and towards the specified coordinates.
The length is defined as the distance travelled from the origin to the specified coordinates.
If the vector's length is 1, the vector is said to be a unit vector, or of unit length.
If the vector's length is not 1, the vector can be normalized, which changes the length to 1, while maintaining its original direction. Normalization is done by dividing each component (e.g. the x, y and z) by the vector's length.
A texture is a set of data, which can contain e.g. color information, which can be used either directly (applying the color to an object), or indirectly (applying various effects in e.g. a shader). Textures can be multi-dimensional, the most common being a 2D texture.
A model is a collection of vertices (as well as optionally texture and/or normal coordinates), using certain geometric primitives -- generally points, lines and triangles.
The collection of vertices and the knowledge of which geometric primitive is used, are used to define an object, e.g. a teapot.
A model's vertices are defined in model space; the model has no knowledge of the world or where it should be drawn/displayed, it just knows what the model should look like (the spout goes on the front, the handle goes on the back, etc.).
We can adjust vertices and vectors (or in general, any point) by transforming it. Transforming can be e.g. translating (moving), rotating and scaling.
Transforms are stored in 4x4 matrices, where a single 4x4 matrix contains translation, rotation and scale.
A point can be transformed by multiplying it with a transformation matrix.
Model Transform & World Space
When we want to place, orient and/or scale a model in a specific place in our world, we transform it from model space to world space by applying a model transform to it.
We can apply multiple model transforms to a single model. If a teapot is residing on a desk (which has its own model transform), the teapot will be affected by both the desk's model transform and its own (which can be considered a model transform containing the offset from the desk).
A single model can be used in different places in the world, by reusing the model data (instancing it) and applying different model transforms to each instance of the model.
Once every model has been transformed by their model transforms, they all exist in world space -- they exist in correct spatial relation to each other.
View Transform & Camera Space
To determine which part of the world is to be displayed on screen, we use the concept of a camera object. The camera object also exists in the world in relation to everything else, with its own translation, rotation, etc. In addition, it has some settings that are specific to it.
For ease of computing, we apply the reverse of the camera's model transform to every model in our world. This places and orients the camera at origin (the orientation depending on the underlying API), and shifts all other objects in relation to the camera object. This maintains the camera's spatial relationship to every model in the world, and is done for computational gains (the math is easier when the camera is at origin).
This transform is called the view transform. The combined transform of the model transform and camera's view transform is called the modelview transform.
Models that have the modelview transform applied to them are said to be in camera space or eye space.
Projection Transform & Unit Cubes
Regardless of camera type, we define a volume which tells us which objects the camera can "see" (e.g. setting the camera's field of view, how far we can see, etc.). This volume is called the view frustum.
The camera can be orthographic, for which the final display will be a parallel projection (no perspective correction). This mainly used for 2D applications, or for architectural applications.
Alternatively, the camera can be perspective, which does not have parallel projection; instead perspective effects are applied (mainly this consists of foreshortening, i.e. making closer objects larger than objects farther away). This is generally used for applications which simulate how we perceive the world with our own eyes.
The graphics hardware only understands coordinates ranging from -1 to 1 (in all 3 dimensions), a so-called unit cube. Thus we apply yet another transform, the projection transform, which transforms the view frustum from whatever size/shape box we had, into a unit cube.
Objects which are within the unit cube are eligible for being displayed on-screen. The other objects are culled away. A special case exists for objects which are partially inside the unit cube; these are clipped to the extents of the unit cube.
Coordinates which have had this transform applied to them are said to be clip coordinates. The hardware automatically performs perspective division on the clip coordinates, leaving you with normalized device coordinates.
Finally, the objects are mapped to the screen/application window, where they are rasterized (converted into pixels on the screen).
Model space combined with model transform --> World space
World space combined with view transform --> Camera space or eye space
Camera space combined with projection transform --> Normalized device coordinates (after automatic perspective division)
Normalized device coordinates combined with rasterization --> Pixels on screen.