The opengl projection matrix is about defining the viewing frustum and perspective or orthopraphic projection... so... are the clipped coordinates the coordinates that lay in the viewing frustum? Which is definied by the projection matrix? That would explain why to get the clipped coordinate one has to multiply the projection matrix and the eye coordinate.
The modelview matrix is a combinaton of the model and the view matrix. It is used for doing transformations to a given vertex (scale, rotate and position). OpenGL doesn't use a "camera" or it can be thought of as the camera always being at (0,0,0) looking down the -Z axis. In order to change one's view, they need to transform all the objects positions so it's as if the world is moving around the "camera" not the "camera" moving around the world.
You have a viewing volume in general. Before normalization it is a "frustum" in case of perspective projection but a cuboid in case of orthographic projection. The projection matrix is responsible to squeeze the viewing volume into the normalized viewing volume which is in fact a cube ranging from -1 to +1 in each dimension. However, you need more to described than how to squeeze the volume: you need to describe where in the space the volume is located and how it is orientated. The both standard projection matrices are defined in a way that the viewing volume has an origin at
0 and OpenGL's standard viewing orientation. I other words, the projection expects to be applied in a local space that is called the viewing space.
You can relate a space to another space by giving a transformation matrix. (In this sense also the projection transforms from one space into another one.) There isn't really a dominant space; each space is in principle as good as another one, but some tasks are more convenient to be done in the one than in the other space. The viewing space, for example, is convenient for doing projection. However, it isn't convenient for placing models. This is because when wandering around we would need to alter each model's co-ordinates. Instead we define a common space where we relate both the models as well as the viewing volume. We usually call this space the global space or "world".
When we relate a model to the world, we'll see that defining a model's shape (by its vertices, of course) in global space isn't convenient, because we have to adapt each and every vertex when we want to shift or rotate the model. It is more convenient to have a space where the vertices have fixed co-ordinates regardless how the model is placed in the world. Such a space is called a local space or "model space", and the transformation that relates the model space to the global space is called the MODEL matrix in OpenGL's nomenclature. It is a transformation that computes the vertex co-ordinates given in model space now in global space, and hence is a local-to-global transformation.
We've said that the viewing volume is related to the global space as well. Hence the belonging matrix "as is" describes also a local-to-global transformation. In analogy to "model" we call the belonging object in the world the "camera" also it is even more "virtual" than a model, because we never see the camera itself but only the subspace it picks out of the world.
Now we can relate the models to the world and the viewing volume (or camera) to the world as well. But to apply the projection (and further steps) we need to work the models in the view space. Transformation from model space to global space to view space hence requires to concatenate the model's local-to-global transformation and the camera's global-to-local-transformation, where the latter one is obviously the inverse of the camera's local-to-global transformation. This inverse one is called the VIEW transformation in OpenGL's nomenclature, and the concatenation logically the MODELVIEW transformation.
Mathematically we speak of
P *
V *
M =
P *
C-1 *
Mwhere
P means the projection matrix,
V the VIEW matrix (global-to-local),
M the MODEL matrix (local-to-global), and
C the camera (local-to-global). Due to the way OpenGL expects the matrices (at least legacy OpenGL), it computes
P * (
V *
M )
but this is mathematically equivalent to
P * (
V *
M ) = (
P *
V ) *
MThat said, the concept of a camera
is actually existing in OpenGL, but mathematically it plays no role whether you move models or a camera around. It is just a question of efficiency: It is more efficient to let the GPU transform the vertices and deal with dynamic models plus the camera by the CPU instead of letting the CPU transform all vertices.