I'm going slightly mad on this.

I've been trying to learn OpenGL. I've never learnt 3D programming / math (in a very solid way, anyway). I do know some basics about vectors, matrices, matrix multiplication, the identity matrix, how matrices store rotations, scaling and other transformations. I've read a couple hundred pages from 3D Math Primer for Graphics and Game Development, but that was at least a year ago. Furthermore, I guess there probably are some gaps in my 3D knowledge.

Lately, I tried to learn OpenGL through the Red Book. However, the book isn't very beginner-friendly. So after googling for a while I tried to learn from the arcsynthesis.com tutorials. I was doing very good until the perspective projection tutorial (http://www.arcsynthesis.org/gltut/Positioning/Tut04%20Perspective%20Projection.html). I don't get all that math on the "Camera perspective" section, tbh I think it's poorly explained (the rest of the tutorial is top-quality, though, gotta be fair).

From arcsynthesis I jumped to scratchapixel.com. The perspective projection matrix lesson (http://scratchapixel.com/lessons/3d-advanced-lessons/perspective-and-orthographic-projection-matrix/perspective-projection-matrix/) got me understanding things a bit better, but I'm starting to get really confused as scratchapixel.com and arcsynthesis.com seem to take slightly different approaches.

From what I understand, the homogeneous coordinates (as **scratchapixel.com** calls them) are acquired by dividing x,y and z by w. Those will then be the NDC (normalized device coordinates, as **arcsynthesis** calls them), right? **Arcsynthesis** seems to say that we "prepare" the perspective projection by setting the correct w coordinates for each vertex and letting the hardware do the rest:

The basic perspective projection function is simple. Really simple. Indeed, it is so simple that it has been built into graphics hardware since the days of the earliest 3Dfx card and even prior graphics hardware.

You might notice that the scaling can be expressed as a division operation (multiplying by the reciprocal). And you may recall that the difference between clip space and normalized device coordinate space is a division by the W coordinate. So instead of doing the divide in the shader, we can simply set the W coordinate of each vertex correctly and let the hardware handle it.

This step, the conversion from clip-space to normalized device coordinate space, has a particular name: the perspective divide. So named because it is usually used for perspective projections; orthographic projections tend to have the W coordinates be 1.0, thus making the perspective divide a no-op.

(...)

Suffice it to say that there are very good reasons to put the perspective term in the W coordinate of clip space vertices.

However, the same **Arcsynthesis** states:

Recall that the divide-by-W is part of the OpenGL-defined transform from clip space positions to NDC positions. Perspective projection defines a process for transforming positions into clip space, such that these clip space positions will appear to be a perspective projection of a 3D world. This transformation has well-defined outputs: clip space positions. But what exactly are its input values?

We therefore define a new space for positions; let us call this space camera space.

...then it proceeds to explain how to feed the clip space with already-processed vertex coordinates (in a way that gives the sense of perspective). I'm confused... Isn't that effect achieved by setting the correct w values for the perspective divide, which happens during the conversion from **clip space to NDC**???

Scratchapixel.com, on the other hand, defines the perspective divide as some sort of process of making w equal to z, so that dividing the coordinates by w will simultaneously make them homogeneous and fit into the z = 1 plane (which is considered the image plane in their example, or projection plane as **arcsynthesis** calls it).

Also remember from the beginning of this lesson, that point Ps, i.e. the projection of P onto the image plane, can be computed by dividing the x- and y-coordinates of P by its z-coordinate. So how do we compute Ps using point-matrix multiplication? First, we set x', y' and z' (the coordinates of Ps) to x, y and z (the coordinates of P). Then we need to divide x', y' and z' by z. Transforming x', y' and z' into x, y, and z is easy enough. First set the matrix to the identity matrix (The identity matrix is where the pivot coefficients, or the coefficients along the diagonal of the matrix, equal 1. All others coefficients equal 0). But why do we divide x', y' and z' by z? We explained in the previous section that a point expressed the homogeneous coordinate system (instead of in the Cartesian coordinate system) has a w-coordinate that equals 1. When the value of w is different than 1, we must divide the x-,y-,z-,w-coordinates of the point by w to reset it back to 1. The trick of the perspective projection matrix thus consists of making the w'-coordinate in Ps be different than 1, so that we have to divide x', y', and z' by w'. When we set w' to z (z cannot equal 1), we divide x', y' and z' by w' (which is equal to z). Through division by z, the resulting x'- and y'-coordinates form the projection of P onto the image plane. This operation is usually known in the literature as the z or perspective divide.

So technically, the perspective projection occurs **from the conversion from clip space into NDC, NOT before feeding the clip space coordinates **(as arcsynthesis stated)! Am I right?

Things are starting to get really messy in my mind, and I'm wondering if I should just pick a textbook on this subject and learn it all the hard way. Subsequent concepts mentioned by scratchapixel.com like far view and near view are also unclear to me, though there are previous tutorials on scratchapixel.com that should enlighten me about those. The question is... should I keep on hopping from place to place to figure out stuff one thing at a time, or would I better grab a textbook? If so, which one?

Any help would be very appreciated. This is starting to harm my motivation to learn OpenGL. It really gives the sense of overwhelmingness (just made up a new word).