• Create Account

Banner advertising on our site currently available from just \$5!

Like
22Likes
Dislike

# The Total Beginner's Guide to 3D Graphics Theory

By Tim Bright | Published Nov 22 2013 02:23 AM in Graphics Programming and Theory
Peer Reviewed by (jbadams, jjd, apatriarca)

graphics theory

# Introduction

When I was a kid, I thought computer graphics was the coolest thing ever. When I tried to learn about graphics, I realized it was harder than I thought to create those super slick programs I'd seen growing up. I tried to hack my way through by reading things like the OpenGL pipeline specs, blogs, websites, etc. on how graphics worked, did numerous tutorials, and I got nowhere. Tutorials like NeHe's helped to see how to set things up, but I would misplace one glXXX() call, and my program would either not work or function exactly as before without my new additions. I didn't know enough about the basic theory to debug the program properly, so I did what any teenager does when they're frustrated because they aren't instantly good at something...I gave up.

However, I got the opportunity a few years later to take some computer graphics classes at the university (from one of Ivan Sutherland's doctoral students, no less) and I finally learned how things were supposed to work. If I had known this before, I would have had a lot more success earlier on. So, in the interest in helping others in a similar plight as mine, I'll try to share what I learned.

# The Idea Behind Graphics

## Overview

Let's start by thinking about the real world. In the real 3D world, light gets emitted from lots of different sources, bounces off a lot of objects, and some of those photons enter your eye via the lens and stimulates your retina. In a real sense, the 3D world is projected on to a 2D surface. Sure, your brain takes visual cues from your environment and composites your stereoscopic vision to perceive the whole 3D space, but it all comes from 2D information. This 2D image on your retina is constantly being changed just by things moving in the scene, you moving in relation to your scene, lighting changing, and so on. Our visual system processes these images at a pretty fast rate and the brain constructs a 3D model.

Horse movie image sequence courtesy of the US Library of Congress.

If we could take images and show them at a similar or higher rate, we could artificially generate a scene that would seem like a real space. Movies basically work on this same principle. They flash images from a 3D scene fast enough that everything looks continuous, like in the horse example above. If we could draw and redraw a scene on the computer that changed depending on motion through the scene, it would seem like a 3D world. Graphics works exactly the same way: it takes a 3D virtual world and converts the whole thing into an accurate 2D representation at a fast enough rate to make the brain think it's a 3D scene.

## Constraints

The human vision threshold to process a series of images as continuous is about 16 Hz. For computer graphics, that means we have at most 62.5 milliseconds to do the following:

• Determine where the eye is looking in a virtual scene.
• Figure out how the scene would look from this angle.
• Compute the colors of the pixels on the display to draw this scene.
• Fill the frame buffer with those colors.
• Send the buffer to the display.
• Display the image.

This is a complex problem. The time constraint means we can't just use a brute-force method like taking the 3D scene, throwing a bunch of photons in it from all our light sources, calculating trajectories and intensities, and figuring out which ones hit the eye, map that to a 2D image, and then draw it. (Note: that's kind of a lie because that is kind of what happens in raytracing, but the techniques are really sophisticated and is different enough to say that the above is true.) Fortunately, there are some cool tricks and things we can take advantage of to cut down on the amount of computation.

# Basic Graphics Theory

## All the World's a Stage

Painting by the infamous Bob Ross courtesy of deshow.net.

Let's begin with an example. Let's say you're in a valley with mountains around you and a meadow in front of a river, similar to the Bob Ross painting above. You want to represent this 3D scene graphically. How do you do it? Well, we can try to paint an image that captures all the elements of the scene. That means we have to pick an angle we want to view the scene with, paint only the things we can see, and ignore the rest. We then have to determine what parts of which objects are behind others. We can see the meadow but it obscures part of the river. The mountains are way in the distance, but they obscure everything behind them, so we can ignore those objects behind the mountain. Since the real physical size of the scene is much bigger than our canvas, we have to figure out how to scale the things we see to the canvas. Then we can paint the objects, taking the lighting and shadows into account, the haze of the mountains in the distance, etc. This is a good analogue to how computer graphics processes a scene. The main steps are:

1. Determine what the objects in the world look like.
2. Determine where the objects are in the world.
3. Determine the position of the camera and a portion of the scene to render.
4. Determine the relative position of the objects with respect to the camera.
5. Draw the objects in the scene.
6. Scale the scene to the viewport of the image.

These steps above are basically trying to map points from our objects in the 3D world to the 2D image on the screen. This seems like a lot of work, but there's some really cool math tricks we can use to make this quick and easy. Remember going through algebra and thinking, "What will I ever use this for?" One answer is, graphics!

We can use matrices to map coordinates from the world into our image. Why matrices? Well, for one, a lot of operations can be represented in matrix form. However, the most important thing is that we can concatenate operations together by multiplying matrices together and get a single matrix that does all the operations at the same time. So, even if we have 50 transformations, we can multiply the matrices together once to get 1 matrix that will do all 50 transformations.

We can define matrices to do the operations that we talked about during our painting example (defining scene, defining view, etc.). These matrices will convert the scene from one coordinate system to another. These conversions between coordinate systems are called transformations. We will talk about each coordinate system and what transformation will move us from one to the other.

## Object Coordinates - Breaking up objects

How do we draw objects on the screen quickly? Computers are really great at doing relatively simple commands a lot of times in succession really fast. So, to take advantage of this, if we were able to represent the whole world with simple shapes, we could optimize graphics algorithms to process a lot of simple shapes really fast. This way, we don't have to make the computer recognize what a mountain or a meadow is in order to know how to draw it.

We'll have to create some algorithms to break our shapes down to simple polygons. This is called tessellation. Although we can use squares, we'll probably use triangles. There are lots of advantages to them, such as that all triangle points are co-planar and the fact that you can approximate just about anything with triangles. The only problem we have is that round objects will look polygonal. However, if we make the triangles small enough, like 1 pixel in size, we won't notice them. There are lots of methods on the "best way" to do this and it might depend on the shape you're tessellating.

Let's say we have a sphere that we want to tessellate. We can define the local origin of the sphere to be the center. If we do that, we can use an equation to pick points on the surface and then connect those points with polygons that we can draw. A common surface parameterization for a sphere is $$S(u,v) = [r\sin{u}\cos{v}, r\sin{u}\sin{v},r\cos{v}]$$, where u and v are just variables with a domain of $$u\in[0,\pi],v\in[0,2\pi]$$ and r is the radius of the sphere. As you can see in the above picture, the points on the surface are drawn with rectangles. We could have just as easily connected them with triangles.

The points on the surface are in what we can call object coordinates. They are defined with respect to a local origin, in this case, the center of the sphere. If we want to place them in a scene, we can define a vector from the origin of the scene to the point we want to place the sphere's origin, and then add that vector to every point on the sphere's surface. This will put the sphere in world coordinates.

## World Coordinates - Putting our objects in the world

We really start our graphics journey here. We define an origin somewhere and every point in the scene is defined by a vector from the origin to that point. Although it's a 3D scene, we'll define each point as a 4-dimensional point $$[x,y,z,w]$$, which will map to a 3D point at coordinates $$[\frac{x}{w},\frac{y}{w},\frac{z}{w}]$$. This kind of mapping is called homogeneous coordinates. There are advantages to using homogeneous coordinates, but I won't discuss them here. Just know we want to use them.

A problem presents itself if we want to move around in our scene. If we want to move our view, we can either move the camera to another location, or just move the world around the camera. In the computer, it's actually easier to move the world around, so we do that and let the camera be fixed at the origin. The modelview matrix is a 4x4 matrix that we can use to move every point in the world around and keep our camera fixed at its location. This matrix is basically a concatenation of all the rotations, translations and scaling that we want to do to the scene. We multiply our points in world coordinates by the modelview matrix to move us into what we call viewing coordinates:

$\left [ \begin{matrix} x \\ y \\ z \\ w \\ \end{matrix} \right ]_{view} = [MV] \left [ \begin{matrix} x \\ y \\ z \\ w \\ \end{matrix} \right ]_{world}$

## Viewing Coordinates - Pick what we can see

After we've rotated, translated, and scaled the world, we can select just a portion of the world to consider. This we do by defining a viewing frustrum, or a truncated pyramid. This frustrum is formed by defining 6 clipping planes in viewing coordinates. The idea is that everything outside this frustrum will be clipped, or discarded, when drawing the final image. This frustrum is defined in a 4x4 matrix. The OpenGL glFrustrum() function defined this matrix as follows:

$P = \left [ \begin{matrix} \frac{2*n}{r-l} & 0 & \frac{r+l}{r-l} & 0 \\ 0 & \frac{2*n}{t-b} & \frac{t+b}{t-b} & 0 \\ 0 & 0 & -\frac{f+n}{f-n} & -\frac{2fn}{f-n} \\ 0 & 0 & -1 & 0 \\ \end{matrix} \right ]$

Picture courtesy of Silicon Graphics, Inc.

We can adjust this matrix for perspective or orthographic viewing. Perspective has a vanishing point, but orthographic views don't. Perspective views are what you usually see in paintings, orthographic views are seen on technical drawings. Because this matrix controls how the objects are projected onto the screen, this is called the projection matrix. Here, t,b,l,r,n,f are the coordinates of the top, bottom, left, right, near, and far clipping planes. Multiplying by the projection matrix moves the point from viewing coordinates to what we call clip coordinates:

$\left [ \begin{matrix} x \\ y \\ z \\ w \\ \end{matrix} \right ]_{ndc} = [P][MV] \left [ \begin{matrix} x \\ y \\ z \\ w \\ \end{matrix} \right ]_{world}$

## Clip Coordinates - Only draw what we see

This coordinate system is a bit different. These coordinates are left-handed (we've been dealing with right-handed systems up to now) and is such that the viewing frustrum we defined earlier maps to a cube that ranges from (-1,1) in X, Y and Z.

Up to now, we've been keeping track of all the points in our scene. However, once we have them in clip coordinates, we can start clipping them. Remember our 4D-to-3D point conversion? If not, we said that $$[x,y,z,w]_{4D} = [\frac{x}{w},\frac{y}{w},\frac{z}{w}]_{3D}$$. Because we only want points in our viewing frustrum, we only want to further process points such that $$-1 \le \frac{x}{w} \le 1$$, or $$-w \le x \le w$$. This goes for coordinates in Y and Z as well. This is a simple way to tell if points lie inside or outside our view.

If we have points inside our viewing frustrum, we do something called perspective divide, where we basically divide by w to move from 4D to 3D coordinates. These points are still in the left-handed clip coordinates, but at this stage, we call them normalized device coordinates.

## Normalized Device Coordinates - Figure out what obscures what

You can think of this as an intermediate step before mapping to an image. If you think about all the possible sizes of images you could have, we don't want to render for one image size and then either scale and stretch the image or re-render the image to fit in case the size changes. Normalized device coordinates (NDC) are nice because no matter what the image size is, you can scale the points in NDC to your image size. In NDC, you can see how the image will be constructed. The image being rendered will be projections of the objects inside the frustrum on the near clipping plane. Thus, the smaller the coordinate of a point in the Z direction, the closer that point is.

At this point, we don't usually do matrix calculations anymore, but apply a viewport transformation. This is usually just to stretch the coordinates to fit the viewport, or the final image size. The last step is to draw the image by converting things to window coordinates.

## Window Coordinates - Scale objects to canvas

The window is where the image is being drawn. At this point, our 3D world is a 2D image on the near clipping plane. We can use a series of line and polygon algorithms to draw the final image. 2D effects, such as anti-aliasing and polygon clipping, are done at this point before the image is drawn.

As well, our window might have different coordinates. For example, sometimes images are drawn with positive X to the right and positive Y downward. A transformation might be needed to draw things correctly in window coordinates.

## There and Back Again - The Graphics Pipeline

You won't be handling all of the above steps yourself. At some point, you will use a graphics library to define things like the modelview and projection matrices and polygons in world coordinates and the library will do just about everything else we talked about. If you're designing a game, you don't care about how the polygons get drawn, only that they get done correct and fast, right?

Libraries like OpenGL and DirectX are very fast and they can use dedicated graphics hardware to do these computations quickly and easily. They are already widely available and there are a large number of developers that use them, so get comfortable with them. They still leave you with a lot of control over how things are done and you'd be amazed at some of the things people can do with them.

# Conclusion

This is a very simple overview of how things are done. There are many more things that happen at the later stages of the rendering process, but this should be enough to get you oriented so that you can read and understand all the brilliant techniques presented in other articles and in the forums.

If you're interested in reading up on some of the interesting things in this article, I suggest the following sites:

http://www.songho.ca/opengl/index.html

# Article Update Log

21 Nov 2013: Initial release

I'm an engineer that designs computer-aided design tools. I'm new to game architectures and game programming, but I'm fairly well versed in mathematics, computational geometry, and graphics theory.

Nice article, thanks

Very good, thanks a lot

Nice article, but I would like to say that I object to calling Bob Ross "infamous"

Good article where can I learn why a 4D model is preferred over a 3D?

Good article where can I learn why a 4D model is preferred over a 3D?

I don't know of a good one to address the topic. You can look up "projective geometry" or "homogeneous coordinates" if you want the full mathematical reasons. I'm not a mathematician, but I suspect it has to do with being able to combine the 3x3 rotation matrix and a translation into a single 4x4 matrix. Normally these are separate operations that aren't able to be combined without elevating to a higher dimensional space. Projective geometry is used a lot working with rational Bezier curves and B-splines as well, so it's got a lot of applications besides just graphics.

" In a real sense, the 3D world is projected on to a 2D surface. "

Well I'm unsure if this is phrasing, but I don't need my eyes open to know that the sofa I'm sat on is still a 3D object.. There's no projection, we live in 3D with X, Y and Z..

If you mean representing a 3D world onto a flat 2D surface like a monitor then yes, I get what you are saying.

"Sure, your brain takes visual cues from your environment and composites your stereoscopic vision to perceive the whole 3D space."

No what we have is depth perception with stereoscopic vision, the trick is how we then reproduce visual cues and depth perception from a device like a monitor. Which even on a 2D plane there are:

Interposition, linear perspective, relative and known size, texture gradients etc.

Then we have motion based queues like:

Parallax, kinetic depth and dynamic occlusion.

There's been a lot of research by the likes of microsoft, state universities into this subject about humans perceive depth perception and 3D games. It's a very interesting subject..

All in all well done, good stuff.

Well I'm unsure if this is phrasing, but I don't need my eyes open to know that the sofa I'm sat on is still a 3D object.. There's no projection, we live in 3D with X, Y and Z..

"Sure, your brain takes visual cues from your environment and composites your stereoscopic vision to perceive the whole 3D space."

No what we have is depth perception with stereoscopic vision, the trick is how we then reproduce visual cues and depth perception from a device like a monitor. Which even on a 2D plane there are:

Interposition, linear perspective, relative and known size, texture gradients etc.

Then we have motion based queues like:

Parallax, kinetic depth and dynamic occlusion.

There's been a lot of research by the likes of microsoft, state universities into this subject about humans perceive depth perception and 3D games.

Thanks for your feedback. I'm trying to describe in as few words as possible that the main idea behind 3D graphics is to create a series of images that mimics what would happen if photons from a real scene would enter the eye. There's a lot more to how the brain processes visual information than the very topical and simplistic explanation I gave, but I thought it was sufficient for beginners trying to understand how graphics can represent 3D objects pretty accurately.

No problem and apologies for nit picking, It's a hard subject to understand and even harder to teach.

Looking forward to some more articles.

Note: Please offer only positive, constructive comments - we are looking to promote a positive atmosphere where collaboration is valued above all else.

PARTNERS