Today (2014) the topic of 3d graphics with textured polygons is all set and done. So we can look back relaxed without worrying about an instruction set of a specific CPU or worrying how to impress a fab to produce silicon. There are two kinds of computer graphic techniques which are easy to understand (for me at least). One are hardware accelerated bitmap images with color keying, so called sprites. Every sprite is stored as a bitmap in memory. We have a limited choice of sizes on most real and common hardware, but often the bitmaps have 32x32 texels. We then tells the video card where to lay these bitmaps over the background. Typically we would write the coordinates of the top left pixel of the background where your sprite is to be overlayed into some register. Even in the 80ies basic scaling was possible. You could scale by a factor of two for example. It was meant to save memory. If one wants an arbitrary integer-pixel texel-size, the hardware would have to have a counter. After every pixel the counter is counted down. If it drops below zero the texel coordinate is counted one down and the counter is reset to the start value we would write into a register before the frame is drawn. These are typical one cycle operations. It just needs some silicon, but helps tremendously to save memory for in-screen racing games. The next step would be to set the width of whole sprite with pixel precision. Some school math and Bresenham tell us, we have to set up a second counter “sprite pixel width rest”. Whenever we count the texel coordinate down, we also would count down the sprite pixel width counter. When it drops below zero we add a the “sprite pixel width rest” and omit one count of the integer part count. For example if we add the fraction 16 (16 is a fraction of the sprite width of 32), then every second texel is one pixel wider allowing us to scale the sprite to 48 px for example. The extra logic runs in parallel, everything is single cycle. Sega has done this for “OutRun”. We have secretly introduced another degree of freedom: the pixel count can be initialized to different values. This will make the sprite scroll with sub-pixel precision. You will see nothing at integer scales, get wavy motion at nearly integer values, but get a nice effect otherwise. So already with this quite simple hardware we leave the save ground of integer pixel coordinates.

So lets switch to ray-tracing. There we take the equations without modifications and throw it at your double precise floating point coprocessor. But even then you sometimes we get pixel snow artifacts. And if you program for some time you may want to prove that your program is always working right. Sometimes you get random crashes in a graphics engine and you want to be sure that it is not a fundamental flaw in your math. Or you get the history bug and wonder why things have not been done earlier. I am specifically thinking of the golden era, the transition of the SNES, Amiga to 3dfx cards, Sega Saturn, Sony Playstation 1.

There are two reason to draw polygons like sprites. One is simple: Billboards -- think of “After Burner” .The other is a 3 sentence long story. While spheres and tubes can be ray-traced very well, general geometry like human characters of landscapes are better represented by a polygon mesh. Lets say we are in a cave. Then the sky should not shine through gaps between polygons. If we have a transparent mesh representing glass for example, then in a ray-tracer we do not want to enter the material two times as this may crash the refraction calculation. In a ray tracer you test where you pass edges. In a mesh multiple polygons meet at vertices, but opposing polygons usually do not share edges. Thus rounding errors can lead to gaps and overlaps at vertices. Apparently there is no problem at edges. But as said above, you do not want any gaps and you can get that perfection so why settle with less. The trick is to project the vertices to the screen and then round them to such a low precision that no further rounding occurs in the edge calculation. The edge calculation is – you guessed it – done using the Bresenham Algorithm which traces along the edges on the screen. With 8 bits this is tough, but with 16 bit we have much reserve to keep the rounding error far below the screen resolution and so to hide them from the viewer (user). We can even build on this to get our sprite bitmap on the screen. In the dark ages before Wolfenstein3d people would either do (professional) 3d simulations with polygons but no bitmaps whatsoever. Or they would do parallel projection and use bitmaps. As wikipedia tells us academics had already united these concepts and named it “texture mapping”, but knowledge transfer was slow.

Lets for a moment concentrate on 2d game evergreen genres like beat'em'up, shoot'em'up, jump'n'run and RPG. These have very narrow angles of view. As shown by numerous PlayStation 1 games for these linear interpolation of texture coordinates across polygons gives satisfying results. For these, both texture coordinates (U and V) have to be interpolated along a scan-line, roughly doubling the cost compared to scaling. Bresenhams algorithm would need twice the number of bits whereas for satisfying results we do expect to need roughly the same number of bits as for the edges. The solution is to do linear interpolation simply using fixed point numbers and rounding. Now the interpolated U and V may lie a little bit outside the texture map, but saturation at the bitmap boundaries does not lead to noticeable artifacts here and is cheap in silicon. With ray tracers perspective texture mapping is quite straight forward, but seems to be far to costly to do it in real-time or in a handful of CMOS gates. So after looking at the equations you have to take a step back and think of an infinite checkerboard, which, by the way, is a popular background for ray-traced images. For these we take the screen distance from the horizon (called “W”) and take the reciprocal to get the distance from the camera (“Z”). Every textured polygon can be represented as a cut out from such an infinite checkerboard. A third linear interpolation unit is need on our chip to linearly interpolate W across the polygon. Going back to the equations one may prove that by first multiplying each U, V with W at the vertices, then interpolating and the dividing by W at the pixels gives the same U and V a straight forward ray-tracer would calculate. Multiplication always has a lower latency than division and needs less silicon. And there are further reasons why here the multiplication is not as expensive as one might think when looking at an unoptimized microcode implementation in a 8086. When drawing a scan-line of the polygon, the values of U and V (after division) at two pixels can be used to extrapolate linearly to the next pixel (“forward differencing”). Also U/W, V/W and W only change by a small increment. Writing each as a sum of the old value and a small delta and applying the law of distribution leads to multiplications with mostly a small number of involved bits. An optimized multiplication could skips zeros. In our use-case such a multiplication would execute in 2 cycles most of the time. The Goldschmidt algorithm allows us to approximate the division by W by a multiplication. We multiply the Z calculated by extrapolation with the current W. This should result in the number one with some error. We multiply both sides with one minus that error and the law of distribution says us, that we get a new Z with an even smaller error.

Our sprite is 32x32 px needing 5 bits. Lets allow bi-linear interpolation to scale up the bitmap to 128x128px and allow 8 repeats for material textures and thus accept 10 bit texture coordinates. If our linear extrapolation predicts the significant 5 bits correctly, a single multiplication gets us the other 5 less significant bits. Since we accept no interpolation error, we can switch from horizontal to vertical or diagonal or two-left-one-up “scan-lines” without visible jumps in the texture. So at every frame we can chose the direction with the lowest Z variation. Note that the horizontal and the vertical direction are still the most important directions because the longer jumps in the other directions increase any error. In software this would lead to many conditional branches slowing down the code, but in hardware this can be implemented in a handful of gates. The scan-lines are drawn front to back. So if the error gets so large that our multiplication cannot push it back only pixels on the far side in the moire pattern are affected. In contrast to sprites we have to render to a frame buffer if we want to render front to back and low variation z. But we need that for painters algorithm anyways. And a 3d scene typically has a large variation in polygon size and we could not saturate our render unit(s).

For a walk-through, polygons, may be (partly) outside the viewing pyramid. Vertices on the left and top surface of this pyramid are considered outside of the screen in order to remove vertices at the origin to prevent division by zero. Often people employ a back plane to avoid overflow in fixed point calculations. For the calculation described above define that after the vertex transformation but before the interpolation the exponent of corresponding values at different vertices are made to share the largest value (among them) to allow fixed point interpolation. There is no constraint across different polygons making any far plane unnecessary.

The perspective projection involves a division and one could fear overflow errors. These are like division by zero, but we have the feeling that we should not simply drop that vertex, but need the result. On our application overflow just means the vertex is outside the viewing pyramid. In moste floating points formats the sign is processed independently from the mantissa and the exponent. We always know if a vertex is too far left or too far right. It would even be wise to integrate a saturation in the division in order to not waste unnecessary energy on calculating large numbers. If two points are off to the left, an edge between them is invisible. Otherwise we need to calculated the y and z coordinate of the intersection with the left pyramid face, and similarly for the other sides of the pyramid. A polygon with all edges passing over the pyramid is discarded otherwise the intersection define a new polygon. If an edge of the pyramid pierces through the polygon, we may employ the good old ray-tracing equations to calculated the texture coordinates in that corner of the screen. These calculations are best done with floating point variables with at least single precision using the plain and well known equations. If the viewing point is close to a polygon in the middle between the vertices, the limited precision may lead to jumps of the geometry. For double precision this effect needs very extreme scenes. Low poly flight simulator scenes of the past come to mind, but anything with a height-field on the ground should be safe.

Sprites have four sides. For this reason this text does not say triangle, but polygon. Furthermore triangles become polygons after clipping. At least they would be perfectly convex in the above scenario. The rounding of the vertex position on screen removes this property from polygons already present in the unclipped mesh. For a software renderer this introduces a lot of conditional branches. In hardware this are again only some gates which determine for each scan-line if we see the front or the back of that polygon. The filling algorithm for arbitrary polygons has a manageable complexity but probably needs a register for each vertex. Also slmost flat quads appear naturally in subdivision surfaces. For convenience of the programmer it may make sense to optionally let them turn on a flatness check on polygons.

The playstation 1 showed that if all models are smooth the painters algorithm produces no artifacts. Doom introduced BSP for architectural scenes on consumer hardware. Descent and Duke Nukem 3d used portals. No one needs a z-buffer. In order to stay in the golden age we define that a z-buffer is too expensive and no screen space z-buffer illumination calculations are performed. Triangles are supposed to cover 32x32 pixels like a sprite and thus a polygon based visible surface determination is more efficient. Furthermore I like the idea of visible surface coherence across frames, which only allows to speed up polygon based sorting, not pixel based sorting.

On a SNES the pixel bit-depth was equal to the bus width to the memory (more or less, closest it ever came). Pixel clock on standard definition PAL or NTSC was the same as the clock rate of the memory. Thus no caching needs to be considered and in turn texture switches are cheap. Assuming 2 times overdraw and taking into account that clock speeds on chip were about two times the pixel clock, the hardware needs to write one pixel per cycle. But the scan-line loop described above contains to multiplications each taking 2 cycles. Additionally, the first two pixels at each scan-line need a full division. There may be some delay when the hardware determines the next diagonal scan-line through non-convex polygon. Thus we estimate that such a chip needs 6 render units drawing 6 consecutive scan-lines to saturate memory bandwidth. If we organize dynamic memory such as that pages cover 2:1 rectangles on screen, write access should benefit from that (coherent access to fast page memory).