Why is math transformation taxing to most CPUs?

Started by
12 comments, last by tanzanite7 9 years, 4 months ago

I'm reading the 3-D Graphics section of the "Video" Chapter in my I.T book and it states:

1) The computer must track all of the vertices of all of the objects in the 3-D world, including the ones you cannot currently see.

Question 1: When the book says the computer, do they really mean the program itself or the CPU that does the processing of the "addresses" or the RAM where the addresses are stored

2) This calculation process is called transformation is extremely taxing to most CPUs.

Question 2: Is it because the CPU cannot process the address of the data quickly per frame? Does it lead to a slow frame-rate in the game?

Anybody who knows a lot about computer architecture and 3D game programming share your experiences with me. =] I only done 2D game programming and I only worked with X, Y coordinates and have never explored the X, Y and Z coordinates.

Advertisement


1) The computer must track all of the vertices of all of the objects in the 3-D world, including the ones you cannot currently see.

Question 1: When the book says the computer, do they really mean the program itself or the CPU that does the processing of the "addresses" or the RAM where the addresses are stored

The vertices have to be stored in RAM, of course. The program is responsible for managing that data.


2) This calculation process is called transformation is extremely taxing to most CPUs.

Question 2: Is it because the CPU cannot process the address of the data quickly per frame? Does it lead to a slow frame-rate in the game?

GPUs were invented because CPUs can't keep up. Let's assume you have a million vertices (quite low nowadays) to deal with. Each one needs to multiply the vertex position by a matrix in order to transform it to 2D space, which means four multiply-add operations across four rows of four elements. Let's crudely call it 50 instructions to process each vertex, giving us 50M total per frame. Now we want to run at 60 frames per second, which puts us at 50M * 60 = 3 billion calculations per second. If we're optimistic and assume that we're getting one calculation per cycle out of a 3 GHz CPU (this is very unrealistically optimistic), we've pretty much consumed the entire amount of CPU that we had in doing just vertices. Let's not forget pixels need to be calculated, physics needs to be calculated, oh and the actual game needs to fit in there somewhere.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

a 3 GHz CPU (this is very unrealistically optimistic), we've pretty much consumed the entire amount of CPU that we had in doing just vertices.


And just to carry through: even if you toss in SIMD (4-8x as many operations per clock, if being unrealistically ideal) and threads (2-8 as many operations per clock, if being unrealistically ideal), you're still stuck with at best 64x as much processing power (unrealistically optimistically) which is still nowhere near as efficient as a budget GPU with ~200-400 shader cores or an enthusiast GPU with ~1000-2000 or more shader cores, at least when it comes to the kind of operations that work well on a GPU.

Sean Middleditch – Game Systems Engineer – Join my team!

How old is that book?

In the 90's the first bottleneck was rasterizing a triangle. Once GPUs became better at it, the next most expensive operation was transform and lighting; which at that time was being done in the CPU and sent every frame to the GPU.

That's why HW TnL (Hardware Transform and Lighting) was invented, which kept the vertices always in the GPU, and the math was done entirely in the GPU. Later this would evolve in what we now know as vertex shaders.

I have a hunch that book could be really, really old.

How old is that book?

In the 90's the first bottleneck was rasterizing a triangle. Once GPUs became better at it, the next most expensive operation was transform and lighting; which at that time was being done in the CPU and sent every frame to the GPU.

That's why HW TnL (Hardware Transform and Lighting) was invented, which kept the vertices always in the GPU, and the math was done entirely in the GPU. Later this would evolve in what we now know as vertex shaders.

I have a hunch that book could be really, really old.

This book is a information technology - 2 years old. It's focus is trouble-shooting. The author decided to shed light on some 3D graphics just for fun.

Well, the author may need to replenish the oil in his lantern, if he wants to shed light on anything.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

2) This calculation process is called transformation is extremely taxing to most CPUs.

Question 2: Is it because the CPU cannot process the address of the data quickly per frame? Does it lead to a slow frame-rate in the game?

This one is actually still true.

Matrix-matrix multiply and matrix-vector multiply are a big cost. A few really smart math geeks have greatly reduced the costs, and some really smart hardware geeks have moved a portion of the cost over to the graphics card rather than the CPU. However, the operations are not free and they are the most common basic functions used in graphics and physics and other systems.

Done poorly a game can still overload the CPU with badly-written math operations. It is a known concern. Faster processors and good libraries can help reduce and mitigate the concern, but it is still something you will see quite visibly on profiling numbers.

A simple naive matrix multiplication, a 4x4 multiplied by another 4x4, is rather costly. Multiply each row by each column (using a dot product) to compute each one of the 16 necessary results. That is 64 floating point multiplications and 48 floating point additions. While an individual matrix multiply isn't overly taxing, doing many of them quickly reaches an unacceptable cost.

With a little bit of math magic and some SIMD instructions you can reduce it to the oft-cited code snippet of 16 multiplications, 12 additions, and 16 "shuffles" that let you reuse some of the intermediate results. It ends up about 5x to 6x faster depending on implementation details of the naive implementation. I'm not sure where it came from or what the proper name for it is, but it has been floating around the web for about a decade now. There are many similar speedy specialized algorithms for various vector-matrix operations for both column-based and row-based vectors.

While that is a reduction in the number of steps, the cost of matrix multiply is still one of the more costly low-level operations you can do. Graphics operations rely heavily on it. Every time you move or position something in 3D space you need to run a series of matrix multiplies all the way through that portion of your scene. You've got the Model or World, the View, and the Projection matrices that ultimately needs to be pushed out and multiplied to every pixel that gets rendered. You'll need to do quite a few of those matrix multiplies on the CPU, but fortunately you can pass the pre-multiplied values out to the GPU and allow the card with its specialized hardware to do the rendering and heavy lifting.

Physics relies heavily on it, every time you move a physics object you also rely heavily on this math Much like the graphics APIs, there are physics libraries (e.g. PhysX) that take advantage of hardware to do the more costly parts. Most physics simulations work on bigger primitives rather than point clouds so they often require less total matrix operations, but it can still require a hefty portion of the CPU budget.

I think in the context of point 1), point 2) reveals a certain lack of understanding of the role of GPUs on the part of the author.

Matrix transforms for tracking object positions on the CPU for, for example, culling or physics would not run into the millions per frame unless you had a world with a very large number of objects in it.


Done poorly a game can still overload the CPU with badly-written math operations.

badly-written math operations as in "poorly optimized math code"? May you write an example showcasing a badly-written math operations and a goodly-written math operations? Definitely would want to learn more.

What's frob's referring to is naïve 'C style' math vs SSE math.

If you search for matrix matrix multiplication, or matrix vector multiplication, and time them against matrix matrix SSE, or matrix vector SSE, you'll see what he means. The SSE runs roughly 3.5x faster.

This topic is closed to new replies.

Advertisement