Sign in to follow this  

Why is math transformation taxing to most CPUs?

This topic is 1104 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I'm reading the 3-D Graphics section of the "Video" Chapter in my I.T book and it states:

 

1) The computer must track all of the vertices of all of the objects in the 3-D world, including the ones you cannot currently see.

 

Question 1: When the book says the computer, do they really mean the program itself or the CPU that does the processing of the "addresses" or the RAM where the addresses are stored

 

2)  This calculation process is called transformation is extremely taxing to most CPUs.

 

Question 2: Is it because the CPU cannot process the address of the data quickly per frame? Does it lead to a slow frame-rate in the game?

 

Anybody who knows a lot about computer architecture and 3D game programming share your experiences with me. =] I only done 2D game programming and I only worked with X, Y coordinates and have never explored the X, Y and Z coordinates.

Share this post


Link to post
Share on other sites

How old is that book?

 

In the 90's the first bottleneck was rasterizing a triangle. Once GPUs became better at it, the next most expensive operation was transform and lighting; which at that time was being done in the CPU and sent every frame to the GPU.

That's why HW TnL (Hardware Transform and Lighting) was invented, which kept the vertices always in the GPU, and the math was done entirely in the GPU. Later this would evolve in what we now know as vertex shaders.

 

I have a hunch that book could be really, really old.

This book is a information technology - 2 years old. It's focus is trouble-shooting. The author decided to shed light on some 3D graphics just for fun.

Share this post


Link to post
Share on other sites

2)  This calculation process is called transformation is extremely taxing to most CPUs.
 
Question 2: Is it because the CPU cannot process the address of the data quickly per frame? Does it lead to a slow frame-rate in the game?

This one is actually still true.

 

Matrix-matrix multiply and matrix-vector multiply are a big cost. A few really smart math geeks have greatly reduced the costs, and some really smart hardware geeks have moved a portion of the cost over to the graphics card rather than the CPU.  However, the operations are not free and they are the most common basic functions used in graphics and physics and other systems. 

 

Done poorly a game can still overload the CPU with badly-written math operations. It is a known concern. Faster processors and good libraries can help reduce and mitigate the concern, but it is still something you will see quite visibly on profiling numbers.

 

A simple naive matrix multiplication, a 4x4 multiplied by another 4x4, is rather costly.  Multiply each row by each column (using a dot product) to compute each one of the 16 necessary results.  That is 64 floating point multiplications and 48 floating point additions. While an individual matrix multiply isn't overly taxing, doing many of them quickly reaches an unacceptable cost.

 

With a little bit of math magic and some SIMD instructions you can reduce it to the oft-cited code snippet of 16 multiplications, 12 additions, and 16 "shuffles" that let you reuse some of the intermediate results. It ends up about 5x to 6x faster depending on implementation details of the naive implementation.  I'm not sure where it came from or what the proper name for it is, but it has been floating around the web for about a decade now. There are many similar speedy specialized algorithms for various vector-matrix operations for both column-based and row-based vectors. 

 

While that is a reduction in the number of steps, the cost of matrix multiply is still one of the more costly low-level operations you can do.  Graphics operations rely heavily on it.  Every time you move or position something in 3D space you need to run a series of matrix multiplies all the way through that portion of your scene.  You've got the Model or World, the View, and the Projection matrices that ultimately needs to be pushed out and multiplied to every pixel that gets rendered. You'll need to do quite a few of those matrix multiplies on the CPU, but fortunately you can pass the pre-multiplied values out to the GPU and allow the card with its specialized hardware to do the rendering and heavy lifting. 

 

Physics relies heavily on it, every time you move a physics object you also rely heavily on this math  Much like the graphics APIs, there are physics libraries (e.g. PhysX) that take advantage of hardware to do the more costly parts. Most physics simulations work on bigger primitives rather than point clouds so they often require less total matrix operations, but it can still require a hefty portion of the CPU budget.

Share this post


Link to post
Share on other sites
I think in the context of point 1), point 2) reveals a certain lack of understanding of the role of GPUs on the part of the author.

Matrix transforms for tracking object positions on the CPU for, for example, culling or physics would not run into the millions per frame unless you had a world with a very large number of objects in it.

Share this post


Link to post
Share on other sites


Done poorly a game can still overload the CPU with badly-written math operations.

 

badly-written math operations as in "poorly optimized math code"? May you write an example showcasing a badly-written math operations and a goodly-written math operations? Definitely would want to learn more.

Share this post


Link to post
Share on other sites

What's frob's referring to is naïve 'C style' math vs SSE math.

 

If you search for matrix matrix multiplication, or matrix vector multiplication, and time them against matrix matrix SSE, or matrix vector SSE, you'll see what he means. The SSE runs roughly 3.5x faster.

Share this post


Link to post
Share on other sites

 


Done poorly a game can still overload the CPU with badly-written math operations.

 

badly-written math operations as in "poorly optimized math code"? May you write an example showcasing a badly-written math operations and a goodly-written math operations? Definitely would want to learn more.

 

I think what he is getting at is that if you don't take advantage of parallel operations, and if you take the naïve approach (i.e. math ops without any refactoring), then you get terrible performance.

 

However, I would like to take a different stance than the others on this topic.  I think the terms provided by the author are probably appropriate - taxing to a CPU can mean a lot of different things, and just like all things in computer graphics, it depends on the scene you are processing.  If you have a simple scene, a CPU rasterizer can easily keep a high frame rate on modern CPUs.  If you want to push the limits of current technology, then CPUs are not the choice processor graphics - you would obviously go for GPUs.

 

So the author is incorrect because of his blanket statement (that the CPU is always heavily taxed by transformations), because it depends on the scene being rendered and the ops being executed.  If you take a look at the latest WARP devices in D3D11, you can find some really screaming software based rasterizers that work just fine for many situations.

Share this post


Link to post
Share on other sites

[...] and threads (2-8 as many operations per clock, if being unrealistically ideal) [...]

 

 

I nearly blew a fuse reading that. Please tell me that's a typo, and not how you think threading improves performance.

Share this post


Link to post
Share on other sites

 

[...] and threads (2-8 as many operations per clock, if being unrealistically ideal) [...]

 

 

I nearly blew a fuse reading that. Please tell me that's a typo, and not how you think threading improves performance.

 

In general no, but if all we're talking about is vertex transforms or another "embarrassingly parallel" problem, then yes. You could very well write a software T&L engine and simply replicate it across any and all cores not already consumed with other duties and achieve essentially linear speedup on vertex transformations, limited only by available memory bandwidth. The same properties that make this problem suitable for the massive parallelism of GPUs make this equally possible on CPUs. This is more or less what GPUs do, except they're massively scaled up (and of course they have other optimizations appropriate for their problem domain. 

Share this post


Link to post
Share on other sites

[...] and threads (2-8 as many operations per clock, if being unrealistically ideal) [...]

 
I nearly blew a fuse reading that. Please tell me that's a typo, and not how you think threading improves performance.

"if being unrealistically ideal".

His example was detailing the higher/upper bound and is correct as such. Reality of it is irrelevant in that context.

edit: or did you get the impression he is not talking about hardware threads (cpu cores and HT if available ... typically 2-8)? Edited by tanzanite7

Share this post


Link to post
Share on other sites

This topic is 1104 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this