A Note on Fermi

posted in Not dead...

Published October 01, 2009

Fermi, for those who haven't noticed, is the code name being given for NV's new GPU and recently some information has come out about it from NV themselves. This is good because, up until this point, there has been a definite lack of information it that regard; something I'd noted myself previously.

So, what is Fermi?

Well, Fermi is a 3 Billion Transistor GPU which is mainly aimed at the Tesla/GPGPU market, however this doesn't preclude it running games of course but it does however hint at the focus the company have taken which is to try and grow the GPGPU area of the market.

Processing power wise it has 512cores, which is twice the GT200, but bandwidth wise its a bit unclear, the bus size having shrunk from 512bit to 384bit but it is on a GDDR5 interface which AMD have had good results from bandwidth wise (the HD5870 is on a 256bit bus after all).

The GPU itself is a 40nm gpu, like the RV870 of the HD5xxx series, however its 40% bigger which could have an impact on yields and thus directly on the cost of the final cards.

The card will of course support DX11 and NV of course believe that it will be faster than the HD5870, however there isn't much else with regards to games being talked about right now.

As I mentioned up top the main focus seems to be GPGPU, this is evident in a few of the choice made in the design and is part of the reason the chip is 'late'.

The maths support and 'power' points very firmly to this; The chip does floating point ops to the IEEE-754 spec and has full support for 32bit integers (apprently before 32bit integer multiples were emulated as the hardware could only do 24bit), and it also suffers a MUCH lower cost for double (64bit FP) maths; where as tehe GT200 processed 64bit at 1/8th the speed and the RV770 processes at 1/5 speed of normal 32bit maths the Fermi GPU only takes a 50% hit to its speed.

The question in my mind is how they do that, I have seen some people commenting that they might well leave silicon idle during 32bit maths/game maths but this would strike me as a waste; I suspect instead some clever fusing of cores is occuring the pipeline somehow to get this performance. (Given that each of these

The Fermi core also introduces a proper cache setup, something which wasn't in the previous core which had a per 'shader block' (a 'shader block' being a section of 'cores', control circuits etc and are made up of 32 cores) section of user controlled memory and a L1 texture cache which was mostly useless when you are in 'compute mode'. In Fermi there is a 64KB block of memory for each shader block, this can be split 16k/48k or 48k/16k with one section being shared memory and the other being L1 cache (The min shared memory size is to ensure that older CUDA apps which required the max 16k still worked). The is also a block of shared L2 cache of 768KB which helps with atomic memory operations making them between 5 and 20 times faster than before.

The other area which has been tweaked, and does have an impact on gaming performance as well, is the number of threads in flight; the GT200 could have 30720 threads active at any given time while Fermi has reduced this to 24576. This is because it was noted the performance was mostly tied to the amount of shared memory not the number of threads; so shared memory went up while thread count went down. Part of me can't help wonder if they might have been better off keeping the thread count up, although I suspect they have run the numbers and the extra memory space might well have offset the extra transistors required to control the extra threads on the die.

There have been various improvements in the shader instruction dispatch as well, mostly to remove deadtime on the shader blocks which was a perticular "problem" with the previous generation when it came to certain operations. This was mostly down to the 'special function unit' which, when used, would stall the rest of a threads in the thread group until they were completed. This has been decoupled now which solves the problem (the SFU is seperate from the main 'core' sections of the shader blocks).

Per clock, each of these shader blocks can dispatch;
32 floating point operations
16 64bit floating point operations
32 int operations
4 'special function unit' (SFU) operations (sin, cos, sqrt)
16 load/store operations

It is however important to note that if you are performing 64bit operations then the whole SM is reduced to 16 ops per clock and that you can't dual issue 64bit FP and SFU ops at the same time. This gives some weight to my idea of fusing cores together to perform the operations, thus not wasting any silicon, and probably involving the SFU in some way thus the lack of dual issue.

Given the way the core blocks seem to be tied I would suspect this is a case of being able to send a FP and int instruction at the same time, as well as a SFU and Load/Store op at the same time, the latter two being seperation from the 'cores' and part of the shader block.

Another big GPGPU improvement is that of 'parallel kernel support'.

In the GT200/G80 GPUs the whole chip had to work on the same kernel at once; for graphics this is no problem as often there is alot of screen to render, however with GPGPU you might not have that kind of data set so parts of the core are idle which is a waste of power.

Fermi allows multiple kernels to execute at once, rather than startng one, waiting for it to complete, then starting another and so on. This was an important move as, with twice as many cores, the chances of idle cores in the original setup was increased under Fermi, now the GPU can scheduel as it needs to to keep the core fed and doing the most amount of work. It'll be intresting to see how this feeds back into graphic performance, if it does at all.

Moving between GPU and CUDA mode has also been improved, NV are claiming a 10x speed up meaning you can switch a number of times a frame, which of course comes with a note about hardware physics support. There is also parallel GPU<->CPU data transfer meaning you can send and recieve data at the same time instead of one than the other.

The HD5870 can detect memory errors on the main bus, however all it can do is resend and until they succeed they can't correct. NV have gone one step futher and added ECC support to the register file, L1 and L2 cache and the DRAM (its noted that this is Tesla specific so I don't expect to see if in a commerical GPU); the main reason for this is that many people won't even talk about Tesla without these features as they are, percieved at least, to be a requirement for them.

One area where NV have typically stood head and shoulders about ATI (and now AMD) has been tools. PerfHUD is often stated as a 'must have' app for debugging things in the PC space. With Fermi they have yet another thing in store; Nexus.

Put simply Nexus is a plugin for Visual Studio which allows you to debug CUDA code as you would C++ or C# code. You can look at the GPU state and step into functions, giving what to me seems like a huge boost in work flow and debugging abiliy. As much as I like the idea of OpenCL, without something like this I can see alot of GPGPU work heading NV's way (provided they have a market, see later).

They have also changed the ISA for the chip, part of which involved unified memory addressing which apprently required to enable C++ on the GPU. Thats right, with Fermi you have virtual functions, new/delete and try/catch on the GPU. Couple this with the improved debugger and intresting times ahead; although it will be intresting to see how C++ makes a transition to the GPU, even in a sub-set type setup they had with Cg and C for CUDA before hand.

I mentioned the GPGPU market earlier and this is an intresting part of the story.

It can't be denied that with this GPU NV are firmly taking aim at this market, yet last quater the Tesla business has only made $10M, which to a company which grossed $1B at its peak isn't a whole lot. However when asked about this the "blame" is placed on a lack of focused marketing rather than a lack of business as such; maybe with the improvements above (ECC being a key) and better marketing NV can grow this sector. Given what they have done with this GPU and the added focus on Tegra, its mobile ARM based System on Chip, these do appear to be the areas NV are betting on growth to expand on.

However, if Fermi doesn't improve things (and NV are looking at an exponential growth in sales) then even with Tegra they are going to be in trouble.

So, those are the facts, and a few guesses by myself about how things are done, but what is my opinion?

Well, the chip itself is intresting and from a GPGPU point of view a great step forward, however I have this feeling NV might have shot themselves in the foot a bit while taking this step forward.

You see alot of these GPGPU features which have been added have more mostly delayed the chip, it is apprently only recently they have got working chips back and that means we are still maybe 2 months away from a possible launch and even then there might be a lack of volume until Q1 of next year (with murmers of Q2).

This means NV might well miss the Thanks giving, XMas and New Year sales, areas I suspect are key for people buying new machines and with Windows 7, argueably the most populare windows release EVER, around the corner I suspect a number of people are looking at the market right now and thinking 'what should I upgrade to?' and in answer to that they are going to find HD5870 and, more importantly, HD5850 cards waiting to be purchased.

It seems that NV have decided to take the pain and, to a lesser or greater degree, are betting on the Tesla GPGPU space to make the pain they are going to take from being late to market against AMD worth while. It remains to be seen if this is the case, however I'm sure AMD will be happy with a free pass for the next few months.

Finally, by the time they do get to market we could be only 2months away from an AMD refresh, so if you hold out for an NV card (and we have no real numbers here) then you start thinking 'maybe I should hold out for the AMD refresh...' and that way madness lies.

From a consumer point of view however, this delay isn't that great as it allows AMD to sell their cards at a higher price point, which will good for the company isn't so great for everyone's pocket.

I think that if NV want to dent AMD's success here then they need to start releasing numbers and possible prices soon, because right now the only game in town is AMD, but if you can get some numbers out and make your GPU start to look worth while then people might hold off.

Right now however, with no numbers, no price and no release date, AMD are a no brainer of a buy.

Previous Entry HD5870 - Impressions.

Next Entry Particles.

0 likes 2 comments

Comments

Jason Z

Thanks for the great overview. I always find it interesting to read the details of the hardware implementation of the GPUs.

Where did you get all of the information from?

October 03, 2009 09:19 AM

_the_phantom_

Pretty much cribbed and condensed from the AnandTech article.

I probably should have sourced that in the main text tbh.

October 03, 2009 12:59 PM

You must log in to join the conversation.

Don't have a GameDev.net account? Sign up!

_the_phantom_

Author

A Note on Fermi

Comments

_the_phantom_

Latest Entries

C++17 and Memory Mapped IO

C++ 17 Transformation...

Trends.

"Valve Box"

Conclusion 2.

Conclusion.

Slowly becoming more and more disappointed...

Week off.

Sounding off....

Slow progress...

A Note on Fermi

Comments

_the_phantom_

Latest Entries

C++17 and Memory Mapped IO

C++ 17 Transformation...

Trends.

&#34;Valve Box&#34;

Conclusion 2.

Conclusion.

Slowly becoming more and more disappointed...

Week off.

Sounding off....

Slow progress...

Reticulating splines

"Valve Box"