Jump to content

  • Log In with Google      Sign In   
  • Create Account

Not dead...

Letters From The Readers

Posted by , 18 October 2009 - - - - - - · 157 views

Its time for one of those entries where I quickly answer a post or two from the comments [smile]

Original post by Jason Z
Do we get to see some screen shots of the system in action?

Right now it exists in a stripped down purely mathematical test environment.

Drawing is on my 'todo' list however before that I want to finish up and refactor the test cases a bit to centralise a few things, check over the maths and add a few 'effectors' for the systems to get an idea of the 'true' performance. I also need to test emitter spawning speed a bit.

At that point I'll probably throw together a simple test render for screen shots/video reasons; maybe multithreaded, maybe not.. its tempted to do the former simply because D3D11 will make it a bit easier.

I'm currently looking into the Thread Building Blocks and some details on existing game system design which uses them to try and come up with a truely scaleable game system based on it. While 'parrellel_for' et al are fine for this test the real game is going to have to manage the task system itself.

Yeah, alot of thought for something which will be a 2D game rendered primarily with pixels [grin]

Original post by Black Knight
About tessellation how do you think it will effect collision detection?Can the detailed tesselletad mesh be used for collision detection? Or they will use some kind of simpler mesh.

This is an intresting one; I suspect you could abuse 'stream out' to get the post-tesselation data back from the card in real time however I suspect in many cases tesselation will be used to add detail but collision will still occur with a lower resolution set of mesh data.

Particles 2

Posted by , 13 October 2009 - - - - - - · 234 views

One unplanned day off due to illness later and, when I felt up to it later in the afternoon, I reattacked my SSE based code.

I decided that, instead of trying to do sillyness with ensuring that the blocks are always multiples of 8, I would just work out how many mulitples of 8 I can process and then deal with the remaining 4 particles after that.
Initally I only did this to one pass, just to check it would work, before moving onto a two pass method, however upon changing the 2nd pass to work the same way I noticed a (very small I admit) speed drop. Some pondering a quick Bing later I realised that in 32bit mode we only have 8 SSE registers and the 2nd pass, doing 2 groups of 4 at a time, would require 9 registers active at one point (5x src, 4x dest) so I'm currently attributing that slow down to register pressure problems.

I also decided that a single pass SSE method was needed, just to try and hit all the bases.

Finally, I adjusted the app to take total frame readings and pull out min, max and average times.

The times, for the biggest test for the SSE routines came out as follows;

Two pass SSE parallel (100/10000) 3.08299 4.6787 3.47186
Single pass SSE parallel (100/10000) 3.12475 4.60054 3.47186
Two pass Adaptive SSE parallel (100/10000) 3.06958 4.64958 3.46114
Two pass Full Adaptive SSE parallel (100/10000) 3.09947 4.9921 3.46305

(times in ms, numbers in name are (emitters/particles per emitter))

On average the Two pass adaptive is a bit faster, however at this point the difference is marginal at best.

I also have some numbers from an Intel Atom 330 @ 1.6Ghz with 4 threads, as there are a few numbers I'll just give the average final test results in each case;

Simple : 113.297ms
Emitter & Particle full parallel : 67.7486
Emitter parallel : 66.8808
Two pass parallel : 67.9637
Two pass SSE : 28.2201
Single pass SSE : 28.246
Two pass adaptive : 28.2491
Two pass full adaptive : 28.2966

So, top tip from that is that if you are going to do a fair amount of maths on an Atom make sure you use SSE and threads.
That said, going to 4 threads only gained around a 50% speed up, and going to 4 particles at a time only gained around a 50% speed up again.

Other tests to add before I release unto the masses to get more data;
- thread grain size
- limiting the number of threads active on the test system

Even with all this, it'll only act as a 'best case' guide for a real game system as that will have other things going on (rendering, audio, maybe network processing) all of which will cut into the threading time as well (in my plans at least). This is really more a feature fesibility study to try and get some idea of processing power.

- i7 920 has lots
- Atom 330 has little



Posted by , 11 October 2009 - - - - - - · 262 views

Around the middle of the week I decided to get my arse in gear and get working on some code related to a game idea I had.

I also decided to grab a trial copy of Intel's Parallel Studio to have a play with it, see what its auto vectorisation is like and also try things like the Thread Building Blocks.

One of the things I need in this game are particles.
Lots of them.

Now, the obvious way to deal with it is the give it to my shiney HD5870 and let that sim things, however I've never been one to take the obvious route and I've got a core i7 sitting here which hardly gets a work out; time for particles on the CPU.

Having installed the Intel stuff I had a look over the TBB code and decided that the task based system would make a useful platform to build a multi-threaded game on; it's basically a C++ way of playing with a threadpool, the thread pool being hidden behind the scenes and shared out as required to tasks. You don't even have to be explicate with tasks they provide parallel_for loops which can subdivide work and share it out between threads, including task stealing. All in all pretty sweet.

After putting it off for a number of days I decided to finally knuckle down and get on with some performance testing code. I've decided to start simply for now, get a vague framework up and running, however it'll require some work to make it useful with the final goal being to throw it at some people and see what performance numbers I get back for emitters and particle count.

I've so far tested 5 different methods of updating the particle system;

The first is a simple serial processing on a single thread; so each emitter is processed in turn and each particle is processed one step at a time. Best I can hope for is some single floating point value SSE processing to help out. The particles themselves are structs stored in an std::vector (floats and arrays of floats in the struct).

The second is the first parallel system I went for; emitters and particles in parallel. This uses the parallel_for from the TBB to split the emitters over threads and then to break each emitter's particles up between the threads. Same data storage as the first test.

The third was me wondering if splitting the particles over threads was a good idea so I removed the parallel_for from the particle code and left it in the emitter processing. Same storage setup as before.

Having had a look at my basic processing of the particles and a bit of a read of the Intel TBB docs I realised that I had a bit of a dependancy going on;

momentum += velocity
position += momentum * time;

So, I decided to split it into two passes; the first works out the momentum and the second the position. Storage stayed the same as before.

At this point I had some timing data;
With 100 emitters, each with 10,000 particles and the particle system assuming an update rate of 16ms a frame and a particle lifetime of 5 seconds:

- serial processing took 5.06 seconds to complete, with the final frame taking 23ms to process
- Emitter and particle parallel took 2.44 seconds to complete, with the final frame time taking 8.6ms to complete.
- Emitter only parallel took 2.4 seconds to complete, with the final frame taking 8.53ms
- Two pass parallel processing took 2.44 seconds to complete, with the final frame taking 8ms

Clearly, single threaded processing wasn't going to win and speed prizes, however splitting the work over multiple threads had the desired effect, with the processing now taking ~8ms, or half a frame.

The final step; SSE.

I decided to build on the two pass parallel processing as it seemed a natural way to go.

The major change to SSE was dropping the std::vector of structures. There were two reasons for this;
- memory alignment is a big one; SSE likes things to be 16byte aligned
- SIMD. Structures of data don't lend themselves well to this as at any given point you are only looking at once chunk of data.

SIMD was a major one, as the various formulas being applied to the components could be applied to x and y seperately, so if we closely pack the data this means we can hold say 4 X positions and work on them at once.

So, the new storage was a bunch of aligned chunks of memory aligned to 16bytes. The particle counts were forced to be multiple of 4 so that we can sanely work with 4 components at a time.

The dispatch is the same as the previous versions; we split threads off into emitters and then into groups of particles and these particles are processed.

Final result;
- Two pass SSE parallel processing took 1.09 seconds, with the final frame taking 4.9ms.

Thats a nicer number indeed, as it still leaves me 11ms to play with to hit 60fps.

There is still some tuning to do (min. particles per thread for example, currently 128) and it might even be better to do my own task dispatch rather than leave it to the parallel_for loop to setup. I've also not done any profiling with regards to stalling etc or how particles and emitter count effect the various methods of processing. Oh, and my own timing system needs a bit of work as well; taking all frame times and processing them for min/max/average might be a good idea.

However, for now at least I'm happy with todays work.

ToDo: Improve test bed.

A Note on Fermi

Posted by , 01 October 2009 - - - - - - · 300 views

Fermi, for those who haven't noticed, is the code name being given for NV's new GPU and recently some information has come out about it from NV themselves. This is good because, up until this point, there has been a definite lack of information it that regard; something I'd noted myself previously.

So, what is Fermi?

Well, Fermi is a 3 Billion Transistor GPU which is mainly aimed at the Tesla/GPGPU market, however this doesn't preclude it running games of course but it does however hint at the focus the company have taken which is to try and grow the GPGPU area of the market.

Processing power wise it has 512cores, which is twice the GT200, but bandwidth wise its a bit unclear, the bus size having shrunk from 512bit to 384bit but it is on a GDDR5 interface which AMD have had good results from bandwidth wise (the HD5870 is on a 256bit bus after all).

The GPU itself is a 40nm gpu, like the RV870 of the HD5xxx series, however its 40% bigger which could have an impact on yields and thus directly on the cost of the final cards.

The card will of course support DX11 and NV of course believe that it will be faster than the HD5870, however there isn't much else with regards to games being talked about right now.

As I mentioned up top the main focus seems to be GPGPU, this is evident in a few of the choice made in the design and is part of the reason the chip is 'late'.

The maths support and 'power' points very firmly to this; The chip does floating point ops to the IEEE-754 spec and has full support for 32bit integers (apprently before 32bit integer multiples were emulated as the hardware could only do 24bit), and it also suffers a MUCH lower cost for double (64bit FP) maths; where as tehe GT200 processed 64bit at 1/8th the speed and the RV770 processes at 1/5 speed of normal 32bit maths the Fermi GPU only takes a 50% hit to its speed.

The question in my mind is how they do that, I have seen some people commenting that they might well leave silicon idle during 32bit maths/game maths but this would strike me as a waste; I suspect instead some clever fusing of cores is occuring the pipeline somehow to get this performance. (Given that each of these

The Fermi core also introduces a proper cache setup, something which wasn't in the previous core which had a per 'shader block' (a 'shader block' being a section of 'cores', control circuits etc and are made up of 32 cores) section of user controlled memory and a L1 texture cache which was mostly useless when you are in 'compute mode'. In Fermi there is a 64KB block of memory for each shader block, this can be split 16k/48k or 48k/16k with one section being shared memory and the other being L1 cache (The min shared memory size is to ensure that older CUDA apps which required the max 16k still worked). The is also a block of shared L2 cache of 768KB which helps with atomic memory operations making them between 5 and 20 times faster than before.

The other area which has been tweaked, and does have an impact on gaming performance as well, is the number of threads in flight; the GT200 could have 30720 threads active at any given time while Fermi has reduced this to 24576. This is because it was noted the performance was mostly tied to the amount of shared memory not the number of threads; so shared memory went up while thread count went down. Part of me can't help wonder if they might have been better off keeping the thread count up, although I suspect they have run the numbers and the extra memory space might well have offset the extra transistors required to control the extra threads on the die.

There have been various improvements in the shader instruction dispatch as well, mostly to remove deadtime on the shader blocks which was a perticular "problem" with the previous generation when it came to certain operations. This was mostly down to the 'special function unit' which, when used, would stall the rest of a threads in the thread group until they were completed. This has been decoupled now which solves the problem (the SFU is seperate from the main 'core' sections of the shader blocks).

Per clock, each of these shader blocks can dispatch;
32 floating point operations
16 64bit floating point operations
32 int operations
4 'special function unit' (SFU) operations (sin, cos, sqrt)
16 load/store operations

It is however important to note that if you are performing 64bit operations then the whole SM is reduced to 16 ops per clock and that you can't dual issue 64bit FP and SFU ops at the same time. This gives some weight to my idea of fusing cores together to perform the operations, thus not wasting any silicon, and probably involving the SFU in some way thus the lack of dual issue.

Given the way the core blocks seem to be tied I would suspect this is a case of being able to send a FP and int instruction at the same time, as well as a SFU and Load/Store op at the same time, the latter two being seperation from the 'cores' and part of the shader block.

Another big GPGPU improvement is that of 'parallel kernel support'.

In the GT200/G80 GPUs the whole chip had to work on the same kernel at once; for graphics this is no problem as often there is alot of screen to render, however with GPGPU you might not have that kind of data set so parts of the core are idle which is a waste of power.

Fermi allows multiple kernels to execute at once, rather than startng one, waiting for it to complete, then starting another and so on. This was an important move as, with twice as many cores, the chances of idle cores in the original setup was increased under Fermi, now the GPU can scheduel as it needs to to keep the core fed and doing the most amount of work. It'll be intresting to see how this feeds back into graphic performance, if it does at all.

Moving between GPU and CUDA mode has also been improved, NV are claiming a 10x speed up meaning you can switch a number of times a frame, which of course comes with a note about hardware physics support. There is also parallel GPU<->CPU data transfer meaning you can send and recieve data at the same time instead of one than the other.

The HD5870 can detect memory errors on the main bus, however all it can do is resend and until they succeed they can't correct. NV have gone one step futher and added ECC support to the register file, L1 and L2 cache and the DRAM (its noted that this is Tesla specific so I don't expect to see if in a commerical GPU); the main reason for this is that many people won't even talk about Tesla without these features as they are, percieved at least, to be a requirement for them.

One area where NV have typically stood head and shoulders about ATI (and now AMD) has been tools. PerfHUD is often stated as a 'must have' app for debugging things in the PC space. With Fermi they have yet another thing in store; Nexus.

Put simply Nexus is a plugin for Visual Studio which allows you to debug CUDA code as you would C++ or C# code. You can look at the GPU state and step into functions, giving what to me seems like a huge boost in work flow and debugging abiliy. As much as I like the idea of OpenCL, without something like this I can see alot of GPGPU work heading NV's way (provided they have a market, see later).

They have also changed the ISA for the chip, part of which involved unified memory addressing which apprently required to enable C++ on the GPU. Thats right, with Fermi you have virtual functions, new/delete and try/catch on the GPU. Couple this with the improved debugger and intresting times ahead; although it will be intresting to see how C++ makes a transition to the GPU, even in a sub-set type setup they had with Cg and C for CUDA before hand.

I mentioned the GPGPU market earlier and this is an intresting part of the story.

It can't be denied that with this GPU NV are firmly taking aim at this market, yet last quater the Tesla business has only made $10M, which to a company which grossed $1B at its peak isn't a whole lot. However when asked about this the "blame" is placed on a lack of focused marketing rather than a lack of business as such; maybe with the improvements above (ECC being a key) and better marketing NV can grow this sector. Given what they have done with this GPU and the added focus on Tegra, its mobile ARM based System on Chip, these do appear to be the areas NV are betting on growth to expand on.

However, if Fermi doesn't improve things (and NV are looking at an exponential growth in sales) then even with Tegra they are going to be in trouble.

So, those are the facts, and a few guesses by myself about how things are done, but what is my opinion?

Well, the chip itself is intresting and from a GPGPU point of view a great step forward, however I have this feeling NV might have shot themselves in the foot a bit while taking this step forward.

You see alot of these GPGPU features which have been added have more mostly delayed the chip, it is apprently only recently they have got working chips back and that means we are still maybe 2 months away from a possible launch and even then there might be a lack of volume until Q1 of next year (with murmers of Q2).

This means NV might well miss the Thanks giving, XMas and New Year sales, areas I suspect are key for people buying new machines and with Windows 7, argueably the most populare windows release EVER, around the corner I suspect a number of people are looking at the market right now and thinking 'what should I upgrade to?' and in answer to that they are going to find HD5870 and, more importantly, HD5850 cards waiting to be purchased.

It seems that NV have decided to take the pain and, to a lesser or greater degree, are betting on the Tesla GPGPU space to make the pain they are going to take from being late to market against AMD worth while. It remains to be seen if this is the case, however I'm sure AMD will be happy with a free pass for the next few months.

Finally, by the time they do get to market we could be only 2months away from an AMD refresh, so if you hold out for an NV card (and we have no real numbers here) then you start thinking 'maybe I should hold out for the AMD refresh...' and that way madness lies.

From a consumer point of view however, this delay isn't that great as it allows AMD to sell their cards at a higher price point, which will good for the company isn't so great for everyone's pocket.

I think that if NV want to dent AMD's success here then they need to start releasing numbers and possible prices soon, because right now the only game in town is AMD, but if you can get some numbers out and make your GPU start to look worth while then people might hold off.

Right now however, with no numbers, no price and no release date, AMD are a no brainer of a buy.

HD5870 - Impressions.

Posted by , 26 September 2009 - - - - - - · 254 views

What can I say? The card is basically everything I was expecting - a fast, quiet, low power using DX11 card.

The card installed with no problem, the drivers (currently special drivers as the current general Cat. release doesn't support the card right now) installed fine and everything is as fast and responsive as you'd expect.

Gaming wise, well the review sites pretty much cover it.
My currently most played game is TF2 so, after a windows reinstall due to a slight SSD firmware issue, I ran the game up, turned everything to max with a screen size of 1920*1200 and was pleased to see a solid 60fps regardless of what was going on (with Vsync off it maxes out around 120fps, with a low point of around 90fps).

I managed to get onto Aion for a while, which is a CryEngine powered game although not quite as hard on the hardware as Crysis, maxed that out and also got between 120 and 60fps depending on what was being displayed.

Based on the hardware reviews I think my setup, with 2 1920*1200 screens, is probably around the sweet spot. Above 1920*1200 the GPU seems to suffer due to its 256bit bus, however given there aren't many people gaming above that level AMD probably took the right path.

Software wise, I've just noticed AMD's Hydravision which gives you;
- the ability to send windows to different monitors
- the ability to fullscreen a window across multple monitors
- multiple desktops (like X11 and OSX)
- a grid setup which allows you to edit a grid on your monitors, attach a window to this grid and have them snap into predefined locations at predefined sizes.

So far I've enabled the first 2 and the last option, at some point I might try the multiple desktops just to see if it really is the big deal some linux and OSX users make it out to be.

Of course, none of the above mentioned D3D11 which is the big thing from this card.

I think, of all the new features, the most striking is the new tesselation stage.
This is highlighted nicely by the sample in the DX11 SDK.

First up we have some stones, here is the bump mapped version;

This can be improved with some Parallax Occusion Mapping

Which is better and what we are used to seeing these days, however they still still don't have a definate silhouette to the image which can be seen around the edges.

So, lets introduce the tesselation;

And suddenly the stones look that much more real.
A few tweaks to the parameters reduces the processing costs at the cost of very little performance.

The finally setting looks better than the Parallax Occlsion Mapping at a lower cost.

Finally some things you can't do with either of the other two methods;

(Note the difference around the bottom of the objects between the 2nd and 3rd images here where the parallax mapping gets it slightly wrong).

This, for me, is the most striking example of what D3D11 can do for games and the visual quality going forward.

My only disappointment was the order independant transparancy example which, for some reason, only gets single digit fps right now. I plan to look into this to see why it is the case as it could be a driver issue or even a case of accidently ending up on a software path.

So, would I recommend the HD5870 to people?
Hell yes!

The only reasons you might not want one are;
- you have a dual GPU card from the last generation, in which case unless you really want D3D11 support you card will be slightly faster
- you are waiting and hoping that NV can come up with something better at a decent price
- you need NV's OpenGL and/or PerfHUD debugging tools.

AMD do have a shader/graphics debugging tool however I'm currently not sure on its DX11 state and I haven't had a chance to try it. However it's not a PerfHUD type tool, it looks more like a Pix type setup, although one of the more intresting things about it is that client-server setup which means I could do my devwork on my desktop but capture and debug on my laptop.

With my plan to do some DX11 stuff in the near future (well, not just DX11, I want to do something to really make my i7 and HD5 cry) I might well be looking into this tool and when I do I'll give you some details about it.

Recent Entries

Recent Comments