Jump to content
  • Advertisement
Sign in to follow this  
Jason Z

ATI x1k Cards and threading on GPU???

This topic is 4787 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I just got done reading an article over at hardocp.com on the new ati architecture. They didn't have all the details, but it had quite a lengthy section on how ati has implemented a threading system in the x1k series of cards. Does anyone know if this is real or is it just some form of marketing hype? If it is real, then it would seem that raycasting, dynamic ambient occlusion, and a whole bunch of gpgpu applications could benefit greatly (could GI be right around the corner???). I found it ironic that the x1k series has a higher transistor count than the nv 7800 series even though it has fewer pixel piplines. I may be being pessimistic, but I just can't see how managing threads with extra transistors would be faster than using the extra transistors for processing pixels - it is a GPU after all, not a CPU :P What do you guys think? The article actually mentions that ati is going to push using GPUs as secondary processors to offload work to. Do you think there is any merit to these methods/ideas?

Share this post


Link to post
Share on other sites
Advertisement
Windows Vista will move towards threaded (and generally virtualized) hardware capabilities, including for graphics. Thus, it might make sense to split your rendering pipe in preparation for that.

Share this post


Link to post
Share on other sites
Quote:
Original post by Jason Z
I found it ironic that the x1k series has a higher transistor count than the nv 7800 series even though it has fewer pixel piplines. I may be being pessimistic, but I just can't see how managing threads with extra transistors would be faster than using the extra transistors for processing pixels - it is a GPU after all, not a CPU :P


It really depends on the useage pattern and where the bottlenecks are. In theory if a shader is waiting for a long texture read cycle to complete then it could be swapped out and another set of pixels could be pushed into the buffer and begin their processing, when they are done it swaps back and picks up where it left of.

Now, as to how well this will work in a game is another matter completely and at this point and discussion about benchmarks etc are just going to end badly so I surgest we dont go there.

Its one of those ideas where only time will tell (that and their new ring buffer memory system), but as hplus points out, its the direction things are going in..

edit:
oh, and you might want to have a look at the new ATI SDK released in the last couple of days, it has a fair amount of information when it comes to programmin for the x1k cards, I only skimmed it so I cant say for sure if it had much about the threading system, but from what i read its probably something people should get hold of...

Share this post


Link to post
Share on other sites
Quote:
Original post by Jason Z
I found it ironic that the x1k series has a higher transistor count than the nv 7800 series even though it has fewer pixel piplines. I may be being pessimistic, but I just can't see how managing threads with extra transistors would be faster than using the extra transistors for processing pixels - it is a GPU after all, not a CPU :P

What do you guys think? The article actually mentions that ati is going to push using GPUs as secondary processors to offload work to. Do you think there is any merit to these methods/ideas?


The transistor count comes in because ATi is managing upto 1024 (correct?) threads in flight with an extremely low batch size of 4x4 pixels. You need to keep state information for all those threads. NVIDIA on the other hand is using a batch size of 32x32 (7800) and 64x64 (6800). Lower batch sizes for branching and flow control is much, much better but costs transistors.

On ATi hardware the ALU:TEX OP ratio is 3:1, so when a texture op is hit ATi can schedule another thread to run ALU ops on the same pipe while the original thread waits for the texture operation to finish. NVIDIA (and pre R520 hardware) does this with some clever latency hiding and their compiler (and shader replacment).

As more applications start using SM3 features like branching the performance delta between a 7800 and X1800 is only going to increase IMO (just look at the FEAR demo numbers, its also NVIDIAs TWIMTBP title).

ATi has already stated (xbitlabs) and demoed (see Beyond3d X1800 preview) that they can run physics on the graphics card, in a crossfire setup, one acts as video the other can do physics.

Quote:

from here
During a demonstration of Crossfire ATI displayed an interesting demo which put the Crossfire cards to an alternative use. As opposed to having the Crossfire boards increasing performance by distributing graphics workloads across the two ATI has a demo with wave simulations that could be fully calculated on the CPU, have the CPU calculating the physics and the graphics rendering the image, or moving the rendering to one board and the physics of the wave simulation to the second graphics board, effectively turning it into a physics co-processor.


article on General Purpose computation on GPUs

Share this post


Link to post
Share on other sites
Very interesting. So under the assumption that during a texture lookup the processing unit can switch to another thread (batch?), then if the new thread is using the same shader (and hence needs a similar texture lookup one pixel over) does it have to wait for the first texture op to finish before it can continue? In theory it seems like a good idea to use the threading for the very reasons that you mentioned (nts), but after thinking about it it doesn't seem like it would make a huge difference for using ALU ops while TEX ops are waiting.

I can certainly appreciate the smaller batch size helping in dynamic branching, but the threading issue I don't see the huge benefit (thats not to say that there isn't one, just that I don't see it!).

Eventual integration with Vista is a very valid suggestion, and I certainly agree that there is no point to discussing the benchmarks (or lack thereof). I suppose it will be interesting to see how the overall throughput performance changes with the changing shader paradigms (i.e. branching vs. non-branching).

Thank you for your input!

Share this post


Link to post
Share on other sites
Quote:
Original post by Jason Z
Very interesting. So under the assumption that during a texture lookup the processing unit can switch to another thread (batch?), then if the new thread is using the same shader (and hence needs a similar texture lookup one pixel over) does it have to wait for the first texture op to finish before it can continue? In theory it seems like a good idea to use the threading for the very reasons that you mentioned (nts), but after thinking about it it doesn't seem like it would make a huge difference for using ALU ops while TEX ops are waiting.

I can certainly appreciate the smaller batch size helping in dynamic branching, but the threading issue I don't see the huge benefit (thats not to say that there isn't one, just that I don't see it!).

Eventual integration with Vista is a very valid suggestion, and I certainly agree that there is no point to discussing the benchmarks (or lack thereof). I suppose it will be interesting to see how the overall throughput performance changes with the changing shader paradigms (i.e. branching vs. non-branching).

Thank you for your input!



In the case of the R520 I believe texture units are still coupled to the pipes but pipes can have 128-256 threads queued up (shared between pipes?). If one thread is waiting for a texture op to finish then the scheduler will scan the next threads and if any have ALU ops then those can be swapped in and let to execute, until another stall. I'm not sure how the scheduler prioritizes which threads to run at what point, I dont think that info has even been released. There is a cache for the texture units so if you need the same texture 1 pixel over there is a good chance what you need is already sitting in cache due to spatial locality.

The more interesting GPUs are the RV530 and R580 because the texture units seem to be fully decoupled from the pipes. 4 Texture units shared across 12 pipes, for the RV530 (4 ROPs and texture units though is pretty limiting for it).

I believe ATi and NVIDIA have said that current GPUs are about 50-60 percent efficient, the Xenos (XBox 360 GPU) is said to be 95% efficient. Using the threading approach ATi can make the R520 much more efficient, upto 70-80 percent. NVIDIAs card (7800) on the other hand has stronger pipe's, 2 full vector ops per clock (correct?) where as ATi has one full vector op and one scalar op.

I'm not sure what will pay off better in the long run (its all going unified anyway) but the threading approach ATi took with the low batch sizes seems to be very good for branching performance.



Upto now NVIDIA has been telling developers to avoid doing branching in shaders whenever possible.

When future games start doing more branching in shaders ATi's card should definately pull ahead IMO. I believe the FEAR benchmarks already show this, X1800XT being almost 100% faster at 16x12 with high AA/AF (unless its the 512MB and BW advantage).

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!