Sign in to follow this  

[XNA] GPGPU library and questions

This topic is 2853 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

As I mentioned in my Water Cube topic, I'm working on a little XNA library for doing general purpose calculations on the GPU. The library is simple, but I think it's also fairly simple to use and drop into XNA things. If anyone wants to give it a go, you can download the current source with a small demo. Should anyone try the demo, please post back your FPS for both GPU & CPU and your CPU & GPU model. I'm particularly interested in the performance, since that's obviously the main reason for switching to GPU computation. The demo seems to run about two to three times as fast as on the CPU. That seems nice, but it's not that great. The CPU code only uses a single core, so with todays quad cores the CPU version could be easily made to run faster than my GPGPU approach. So on to my questions:
  1. I'm only using textures as input/output buffers as a tradeoff to keep things simple and generic. I wonder if this is typical for GPGPU or if I'm shooting myself in the foot badly by not leveraging vertex data? I can only use them as inputs anyway (DX9 and not using R2VB) and it'd be hard to use them in a generic way.
  2. The GPU bottleneck seems to be in setting and retrieving the data. I'd like to use dynamic textures as inputs and SetData with SetDataOptions.Discard for this reason, but ResourceUsage.Dynamic vanished a while ago. For vertexbuffers you now have DynamicVertexBuffer, but there's no DynamicTexture2D. I tried using a ResolveTexture2D instead, but that had no effect. Does anyone know what to use here?
  3. Retrieving output from a render target is another costly operation. I'm currently just using rt.GetTexture().GetData(), but I have no clue what it's doing. As I recall, GetRenderTargetData in MDX wasn't as slow as this XNA setup. Does anyone know of ways to speed this up or an alternative? Maybe I'm doing something dumb with the RTs?
Both copying the data from the CPU to the input texture and reading it back from the RT has about the same impact on the performance, simplifying the shader has no impact. I wondered if it's a bandwidth problem, but it's only a 256x256 texture & RT, so I guess something really is up with the locking/data retrieval. Thanks in advance for any input. I'm pretty stumped [smile]

Share this post


Link to post
Share on other sites

Thanks for the suggestion, I've downloaded it to check it out. I had actually been thinking of coding up my demo to compute Pi like theirs, so we're off to a good start [smile] Other than that, some things look quite similar in structure but they seem to have a much more fancy way to write the actual GPGPU code. Mine is a bit barebones in that respect, since you just code up an .FX file in HLSL, but that works out reasonably in XNA since compiling shaders is integrated into Visual Studio.

I guess Accelarator stole the stage for general GPGPU in C# and by some years at that! I still quite like my little setup though. Aside from being dead simple, it should also work on the XBox360 (I really have to get a Creator's club subscription sometime). Since my primary goal is to offload game stuff to the GPU, XBox compatibility is nice to have. Practical applications for the lib which I have working at the moment are computing normals and tangent frames for arbitrary meshes in one go and performing verlet integration on particles. Down the line I'd like to be able to offload my entire particle system to the GPU, but that requires some rewriting so I don't lose too much time sticking the constraints data into GPUBuffers.

Looking at the Accelerator toolkit, I guess they primarily employ surfaces too, so I'll consider question 1 answered for now. Regarding question 2, the DirectX debug runtimes told me that LockFlags.Discard cannot be used on normal Texture2Ds since they're not dynamic. Switching back to ResolveTexture2Ds silenced the debug runtimes, so I guess this is the only way to create dynamic textures in XNA even though I don't resolve the backbuffer to them. If GraphicsDevice.ResolveBackBuffer uses GetRenderTargetData under the hood though, it's a good guess that ResolveTexture2Ds are created in D3DPOOL_SYSTEMMEM (since GetRenderTargetData requires that) and that's not ideal for my situation.

So I'd like to know in which pool ResolveTexture2Ds are created. If they are in fact created in D3DPOOL_SYSTEMMEM, is there a way to create dynamic textures in D3DPOOL_DEFAULT with XNA? (I crossposted this question to the creators forum, hopefully someone from the XNA team will pitch in there)

Question 3 also remains. Am I doing anything wrong with the RenderTargets? If not, how does rt.GetTexture().GetData() work under the hood? I'll give it a go with Pix later today to see if I can find out, but if anyone happens to know this and what performance considerations to keep in mind, that'd be great. Again referring to our screencapture utility code from good old MDX (here), it all feels slower than it needs to be.


[Edited by - remigius on February 22, 2010 1:35:09 AM]

Share this post


Link to post
Share on other sites
Quote:
Original post by remigius
Question 3 also remains. Am I doing anything wrong with the RenderTargets? If not, how does rt.GetTexture().GetData() work under the hood? I'll give it a go with Pix later today to see if I can find out, but if anyone happens to know this and what performance considerations to keep in mind, that'd be great. Again referring to our screencapture utility code from good old MDX (here), it all feels slower than it needs to be.


The main problem with retrieving the RT data is that it's a guaranteed stall. Since you're asking for the data, the driver will have to flush all pending commands and wait for the GPU to finish them. And of course the PCI-e bus isn't optimized for readback, so that will be slow too.

To avoid a stall you can add some latency by double or triple-buffering your render target. If that's not an option, do as much stuff as you can before calling GetData to give the GPU some time to finish.

Also D3D9 has a special function called GetRenderTargetData that was meant for CPU readback of render targets. However I'm not sure if XNA will ever use that function.

Share this post


Link to post
Share on other sites
Quote:
Original post by MJP
The main problem with retrieving the RT data is that it's a guaranteed stall. Since you're asking for the data, the driver will have to flush all pending commands and wait for the GPU to finish them. And of course the PCI-e bus isn't optimized for readback, so that will be slow too.

...

Also D3D9 has a special function called GetRenderTargetData that was meant for CPU readback of render targets. However I'm not sure if XNA will ever use that function.


I know, but it's orders of magnitude slower in XNA than when I was fetching screenshots off rendertargets and writing them out to bitmaps in MDX. Those were much larger to boot than my measily 256x256 render target. I used GetRenderTargetData back then, so maybe that's the difference.

I assumed that XNA would use GetRenderTargetData under the hood to resolve the render target to a surface in system memory and lock that. This would probably the fastest option and it's amost implied by RenderTarget2D.GetTexture().GetData(). I'll have to look with PIX to see if it does, but it'd be a bit silly if it doesn't and worth at least finding out why not.

Quote:
To avoid a stall you can add some latency by double or triple-buffering your render target. If that's not an option, do as much stuff as you can before calling GetData to give the GPU some time to finish.


I thought about doing that, but it's my last resort. It makes the whole setup a lot complexer than I'd like and looking at the performance of my MDX screencapture code on much older and worse hardware (a Centrino laptop with Mobility Radeon9000!) it seems weird that it should be necessary.

Anyway, I did already split up the SetData/GetData() parts into GPUProcessor.Compute() and GPUProcessor.ResolveResults() to give the GPU time to do its thing. I guess I could try calling GPUProcessor.ResolveResults() the next frame to see if the stall is the main issue.

Edit - just tried this after rearranging some code, so I only call RenderTarget.GetTexture().GetData() the next frame, but the performance actually got worse. If I comment out the SetData() part though, performance shoots up (150fps to 600fps), so the RenderTarget.GetTexture().GetData() part is fine. It might still be stall-related, but I wonder if it's got anything to do with the texture usage & pool. Any thoughts?


[Edited by - remigius on February 22, 2010 5:12:46 AM]

Share this post


Link to post
Share on other sites

For future reference, I brushed up the code a bit and put it online over here. I'm not quite convinced I got the stall fixed satisfactory or if the performance is acceptable, but I have some hopes the performance might be much better with a newer GPU and on a desktop machine rather than my trusty old laptop.

As I spammed all over that page already: if you happen to try it, please let me know how it performs and what your hardware specs are. Thanks [smile]

Share this post


Link to post
Share on other sites

This topic is 2853 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this