Sign in to follow this  
BetaASM

Difficult Question: input/output calcs gpu

Recommended Posts

Hi, I'm not a game programmer, but I figured you guys (knowing the GPU architecture better than me) would be able to help me out. I'm trying to run some non graphics related calculations on the GPU and get the results of the calculations. There's programs like BrookGPU and what not but I'd like to do this in ASM so I need a lower level solution. OK heres the question. Using the API in d3dx9_26.dll how can I give the GPU say 3 floats, have the gpu add them together then add a constant, and retreive the result of the accumulation. I don't need any graphics or drawing on the screen. I assume I'd need a shader like (started learning gpu asm a few min ago) vs_1_1 ;v0.xyz = 3 floats, c0.x = const value to add to the accum'd #define ;don't realy know #decl ;what these are for :add r1.x, v0.x, v0.y ;add input floats add r1.x, r1.x, v0.z ; add oPos.x, r1.x, c0.x ; add the constant to the accumulation Then an API call to AssembleShader which points to a buffer that holds the shader code above. I assume there's some API to set the constant register But how to set the input register ? But after that I don't know how to run the program on the gpu and retreive the result. Or how to input an array of floats and retreive the output. SORRY FOR THE TOTALLY OFF TOPIC QUESTION But running nongraphics code on the gpu is going to become very popular very soon and you guys (being game programmers) seem to have the best grasp in the implementation knowledge of it.

Share this post


Link to post
Share on other sites
Quote:
Original post by BetaASM
Hi, I'm not a game programmer, but I figured you guys (knowing the GPU architecture better than me) would be able to help me out.

I'm trying to run some non graphics related calculations on the GPU and get the results of the calculations. There's programs like BrookGPU and what not but I'd like to do this in ASM so I need a lower level solution.

What's wrong with the level of abstraction offered by Brook? Remember that in GPU terms, shader asm is just as much an abstraction as HLSL, Cg, or Brook.
Quote:
OK heres the question.
Using the API in d3dx9_26.dll how can I give the GPU say 3 floats, have the gpu add them together then add a constant, and retreive the result of the accumulation.

If you're doing GPGPU stuff, I suggest you use OpenGL, not Direct3D. It has a lot better support for this sort of thing. In OpenGL, you'd render to a pbuffer, and then read back from the pbuffer. For more information, you should check out GPGPU.org; it's sort of a clearinghouse for this stuff. If you really wanted to do just one addition, it'd be a 1x1 pbuffer. Of course, this would be incredibly inefficient. GPGPU stuff thrives on parallelization.
Quote:
SORRY FOR THE TOTALLY OFF TOPIC QUESTION
But running nongraphics code on the gpu is going to become very popular very soon and you guys (being game programmers) seem to have the best grasp in the implementation knowledge of it.

As a graduate researcher in graphics programming and in GPGPU, I disagree. Vectorized processing in consumer-level hardware? sure, it'll be popular. But it won't be hosted in the GPU. The GPU is just the wrong place to do this stuff, for many reasons. The current popularity of GPGPU is a response to the current (transient) situation where everyone has a nice GPU and nobody has a general-purpose vectorized coprocessor.

Share this post


Link to post
Share on other sites
-Level of abstraction
I was ambiguous, I meant that the programs I would be writing would be in x86 asm (86-64 in the near future) but anyways I wasn't refering to the GPU assembly language. Which is why I wanted to use the lowest availible valid level of shader programming, which was the assembly style syntax.

-SSE/2 3d Now instructions
They can't touch the single precision vector FP speed of the GPU. If you don't beleive me check out http://sourceforge.net/projects/ffff/ run the benchmark built into that program and you'll see around a 4x speed increase from the optimized SSE instructions to the GPU processing.

-The simple addition
Was an EXAMPLE I required to get a handle on the technique K.I.S.S. ala HelloWorld. Obviously its ineffecient to run 3 instructions on the GPU that's not the point at all.

Well I guess I'm forced to look into GLUT and openGL.
I know it was unintentional but you've managed to dodge helping me all together LOL. Thanks anyways.

Share this post


Link to post
Share on other sites
Quote:
Original post by BetaASM
-SSE/2 3d Now instructions
They can't touch the single precision vector FP speed of the GPU. If you don't beleive me check out http://sourceforge.net/projects/ffff/ run the benchmark built into that program and you'll see around a 4x speed increase from the optimized SSE instructions to the GPU processing.

That's true; currently high-end GPUs can easily beat CPUs in some non-graphics computational tasks. However, this will most likely not last for more than 2-3 more years. Once architectures similar to that of IBM's Cell processor become more commonplace, it will be more or less useless to do general computation on the GPU.

Share this post


Link to post
Share on other sites
Quote:
Original post by BetaASM
But running nongraphics code on the gpu is going to become very popular very soon

I doubt it. Technical issues aside, its never going to give you a significant gain for any practical app. Either you need a certain degree of precision (scientific apps) in which case GPUs just don't cut it. If you can live with the inaccuracy you're probably doing something realtime like a game where your graphics card is already busy doing proper display stuff.

Slightly more practically, you can't just write a GPU shader and tell it to 'do stuff'. You need to render dummy geometry(*) in order for the shader to actually process the vertices and fragments generated. Your end result becomes avalible in the form of your framebuffer (or other render target). You'll have to read back the framebuffer (backwards over the AGP - slow!) to actually get the results on the CPU.

How you actually set constant input registers depends on your API. If you're looking at OpenGL then the orange book (OpenGL Shader Language) is a good place to start.

(*) Or more likely, carefully massaged geometry. Another example of why this is impractical, as you end up with lots of non-trivial setup and tear down just so your GPU can rattle though the instructions.

Share this post


Link to post
Share on other sites
GPUs are getting a lot more general purpose - in the Longhorn time frame with WGF 2.0 it will become a lot easier to use them for non-graphics computations as a lot of the current difficulties and restrictions are relaxed. The GPU in the Xbox 360 hints at the sort of things that we will probably see on the PC in the future - unified vertex and pixel shaders that can write to main memory as well as to the frame buffer for example. The Cell is an example of a CPU becoming more like a GPU (in some ways it is more restrictive) whilst GPUs are becoming more general purpose and more like a CPU. At the moment I think it's far from clear that CPUs will eventually displace GPUs rather than the other way around.

Share this post


Link to post
Share on other sites
Quote:
Original post by mattnewport
GPUs are getting a lot more general purpose - in the Longhorn time frame with WGF 2.0 it will become a lot easier to use them for non-graphics computations as a lot of the current difficulties and restrictions are relaxed. The GPU in the Xbox 360 hints at the sort of things that we will probably see on the PC in the future - unified vertex and pixel shaders that can write to main memory as well as to the frame buffer for example. The Cell is an example of a CPU becoming more like a GPU (in some ways it is more restrictive) whilst GPUs are becoming more general purpose and more like a CPU. At the moment I think it's far from clear that CPUs will eventually displace GPUs rather than the other way around.

Oh, I don't think either will displace the other.

GPUs are indeed getting more general purpose; they need to, to provide the shading capabilities developers are demanding. But at the same time, their basic data transfer paradigm has not shifted much: They stream data at blistering speeds but do not reverse their flow direction easily.

When you say "The Cell is an example of a CPU becoming more like a GPU" what you really mean is "The Cell is an example of a CPU incorporating vectorized processing". It's important to recognize that this isn't a capability particular to GPUs. DSP hardware, in particular, has been doing this stuff for a long time. And a GPU is not JUST a vectorized processor. Despite the move towards programmable pipelines, a GPU is still optimized for performing very specific tasks, such as triangle rasterization and 4x4 matrix ops, to the detriment of other tasks. For instance, the GPU lacks a "scatter" operation.

Ultimately, as the Cell processor has shown, there's nothing magical about a GPU. Intel could decide tomorrow to stick a GeForce 6800 GPU on a P4 die and get the sort of vectorized performance. But if Intel really wanted vectorization, they wouldn't use a GeForce; they'd use something with more general vectorization. I think what we'll see is the GPU becoming more general but remaining a GPU, while CPUs evolve vectorization support similar to that found in GPUs and other vectorized processors.

Share this post


Link to post
Share on other sites
Quote:
Original post by Sneftel
In OpenGL, you'd render to a pbuffer, and then read back from the pbuffer.


What is that different than rendering to texture in DirectX, and reusing that texture as input in next iteration?

Share this post


Link to post
Share on other sites
I kind of agree with Sneftel, in that I don't think GPUs will be used that much for general computation...They are so stream oriented and much of their speed comes from their pipeline nature. What I think that the current GPGPU interest shows is that some people want more general purpose math acceleration and the only place they are finding it now is the GPU. That doesn't mean that GPUs are the solution.

On another note, there seems to be a lot of hyping of GPGPU stuff these days. I know there are many demos of GPU computation and I am very familiar with the power of the GPU's vector float processing, but does anyone actually use the GPU for something useful and practical besides graphics and GPGPU demos? Pardon my ignorance if they do, I'm just curious.

Share this post


Link to post
Share on other sites
Quote:
Original post by Sneftel
When you say "The Cell is an example of a CPU becoming more like a GPU" what you really mean is "The Cell is an example of a CPU incorporating vectorized processing". It's important to recognize that this isn't a capability particular to GPUs. DSP hardware, in particular, has been doing this stuff for a long time. And a GPU is not JUST a vectorized processor. Despite the move towards programmable pipelines, a GPU is still optimized for performing very specific tasks, such as triangle rasterization and 4x4 matrix ops, to the detriment of other tasks. For instance, the GPU lacks a "scatter" operation.

True, but the point I'm trying to make is that the characteristics that make GPUs very fast for certain kinds of tasks and difficult to apply to others and the programming model offered by GPUs as exposed through vertex and pixel shaders are not so much limitations of GPUs as fundamental limitiations of this kind of approach. The reason GPUs are so effective at exploiting parallelism and can be made faster simply by increasing the number of pipelines / ALUs is that the programming model enforces a strict separation between 'threads' of execution - each pixel or vertex is processed with limited output options (the lack of a scatter operation you mention) and has no side effects that can affect other pixels or vertices in the same batch. This kind of stream processing model is the only really effective way to get linear performance increases by throwing more execution units at a problem. Any kind of algorithm that can efficiently utilise multiple SPEs performing the same instructions on separate data elements will have to work under similar kinds of restrictions. The SPEs are more flexible and have a relatively large local store which can be used for working memory but they also lack simple, efficient access to a large amount of read only memory.

As soon as you introduce a scatter operation you require mechanisms for synchronization between multiple simultaneous threads of execution which limits how efficiently you can scale performance by increasing the number of available execution units. It's possible to do this with the Cell but if you rely on it heavily you'll lose most of the performance gains of having multiple processing units. At some point we may see GPUs that allow for scatter operations but you can expect it to come at a large performance cost. Really it's far better if possible to recast your algorithm in a way that doesn't require expensive synchronization if you want it to scale with increased hardware resources.

Of course the Cell does give you the flexibility to take other approaches to exploiting parallelism. You can use multiple execution units in a pipeline for some problems but pipelines can only scale up in performance to the extent that the problem can be broken down into suitable sub-steps. You can use multiple execution units to work on unrelated problems that require little synchronisation (some working on physics, some on sound, some on animation, etc. in a game) but that will only scale to the extent that the problem can be broken down into largely independent tasks. Only problems that are suited to stream processing have the potential to scale up to massive numbers of execution units (on the order of the number of elements that need to be processed).

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this