Questions about general GPU architecture and graphics pipeline

Started by
3 comments, last by ET3D 15 years, 5 months ago
I have been programming with Direct3D for a fair amount of time. From time to time, I am coming to realize that I have very limited knowledge about the architecture of the GPUs and the rendering pipeline in general. This fact was more obvious to me when I began to learn and experiment with the vertex and pixel shaders. At the moment, I have at least a blurred picture of how the pipeline works; but I do still have questions about some details. 1- I found the following diagram when I was googling about the topic: http://www.ozone3d.net/tutorials/images/gpu_arch/pipeline_3d_w570.jpg Let's say, as an example we issued a draw call in Direct3D for a mesh which consists of 1000 vertices. If we have 8 vertex shader units like in the diagram, how is the GPU going to work; will it take 8 vertices at a time, process and transform them in parallel until it consumes all of them, and after all the 1000 vertices are finished,send the list of transformed 2d triangles to the rasterizer? Or, does it send a 2d triangle to the rasterizer immediately as soon as its corner vertices are transformed? Where does the GPU keep the data for the 2D triangles here? 2-My second question is about the pixel processing part(The superset of the rasterizer,fragment processors and raster operation units). Does this part operates on a single triangle at a given time? The tutorial, to which the picture belongs talks about pixel shader threads which are running on every single fragment processor. So does this mean that a fragment processor can operate on the pixels of different triangles at the same time? 3-How the quad structure of the pixel shaders work on the pixels of a triangle? Lets say a quad shades the following pixels at time t0: [8,10][9,10] [8,11][9,11] Does it move to [10,10][11,10] [10,11][11,11] after the first 4 pixels have been finished,at time t1? (i.e. Does it scans two rows of pixels from left to right?) I know there are a lot of questions, but actually, what I need is a step by step explanation of a mesh's journey through the graphics pipeline. But I couldn't find something like that on the internet until now. Thanks in advance.
Advertisement
The questions you ask are beyond the scope of the Direct3D pipeline specification
Every GPU works different but there are some guesses you can be quite certain of

Quote:
1- I found the following diagram when I was googling about the topic: http://www.ozone3d.net/tutorials/images/gpu_arch/pipeline_3d_w570.jpg
Let's say, as an example we issued a draw call in Direct3D for a mesh which consists of 1000 vertices. If we have 8 vertex shader units like in the diagram, how is the GPU going to work; will it take 8 vertices at a time, process and transform them in parallel until it consumes all of them, and after all the 1000 vertices are finished,send the list of transformed 2d triangles to the rasterizer? Or, does it send a 2d triangle to the rasterizer immediately as soon as its corner vertices are transformed? Where does the GPU keep the data for the 2D triangles here?


Since the purpose of a pipeline is to maximize the throughput, the best thing to do is to subdivide the work until possible. This means a vertex level in the vertex shader, and a fragment level in the pixel shader.
There's no need for a GPU to buffer all vertices until they all have been transformed before passing them to the pixel shader units; this would mean having to store all trasformed vertices, and vertex attributes in an appropriate memory and would also mean that during all the vertex processing time the pixel shader units would be left with no work to do.
So this scheme would be very inefficient, and since does not introduce any benefit you can be quite sure that no GPU will ever do that.
Something important about pipeline architectures is that they have to feed the latest stages as soon as possible, that is, in this case, as soon as all the 3 vertices composing a triangle have been processed.
That's also the reason why you have triangle strips.

Questions 2 and 3 are very specific to the architecture scheme you posted, but i guess
2 - Yes, since a fragment can only belong to one triangle
3 - I guess this one depends on the texel caching technique used, since the goal is to reduce the accesses to memory
Whoa!! Your questions are toooo low level.

Quote:Original post by 67rtyus
what I need is a step by step explanation of a mesh's journey through the graphics pipeline. But I couldn't find something like that on the internet until now.

That's because you're so low level that you're thinking how the hardware works.
Most of that is most likely to be a secret from NVIDIA, ATI, and each other vendor has it's own implementation.
It's fun to know this stuff, but if you want to know this to take the maximum advantage of your GPU's performance, take in mind that will only work for your specific vendor and model. Each model may change the way it did things. Some card revision have even changed the order they processed things (i.e. If I recall correctly ATI cards used to process Z buffer first, then fragment shaders, now it's the other way).

Question 1 has already been answered. What he suggests is to use common sense.
Of all alternatives you can ever imagine, pick the one that can work as fast as possible, using the lowest ammount of memory and bandwidth, without causing of course visual corruption (i.e. Race conditions due to parallel execution).
Probably an army of electronic engineers have thought about that idea and refined it during these ~10 years since accelerated 3D hardware went to the market.
This doesn't mean you can't come out with a better implementation. If you do it, you'll be rich.

Quote:Original post by 67rtyus
So does this mean that a fragment processor can operate on the pixels of different triangles at the same time?

Probably yes.
My recomendation is that you should take GPUs to extreme for your curiosity.
Try writting to a texture and reading back from another fragment that shouldn't be written yet, and from another fragment that should have been written yet.
If the driver detects you're doing this, you should see a real slow down because the mega-parallel power present in todays GPU should be turned off to produce a good result, or you could see increased memory usage (because it needs a fresh copy from the original texture)
If it doesn't detect it, you will get some fun from visual artifacts

Oh and by the way, much of these ways to find out is very close to reverse-engineering

About question 3, I think no. Each fragments should move to the next four pixels. But like gabe83 said, this could be true due to caching efficient usage.
So in fact, it may turn out that both statements are correct, because the GPU is able to detect whether it's more efficient to move to the next 4 pixels, or to move to the next 2.

The best way to know all this would be to ask an NVIDIA/ATI/Intel(haha)/SiS/NEC/etc engineer. But they probably won't be able to tell you because they have a confidentiality agreement signed.

Have luck and fun finding this stuff out.
Dark Sylinc
I would suggest asking this question over at the beyond3d forums. The crowd there is generally more knowledgeable about these sorts of low-level hardware details (in fact a few employees from ATI/Nvidia hang out there), while here the discussions are generally geared towards software topics.
I was about to suggest Beyond3D. :)

I'd like to comment (because it's an important point that many people don't seem to realise) that modern cards (GeForce 8000 family and up, Radeon HD 2000 family and up, i.e. all Direct3D 10 capable cards) don't have separate pixel and vertex pipelines. The same processing units are used for both.

This topic is closed to new replies.

Advertisement