• entries
316
485
• views
323750

# Two Part Entry

121 views

This post is going to cover two areas: 1) Triangle Clipping, 2) Multi-Threading

##### Triangle Clipping
As shown in my last post, triangle clipping is now added in to the pipeline. It exists as its own processor between the back-face culling processor and the rasterizer. I had been dreading writing this particular piece of the puzzle because I thought it was going to be very difficult to work out. However, I was pleasantly surprised at how simple it was get up and running correctly.

The overall concept is that you start with a triangle whose vertices are post projection, but pre-divide-by-w. Then you use the plane equations from each of the six frustum planes to clip the triangle by. Since the vertices are in clip space, the planes are defined by a constant set of plane coefficients (this simplifies things greatly - no need to recalculate the plane equations every frame!).

The sequence for each triangle is more or less:

1. Using the first plane:
2. Determine the first vertex outside of the plane (each plane has a positive halfspace and a negative halfspace).
3. Where that edge of the triangle intersects the plane, create a new vertex at that point and add it to the output list of vertices.
4. Since there is an outgoing intersection, there must be a returning edge intersection connecting the vertices that are inside the plane. Find it and create the new vertex at that point, and add it to the output list.
5. Pass the vertices inside the plane to the output list as they are.
6. Repeat the process for each frustum plane.

This essentially creates a single convex polygon out of the clipped triangle with vertices at every point that intersected the frustum. It works wonderfully, and isn't all that processing heavy to perform. I am pretty happy with the results!
With the triangle clipper done, I am ready to get down to business on making the whole rendering pipeline multi-threaded. The overall pipeline consists of several processors linked together, including the following:

2. Back Face Culling Unit
3. Triangle Clipping Unit
4. Rasterizer
5. Depth Test Unit (Z-compare)

Up to this point, there were memory buffer attached to the input and output of each processor. The processor basically operates on its own, and doesn't care about what the other processors are doing. As long as there is an input to consume and space available in the output buffer, the processor would operate and continue.

Since the whole thing is single threaded right now, I had to allocate buffers large enough to process the entire set of input and output for each processor. This was no good for the rasterizer, depth tester, and pixel shaders because they each needed a buffer proportional to the number of pixels in the framebuffer. That takes a bunch of memory depending on the size of the elements that are being buffered!

So the idea now is to convert each processor to run on its own thread. This still allows the idea of processing one input and writing out one output at a time. However, due to the nature of each processor running all the time, there has to be a method of making sure that the buffers are not being accessed by two sequential processors at the same time (one for input and one for output). My first attempted solution here is to have each buffer obtain a mutex and let the processors 'lock' the buffer while performing memory access operations.

Since all of the processors are running all the time, the buffer sizes in between them can be sized much smaller, and can actually be specified at run time based on performance metrics of the given hardware that it is running on. Currently I am using fixed size buffers, but eventually I would like to auto-size the buffers and see how small they should really be. I wonder how the GPU manufacturers cache the inter-stage results in their massively parallel architectures? Surely they don't use too much memory, but it must depend on the number of pipelines they are using (or more recently, the number of processing streams they are using).

In any case, the multi-threaded pipeline is up and running. Once I get it optimized to the extent that I can, I am going to put together some benchmark tests and see if there are any multi CPU core users out there to test out the performance advantage from running in parallel. I am very interested to see how it all turns out - my next post should be quite interesting!

Why not use a lockless queue to supply data to and from a processor? That way you wouldn't even have locks.

I'm not sure what you mean by a lockless queue system. I currently use queues as the buffer datastructure, but shouldn't I guarantee that when one processor is writing an element that the next processor can't access it before it is completely written?

This is my first adventure into multithreaded processing, so any help or suggestions are appreciated!

A Lock-Less queue is a queue that doesn't require locking. I'm sure there are all kinds of implementations, but the one I know allows a single thread to write, and a single (different) thread to read from it. The implementation of the queue makes sure that data can't be read before it is fully written.

I'm sure there are plenty of implementations online you could refer to. While locking isn't too horrible, if you could go without it, all the better.

I'll have a look into it on the net. Thanks for the suggestion (rate++)!

EDIT: I rated you positively, but it didn't change your score :(

## Create an account

Register a new account