Fast indexing of a 2x2 block Frame Buffer storage

Started by
6 comments, last by D.V.D 5 years, 11 months ago

Hey guys, I'm writing a software renderer for fun and I decided to try and optimize the renderer by storing the frame buffer and depth buffer as rows of 2x2 blocks of pixels. I figured it would be easier to SIMDify and be more cache local. My main function for the software renderer is traversing boxes on the screen to either check for occlusion or to color in pixels. So if my pixel format is rows of pixels, then the main thing I perform is this:


void draw(u32* Pixels, int minx, int miny, int maxx, int maxy)
{
    u32* PixelRow = Pixels + NumPixelsX*MinY + MinX;
    for (u32 Y = miny; Y < maxy; ++Y)
    {
        u32* CurrPixel = PixelRow;
        for (u32 X = minx; X < maxx; ++X)
        {
            *CurrPixel++ = some color/depth;
        }
        PixelRow += NumPixelsX;
    }
}

I converted this routine to SIMD and to process 2x2 pixels, but I found that by doing so, I added a big setup cost before the actual loop:


#define BLOCK_SIZE 2
#define NUM_BLOCK_X (ScreenX / BLOCK_SIZE)
global __m128i GlobalMinXMask[] =
{
    _mm_setr_epi32(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF),
    _mm_setr_epi32(0, 0xFFFFFFFF, 0, 0xFFFFFFFF),
};
        
global __m128i GlobalMaxXMask[] =
{
    _mm_setr_epi32(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF),
    _mm_setr_epi32(0xFFFFFFFF, 0, 0xFFFFFFFF, 0),
};
        
global __m128i GlobalMinYMask[] =
{
    _mm_setr_epi32(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF),
    _mm_setr_epi32(0, 0, 0xFFFFFFFF, 0xFFFFFFFF),
};
        
global __m128i GlobalMaxYMask[] =
{
    _mm_setr_epi32(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF),
    _mm_setr_epi32(0xFFFFFFFF, 0xFFFFFFFF, 0, 0),
};

inline void DrawNode(u32* Pixels, u32 MinX, u32 MaxX, u32 MinY, u32 MaxY)
{
    // NOTE: All of this is the setup cost right here
    u32 BlockMinX = MinX / BLOCK_SIZE;
    u32 BlockMaxX = MaxX / BLOCK_SIZE;
    u32 BlockMinY = MinY / BLOCK_SIZE;
    u32 BlockMaxY = MaxY / BLOCK_SIZE;

    u32 DiffMinX = MinX - BLOCK_SIZE*BlockMinX;
    u32 DiffMaxX = MaxX - BLOCK_SIZE*BlockMaxX;
    u32 DiffMinY = MinY - BLOCK_SIZE*BlockMinY;
    u32 DiffMaxY = MaxY - BLOCK_SIZE*BlockMaxY;

    if (DiffMaxX)
    {
        BlockMaxX += 1;
    }
    if (DiffMaxY)
    {
        BlockMaxY += 1;
    }

    f32* RowDepth = RenderState->DepthMap + Square(BLOCK_SIZE)*(BlockMinY*NUM_BLOCK_X + BlockMinX);
    __m128 NodeDepthVec = _mm_set1_ps(Z);
    for (u32 Y = BlockMinY; Y < BlockMaxY; ++Y)
    {
        f32* DepthBlock = RowDepth;
        for (u32 X = BlockMinX; X < BlockMaxX; ++X)
        {
            __m128i Mask = _mm_set1_epi32(0xFFFFFFFF);
            if (X == BlockMinX)
            {
                Mask = _mm_and_si128(Mask, GlobalMinXMask[DiffMinX]);
            }
            if (X == BlockMaxX - 1)
            {
                Mask = _mm_and_si128(Mask, GlobalMaxXMask[DiffMaxX]);
            }
            if (Y == BlockMinY)
            {
                Mask = _mm_and_si128(Mask, GlobalMinYMask[DiffMinY]);
            }
            if (Y == BlockMaxY - 1)
            {
                Mask = _mm_and_si128(Mask, GlobalMaxYMask[DiffMaxY]);
            }

            __m128 CurrDepthVec = _mm_maskload_ps(DepthBlock, Mask);
            __m128 Compared = _mm_cmp_ps(CurrDepthVec, NodeDepthVec, _CMP_GT_OQ);
            Mask = _mm_and_si128(Mask, _mm_castps_si128(Compared));
            _mm_maskstore_ps(DepthBlock, Mask, NodeDepthVec);

            DepthBlock += 4;
        }

        RowDepth += Square(BLOCK_SIZE)*NUM_BLOCK_X;
    }
}

(The min/max masks are if the min/max values fall inside a 2x2 block of pixels, we want to mask out writing/reading the pixels that aren't actually part of our range).

Calculating the block min/max as well as the diff min/max all seems to be a little much, I'm not sure if there is a much more efficient way to do that. I also wanted to take advantage of 8 wide SIMD using AVX so I figured I would have rows of 4x4 pixels, but where each block of 4x4 pixels itself stores 4 blocks of 2x2 pixels. Im worried that doing that will add a even larger setup cost to the loop which for my application would negate most of the benefits. 

My bottom line is, I want to optimize as much as I can the process of filling a box of pixels on the screen with a color because my software renderer does it a lot every frame (4 million times currently), and I figured storing pixels in 2x2 blocks would make it faster, but I'm not sure if I'm missing some trick to more quickly calculate which pixels I have to iterate over.

Advertisement

This is something I'm meaning to get around to, I read an excellent tutorial on software renderers a few weeks back and it isn't half as difficult as I expected (might even be viable in some cases because there are a lot of setup costs to doing anything on the GPU, versus no setup on the CPU, only the bandwidth for transfer).

Nothing springs out at me straight away with the small blocks as I've not done this yet, but have you also considered doing it as a tile renderer like mobiles do? That way you just have a 1 off cost of clipping your box (or triangle) to the larger tile, and render as normal just into a smaller tile, and hopefully get some cache benefits. SIMD is great in my experience but it works best when you have a whole lot of stuff to go do all at once, BLAT, instead of do a tiny bit, do some branching etc, do another bit...

5 hours ago, lawnjelly said:

This is something I'm meaning to get around to, I read an excellent tutorial on software renderers a few weeks back and it isn't half as difficult as I expected (might even be viable in some cases because there are a lot of setup costs to doing anything on the GPU, versus no setup on the CPU, only the bandwidth for transfer).

Nothing springs out at me straight away with the small blocks as I've not done this yet, but have you also considered doing it as a tile renderer like mobiles do? That way you just have a 1 off cost of clipping your box (or triangle) to the larger tile, and render as normal just into a smaller tile, and hopefully get some cache benefits. SIMD is great in my experience but it works best when you have a whole lot of stuff to go do all at once, BLAT, instead of do a tiny bit, do some branching etc, do another bit...

Yeah, writing software renderers is actually kind of fun :).

So when you say tile rendering, you mean binning? Like dividing the screen into 16x16 blocks, figuring out which objects lie in which block, and then later rendering everything in a block at a time? So I was going to do that (if thats what you are talking about) using multi threading but I guess I should say that my renderer is not a triangle rasterizer.

I'm writing a octree voxel splatter (again). I represent my geometry in the scene as an octree and all the leaf nodes are either marked full or empty, so the geometry is just a ton of cubes assembled with an octree. When I render, I traverse the octree in front to back order and render nodes that aren't occluded. When I render the nodes, rather than rasterizing a cube, I find the cubes bounding box on screen and fill those pixels as if that entire block was the cube. The idea is that once I have a around a single node per pixel, rasterizing a cube vs rendering its min/max box becomes the same. Below I attached some photos to show what I mean, I'm changing the max recursion depth so you can see what I'm rendering.

Pic2.thumb.png.c461481cfbb075b0bbc2c6acdc688f2c.pngPic3.thumb.png.91ad61972a0a5806b16bd7c38f773401.pngPic1.thumb.png.85fc80ff16d520426c0459b9364d2b77.png

So in my circumstance, its a little hard to do binning mostly because the main speed up for my algorithm is the occlusion culling, which I don't think can be made parallel. That's why I'm trying to optimize stuff like how fast I can render or check a box on the screen for some depth value instead. 

I've been thinking about the 4x4 blocks of pixels that I wanted to implement and I figured that instead of having 4x4 blocks with 2x2 blocks inside, I can have a 4x4 block and one AVX vector will be able to index 2 2x2 blocks at a time anyways, because its 8 wide. I attached a picture below showing what I mean.

20180511_123512.thumb.jpg.07aa2eeade4d78cafc13a9f289c3a184.jpg

The problem now becomes the masks. The setup costs are still the same so if someone has a way to reduce that, it would be very helpful but for this case, I also have to figure out how to reduce the setup cost for more masks. So lets say I'm writing to pixel 5, 6, 7, 9, 10, 11, 13, 14, 15. That's a 3x3 block inside of the 4x4 block. For masks, I have two options. One is to load all 16 pixels in the 4x4 block, and pick a mask for the configuration I'm writing. Unfortunately, this would mean a huge amount of masks because I'd need masks for every kind of rect that I want to draw (somewhere around 32 masks I think). If I instead point to pixel 5, then I would need to repeat a 4 wide mask of (0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0), across the pixels in this tile. So now I get a larger setup cost because I have to construct the masks for each block instead of just reading off a table, but I reduce the number of masks I'd have to store by a lot. 

I'm just thinking out loud right now, I have no idea if this will be any good. I was hoping there was some resource that explained how to better do stuff like this but from looking around online, I can't seem to find anything.

 

Another obvious thing if you are writing rects that are bigger than the blocks, is to predetermine which will be completely filled, and only do the complex check for edge cases, and use a simpler routine for fully filled blocks. But then it depends on the proportion of completely filled boxes, as to whether this is a win, and as you say you are going down finally to the level of pixels.

I'm presuming you done lots of profiling of this version versus more naive versions? What kind of results did you get? A comparison of a naive non SIMD line by line version versus the block version would be interesting. You could also try and make things branchless like you do in shaders and see if that helps. And I suspect you'd get more answers with a few more comments in the code, reading other people's intrinsics can be a little impenetrable lol! :)

19 hours ago, lawnjelly said:

Another obvious thing if you are writing rects that are bigger than the blocks, is to predetermine which will be completely filled, and only do the complex check for edge cases, and use a simpler routine for fully filled blocks. But then it depends on the proportion of completely filled boxes, as to whether this is a win, and as you say you are going down finally to the level of pixels.

I'm presuming you done lots of profiling of this version versus more naive versions? What kind of results did you get? A comparison of a naive non SIMD line by line version versus the block version would be interesting. You could also try and make things branchless like you do in shaders and see if that helps. And I suspect you'd get more answers with a few more comments in the code, reading other people's intrinsics can be a little impenetrable lol!

I actually had a routine I wrote for doing just that. It would only do masked reads for the edges/corners of the box. The only issue was that it failed for rendering 1 pixel sized boxes, it needed more if statement checks to do that correctly. I just left it alone since I figured I should at least get the easy case right. 

So I did profiling before and as I was writing code, but I think I made the most rookie mistake. When I was adding SIMD to my code, I didn't just apply it for pixel checks, I also did it for other parts of the program. So when I bench marked that, I was way faster than scalar code. I originally used AVX to check a line of 8 pixels at a time and later I did the 2x2 pixel approach. Using Very Sleepy profiler, I saw that those routines took around 8-12% of my rendering time so I figured oh they need to be optimized. What I should have checked was if the pixel checks only are faster in SIMD than they are in scalar code. 

So these are my profiling results:

Average Scalar: 611ms, Average 8x1 AVX: 652ms, Average 2x2 SSE: 644ms

With other parts of the rendering optimized for SIMD, the results are

Average Scalar: 103ms, Average 8x1 AVX: 112ms, Average 2x2 SSE: 129ms

So these are the average times it takes to render my frame, with 8x1 AVX checking a line of 8 pixels at a time, and 2x2 SSE checking a block of 2x2 pixels at a time. So yeah, this is really wild :P.

With Very Sleepy, the routine for pixel checking in scalar code takes like 2% of the total runtime, so its already incredibly fast. I guess I should have checked this earlier :P. It may be that in the future, once my scenes get more complex, Ill have a lot more nodes being occluded which require checking a block of pixels (currently, its somewhere between 10-20% of traversed nodes). So yeah, I lost a lot of time because of my ignorance XD.

I ended up looking only and I found that Ryg's blog has a post on traversing images which store data in 2x2 blocks. He provided the below code that shows how to address a particular pixel location if the texture is stored in tiles of pixels, with 4x4 pixels inside:


  // per-texture constants
  uint tileW = 4;
  uint tileH = 4;
  uint widthInTiles = (width + tileW-1) / tileW;

  // actual addressing
  uint tileX = x / tileW;
  uint tileY = y / tileH;
  uint inTileX = x % tileW;
  uint inTileY = y % tileH;

  pixel = image[(tileY * widthInTiles + tileX) * (tileW * tileH)
                + inTileY * tileW
                + inTileX];

The post can be found here: https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/. So the optimization would be to use the % operator instead of a multiplication and subtraction. I'll probably look more into this, mainly because its interesting to reorder 2d buffers into these kinds of formats, but it looks like for my app now, its a decent amount slower.

Lastly, yeah I should have commented my code more. I guess I got to into what I was doing and didn't realize if it was readable or not :P Thanks a lot for still sticking by though :)

Yes, that old chestnut, we've all been there, where our optimization made things slower! :) It is well worth doing these experiments though, but the lesson is let the profiler be your guide. I often find that SIMD version isn't faster than scalar code, very often it is something else like memory bandwidth / cache that is limiting, hence why packing your data well can be such a win. The more you do it the more you get a feel for where the bottlenecks are likely to be, and of course, it is fun!

Yeah it for sure is. Ill try to be more careful, but I'm happy I implemented the changes anyways, now I have an idea of how to do it :P. I think in my case I'm completely computationally bounded, almost all the time is taken projecting octree nodes on to the screen, and I wrote a SIMD version of that (its what got my times from 600ms to 100ms), but I got a couple more ideas for optimizations to make it faster! I'll try to be more careful from now XD.

This topic is closed to new replies.

Advertisement