SSE accelerated write with bit mask

Started by
5 comments, last by implicit 14 years, 10 months ago
I'm playing with a stencil buffer stored as 1 bit per pixel. I'd like to store 32bit colors based upon the 1 bit mask. I see the intrinsic _mm_maskmoveu_si128 allows me to write with a mask however the mask is represented as the high bit in each byte for all 16 bytes. What I need in order to use this, is a fast way to get from 4 bits to 16 bytes. This way I can write 4 colors at a time from my 4 bit mask. My current hack is to store dword[2] {0x0, 0xffffffff} and fill the 16 byte mask from these, however the number of instructions generated and memory shuffling is slower than not using SSE for this operation at all. Any thoughts folks? Bit of an SSE noob here.
Advertisement
One simple option is a 16 entry array of 16 byte values, which you use as a lookup table (indexed with the 4 bit value). That should only require a couple of instructions.
Quote:Original post by Adam_42
One simple option is a 16 entry array of 16 byte values, which you use as a lookup table (indexed with the 4 bit value). That should only require a couple of instructions.

Thanks, that seems to generate simple code.

I was hoping to avoid tables and I see there is a reverse function that creates a bit mask from a byte set. Unfortunately the result so far is slower code than non SSE. Perhaps I need to batch or unroll some more.
You're aware that the masked write instructions have some peculiar cache control semantics, right? That is you might want to try manually masking and writing as well. More to the point testing the stencil masking alone might not reflect the performance of the routine within the final system.

At any rate one option would be to swizzle the stencil bitmap a bit.
Instead of letting the bit-index within the word correspond to horizontal coordinates you could let it signify the least-significant 5 bits of the vertical coordinate. That way you get four horizontal pixels at the same bit-offset within four consecutive 32-bit words, much as with the normal pixel buffer.

Err.. That sounds *way* more complicated than it really is, what I mean is something along these lines:
uint32_t bitmap[480 >> 5][640];void set_pixel(unsigned x, unsigned y) { bitmap[y >> 5][x] |= 1 << (y & 31);}

To actually extract the right bits and build the byte mask you might use a variable left shift followed by an arithmetic 31-bit right shift, or perhaps an AND instruction to isolate the four bits of interest and a compare against zero to build to masks.

Still, I wonder whether it's really worth the effort, and I would have thought the LUT method would to be plenty fast enough. Personally I'd look into more high-level optimizations instead, such as a separate test pass to check whether an entire tile is completely blank or filled.
Quote:Original post by implicit
You're aware that the masked write instructions have some peculiar cache control semantics, right? That is you might want to try manually masking and writing as well...

Thank you for your suggestions. No, I'm not familiar with cache control semantics of the SSE masked write or some of the other timing effects of SSE. I need to learn more. Feel free to share any links or resources to such topics.

I have tried a number of stencil formats, the current being 4x8 bits to a dword representing 4 columns and 8 rows of pixels, which I think is similar to your description. The different formats show some benefit depending on the size, orientation and position of primitives I'm testing. I have not yet tried multi level grids or pyramid. I can see how testing can benefit but maintaining it may be costly.
I'm sure these SSE instructions improve performance when used appropriately.
Just for fun, I thought I'd show that this code:
int writeMask0 = 0-((maskBits >> 0) & 1);int writeMask1 = 0-((maskBits >> 1) & 1);int writeMask2 = 0-((maskBits >> 2) & 1);int writeMask3 = 0-((maskBits >> 3) & 1);colorPtr[0] += color & writeMask0;colorPtr[1] += color & writeMask1;colorPtr[2] += color & writeMask2;colorPtr[3] += color & writeMask3;

(Yes, I'm adding to the previously zeroed output buffer, quite undesirable.)
Is faster than this code in my current usage:
_mm_maskmoveu_si128(color, (__m128i&)maskLUT[maskBits & 0xf], (char*)colorPtr);

Feel free to comment.
Quote:Original post by GregDude
I have tried a number of stencil formats, the current being 4x8 bits to a dword representing 4 columns and 8 rows of pixels, which I think is similar to your description. The different formats show some benefit depending on the size, orientation and position of primitives I'm testing. I have not yet tried multi level grids or pyramid. I can see how testing can benefit but maintaining it may be costly.
The point of the format was supposed to be that it should minimize the amount of bit twiddling during blitting. Well, that, and simplify addressing given that the stencil pointer of a horizontal run would match the normal horizontal pixel offset and avoid some nastiness with unaligned operations.

Something along these lines:
uint32_t stencil[480 >> 5][640];uint32_t pixels[480][640];void merge(uint32_t color) { size_t x, y; for(y = 0; y < 480; ++y) {  uint32_t const stencil_bit = ~y & 31;  const uint32_t *const stencil_ptr = stencil[y >> 5];  uint32_t *const pixel_ptr = pixels[y];  for(x = 0; x < 640; ++x) {   uint32_t mask = stencil_ptr[x];   mask = (int32_t) (mask << stencil_bit) >> 31;   pixel_ptr[x] = (pixel_ptr[x] & mask) | (color & ~mask);/* pixel_ptr[x] = ((pixel_ptr[x] ^ color) & mask) ^ color; */  } }}
Of course a layout which reduces the cost of other operations, such as testing whether or not a block is filled, might be preferable.

[Edited by - implicit on June 5, 2009 1:37:17 AM]

This topic is closed to new replies.

Advertisement