Sign in to follow this  

Software renderer: write 4 colors at once without reading old pixels, how?

This topic is 1991 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hello!

I'm writing a simple software renderer and trying to use SSE to accelerate it.

How would one write 4 pixels into the colorbuffer without first reading the previous contents of the framebuffer?

Is there a magical instruction for writing words of an SSE register selectively, based on mask?

i know only of _mm_store_si128() instruction, but it writes the whole register into memory,
so i need to fetch old 4 colors, combine them with computed colors using a bit mask, and write them back.
i'd like to avoid reading the old pixels.

right now i'm using the _mm_movemask_ps() instruction to calculate which pixels are inside triangle and costly 'if' branches:
[CODE]
if( mask & 1 ) {
pixels[iX] = RGBA8_WHITE;
}
if( mask & 2 ) {
pixels[iX+1] = RGBA8_WHITE;
}
if( mask & 4 ) {
pixels[iX+2] = RGBA8_WHITE;
}
if( mask & 8 ) {
pixels[iX+3] = RGBA8_WHITE;
}
[/CODE] Edited by Anfaenger

Share this post


Link to post
Share on other sites
In AVX, there's the [url="http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/cpp/lin/intref_cls/common/intref_avx_maskstore_ps.htm"]_mm_maskstore_ps/vmaskmovps[/url] instruction. In SSE2, there's the [url="http://msdn.microsoft.com/en-us/library/yyhs9sh7.aspx"]_mm_masmoveu_si128/maskmovdqu[/url] instruction, but note that this instruction is in the class of byte-wide integer instructions, so it can generate few cycles of stall in the pipeline when used (profile?) if a transition from float mode to int mode occurs.

If you are doing manual load-blend-store, there's the [url="http://msdn.microsoft.com/en-us/library/bb514102.aspx"]_mm_blend_ps/blendps [/url] and [url="http://msdn.microsoft.com/en-us/library/bb514075.aspx"]_mm_blendv_ps/blendvps[/url] instructions in SSE4.1, which can aid the process, although that kind of load followed by a store can be a large performance impact. For earlier than SSE4.1, that kind of blend between registers can be achieved by a sequence of and+andnot+or instructions.

I recommend the [url="http://software.intel.com/en-us/avx/"]Intel Intrinsic Guide[/url], which has the instructions in an easily searchable format.

Share this post


Link to post
Share on other sites
wow, thanks, that's exactly what i need!

by transition from float mode to int mode you mean load-hit-store penalties?

right now i'm doing this:
[CODE]
__m128i* dest = (__m128i*) (pixels + iX);
Assert(IS_16_BYTE_ALIGNED(dest));
const __m128i oldQuad = _mm_load_si128( dest );
__m128i result = _mm_or_si128( oldQuad, _mm_and_si128( _mm_set1_epi32(RGBA8_WHITE), mask ) );
_mm_store_si128( dest, result );
[/CODE]

but i'd like to avoid stalls due to loading 'oldQuad' when i'm not doing blending.

UPDATE:

i've just tried using _mm_maskmoveu_si128() and it was actually slower than the above version (FPS went from ~218 to ~187 in the same scene).
here is my inner loop code:

[CODE]
for( UINT iX = iBlockX; iX < iBlockX + BLOCK_SIZE_X; iX += SSE_REG_WIDTH )
{
const __m128i qiCX1mask = _mm_cmpgt_epi32( qiCX1, _mm_setzero_si128() );
const __m128i qiCX2mask = _mm_cmpgt_epi32( qiCX2, _mm_setzero_si128() );
const __m128i qiCX3mask = _mm_cmpgt_epi32( qiCX3, _mm_setzero_si128() );
const __m128i qiCXmask = _mm_and_si128( qiCX1mask, _mm_and_si128( qiCX2mask, qiCX3mask ) );
__m128i* dest = (__m128i*) (pixels + iX);
Assert(IS_16_BYTE_ALIGNED(dest));
#if 0
// load previous pixels
const __m128i oldQuad = _mm_load_si128( dest );
__m128i result = _mm_or_si128( oldQuad, _mm_and_si128( _mm_set1_epi32(RGBA8_WHITE), qiCXmask ) );
_mm_store_si128( dest, result );
#else
__m128i result = _mm_set1_epi32(RGBA8_WHITE);
_mm_maskmoveu_si128( result, qiCXmask, (char*)dest );
#endif
qiCX1 = _mm_sub_epi32( qiCX1, qiFDY12_4 );
qiCX2 = _mm_sub_epi32( qiCX2, qiFDY23_4 );
qiCX3 = _mm_sub_epi32( qiCX3, qiFDY31_4 );
}//for x
[/CODE] Edited by Anfaenger

Share this post


Link to post
Share on other sites
You should not worry about the loading/storing and masking. In the end, the controller will load a whole cacheline into L2 and L1, from there, it doesn't really matter whether you load/store 1byte or 32bytes, modern CPUs (Sandy Bridge, Ivy Bridge) can load two 16byte words per cycle, most older still can load 16byte per cycle, and internally it's anyway impossible to address just one byte in memory, the load/store unit has to get it, modify those bytes store it the whole bunch of data you did not modify.

To get best performance, you shall focus on using as few instructions as possible if you have data dependancies like in your code. e.g.
[code]
const __m128i oldQuad = _mm_load_si128( dest );
__m128i result = _mm_or_si128( oldQuad, _mm_and_si128( _mm_set1_epi32(RGBA8_WHITE), qiCXmask ) ); // Stall
_mm_store_si128( dest, result ); //stall
[/code]
you have two potential stalls here, if the OoO units cannot find other independent instructions, the OR will wait until the LOAD and AND are done. the Store will again wait for the OR. this might in the end cost a lot.

your previous example:
[code]

if( mask & 1 )
pixels[iX] = RGBA8_WHITE;
if( mask & 2 )
pixels[iX+1] = RGBA8_WHITE;
if( mask & 4 )
pixels[iX+2] = RGBA8_WHITE;
if( mask & 8 )
pixels[iX+3] = RGBA8_WHITE;
[/code]might suffer from branch miss-prediction.

So, I would suggest, just load,blend,store might be the nicest work for your pipeline. if you use 64bit, then unroll your loop to process 4lines at the same time (just like you process now 4pixel in a line, so you'd work on 16pixel per loop, the compiler will do a good job to utilize all 16 SSE registers and you'll probably end up with less cycles/pixel.

Share this post


Link to post
Share on other sites

This topic is 1991 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this