Efficiently obtaining Red Channel from BGRA Bitmap

Started by
21 comments, last by Adam_42 11 years, 3 months ago

byte* bgra	= byte array of a BGRA formatted bitmap image;
byte* r		= new byte[Height*Width];
			
for (int i = 0; i < Height; i++)
{
	for (int j = 0; j < Width; j++)
	{
		int offset = i*Width + j;
		r[offset] = bgra[offset*4 + 2];
	}
}
delete[] r;

I'm using the above code to obtain red channel values from a byte array of a BGRA bitmap image. The image is formatted as:


B G R A B G R A... (Size of W*H*4)

I want to obtain a byte array of

R R R R... (Size of W*H)

Is there a more efficient way of doing this without using for loops?

Advertisement

You could probably eliminate that *4 but that's insignificant.

It's possible to achieve same using shaders: output red color into single channel render target, but latency will kill any performance you gained.


byte* bgra	= byte array of a BGRA formatted bitmap image;
byte* r		= new byte[Height*Width];
byte *rSource = bgra+2;
int iPixels = Height*Width;

for (int i=0;i<iPixels;i++,r++,rSource+=4)
{    
    *r = *rSource;
}

I'd probably do something like this. I doubt it'd make much difference in the grand scheme of things.
You could probably eliminate that *4 but that's insignificant.

It's possible to achieve same using shaders: output red color into single channel render target, but latency will kill any performance you gained.

Would it be possible to somehow transfer the single channel rendertargetdata to system memory via with d3d10 copyresource or d3d9 getrendertargetdata? I need to be able to access the Red channel on CPU.

I'm afraid not. It copies whole/part of resource and doesn't pick individual bytes. You'd have to render quad with pixel shader, then copy red render target into RAM to have CPU access.
I'm afraid not. It copies whole/part of resource and doesn't pick individual bytes. You'd have to render quad with pixel shader, then copy red render target into RAM to have CPU access.

I'm not sure what you mean by copy red rendertarget to RAM. I thought the rendertargets have to be 32-bit aligned. Is there an example of how extract only red channels from rendertarget texture using pixel shaders?

Is there a more efficient way of doing this without using for loops?

You seem to be under the [incorrect] assumption that for loops are somehow slow. Chances are that code is perfectly fast. You might be able to save a bit in the pointer arithmetic, since you don't really need to compute the offset from scratch each time: It's just one more than the value it was in the previous iteration of the loop, so you can do it with a counter. But even that probably won't matter much.

You should generally only worry about performance when you have evidence that this operation is taking too much time in your program.


byte* bgra	= byte array of a BGRA formatted bitmap image;
byte* r		= new byte[Height*Width];
byte *rSource = bgra+2;
int iPixels = Height*Width;

for (int i=0;i<iPixels;i++,r++,rSource+=4)
{    
    *r = *rSource;
}

I'd probably do something like this. I doubt it'd make much difference in the grand scheme of things.
You can go further than that, 'i' is not needed:
byte *bgra    = <byte array of a BGRA formatted bitmap image>;
byte *r       = new byte[Height*Width];
byte *rbegin  = r;
byte *rend    = r + Height*Width
byte *rSource = bgra+2;

while (rbegin < rend)
{    
    *rbegin++ = *rSource;
    rSource += 4;
}
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
I just tried all of the solutions given above, and they have the exact same performance. So just write whatever is easiest to read. I personally would write this:
  byte *bgra = <byte array of a BGRA formatted bitmap image>;
  byte *r = new byte[Height*Width];

  for (int i=0; i<Height*Width; ++i)
    r[i] = bgra[4*i+2];

All of the vanilla C++ that's been posted is about as efficient as you're going to get.

However, if you can prove that this is still a bottleneck for you, you could further try:

Pre-warm the cache by reading ahead (depends on cache-line size, but probably 8 or 16 source pixels)

Unroll loop x4 (read), coalesce writes (need to add some code to deal with non-multiple-of-4 source data).

Drop down to SSE or AVX assembly/intrinsics (coalesce more writes, using shuffle instructions)

I would try those things in that order, but remember -- fast for fast's sake is a silly goal unless its an academic exercise; In "the real world" the best solution is usually the simplest one which is fast enough. Optimizing without profiling is the coding equivalent of shooting first and asking questions later.

throw table_exception("(? ???)? ? ???");

This topic is closed to new replies.

Advertisement