Sign in to follow this  
gpu_noob

Efficiently obtaining Red Channel from BGRA Bitmap

Recommended Posts

gpu_noob    114
byte* bgra	= byte array of a BGRA formatted bitmap image;
byte* r		= new byte[Height*Width];
			
for (int i = 0; i < Height; i++)
{
	for (int j = 0; j < Width; j++)
	{
		int offset = i*Width + j;
		r[offset] = bgra[offset*4 + 2];
	}
}
delete[] r;

 

I'm using the above code to obtain red channel values from a byte array of a BGRA bitmap image. The image is formatted as:


B G R A B G R A... (Size of W*H*4)

 

I want to obtain a byte array of 

 

R R R R... (Size of W*H)

 

Is there a more efficient way of doing this without using for loops?

 

Share this post


Link to post
Share on other sites
Zaoshi Kaba    8434

You could probably eliminate that *4 but that's insignificant.

 

It's possible to achieve same using shaders: output red color into single channel render target, but latency will kill any performance you gained.

Share this post


Link to post
Share on other sites
C0lumbo    4411
byte* bgra	= byte array of a BGRA formatted bitmap image;
byte* r		= new byte[Height*Width];
byte *rSource = bgra+2;
int iPixels = Height*Width;

for (int i=0;i<iPixels;i++,r++,rSource+=4)
{    
    *r = *rSource;
}

I'd probably do something like this. I doubt it'd make much difference in the grand scheme of things. Edited by C0lumbo

Share this post


Link to post
Share on other sites
gpu_noob    114
You could probably eliminate that *4 but that's insignificant.

 

It's possible to achieve same using shaders: output red color into single channel render target, but latency will kill any performance you gained.

 

Would it be possible to somehow transfer the single channel rendertargetdata to system memory via with d3d10 copyresource or d3d9 getrendertargetdata? I need to be able to access the Red channel on CPU.

Share this post


Link to post
Share on other sites
Zaoshi Kaba    8434
I'm afraid not. It copies whole/part of resource and doesn't pick individual bytes. You'd have to render quad with pixel shader, then copy red render target into RAM to have CPU access. Edited by Zaoshi Kaba

Share this post


Link to post
Share on other sites
gpu_noob    114
I'm afraid not. It copies whole/part of resource and doesn't pick individual bytes. You'd have to render quad with pixel shader, then copy red render target into RAM to have CPU access.

 

I'm not sure what you mean by copy red rendertarget to RAM. I thought the rendertargets have to be 32-bit aligned. Is there an example of how extract only red channels from rendertarget texture using pixel shaders?

Edited by gpu_noob

Share this post


Link to post
Share on other sites
alvaro    21246
Is there a more efficient way of doing this without using for loops?

You seem to be under the [incorrect] assumption that for loops are somehow slow. Chances are that code is perfectly fast. You might be able to save a bit in the pointer arithmetic, since you don't really need to compute the offset from scratch each time: It's just one more than the value it was in the previous iteration of the loop, so you can do it with a counter. But even that probably won't matter much.

You should generally only worry about performance when you have evidence that this operation is taking too much time in your program.

Share this post


Link to post
Share on other sites
iMalc    2466

byte* bgra	= byte array of a BGRA formatted bitmap image;
byte* r		= new byte[Height*Width];
byte *rSource = bgra+2;
int iPixels = Height*Width;

for (int i=0;i<iPixels;i++,r++,rSource+=4)
{    
    *r = *rSource;
}

I'd probably do something like this. I doubt it'd make much difference in the grand scheme of things.
You can go further than that, 'i' is not needed:
byte *bgra    = <byte array of a BGRA formatted bitmap image>;
byte *r       = new byte[Height*Width];
byte *rbegin  = r;
byte *rend    = r + Height*Width
byte *rSource = bgra+2;

while (rbegin < rend)
{    
    *rbegin++ = *rSource;
    rSource += 4;
}

Share this post


Link to post
Share on other sites
alvaro    21246
I just tried all of the solutions given above, and they have the exact same performance. So just write whatever is easiest to read. I personally would write this:
  byte *bgra = <byte array of a BGRA formatted bitmap image>;
  byte *r = new byte[Height*Width];

  for (int i=0; i<Height*Width; ++i)
    r[i] = bgra[4*i+2];

Share this post


Link to post
Share on other sites
Ravyne    14300

All of the vanilla C++ that's been posted is about as efficient as you're going to get.

 

However, if you can prove that this is still a bottleneck for you, you could further try:

 

Pre-warm the cache by reading ahead (depends on cache-line size, but probably 8 or 16 source pixels)

Unroll loop x4 (read), coalesce writes (need to add some code to deal with non-multiple-of-4 source data).

Drop down to SSE or AVX assembly/intrinsics (coalesce more writes, using shuffle instructions)

 

I would try those things in that order, but remember -- fast for fast's sake is a silly goal unless its an academic exercise; In "the real world" the best solution is usually the simplest one which is fast enough. Optimizing without profiling is the coding equivalent of shooting first and asking questions later.

Share this post


Link to post
Share on other sites
gpu_noob    114

This is indeed a bottleneck for me. I suspect this has to do with how many memory read/write operations happen at low level. The problem is that it's using up a lot of CPU power for larger images (1000x1000)

 

As an alternative, how can I use the GPU to obtain only the red channel? I'm currently rendering the BGRA bitmap image in Direct3D9 and obtaining it to system memory using GetRenderTargetData() and LockRect() then copying the Red Channel using the above method.

 

Is there any Direct3D way of copying only Red channel to system memory?

Share this post


Link to post
Share on other sites
C0lumbo    4411

Could you detail what it is that you're doing that requires you having the R channel accessible on the CPU?

 

Perhaps there's an alternative approach that can get you what you want. e.g. Perhaps whatever it is you're doing with your R channel on the CPU can actually be done on the GPU? Perhaps you can achieve your goal with a downsampled copy of your RGBA framebuffer so you'd only copy 1/4 of the pixels? Perhaps your rendering can be done on a single channel render buffer in the first place?

 

Finally, are you sure it's the copying around of the data that's actually taking the time, and it's not just the latency from the sync point between requesting a copy of the frame buffer and actually getting it on the CPU that's hurting your apps performance? If that's the case then perhaps you can insert a double buffering so that you do your CPU stuff on the previous frames buffer instead of the current frame's buffer.

Share this post


Link to post
Share on other sites
gpu_noob    114

I'm not sure how to profile properly but I used QueryPerformanceCounter to check the execution times for lockRect (0.003ms), getrendertargetdata(0.5ms) extracting red channel (5ms).

 

The bgra image is a formatted YUV where red channel is the Luminosity. I need to record the red channel data because it contains Y, which is used in a video encoding algorithm.

Share this post


Link to post
Share on other sites
tivolo    1367

The execution time of that piece of code is not limited by how you write the for-loops, or other micro-optimisations. The time spent in the loops is totally governed by memory accesses. Even if you only need the red channel, for a 1000x1000 BGRA image you're actually touching ~4MB of data in your read operations.

 

If your CPU has a cache-line size of e.g. 64 bytes, that means the code generates 62500 cache misses, assuming no data is in the cache - which it won't be because it's been copied from GPU to CPU memory. On what kind of CPU did you see the 5ms? Modern processors have automatic prefetching in order to deal with these issues, and I assume you're not working on console hardware, are you?

 

5ms seems a lot to me. Are you certain the transfer from the GetRenderTargetData operation has completely finished before calling QueryPerformanceCounter?

Share this post


Link to post
Share on other sites
wintertime    4108

Somehow this whole thing feels silly to me. Why waste time loading data thats intermingled with unneeded other data and then try to optimize that wrong usage?

Just load that file into your favorite image manipulation program, single out that red channel, save as a file with only 1 channel, load that simple file into your program, be happy!

Share this post


Link to post
Share on other sites
gpu_noob    114
Somehow this whole thing feels silly to me. Why waste time loading data thats intermingled with unneeded other data and then try to optimize that wrong usage?

Just load that file into your favorite image manipulation program, single out that red channel, save as a file with only 1 channel, load that simple file into your program, be happy!

 

I need to do this in real-time up to framerate of 60FPS.

 

 

5ms seems a lot to me. Are you certain the transfer from the GetRenderTargetData operation has completely finished before calling QueryPerformanceCounter?

 

 

How can I test if GetRenderTargetData has completed? I'm using the following lines of code

 

 

d3d->GetRenderTargetData(renderSurface,videoSurface);

videoSurface->LockRect(&lr,0, D3DLOCK_READONLY) );
byte* bgra = (byte*) lr.pBits;
byte* r = new byte[Height*Width];
 
for (int i = 0; i < Height; i++)
{
for (int j = 0; j < Width; j++)
{
int offset = i*Width + j;
r[offset] = bgra[offset*4 + 2];
}
}
videoSurface->UnlockRect();
ProcessRedChannel(&r);
delete[] r;
Edited by gpu_noob

Share this post


Link to post
Share on other sites
C0lumbo    4411

I'm pretty surprised too that you're measuring 5ms for copying that much data. Are you on some old hardware?

 

One more thing... You have a new and a delete, is there any chance that's taking up a lot of your time? If you are measuring in a debug build or have a poor performing memory manager, then that could be eating a lot of your time. Especially if your memory manager is filling 1MB of memory on both the new and on the delete.

Share this post


Link to post
Share on other sites
gpu_noob    114
I'm pretty surprised too that you're measuring 5ms for copying that much data. Are you on some old hardware?

 

One more thing... You have a new and a delete, is there any chance that's taking up a lot of your time? If you are measuring in a debug build or have a poor performing memory manager, then that could be eating a lot of your time. Especially if your memory manager is filling 1MB of memory on both the new and on the delete.

 

I just used byte* r = new byte[Height*Width] and delete[] r outside the renderloop and I get about 0.2ms reduction. I'm running on AMD Phenom 945. Also i'm getting about 5ms when rendertarget is about 1920x1080.

Share this post


Link to post
Share on other sites
jwezorek    2663
I'm currently rendering the BGRA bitmap image in Direct3D9 and obtaining it to system memory using GetRenderTargetData() and LockRect() then copying the Red Channel using the above method.
Then what are you doing with the red channel?

Basically, if you're doing anything per-pixel on the CPU on over a million pixels (1000x1000 bitmap) at each iteration of your main game loop, which is what it sounds like you are doing, there is no way that you're going to optimize this to be fast enough by screwing around with for-loops and so forth. You need to re-design and/or do more on the GPU or whatever, but in order for posters here to help with that we need to know what you are doing.

Share this post


Link to post
Share on other sites
gpu_noob    114
Then what are you doing with the red channel?

 

I'm feeding it to an video encoding algorithm. I'm not sure 

 

 

Basically, if you're doing anything per-pixel on the CPU on over a million pixels (1000x1000 bitmap) at each iteration of your main game loop, which is what it sounds like you are doing, there is no way that you're going to optimize this to be fast enough by screwing around with for-loops and so forth. You need to re-design and/or do more on the GPU or whatever, but in order for posters here to help with that we need to know what you are doing.

 

 

I don't really want to obtain a copy of the pixels but rather format the pixel data in system memory as follows:

 

B G R A B G R A...

 

R R R R R R R R... G G G G G G G G... B B B B B B B B... A A A A A A A A

 

so that I can feed the Red Channel to the video encoding algorithm which contains Luminance information about the image.

Edited by gpu_noob

Share this post


Link to post
Share on other sites
alvaro    21246
I'm pretty surprised too that you're measuring 5ms for copying that much data. Are you on some old hardware?

I got similar timings on my laptop, which is about one year old. Edited by Álvaro

Share this post


Link to post
Share on other sites
Ravyne    14300

So, if you're not already, what you probably want to do in this case is modify your loop to compute all 4 R, G, B, and A arrays (I'll call these planes) -- I presume you'll need the Green and Blue channels at some point too, for YUV you may or may not need A (which I assume remains alpha).

 

It seems likely to me that the real bottleneck here is the copy from GPU to system memory -- by doing all 4 planes per loop iteration, you'll make efficient use of cache, and since the source array is already transfered, you aren't paying that penalty again. Whereas the red channel alone has a cost of around 6ms, I'd wager you can easily get the whole set for under 10.

 

Something like:

int size        = Height * Width;

bgra* src	= byte array of a BGRA formatted bitmap image;
bgra* end       = src + size;

byte* r		= new byte[size];
byte* g		= new byte[size];
byte* b		= new byte[size];
byte* a		= new byte[size];

byte* r_dst     = r;
byte* g_dst     = g;
byte* b_dst     = b;
byte* a_dst     = a;

while (src < end)
{
  r_dst++ = RED(src);
  g_dst++ = GRN(src);
  b_dst++ = BLU(src);
  a_dst++ = ALP(src);

  src++;
}

delete[] a;
delete[] b;
delete[] g;
delete[] r;

 

And then everything I said before applies -- unroll loop, coalesce writes, drop to SSE/AVX.

 

Another thought -- also look into the restrict keyword and make sure your pointers are const-correct. Without restrict/const correctness, its possible (if not likely) that the compiler can't optimize this code, because it won't know whether your pointers alias each other or not.

Share this post


Link to post
Share on other sites
Adam_42    3629

Your best bet performance wise is probably to get the GPU to do as much of the work as possible. It should be fairly simple to write a shader that does the colour space conversion and outputs the data in the format you need. The only awkwardness is that there are no one byte per pixel render target formats, so you'll have to use RGBA and process four source pixels for each destination one (and ensure the source image is a multiple of 4 pixels wide).

 

In addition to that don't lock the texture on the same frame as you call GetRendertargetData() - double buffer it and you'll be blocking waiting for the GPU less often.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this