Jump to content

  • Log In with Google      Sign In   
  • Create Account


Efficiently obtaining Red Channel from BGRA Bitmap


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
22 replies to this topic

#1 gpu_noob   Members   -  Reputation: 114

Like
0Likes
Like

Posted 13 January 2013 - 04:00 AM

byte* bgra	= byte array of a BGRA formatted bitmap image;
byte* r		= new byte[Height*Width];
			
for (int i = 0; i < Height; i++)
{
	for (int j = 0; j < Width; j++)
	{
		int offset = i*Width + j;
		r[offset] = bgra[offset*4 + 2];
	}
}
delete[] r;

 

I'm using the above code to obtain red channel values from a byte array of a BGRA bitmap image. The image is formatted as:


B G R A B G R A... (Size of W*H*4)

 

I want to obtain a byte array of 

 

R R R R... (Size of W*H)

 

Is there a more efficient way of doing this without using for loops?

 



Sponsor:

#2 Zaoshi Kaba   Crossbones+   -  Reputation: 3608

Like
0Likes
Like

Posted 13 January 2013 - 04:06 AM

You could probably eliminate that *4 but that's insignificant.

 

It's possible to achieve same using shaders: output red color into single channel render target, but latency will kill any performance you gained.



#3 C0lumbo   Crossbones+   -  Reputation: 2118

Like
0Likes
Like

Posted 13 January 2013 - 04:10 AM

byte* bgra	= byte array of a BGRA formatted bitmap image;
byte* r		= new byte[Height*Width];
byte *rSource = bgra+2;
int iPixels = Height*Width;

for (int i=0;i<iPixels;i++,r++,rSource+=4)
{    
    *r = *rSource;
}

I'd probably do something like this. I doubt it'd make much difference in the grand scheme of things.

Edited by C0lumbo, 13 January 2013 - 04:11 AM.


#4 gpu_noob   Members   -  Reputation: 114

Like
0Likes
Like

Posted 13 January 2013 - 09:50 AM

You could probably eliminate that *4 but that's insignificant.

 

It's possible to achieve same using shaders: output red color into single channel render target, but latency will kill any performance you gained.

 

Would it be possible to somehow transfer the single channel rendertargetdata to system memory via with d3d10 copyresource or d3d9 getrendertargetdata? I need to be able to access the Red channel on CPU.



#5 Zaoshi Kaba   Crossbones+   -  Reputation: 3608

Like
0Likes
Like

Posted 13 January 2013 - 10:29 AM

I'm afraid not. It copies whole/part of resource and doesn't pick individual bytes. You'd have to render quad with pixel shader, then copy red render target into RAM to have CPU access.

Edited by Zaoshi Kaba, 13 January 2013 - 10:30 AM.


#6 gpu_noob   Members   -  Reputation: 114

Like
0Likes
Like

Posted 13 January 2013 - 11:58 AM

I'm afraid not. It copies whole/part of resource and doesn't pick individual bytes. You'd have to render quad with pixel shader, then copy red render target into RAM to have CPU access.

 

I'm not sure what you mean by copy red rendertarget to RAM. I thought the rendertargets have to be 32-bit aligned. Is there an example of how extract only red channels from rendertarget texture using pixel shaders?


Edited by gpu_noob, 13 January 2013 - 12:22 PM.


#7 Álvaro   Crossbones+   -  Reputation: 11861

Like
0Likes
Like

Posted 13 January 2013 - 12:23 PM

Is there a more efficient way of doing this without using for loops?

You seem to be under the [incorrect] assumption that for loops are somehow slow. Chances are that code is perfectly fast. You might be able to save a bit in the pointer arithmetic, since you don't really need to compute the offset from scratch each time: It's just one more than the value it was in the previous iteration of the loop, so you can do it with a counter. But even that probably won't matter much.

You should generally only worry about performance when you have evidence that this operation is taking too much time in your program.

#8 iMalc   Crossbones+   -  Reputation: 2259

Like
0Likes
Like

Posted 14 January 2013 - 12:28 AM


byte* bgra	= byte array of a BGRA formatted bitmap image;
byte* r		= new byte[Height*Width];
byte *rSource = bgra+2;
int iPixels = Height*Width;

for (int i=0;i<iPixels;i++,r++,rSource+=4)
{    
    *r = *rSource;
}

I'd probably do something like this. I doubt it'd make much difference in the grand scheme of things.
You can go further than that, 'i' is not needed:
byte *bgra    = <byte array of a BGRA formatted bitmap image>;
byte *r       = new byte[Height*Width];
byte *rbegin  = r;
byte *rend    = r + Height*Width
byte *rSource = bgra+2;

while (rbegin < rend)
{    
    *rbegin++ = *rSource;
    rSource += 4;
}

"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

#9 Álvaro   Crossbones+   -  Reputation: 11861

Like
0Likes
Like

Posted 14 January 2013 - 09:10 AM

I just tried all of the solutions given above, and they have the exact same performance. So just write whatever is easiest to read. I personally would write this:
  byte *bgra = <byte array of a BGRA formatted bitmap image>;
  byte *r = new byte[Height*Width];

  for (int i=0; i<Height*Width; ++i)
    r[i] = bgra[4*i+2];


#10 Ravyne   Crossbones+   -  Reputation: 6765

Like
2Likes
Like

Posted 14 January 2013 - 09:57 AM

All of the vanilla C++ that's been posted is about as efficient as you're going to get.

 

However, if you can prove that this is still a bottleneck for you, you could further try:

 

Pre-warm the cache by reading ahead (depends on cache-line size, but probably 8 or 16 source pixels)

Unroll loop x4 (read), coalesce writes (need to add some code to deal with non-multiple-of-4 source data).

Drop down to SSE or AVX assembly/intrinsics (coalesce more writes, using shuffle instructions)

 

I would try those things in that order, but remember -- fast for fast's sake is a silly goal unless its an academic exercise; In "the real world" the best solution is usually the simplest one which is fast enough. Optimizing without profiling is the coding equivalent of shooting first and asking questions later.



#11 gpu_noob   Members   -  Reputation: 114

Like
0Likes
Like

Posted 15 January 2013 - 02:26 AM

This is indeed a bottleneck for me. I suspect this has to do with how many memory read/write operations happen at low level. The problem is that it's using up a lot of CPU power for larger images (1000x1000)

 

As an alternative, how can I use the GPU to obtain only the red channel? I'm currently rendering the BGRA bitmap image in Direct3D9 and obtaining it to system memory using GetRenderTargetData() and LockRect() then copying the Red Channel using the above method.

 

Is there any Direct3D way of copying only Red channel to system memory?



#12 C0lumbo   Crossbones+   -  Reputation: 2118

Like
2Likes
Like

Posted 15 January 2013 - 03:03 AM

Could you detail what it is that you're doing that requires you having the R channel accessible on the CPU?

 

Perhaps there's an alternative approach that can get you what you want. e.g. Perhaps whatever it is you're doing with your R channel on the CPU can actually be done on the GPU? Perhaps you can achieve your goal with a downsampled copy of your RGBA framebuffer so you'd only copy 1/4 of the pixels? Perhaps your rendering can be done on a single channel render buffer in the first place?

 

Finally, are you sure it's the copying around of the data that's actually taking the time, and it's not just the latency from the sync point between requesting a copy of the frame buffer and actually getting it on the CPU that's hurting your apps performance? If that's the case then perhaps you can insert a double buffering so that you do your CPU stuff on the previous frames buffer instead of the current frame's buffer.



#13 gpu_noob   Members   -  Reputation: 114

Like
0Likes
Like

Posted 15 January 2013 - 03:33 AM

I'm not sure how to profile properly but I used QueryPerformanceCounter to check the execution times for lockRect (0.003ms), getrendertargetdata(0.5ms) extracting red channel (5ms).

 

The bgra image is a formatted YUV where red channel is the Luminosity. I need to record the red channel data because it contains Y, which is used in a video encoding algorithm.



#14 tivolo   Members   -  Reputation: 882

Like
0Likes
Like

Posted 15 January 2013 - 08:17 AM

The execution time of that piece of code is not limited by how you write the for-loops, or other micro-optimisations. The time spent in the loops is totally governed by memory accesses. Even if you only need the red channel, for a 1000x1000 BGRA image you're actually touching ~4MB of data in your read operations.

 

If your CPU has a cache-line size of e.g. 64 bytes, that means the code generates 62500 cache misses, assuming no data is in the cache - which it won't be because it's been copied from GPU to CPU memory. On what kind of CPU did you see the 5ms? Modern processors have automatic prefetching in order to deal with these issues, and I assume you're not working on console hardware, are you?

 

5ms seems a lot to me. Are you certain the transfer from the GetRenderTargetData operation has completely finished before calling QueryPerformanceCounter?



#15 wintertime   Members   -  Reputation: 1601

Like
0Likes
Like

Posted 15 January 2013 - 09:57 AM

Somehow this whole thing feels silly to me. Why waste time loading data thats intermingled with unneeded other data and then try to optimize that wrong usage?

Just load that file into your favorite image manipulation program, single out that red channel, save as a file with only 1 channel, load that simple file into your program, be happy!



#16 gpu_noob   Members   -  Reputation: 114

Like
0Likes
Like

Posted 15 January 2013 - 11:09 AM

Somehow this whole thing feels silly to me. Why waste time loading data thats intermingled with unneeded other data and then try to optimize that wrong usage?

Just load that file into your favorite image manipulation program, single out that red channel, save as a file with only 1 channel, load that simple file into your program, be happy!

 

I need to do this in real-time up to framerate of 60FPS.

 

 

5ms seems a lot to me. Are you certain the transfer from the GetRenderTargetData operation has completely finished before calling QueryPerformanceCounter?

 

 

How can I test if GetRenderTargetData has completed? I'm using the following lines of code

 

 

d3d->GetRenderTargetData(renderSurface,videoSurface);

videoSurface->LockRect(&lr,0, D3DLOCK_READONLY) );
byte* bgra = (byte*) lr.pBits;
byte* r = new byte[Height*Width];
 
for (int i = 0; i < Height; i++)
{
for (int j = 0; j < Width; j++)
{
int offset = i*Width + j;
r[offset] = bgra[offset*4 + 2];
}
}
videoSurface->UnlockRect();
ProcessRedChannel(&r);
delete[] r;

Edited by gpu_noob, 15 January 2013 - 11:11 AM.


#17 C0lumbo   Crossbones+   -  Reputation: 2118

Like
1Likes
Like

Posted 15 January 2013 - 11:42 AM

I'm pretty surprised too that you're measuring 5ms for copying that much data. Are you on some old hardware?

 

One more thing... You have a new and a delete, is there any chance that's taking up a lot of your time? If you are measuring in a debug build or have a poor performing memory manager, then that could be eating a lot of your time. Especially if your memory manager is filling 1MB of memory on both the new and on the delete.



#18 gpu_noob   Members   -  Reputation: 114

Like
0Likes
Like

Posted 15 January 2013 - 11:49 AM

I'm pretty surprised too that you're measuring 5ms for copying that much data. Are you on some old hardware?

 

One more thing... You have a new and a delete, is there any chance that's taking up a lot of your time? If you are measuring in a debug build or have a poor performing memory manager, then that could be eating a lot of your time. Especially if your memory manager is filling 1MB of memory on both the new and on the delete.

 

I just used byte* r = new byte[Height*Width] and delete[] r outside the renderloop and I get about 0.2ms reduction. I'm running on AMD Phenom 945. Also i'm getting about 5ms when rendertarget is about 1920x1080.



#19 jwezorek   Crossbones+   -  Reputation: 1606

Like
0Likes
Like

Posted 15 January 2013 - 11:52 AM

I'm currently rendering the BGRA bitmap image in Direct3D9 and obtaining it to system memory using GetRenderTargetData() and LockRect() then copying the Red Channel using the above method.
Then what are you doing with the red channel?

Basically, if you're doing anything per-pixel on the CPU on over a million pixels (1000x1000 bitmap) at each iteration of your main game loop, which is what it sounds like you are doing, there is no way that you're going to optimize this to be fast enough by screwing around with for-loops and so forth. You need to re-design and/or do more on the GPU or whatever, but in order for posters here to help with that we need to know what you are doing.

#20 gpu_noob   Members   -  Reputation: 114

Like
0Likes
Like

Posted 15 January 2013 - 12:17 PM

Then what are you doing with the red channel?

 

I'm feeding it to an video encoding algorithm. I'm not sure 

 

 

Basically, if you're doing anything per-pixel on the CPU on over a million pixels (1000x1000 bitmap) at each iteration of your main game loop, which is what it sounds like you are doing, there is no way that you're going to optimize this to be fast enough by screwing around with for-loops and so forth. You need to re-design and/or do more on the GPU or whatever, but in order for posters here to help with that we need to know what you are doing.

 

 

I don't really want to obtain a copy of the pixels but rather format the pixel data in system memory as follows:

 

B G R A B G R A...

 

R R R R R R R R... G G G G G G G G... B B B B B B B B... A A A A A A A A

 

so that I can feed the Red Channel to the video encoding algorithm which contains Luminance information about the image.


Edited by gpu_noob, 15 January 2013 - 12:19 PM.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS