Efficiently obtaining Red Channel from BGRA Bitmap

Started by
21 comments, last by Adam_42 11 years, 3 months ago

This is indeed a bottleneck for me. I suspect this has to do with how many memory read/write operations happen at low level. The problem is that it's using up a lot of CPU power for larger images (1000x1000)

As an alternative, how can I use the GPU to obtain only the red channel? I'm currently rendering the BGRA bitmap image in Direct3D9 and obtaining it to system memory using GetRenderTargetData() and LockRect() then copying the Red Channel using the above method.

Is there any Direct3D way of copying only Red channel to system memory?

Advertisement

Could you detail what it is that you're doing that requires you having the R channel accessible on the CPU?

Perhaps there's an alternative approach that can get you what you want. e.g. Perhaps whatever it is you're doing with your R channel on the CPU can actually be done on the GPU? Perhaps you can achieve your goal with a downsampled copy of your RGBA framebuffer so you'd only copy 1/4 of the pixels? Perhaps your rendering can be done on a single channel render buffer in the first place?

Finally, are you sure it's the copying around of the data that's actually taking the time, and it's not just the latency from the sync point between requesting a copy of the frame buffer and actually getting it on the CPU that's hurting your apps performance? If that's the case then perhaps you can insert a double buffering so that you do your CPU stuff on the previous frames buffer instead of the current frame's buffer.

I'm not sure how to profile properly but I used QueryPerformanceCounter to check the execution times for lockRect (0.003ms), getrendertargetdata(0.5ms) extracting red channel (5ms).

The bgra image is a formatted YUV where red channel is the Luminosity. I need to record the red channel data because it contains Y, which is used in a video encoding algorithm.

The execution time of that piece of code is not limited by how you write the for-loops, or other micro-optimisations. The time spent in the loops is totally governed by memory accesses. Even if you only need the red channel, for a 1000x1000 BGRA image you're actually touching ~4MB of data in your read operations.

If your CPU has a cache-line size of e.g. 64 bytes, that means the code generates 62500 cache misses, assuming no data is in the cache - which it won't be because it's been copied from GPU to CPU memory. On what kind of CPU did you see the 5ms? Modern processors have automatic prefetching in order to deal with these issues, and I assume you're not working on console hardware, are you?

5ms seems a lot to me. Are you certain the transfer from the GetRenderTargetData operation has completely finished before calling QueryPerformanceCounter?

Somehow this whole thing feels silly to me. Why waste time loading data thats intermingled with unneeded other data and then try to optimize that wrong usage?

Just load that file into your favorite image manipulation program, single out that red channel, save as a file with only 1 channel, load that simple file into your program, be happy!

Somehow this whole thing feels silly to me. Why waste time loading data thats intermingled with unneeded other data and then try to optimize that wrong usage?

Just load that file into your favorite image manipulation program, single out that red channel, save as a file with only 1 channel, load that simple file into your program, be happy!

I need to do this in real-time up to framerate of 60FPS.

5ms seems a lot to me. Are you certain the transfer from the GetRenderTargetData operation has completely finished before calling QueryPerformanceCounter?

How can I test if GetRenderTargetData has completed? I'm using the following lines of code

d3d->GetRenderTargetData(renderSurface,videoSurface);

videoSurface->LockRect(&lr,0, D3DLOCK_READONLY) );
byte* bgra = (byte*) lr.pBits;
byte* r = new byte[Height*Width];
for (int i = 0; i < Height; i++)
{
for (int j = 0; j < Width; j++)
{
int offset = i*Width + j;
r[offset] = bgra[offset*4 + 2];
}
}
videoSurface->UnlockRect();
ProcessRedChannel(&r);
delete[] r;

I'm pretty surprised too that you're measuring 5ms for copying that much data. Are you on some old hardware?

One more thing... You have a new and a delete, is there any chance that's taking up a lot of your time? If you are measuring in a debug build or have a poor performing memory manager, then that could be eating a lot of your time. Especially if your memory manager is filling 1MB of memory on both the new and on the delete.

I'm pretty surprised too that you're measuring 5ms for copying that much data. Are you on some old hardware?

One more thing... You have a new and a delete, is there any chance that's taking up a lot of your time? If you are measuring in a debug build or have a poor performing memory manager, then that could be eating a lot of your time. Especially if your memory manager is filling 1MB of memory on both the new and on the delete.

I just used byte* r = new byte[Height*Width] and delete[] r outside the renderloop and I get about 0.2ms reduction. I'm running on AMD Phenom 945. Also i'm getting about 5ms when rendertarget is about 1920x1080.

I'm currently rendering the BGRA bitmap image in Direct3D9 and obtaining it to system memory using GetRenderTargetData() and LockRect() then copying the Red Channel using the above method.
Then what are you doing with the red channel?

Basically, if you're doing anything per-pixel on the CPU on over a million pixels (1000x1000 bitmap) at each iteration of your main game loop, which is what it sounds like you are doing, there is no way that you're going to optimize this to be fast enough by screwing around with for-loops and so forth. You need to re-design and/or do more on the GPU or whatever, but in order for posters here to help with that we need to know what you are doing.
Then what are you doing with the red channel?

I'm feeding it to an video encoding algorithm. I'm not sure

Basically, if you're doing anything per-pixel on the CPU on over a million pixels (1000x1000 bitmap) at each iteration of your main game loop, which is what it sounds like you are doing, there is no way that you're going to optimize this to be fast enough by screwing around with for-loops and so forth. You need to re-design and/or do more on the GPU or whatever, but in order for posters here to help with that we need to know what you are doing.

I don't really want to obtain a copy of the pixels but rather format the pixel data in system memory as follows:

B G R A B G R A...

R R R R R R R R... G G G G G G G G... B B B B B B B B... A A A A A A A A

so that I can feed the Red Channel to the video encoding algorithm which contains Luminance information about the image.

This topic is closed to new replies.

Advertisement