# Display screen remotely

## Recommended Posts

David Lake    117

I have a program I made for remotely controlling PC's that uses bitblt, bitmaps, a bit of processing and compression and I would like to improve the way it captures and displays the desktop using SlimDX if that can be used to increase performance.

At the moment the capture side is very heavy on the CPU, I use the threadpool for threading and split the screen capture and processing/sending of frames into different threads but would like to reduce CPU utilization without compromising performance if possible.

Main problem is I cant control the login screen I would like to know if theres a way arround that.

Also I need a faster way(than GDI+) to scale the images before sending, I'm hoping SlimDX can speed this up using the GPU

Is there a relatively easy way to use SlimDX to capture the screen and display it remotely thats at least as fast as using bitblt, GDI+ and bitmaps?

##### Share on other sites
Vortez    2714

Use dx to scale your image, not GDI, it will be super fast. I have a similar working project and that's what i do, execpt i use OpenGL but it's almost the same thing.

For the cpu, i don't really see any way other than add a sleep before or after you capture the screen. It might be a bit slower, but keep in mind that your image is going to

be streamed on the network anyway, which isn't that fast. In fact, i multithreaded mine so, after sending a screenshot, it start a thread to make the next one, while the current one is being sent. Keep in mind though that multithreading add some complexity like syncronisation but i think it's easier in c# than c++, like i did.

good luck

Edited by Vortez

##### Share on other sites
David Lake    117

Thanks, I can easily slow it down but ultimately I want it to go faster without using 2 whole cores.

I'm looking for code samples for speedy transfer of screen capture, possibly with SlimDX, over the network.

I have tried it before but it was too slow because I had to turn it into a Drawing.Bitmap but if theres a way to avoid System.Drawing all together and convert the sprite or whatever it was into a byte array to send it over the network that might make it faster.

Also I'm a complete n00b when it comes to any sort of GPU programming but if I could process and compress the frames on the GPU that would be nerdgasmtastic!

Edited by David Lake

##### Share on other sites
Vortez    2714

I don't know c# very well, but in c++ i have direct access to the bitmap buffer memory so i don't know. Im using Delphi for the interface part and a c++ dll to do the hard work (networking, screenshot, hooks, compression).

Not sure if compression on the gpu is fesable, im using zlib to compress mine, but the real speed up isn't that. In fact, i use a QuadTree, basically, i split the image in 4 like 4 or 5 times, then i check wich block have changed, and send only the part that changed. The quadtree is used to send different sized part of the images to the other side, so i only need to update some part of the texture most of the time, and, if nothing changed, it just send an empty screenshot message. It's a bit complex but the best optimization i've found yet.

It's a bit the same thing you do when optimizing a terrain mesh using quadtree but with an image instead. I also tried using mp1 compression to send frames, like a movie, and it worked, but the image is blurry, and it's slower than my other method so i don't see any reason to use it.

Let directx do the scaling for you. And make sure to replace, not recreate, the texture each time you receive a screenshot, it's way faster this way.

Edited by Vortez

##### Share on other sites

The main problem with using the GPU for this is that the transfers to and from the GPU are relatively slow - the CPU can read and write memory faster. This means unless the processing required is expensive then it probably won't help compared to some optimized CPU code. However if you want to go that way you may find http://forums.getpaint.net/index.php?/topic/18989-gpu-motion-blur-effect-using-directcompute/ useful, there's some source code available there.

For data compression I'd go with using the xor operator on pairs of adjacent frames. The result will be zeros where they are identical. You can then apply zlib to the results, which should compress all those zeros really well. Reconstructing at the other end is done with xor too. As the xor operator is really cheap that should be reasonably quick to do even on a CPU.

To cut down CPU load make sure you have a frame rate limiter in there. There's no point processing more than 60 frames per second, and you can probably get away with far less than that.

##### Share on other sites
David Lake    117

I store the previous frame and zero ARGB of any unchanged pixils (xor), then compress twice with QuickLZ this makes highly compressible frames much smaller then a single compression, all the processing is done by an optimized dll built using the Intel C++ compiler.

Then when displaying the frame (in a picturebox with GDI+) I simply draw over the previous one and since the alpha of the unchanged pixels is zero, the unchanged pixels of the previous frame remain (ingenious I know!).

Whats the fastest way to capture the screen, scale it and get the frame into a byte array using the GPU then display it without using GDI which I find slows down when rendering at a high resolution such as 1920x1200 even on an i7 3820 at 4.3GHz?

Oh and as for framerate, I like it to be as fast as possible thats why I dont use Remote Desktop and if it did go over 60 FPS I dont know how to limit it to exactly that.

Edited by David Lake

##### Share on other sites
Vortez    2714

For data compression I'd go with using the xor operator on pairs of adjacent frames. The result will be zeros where they are identical. You can then apply zlib to the results, which should compress all those zeros really well. Reconstructing at the other end is done with xor too. As the xor operator is really cheap that should be reasonably quick to do even on a CPU.

That's actually a very good idea. Dunno why i didn't think about it before...

##### Share on other sites
David Lake    117

If anything that uses the GPU is slower then wont using it for scaling be slow?

Edited by David Lake

##### Share on other sites
Vortez    2714

GPU are not slower than cpu, they are just more optimized to do parallel tasks and work with vectors and 3d/graphics/textures stuffs, while cpus are more for serials operation. Most compression algorithm are serial by nature i think. I think what Adam_42 meant is that it take time to transfer the data to be compressed from normal memory to gpu memory and back, and that time could be used to compress on the cpu, rendering gpu compression useless.

I can't really tell why scaling a texture using the gpu is faster, but it is, it's one of the thing the gpu is good at. Also, think about it, isn't it better to send pictures of a fixed size and render it the size you want on the other side, or scale it first, then be stuck with that size on the other side? I prefer the first solution. This way you can resize the window that draw the screenshot and directx will scale it for you effortlessly, all you have to do is draw a quad the size of the renderer window and the texture will stretch with it automatically. If you want the picture not to be distorted, then it's a little bit more work since you have to draw black borders, but it's not complicated to compute either. (In fact, it's not the border you must draw, but rather adjust the quad size so it leave some area black, or whaterver you set the background color to)

PS: Sorry if im not explaining very well but english is not my native language.

Edited by Vortez

##### Share on other sites
Bacterius    13165

GPU are not slower than cpu, they are just more optimized to do parallel tasks and work with vectors and 3d/graphics stuffs, while cpus are more for serials operation.

Most compression algorithm are serial i think.

I think he meant the data transfer. GPU's are good for games because the frame stays on the GPU. For his application he would need to constantly send the frame to the GPU, scale it, and then copy it back to the CPU for transfer, which may be too slow (PCI-E has very real bandwidth limits) and can have nontrivial latency.

There is always some overhead in doing GPU stuff, which is why in some cases it is best to let the CPU do it because the cost of getting the GPU involved exceeds the benefits.

##### Share on other sites
David Lake    117

GPU are not slower than cpu, they are just more optimized to do parallel tasks and work with vectors and 3d/graphics stuffs, while cpus are more for serials operation.

Most compression algorithm are serial i think.

I think he meant the data transfer. GPU's are good for games because the frame stays on the GPU. For his application he would need to constantly send the frame to the GPU, scale it, and then copy it back to the CPU for transfer, which may be too slow (PCI-E has very real bandwidth limits) and can have nontrivial latency.

There is always some overhead in doing GPU stuff, which is why in some cases it is best to let the CPU do it because the cost of getting the GPU involved exceeds the benefits.

Yes, thank you.

I suppose I'd better be looking for performance optimization elsewhere in the code, after doing a performance analysis it seems array copying is quite slow.

##### Share on other sites
Vortez    2714

I suppose I'd better be looking for performance optimization elsewhere in the code, after doing a performance analysis it seems array copying is quite slow.

Of course it is, just multiply the width and height of you screen then multiply that by 3 to give you an idea of how much bytes you need copying. my 1360x768 screen is 1360 * 768 * 3 bytes = 3133440 bytes or about 3 Mb. Without some form of compression or optimization this take quite a while to transfer. Although, in my program at least, using 2,8,16 or 24 bits per pixels dosen't seem to help much, i might be bottlenecked somewhere else but still get pretty descent result (couple of frames/second).

Edited by Vortez

##### Share on other sites
David Lake    117

I am aware, my test involved 1920x1200*4 (32bit), but still a 9,216,000 byte frame is nothing for quad channel DDR3 at 2133MHz with an effective performance of 45GB/s and 26.8GIPS per core.

Edited by David Lake

Vortez    2714

Ouch!

##### Share on other sites
Vortez    2714

Another optimization that i do is to paint the desktop background black when the connection is done, then restore it afterward, it's a lot easier to compress, although with the xor trick pointed out by Adam that become rather useless.

##### Share on other sites
David Lake    117

Ouch!

Is that sarcasm I cant tell?

##### Share on other sites
Vortez    2714

I just meant that this is like 3 times more pixels than my example, so without compression, that must take a while to transfer a single frame over the network.

With my 8000 mbits/sec connection that would take 100 seconds to upload and 10 to download... hence the "ouch!" haha.

Edited by Vortez

##### Share on other sites
David Lake    117

I'm curious how you lot got xor to work, I check whole pixels and use the alpha channel to tell if the pixel is unchanged and not just black, if xor is done on each sub pixel with no alpha channel theres no way to tell if its zero or unchanged?

Edited by David Lake

##### Share on other sites
aqrit    119

Why would you need to draw a distinction between black and unchanged?

( assuming black is zero )

previous[3] = { green, black, green };

current[3] = { black, black, green };

encode: ( xor previous with current frame )

green xor black = green

black xor black = black

green xor green = black

so pak[3] = { green, black, black }

decode: ( xor previous with "packed" frame )

green xor green = black

black xor black = black

green xor black= green

which is back to current frame

If your doing this for the first frame just use zero filled memory / black as the previous frame.

Edited by aqrit

##### Share on other sites
Vortez    2714

The point is not to check if a pixel have changed, but to make black all pixels that haven't changed. Then, when compressing, if 2 images are identical you get a buffer full of zeros, which is very compressible. I haven't tryed it yet but i know it work, it's like xor encryption. All you need is a buffer with the previous image and one with the current image and perform xor on all those bits before, and after sending it. The second pass will restore the original image.

Im pretty sure it's more fast than my quad tree algorithm.

Edited by Vortez

##### Share on other sites
Vortez    2714

Btw, you can extract your bitmap in 24 bits per pixel if you wish, by settings the LPBITMAPINFO bmiHeader.biBitCount member to 24

// ... some code removed

// De-select our hbmp
SelectObject(s_hdc, ex_hbmp);

// Allocate a BITMAPINFO buffer
LPBITMAPINFO lpbi = (LPBITMAPINFO)(new BYTE[BMISize]);
ZeroMemory(lpbi, sizeof(BITMAPINFO));

// Get information about the screenshot image format
GetDIBits(s_hdc, hbmp, 0, h, NULL, lpbi, DIB_RGB_COLORS);
// Make sure it's gonna be extracted in 24 bits format

// Extract the image in 24 bits format
GetDIBits(s_hdc, hbmp, 0, h, pSrc->GetBuffer(), lpbi, DIB_RGB_COLORS);

...

Edited by Vortez

##### Share on other sites
David Lake    117

Oh yea I understand now, my brain dont work as well as it used to and im only 24!

Yipeee that sped it up a bit!

Edited by David Lake

##### Share on other sites
Vortez    2714

1 xor 1 = 0
1 xor 0 = 1
0 xor 1 = 1
0 xor 0 = 0

Do some exercise with a pen and paper, with 2 Bytes. Try it twice with identical value, then try it twice again with non-identical value.

You'll get it eventually.

Edit: Oh, now it was me that though you where being sarcastic haha.

Edited by Vortez

##### Share on other sites
David Lake    117

Now I would like a faster way to display the image than a picturebox if possible?

I also need to remove the alpha channel as bitblt gives me no choice on that, whats the best way to do that, in the xor loop in my dll?

Edited by David Lake

##### Share on other sites
Vortez    2714

You need directx or opengl for that, in c#, i dunno how that would work though. All you have to do is create a texture once, then replace it with the new one each frame, then draw it on a quad the size of the screen, using the texture above.

As for the alpha channel, i can post all my code but it's in c++

//-----------------------------------------------------------------------------
// Draw the cursor
//-----------------------------------------------------------------------------
void CScreenShot::DrawCurcor(HDC hDC)
{
CURSORINFO CursorInfo;
CursorInfo.cbSize = sizeof(CURSORINFO);
GetCursorInfo(&CursorInfo);

static DWORD Version = WinVer.DetectWindowsVersion();
//static HCURSOR hCur = LoadCursor(NULL, IDC_ARROW);

DWORD CursorWidth  = GetSystemMetrics(SM_CXCURSOR);
DWORD CursorHeight = GetSystemMetrics(SM_CYCURSOR);

POINT CursorPos;
GetCursorPos(&CursorPos);

// Needed for XP or older windows
if(Version < _WIN_VISTA_){
CursorPos.x -= CursorWidth  >> 2;
CursorPos.y -= CursorHeight >> 2;
}

DrawIconEx(hDC, CursorPos.x, CursorPos.y, CursorInfo.hCursor, CursorWidth, CursorHeight, 0, NULL, DI_NORMAL);
}

//-----------------------------------------------------------------------------
// Take a screenshot, extract it to a buffer in 24 bits, and compress it
//-----------------------------------------------------------------------------
int CScreenShot::GenMPEGScreenShot(CVideoEncoder *pVideoEncoder, BOOL ShowCursor)
{
HWND hDesktopWnd = GetDesktopWindow();
HDC  hdc = GetDC(hDesktopWnd);

int x = 0;
int y = 0;
int w = GetSystemMetrics(SM_CXSCREEN);
int h = GetSystemMetrics(SM_CYSCREEN);

HDC     s_hdc   = CreateCompatibleDC(hdc);
HBITMAP hbmp    = CreateCompatibleBitmap(hdc, w,h);
HBITMAP ex_hbmp = (HBITMAP)SelectObject(s_hdc, hbmp);

/////////////////////////////////////////////////////////////////////////////////////////

// Copy the screen image in our bitmap
BitBlt(s_hdc, x,y,w,h, hdc, x,y, SRCCOPY);

// Draw the cursor over the image
if(ShowCursor)
DrawCurcor(s_hdc);

ReleaseDC(hDesktopWnd, hdc);

/////////////////////////////////////////////////////////////////////////////////////////

// Create pointers to our buffers object
CRawBuffer *pSrc = &Buffers.MPEG.ScreenShot;
CRawBuffer *pDst = &Buffers.MPEG.Encoded;

// Allocate buffers
DWORD NumPixels = w * h;
if(pSrc->GetBufferSize() != NumPixels * 3)
pSrc->Allocate(NumPixels * 3);

// Allocate a BITMAPINFO buffer
LPBITMAPINFO lpbi = (LPBITMAPINFO)(new BYTE[BMISize]);
ZeroMemory(lpbi, sizeof(BITMAPINFO));

// De-select our hbmp
SelectObject(s_hdc, ex_hbmp);

// Get information about the screenshot image format
GetDIBits(s_hdc, hbmp, 0, h, NULL, lpbi, DIB_RGB_COLORS);
// Make sure it's gonna be extracted in 24 bits format

// Extract the image in 24 bits format
GetDIBits(s_hdc, hbmp, 0, h, pSrc->GetBuffer(), lpbi, DIB_RGB_COLORS);

// Delete the BITMAPINFO buffer
SAFE_DELETE_ARRAY(lpbi);

// Release the bitmap handles
if(SelectObject(s_hdc, hbmp)){
DeleteObject(hbmp);
DeleteDC(s_hdc);
}

/////////////////////////////////////////////////////////////////////////////////////////

// Convert from BGR to RGB
Convert24bitsBGRTORGB(pSrc->GetBuffer(), pSrc->GetBufferSize());

/////////////////////////////////////////////////////////////////////////////////////////

// Compress the frame using ffmpeg
int FrameSize = pVideoEncoder->EncodeFrame(pDst->GetBuffer(6), pSrc->GetBuffer(), pSrc->GetBufferSize());

WORD MsgID = MSG_MP1_IMG_REQUEST;
memcpy(pDst->GetBuffer(0), &FrameSize, sizeof(DWORD));
memcpy(pDst->GetBuffer(4), &MsgID,     sizeof(WORD));

// Free our source buffer
pSrc->Free();

// Return compressed buffer size
Size = FrameSize;
return Size;
}