Jump to content

  • Log In with Google      Sign In   
  • Create Account

Display screen remotely


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
59 replies to this topic

#1 David Lake   Members   -  Reputation: 118

Like
0Likes
Like

Posted 27 June 2013 - 02:18 PM

I have a program I made for remotely controlling PC's that uses bitblt, bitmaps, a bit of processing and compression and I would like to improve the way it captures and displays the desktop using SlimDX if that can be used to increase performance.

 

At the moment the capture side is very heavy on the CPU, I use the threadpool for threading and split the screen capture and processing/sending of frames into different threads but would like to reduce CPU utilization without compromising performance if possible.

Main problem is I cant control the login screen I would like to know if theres a way arround that.

 

Also I need a faster way(than GDI+) to scale the images before sending, I'm hoping SlimDX can speed this up using the GPU

 

Is there a relatively easy way to use SlimDX to capture the screen and display it remotely thats at least as fast as using bitblt, GDI+ and bitmaps?



Sponsor:

#2 Vortez   Crossbones+   -  Reputation: 2704

Like
0Likes
Like

Posted 27 June 2013 - 03:02 PM

Use dx to scale your image, not GDI, it will be super fast. I have a similar working project and that's what i do, execpt i use OpenGL but it's almost the same thing.

 

For the cpu, i don't really see any way other than add a sleep before or after you capture the screen. It might be a bit slower, but keep in mind that your image is going to

be streamed on the network anyway, which isn't that fast. In fact, i multithreaded mine so, after sending a screenshot, it start a thread to make the next one, while the current one is being sent. Keep in mind though that multithreading add some complexity like syncronisation but i think it's easier in c# than c++, like i did.

 

good luck

 

EDIT: oh, i didn't see your already multithreaded that. Experiment by adding some Sleep somewhere, it might help.


Edited by Vortez, 27 June 2013 - 03:04 PM.


#3 David Lake   Members   -  Reputation: 118

Like
0Likes
Like

Posted 27 June 2013 - 03:20 PM

Thanks, I can easily slow it down but ultimately I want it to go faster without using 2 whole cores.

I'm looking for code samples for speedy transfer of screen capture, possibly with SlimDX, over the network.

I have tried it before but it was too slow because I had to turn it into a Drawing.Bitmap but if theres a way to avoid System.Drawing all together and convert the sprite or whatever it was into a byte array to send it over the network that might make it faster.

Also I'm a complete n00b when it comes to any sort of GPU programming but if I could process and compress the frames on the GPU that would be nerdgasmtastic!


Edited by David Lake, 27 June 2013 - 03:21 PM.


#4 Vortez   Crossbones+   -  Reputation: 2704

Like
0Likes
Like

Posted 27 June 2013 - 03:58 PM

I don't know c# very well, but in c++ i have direct access to the bitmap buffer memory so i don't know. Im using Delphi for the interface part and a c++ dll to do the hard work (networking, screenshot, hooks, compression).

 

Not sure if compression on the gpu is fesable, im using zlib to compress mine, but the real speed up isn't that. In fact, i use a QuadTree, basically, i split the image in 4 like 4 or 5 times, then i check wich block have changed, and send only the part that changed. The quadtree is used to send different sized part of the images to the other side, so i only need to update some part of the texture most of the time, and, if nothing changed, it just send an empty screenshot message. It's a bit complex but the best optimization i've found yet.

 

It's a bit the same thing you do when optimizing a terrain mesh using quadtree but with an image instead. I also tried using mp1 compression to send frames, like a movie, and it worked, but the image is blurry, and it's slower than my other method so i don't see any reason to use it.

 

Let directx do the scaling for you. And make sure to replace, not recreate, the texture each time you receive a screenshot, it's way faster this way.


Edited by Vortez, 27 June 2013 - 04:03 PM.


#5 Adam_42   Crossbones+   -  Reputation: 2619

Like
2Likes
Like

Posted 27 June 2013 - 07:26 PM

The main problem with using the GPU for this is that the transfers to and from the GPU are relatively slow - the CPU can read and write memory faster. This means unless the processing required is expensive then it probably won't help compared to some optimized CPU code. However if you want to go that way you may find http://forums.getpaint.net/index.php?/topic/18989-gpu-motion-blur-effect-using-directcompute/ useful, there's some source code available there.

 

For data compression I'd go with using the xor operator on pairs of adjacent frames. The result will be zeros where they are identical. You can then apply zlib to the results, which should compress all those zeros really well. Reconstructing at the other end is done with xor too. As the xor operator is really cheap that should be reasonably quick to do even on a CPU.

 

To cut down CPU load make sure you have a frame rate limiter in there. There's no point processing more than 60 frames per second, and you can probably get away with far less than that.



#6 David Lake   Members   -  Reputation: 118

Like
0Likes
Like

Posted 27 June 2013 - 07:37 PM

I store the previous frame and zero ARGB of any unchanged pixils (xor), then compress twice with QuickLZ this makes highly compressible frames much smaller then a single compression, all the processing is done by an optimized dll built using the Intel C++ compiler.

Then when displaying the frame (in a picturebox with GDI+) I simply draw over the previous one and since the alpha of the unchanged pixels is zero, the unchanged pixels of the previous frame remain (ingenious I know!).

 

Whats the fastest way to capture the screen, scale it and get the frame into a byte array using the GPU then display it without using GDI which I find slows down when rendering at a high resolution such as 1920x1200 even on an i7 3820 at 4.3GHz?

 

Oh and as for framerate, I like it to be as fast as possible thats why I dont use Remote Desktop and if it did go over 60 FPS I dont know how to limit it to exactly that.


Edited by David Lake, 27 June 2013 - 09:49 PM.


#7 Vortez   Crossbones+   -  Reputation: 2704

Like
0Likes
Like

Posted 27 June 2013 - 09:03 PM


For data compression I'd go with using the xor operator on pairs of adjacent frames. The result will be zeros where they are identical. You can then apply zlib to the results, which should compress all those zeros really well. Reconstructing at the other end is done with xor too. As the xor operator is really cheap that should be reasonably quick to do even on a CPU.

 

That's actually a very good idea. Dunno why i didn't think about it before...



#8 David Lake   Members   -  Reputation: 118

Like
0Likes
Like

Posted 27 June 2013 - 09:42 PM

If anything that uses the GPU is slower then wont using it for scaling be slow?


Edited by David Lake, 27 June 2013 - 09:51 PM.


#9 Vortez   Crossbones+   -  Reputation: 2704

Like
0Likes
Like

Posted 27 June 2013 - 10:29 PM

GPU are not slower than cpu, they are just more optimized to do parallel tasks and work with vectors and 3d/graphics/textures stuffs, while cpus are more for serials operation. Most compression algorithm are serial by nature i think. I think what Adam_42 meant is that it take time to transfer the data to be compressed from normal memory to gpu memory and back, and that time could be used to compress on the cpu, rendering gpu compression useless.

 

I can't really tell why scaling a texture using the gpu is faster, but it is, it's one of the thing the gpu is good at. Also, think about it, isn't it better to send pictures of a fixed size and render it the size you want on the other side, or scale it first, then be stuck with that size on the other side? I prefer the first solution. This way you can resize the window that draw the screenshot and directx will scale it for you effortlessly, all you have to do is draw a quad the size of the renderer window and the texture will stretch with it automatically. If you want the picture not to be distorted, then it's a little bit more work since you have to draw black borders, but it's not complicated to compute either. (In fact, it's not the border you must draw, but rather adjust the quad size so it leave some area black, or whaterver you set the background color to)

 

PS: Sorry if im not explaining very well but english is not my native language.


Edited by Vortez, 27 June 2013 - 10:46 PM.


#10 Bacterius   Crossbones+   -  Reputation: 9282

Like
0Likes
Like

Posted 27 June 2013 - 10:34 PM

GPU are not slower than cpu, they are just more optimized to do parallel tasks and work with vectors and 3d/graphics stuffs, while cpus are more for serials operation.

Most compression algorithm are serial i think.

 

I think he meant the data transfer. GPU's are good for games because the frame stays on the GPU. For his application he would need to constantly send the frame to the GPU, scale it, and then copy it back to the CPU for transfer, which may be too slow (PCI-E has very real bandwidth limits) and can have nontrivial latency.

 

There is always some overhead in doing GPU stuff, which is why in some cases it is best to let the CPU do it because the cost of getting the GPU involved exceeds the benefits.


The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#11 David Lake   Members   -  Reputation: 118

Like
0Likes
Like

Posted 27 June 2013 - 10:46 PM

 

GPU are not slower than cpu, they are just more optimized to do parallel tasks and work with vectors and 3d/graphics stuffs, while cpus are more for serials operation.

Most compression algorithm are serial i think.

 

I think he meant the data transfer. GPU's are good for games because the frame stays on the GPU. For his application he would need to constantly send the frame to the GPU, scale it, and then copy it back to the CPU for transfer, which may be too slow (PCI-E has very real bandwidth limits) and can have nontrivial latency.

 

There is always some overhead in doing GPU stuff, which is why in some cases it is best to let the CPU do it because the cost of getting the GPU involved exceeds the benefits.

 

 

Yes, thank you.

 

I suppose I'd better be looking for performance optimization elsewhere in the code, after doing a performance analysis it seems array copying is quite slow.



#12 Vortez   Crossbones+   -  Reputation: 2704

Like
0Likes
Like

Posted 27 June 2013 - 10:59 PM


I suppose I'd better be looking for performance optimization elsewhere in the code, after doing a performance analysis it seems array copying is quite slow.

 

Of course it is, just multiply the width and height of you screen then multiply that by 3 to give you an idea of how much bytes you need copying. my 1360x768 screen is 1360 * 768 * 3 bytes = 3133440 bytes or about 3 Mb. Without some form of compression or optimization this take quite a while to transfer. Although, in my program at least, using 2,8,16 or 24 bits per pixels dosen't seem to help much, i might be bottlenecked somewhere else but still get pretty descent result (couple of frames/second).


Edited by Vortez, 27 June 2013 - 11:02 PM.


#13 David Lake   Members   -  Reputation: 118

Like
0Likes
Like

Posted 27 June 2013 - 11:23 PM

I am aware, my test involved 1920x1200*4 (32bit), but still a 9,216,000 byte frame is nothing for quad channel DDR3 at 2133MHz with an effective performance of 45GB/s and 26.8GIPS per core.


Edited by David Lake, 27 June 2013 - 11:33 PM.


#14 Vortez   Crossbones+   -  Reputation: 2704

Like
0Likes
Like

Posted 27 June 2013 - 11:26 PM

Ouch!



#15 Vortez   Crossbones+   -  Reputation: 2704

Like
0Likes
Like

Posted 27 June 2013 - 11:32 PM

Another optimization that i do is to paint the desktop background black when the connection is done, then restore it afterward, it's a lot easier to compress, although with the xor trick pointed out by Adam that become rather useless.



#16 David Lake   Members   -  Reputation: 118

Like
0Likes
Like

Posted 27 June 2013 - 11:44 PM

Ouch!

 

Is that sarcasm I cant tell?



#17 Vortez   Crossbones+   -  Reputation: 2704

Like
0Likes
Like

Posted 27 June 2013 - 11:48 PM

I just meant that this is like 3 times more pixels than my example, so without compression, that must take a while to transfer a single frame over the network.

 

With my 8000 mbits/sec connection that would take 100 seconds to upload and 10 to download... hence the "ouch!" haha.


Edited by Vortez, 27 June 2013 - 11:53 PM.


#18 David Lake   Members   -  Reputation: 118

Like
0Likes
Like

Posted 28 June 2013 - 02:47 PM

I'm curious how you lot got xor to work, I check whole pixels and use the alpha channel to tell if the pixel is unchanged and not just black, if xor is done on each sub pixel with no alpha channel theres no way to tell if its zero or unchanged?


Edited by David Lake, 28 June 2013 - 03:06 PM.


#19 aqrit   Members   -  Reputation: 119

Like
0Likes
Like

Posted 28 June 2013 - 03:34 PM

Why would you need to draw a distinction between black and unchanged?

 

( assuming black is zero )

 

previous[3] = { green, black, green };

current[3] = { black, black, green };

 

encode: ( xor previous with current frame )

green xor black = green

black xor black = black

green xor green = black

 

so pak[3] = { green, black, black }

 

decode: ( xor previous with "packed" frame )

green xor green = black

black xor black = black

green xor black= green

 

which is back to current frame

 

If your doing this for the first frame just use zero filled memory / black as the previous frame.


Edited by aqrit, 28 June 2013 - 03:42 PM.


#20 Vortez   Crossbones+   -  Reputation: 2704

Like
0Likes
Like

Posted 28 June 2013 - 05:17 PM

The point is not to check if a pixel have changed, but to make black all pixels that haven't changed. Then, when compressing, if 2 images are identical you get a buffer full of zeros, which is very compressible. I haven't tryed it yet but i know it work, it's like xor encryption. All you need is a buffer with the previous image and one with the current image and perform xor on all those bits before, and after sending it. The second pass will restore the original image.

 

Im pretty sure it's more fast than my quad tree algorithm.


Edited by Vortez, 28 June 2013 - 05:23 PM.





Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS