How would I create a graphical recognition program?

Started by
8 comments, last by GreyHound 16 years, 8 months ago
Hi I was wondering how would I create a program that just looked at the screen and found if a certain picture was somewhere on the screen. I think I know how I could check if the picture was on the screen, but I would need to have both the screen and the picture as multi-dimensional arrays of RGB values. I have had some experience with SDL (and I know that bitmaps are stored pretty much with arrays of RGBs in them (including header info and other stuff)), so I would just like to ask: 1. What language should I use? 2. Is there an already made library that loads bitmaps and that will allow me to access specific pixel data (I don't think SDL can do this, but I may be wrong)? 3. How do I get a screen shot? 4. How do I get my program to run in the background (because it would be pretty useless getting a screenshot if half of the screen is my program). I'd appreciate any help, Peter
Whether you think you can or think you can’t, you’re probably right – Henry Ford
Advertisement
What exactly are you trying to capture?

Is the image rectangular, does it have any distinct features, what kind of context will it be placed in, is this full-screen accelerated display or desktop image, is the image even in system memory, or is it rendered via overlay?

Then, you need to determine how to match the pixels. Will the screen be normalized or color-corrected? Can you match on per-pixel basis, or will you need to first transform into feature space? How will you build your comparison database for fast lookups? How will you check for false positives?

What you want is far from trivial. Not just from machine vision perspective, but can also be extremely hard from technical perspective. Graphic card manufacturers offer special hardware based solutions for this since there's so much processing involved.
Erm, well I can get the image by pressing "Print Screen", and the image I am searching for is not rectangular.

I was just going to find the first none transparent pixel in the image that I am going to look for (it will be no bigger than 64x64) and look for a pixel on the screen with the same RGB value, then I was going to see if the next none transparent pixel from the picture was the same colour as the corresponding one on the screen, and carry on like that returning false if it doesn't match. Then if they were all equal return a true (or the position the picture was found on the screen).

I'm sorry if that doesn't help much, I don't know alot (I don't know what an accelerated display or a feature space are!).

Thanks for the help though
Whether you think you can or think you can’t, you’re probably right – Henry Ford
what do you mean by transparent pixel? once stuff is rendered to the screen there is no such thing as transparency. Everything just has a simple RGB value, that's it. What's displayed to the screen is just a 2D grid of RGB values.

Image recognition is very hard, hard as in this is something that some of the best programmers in the world are working on but can't do well yet except for under specific circumstances.

So what _exactly_ are you trying to find on the screen? If it's exactly the same image displayed at exactly the same size every time, this is relatively easy. If it's something like "every time my friend Bob is on the screen" it's very very hard.

[EDIT: and to understand what I mean by "very hard": http://fishbowl.pastiche.org/2007/07/17/understanding_engineers_feasibility [smile]]

-me

[Edited by - Palidine on August 4, 2007 3:48:19 PM]
Quote:So what _exactly_ are you trying to find on the screen?


This isn't just a question from pesky doubters.

This is the crucial thing that will affect everything: the way you capture the screen, which algorithm you use, how to transform between captured image and format you want, how to locate the 64x64 square, and everything else.

*All* computer vision works on an incredibly rigid set of rules. Step only a bit aside, and it will fail miserably.

It's the very fact that everything is just a RGB value that makes this task so hard. If you had access to original data, or some other form of context, then it's easy.


Feature space transform takes RGB image and does something to it. For example, if you were looking for rectangles on screen, you'd apply Hough transform on image and obtain feature space consisting of nothing but lines. Then you'd post-process these lines to detect intersections, and figure out which of these comprise a rectangle.

Almost without exception all machine vision is a set of transformations, that turn RGB pixels into feature space that you can use.

This is why it's critical to specify, to last detail, exactly what you're trying to do, from which a suitable algorithm can then be chosen.

These are some basic example of how this looks in real-life, along with all the side-effects you get even under controlled conditions.
Well I know the whole different shapes, different rotations, different sizes thing would be extremely difficult - but I am just wanting to do what Palidine says - same image, exactly the same size everytime.

The only differences will be the position and the background - by transparency I mean that the image I am searching for has some transparent pixels in it - which won't be drawn to the screen, therefore I will just skip checking for those pixels.

At the moment I would just like to know if there is somewhere to get the screen image and a good library for handling images on a pixel level (I don't mind what format).
Whether you think you can or think you can’t, you’re probably right – Henry Ford
The print screen functionality is user-invokable just search msdn for the hook. Otherwise I believe SDL gives you that functionality as do DirectX and OpenGL.

Quote:Original post by Peter Conn
The only differences will be the position and the background - by transparency I mean that the image I am searching for has some transparent pixels in it - which won't be drawn to the screen, therefore I will just skip checking for those pixels.


Remember that "same size" means same size in pixels. It will only work if the displayed resolution of the image is exactly the same as the resolution of the sample image. i.e. if the sample image is 100x100 pixels, this algorithm only works if the displayed image takes up exactly 100x100 pixels on the screen. So if it's appearing in an application that scales the image relative to window size, or relative to screen resolution this doesn't work. [smile]

That implementation is something like:

grab the screen bufferfor ( each row of pixels ){    match = sequence on this row match first row of sample image    while ( match )    {        go to same width index on next row                match = this row matches the corresponding row in sample image        if ( match && examined all sample rows )        {            return true;        }    }}return false;


If resolutions change or do not match sample image, or sample image rotates or skews, or gets partially occluded by other windows, etc, prepare to write at least a few hundred or thousand more lines of code. =)

-me
If you're wanting to search for a 64x64 image on the desktop as a background task, then it is going to either be extremely high CPU usage, or unlikely to see the image if it isn't visible for long.

Why do you want to do this? Is it for something like WinTask, which as it happens has this functionality built in...
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms
Well given your description this shouldnt be too hard. You mentioned that the image is always exactly the same and is never rotated.
So basically just can the entire image top to bottom and left to right. Once you hit a pixel that is equal to the pixel of the "image you are searching for"'s top left corner increment an index counter. Then check the next pixel in your scan against the pixel one to the right of the corner and so on. If they dont match once reset back to the top left corner and continue. Extending to 2d isnt hard.

Oh and:
1.) Use any language, up to you.
2.) I think there is DevIL for c++ which ive never used, also there is the Win32 API and DirectX could help. Java has built in help and c# probably does too.
3.) you can use directX for screenshots, there may be another way I dont know.
4.) Id suggest minimizing it, check the msdn.

Regards

-CProgrammer
http://opencvlibrary.sourceforge.net/

OpenCV is a library designed for similar problems. Collegues of mine currently use to to achiefe optical tracking functionality. It´s meant to be fast.

No warranty it's suited for your personal situation.

gl

This topic is closed to new replies.

Advertisement