This morning I went for a long drive. I started thinking about image recognition. The question is, "How can you train a computer to recognize a common image pattern, even if many parts of the image change?" In other words, can we teach a computer to recognize and identify memes? The thing with meme images is that they usually reuse the same image and people post different text overlays on it. Here's an example:
Here's another example:
So, if you took both of these images and just hashed them with MD5 and stored the result in a database somewhere, and then only looked at the hash value, you would have no way of knowing that these are practically the same image but with minor variations. How could you devise a better way to know what you're looking at and whether its similar to other things? And why would you want to?
This is probably a solved problem and I'm probably reinventing the wheel -- and much more inefficiently, as first attempts tend to go. I imagine that many much smarter engineers working at facebook and google have already solved this problem and written extensive white papers on the subject, of which I am ignorant. But, I think the journey towards the discovery of a solution is half the fun and also a good exercise. It's like trying to solve a math problem in a textbook before checking the answer in the back of the book, right? The point isn't to get the answer, but to develop the method... So, onwards to what I came up with!
My initial thought was that you'd just feed an image into a machine learning neural network. If the network is big enough and the magic black box is complicated enough, and we give it enough repetitions in supervised learning, maybe it can learn to recognize images and similarities between images? But that sounds really complicated and I don't like things that "just work" and by unexplainable "magic". It's great when it works, but what do you do when it doesn't work and you don't know why? What hope would you ever have of diagnosing the problem and fixing it? Maybe neural networks are enough and I'm unnecessarily overthinking things, but that's the fun of it. What if... instead of taking a hash value of the complete image, we created several hash values of subdivisions of the image? We could create a quad tree out of an image, and then generate a hash value for each quadrant, and then keep repeating the process until we get to some minimum sized subdivision. Then, we can store all of these hash values in a database as an index to the image data. To test if an image is similar to another, we just look at the list of subdivided hash values and see if we find any matches. The more matches you find, the more statistically likely you are to have a similar image match.
Of course, this all assumes that the images are going to be exactly the same (at the binary level) with the exception of text. If even one bit is off, then the generated hash value will be completely different. This could be particularly problematic with lossy image formats such as JPEG, so doing hashes of image quadrants may not necessarily be the best approach. But, what if we looped back to our machine learning idea and had a neural network be specialized at recognizing subsections of an image and recognizing it, despite any lossy or incomplete data? The tolerance for lossiness would make it easier to correctly identify an image subquadrant and match it to a set of high and low resolution image quadrants.
But now, this is where it gets interesting.
What if... when someone uploads an image online, rather than storing the whole image, we subdivide the image into quadrants and compare each image quadrant against a set of existing image quadrants? The only data we would upload would be any image quadrants which had no matches in the database. The rest of the image data would just be stored as references to the existing image quadrants. Theoretically, you could have some pretty intense data compression. That 1024x1024 32 bit image (~4.1Mb), could be 90% represented by pre-existing hash values representing image blocks in the database, and you'd be looking at a massive reduction in disk space usage. Rather than storing the full image itself, you'd be storing the deltas on a quadrant by quadrant basis, and as people ask for any image, it gets reconstructed on the fly based on the set of hash values which describe the image. And, if you have this way of laying out your data, you can create a heat map of which data quadrants are more popular than others, and then make those data quadrants far more "hot" and accessible. In other words, if a meme or video goes viral and needs to serve out a billion views, you would want a distributed file system which makes thousands of copies of that data available on hundreds of mirrors, and then let a network load balancer find the fastest data block to send to a user. The more frequently a block of data is requested, the more distributed and readily accessible it gets (with an obvious time decay of course, to handle dead memes). And in the best case scenario, if someone uploads an exact copy of an image which already has a hash value match to something else someone uploaded, we just create a duplicate reference to that data instead of a duplication of the data itself.
Then, the question becomes: How and when do we delete data? If there's a dead meme, that's one thing. But what if someone uploads naked pictures of me to the internet and I want them deleted (just to personalize the stakes)? Both are use cases that need to be handled elegantly by the underlying design. In the case of dead memes, we would enforce a gradual decay timer based on frequency of access to each data block. If a block of data hasn't had a read request for months and we have 4 copies, maybe we can automatically delete three of those copies to free up space without completely deleting the data? If a user uploads an image, they don't "own" the image, but rather they're given a hash value which represents the composition of that image, which may have data blocks which are shared by other users who uploaded similar images. A "delete" action by this user would only delete their hash link rather than the underlying image itself. Maybe we also maintain a reference counter, such that when it reaches zero, it gets marked for garbage collection in a month or so?
In the case of someone uploading naked pictures of me, where removing the image is more important than just deleting references to it, there would need to be some sort of purge command which not only deletes all references to all data blocks, but also deletes the data blocks themselves, so any future read accesses will fail. This would be a dangerous section of code and would need to be done very carefully, with security and abuse in mind. But, that raises another question: who gets to decide who gets 'delete' access and who gets 'purge' access? Do you own every quadrant of an image you upload, even if its shared among other related images? Let's say a vicious person took a naked picture of me and uploaded it to this cloud, and then I go through the process and get this image purged. What's going to stop this vicious person from uploading the image again, and causing me new headache and harm? Ideally, we'd like to block the vicious person from being able to upload the image ever again. But, in order to do that, we'd have to know that the data someone is trying to upload has been deemed "forbidden" by the system before we can block it. And how are we to know a block of data is forbidden without comparing it against an existing dataset? To be much more specific towards a real world problem, let's pretend that someone uploads child porn to our image cloud. They use the network to distribute the CP to other pedophiles as quickly as they can before it gets found and shut down. The server owners find and immediately purge the content as quickly as they can and want to block any further attempts to upload the same content/material, but they run a bit of a challenge: In order to know what content to block, you must store copies of the content to recognize it, and that ultimately would mean that you're still storing child porn on your servers, which would then get you into legal trouble. I think my quadrant based hash value idea would still be an elegant way to resolve this. Instead of storing the data itself, you store a list of banned hash values. If, during the image decomposition, one or more of the hashed image quadrants match a list of the banned image values, then you reject the complete image. If your hashing function has a one in a trillion collision rate, you don't really need to worry about false positives (provided your quadrants don't get down to the granular level of individual pixels).
The other danger is that someone is a bit too liberal with the "purge & ban" button. Imagine that an artificial neural network isn't used just for pattern analysis to identify images and image quadrants, but also to identify paragraphs, sentences, and words in blogs, forums, and message boards. If someone posts copy/pasta which matches a forbidden topic, such as "Falun Gong" or the "Tianamen Square Massacre" in China, this sort of system could potentially be used by authoritarians to squash free speech and ideas which they don't like. That could ultimately be more dangerous & damaging in the long run for the flourishing of mankind, than a permissive policy? Imagine it gets really crazy, where the home owners association for your neighborhood has legal authority to silence anyone, and someone petty with a spiteful bone to pick decides to get happy with the "purge & ban" button with a neighbor they don't like? Obviously, the room for abuse on both the admin and user sides will need to be very carefully planned and designed. For that, I don't have easy technical solutions to people centered problems... Maybe in this case, a diversity of platforms and owners would be better than one large shared cloud? But maybe that's just trading one problem for another.
Anyways, this is the kind of stuff my mind tends to wander towards when I'm stuck in traffic.