Um, I can see from where the video suggestion comes. One of the methods mentioned requires multiple images, and an user who's holding a phone is not going to have a steady aim (unlike e.g. a tripod), especially not when pressing the button. So instead of taking a photo, you could take a few consecutive frames of video and use them for the algorithm. The user would still probably think it's just like taking a pic since the amount of time is very short =P
Alternatively you could take e.g. a pic when the user presses the button and a pic when the user releases it. Both pics would be from different viewpoints and could achieve the same result.
Don't pay much attention to "the hedgehog" in my nick, it's just because "Sik" was already taken =/ By the way, Sik is pronounced like seek, not like sick.