Not sure where this goes, but the solution is kind of graphicsy, so I put it here!
As part of my robotics research, I hacked together a simple way of tracking arbitrary, modeled 3D objects with a kinect or similar depth sensor using only a fixed-function graphics pipeline. Maybe it will be useful for some game programmer somewhere? I mean, imagine that you just have a bunch of household objects that have known models. You can interact with these objects in front of a kinect sensor to put feedback into the game.
Here's a video:
On the left we have a point cloud showing a table with a rock and a drill. I move around the rock and drill, and the system tracks them both. On the right, we see an offscreen buffer that is used to render synthetic point clouds. The synthetic point clouds are matched with the sensor cloud to track the objects.
I make the following assumptions:
- We have accurate models of the objects we wish to track.
- We have a good estimate of the initial position of the objects in the Kinect image frame.
- The objects either lie on tables, or are being held so that they are mostly visible.
The algorithm works like this:
- Initialize the object positions in the kinect frame using user help, feducials, or an offline template-matching algorithm.
- Each frame, get a kinect point cloud.
- Cull out any points near large planes (use RANSAC to find the plane), which we assume belong to tables.
- Now, render synthetic point clouds for each of the objects. The way we do this is extremely simple. We just color each object uniquely, and render the entire scene in an offscreen buffer (this is the image on the right in the video). Then, we sample from the depth buffer to find the z-coordinate of each pixel.
- For each point in each object's synthetic point cloud, find its nearest point in the kinect sensor cloud, such that the point is within a radius D (we set D to 10 cm). This is done using an octree.
- Using the corresponding points, run one iteration of ICP to find a correction.
- Transform the object by the correction returned by ICP
What we get is a system that can (sort of) track multiple 3D objects in (near) real time, and can handle occlusion so long as all of the objects in the scene are known. It could be made more useful if we also model the location of the human hands holding the objects. Work also needs to be done to figure out where the objects are to begin with...