Started by Apr 12 2011 03:16 AM

,
18 replies to this topic

Posted 12 April 2011 - 03:16 AM

Hi, long time since I used these forums :-) I basically learned to code here, about 8 years ago! Even lost my old user uncutno.

Question:

I'm doing a nearest neighbor search with a couple of thousand 16x16 1-bit black and white images. I pack the 256 bits into 32 bytes, and use a superfast bit-comparator to count the number of matching bits, and that way i find the closest match.

Without any optimization I can do a search with 2000 instances within the performance requirements.

My first optimization was to create a distance map from every instance to each other, a NxN array. While searching trough the set, i keep track of the closest instance found until now, and then for the rest of the instances I can check in the distance map if i possibly can be closer to that instance, before actually checking the real distance. This makes the search run in about 66% of the time.

I have also tried a lot of other approaches, but with no success.

The problem is that any k-tree or any stuff like that does not work, since its basically 256 dimensions with only two position each; no way to "split" the dimension.

So, any idea on how to do this hierarchal? I have the fealing that i should be able to do this in about 10% of a linear search, and i have all the preprocessing time in the world :-)

Question:

I'm doing a nearest neighbor search with a couple of thousand 16x16 1-bit black and white images. I pack the 256 bits into 32 bytes, and use a superfast bit-comparator to count the number of matching bits, and that way i find the closest match.

Without any optimization I can do a search with 2000 instances within the performance requirements.

My first optimization was to create a distance map from every instance to each other, a NxN array. While searching trough the set, i keep track of the closest instance found until now, and then for the rest of the instances I can check in the distance map if i possibly can be closer to that instance, before actually checking the real distance. This makes the search run in about 66% of the time.

I have also tried a lot of other approaches, but with no success.

The problem is that any k-tree or any stuff like that does not work, since its basically 256 dimensions with only two position each; no way to "split" the dimension.

So, any idea on how to do this hierarchal? I have the fealing that i should be able to do this in about 10% of a linear search, and i have all the preprocessing time in the world :-)

Posted 12 April 2011 - 06:25 AM

How many images would you like to be able to compare against, and how much time do you have to do it? On roughly what hardware?

Posted 12 April 2011 - 06:40 AM

While I wait for your answers, here's an idea: If you coarsen the image to a 4x4 image, where each pixel is now a count of how many bits were set in its box, you know that the L1 distance between the coarse images is less or equal than the Hamming distance between the original 16x16 images. You can then use some partitioning structure on this lower-dimensional space and prune branches when you can prove that the distance will be larger than one you've already found.

Do you think this might work with your images?

Do you think this might work with your images?

Posted 12 April 2011 - 09:45 AM

It sounds like spatial partitioning trees can still work here. You'd just need to consider planes that aren't aligned with your axes. You could use a KD tree with another basis, like the Hadamard Basis . Or you could just use a general BSP tree.

Considering general BSP trees: A reasonable algorithm for building these would be to recursively split each data set with a plane through its mean with normal along its principal axis. A yet-better one would be to use k-means with k=2 to determine your splitting planes. If you look at the independent component analysis (ICA) literature, you might also find other ideas to consider.

Whatever splitting method you choose, for a general BSP-tree approach, you'd then at runtime be doing roughly O(log(N)) inner products (each of which is O(n^2)) (if you have N images, each of which is nxn), so you'd have O(n^2 log(N)) complexity, vs. the O(n^2 N) complexity you have now. That said, you'd be doing multiplications instead of bitwise comparisons, so the constants would be worse; there might be a point in the tree beyond which it's cheaper to brute-force.

Anyway I'm speculating at this point. Have you looked at e.g. this?

Ah hah! YES. Check out p. 52 of this pdf. I like this. Here, halfspaces are encoded by pairs of points (called "pivot pairs"): If you're closer to the first point, you're in the first halfspace; if you're closer to the second, you're in the second halfspace. What's nice about this is that (1) it allows you to stick with your fast bitwise stuff, without requiring more general arithmetic (no MULs!), and (2) there are only finitely-many planes to consider. Let's see... well, without even coming up with something cleverer, you can brute-force construct a near-optimal tree (by choosing the pair of pivots that most nearly divide your points in half at each splitting) in O(N^3) time...

There's other stuff out there; e.g. R-trees, M-trees, but I don't really know them well. They're all various kinds of bounding volume hierarchies.

Hope that's somewhat helpful.

Considering general BSP trees: A reasonable algorithm for building these would be to recursively split each data set with a plane through its mean with normal along its principal axis. A yet-better one would be to use k-means with k=2 to determine your splitting planes. If you look at the independent component analysis (ICA) literature, you might also find other ideas to consider.

Whatever splitting method you choose, for a general BSP-tree approach, you'd then at runtime be doing roughly O(log(N)) inner products (each of which is O(n^2)) (if you have N images, each of which is nxn), so you'd have O(n^2 log(N)) complexity, vs. the O(n^2 N) complexity you have now. That said, you'd be doing multiplications instead of bitwise comparisons, so the constants would be worse; there might be a point in the tree beyond which it's cheaper to brute-force.

Anyway I'm speculating at this point. Have you looked at e.g. this?

Ah hah! YES. Check out p. 52 of this pdf. I like this. Here, halfspaces are encoded by pairs of points (called "pivot pairs"): If you're closer to the first point, you're in the first halfspace; if you're closer to the second, you're in the second halfspace. What's nice about this is that (1) it allows you to stick with your fast bitwise stuff, without requiring more general arithmetic (no MULs!), and (2) there are only finitely-many planes to consider. Let's see... well, without even coming up with something cleverer, you can brute-force construct a near-optimal tree (by choosing the pair of pivots that most nearly divide your points in half at each splitting) in O(N^3) time...

There's other stuff out there; e.g. R-trees, M-trees, but I don't really know them well. They're all various kinds of bounding volume hierarchies.

Hope that's somewhat helpful.

Posted 13 April 2011 - 02:01 PM

WOW! Guys, thanks!

Still the same great responses I remember from my old days here :-) Have to look into these forums more often; even if my fixed-pipeline-graphics-knowledge is somehow outdated :-) I also know alot about AI, but more of the theoretical "not usable in realtime games"-type stuff, then the "how do I make my soldiers cooperate as a realistic squad"-stuff.

alvaro#1: Currently i do about 10 lookups in a set of 2600 images, 10 times per sec = 100 checks in a 2600 set. I do it on an iPhone 3GS arm 32-bit processor. This is brute force check all strategy, so my feeling is that I should be able to do about 100 check is a 15000 set with optimizations. If i could do 200-300 check against a set of about 5000, I would be more then happy.

alvaro#2: Good idea; rank them along number of pixels in square, and then divide on that. Should give me the possibility to exclude a lot of the set + do some space splitting. I have a little pixel-lab where I try out stuff and measure improvement. I promise to post results if I try this out; should be interesting.

Emergent: Ah, thats how you do it :-) you don't need to split along axes. I did a lot of 3D BSP stuff back in the days, so I totally get this now (I sort of fell of the space-stuff when going from 3 to 256 dimensions :-) ). As I read trough your answer, I totally follow you. The two point axis makes perfectly sense regarding my bitstuff. I could even do a sqrt table lookup for distance, since square distance is int 0 - 256. Also thanks for a hint on how to find those points, the perfect divider-algo should be pretty simple to create.

Just to let you know:

Since last time, i optimized my distance-map system, and I optimized my bitstuff. Even if I hate to do low level optimizations; I managed to run in 50% time.

Distance map optimization: For all instances in set, I know the distance to all other instances. If my test is X away from instance 1, and instance 1 is Y away from instance 2, I don't need to do a full distance-test for instance 2 if Y > X*2.

Bitflick optimization: My old version had each case packed into bytes, so my 256 bits was inside 32 bytes. For each of those I did a bit-compare-count. My new version packs the bits into 8 32-bit ints. That way I get away with much less instructions, since I only need to do most of them 8 times (instead of 32).

I will implement the BSP solution and measure efficiency. Since the penalty for increasing the set is lower if I do BSP, I would actually be happy if i could match the current speed. Hopefully I _should_ be able to do it in 5-10% time :-)

Still the same great responses I remember from my old days here :-) Have to look into these forums more often; even if my fixed-pipeline-graphics-knowledge is somehow outdated :-) I also know alot about AI, but more of the theoretical "not usable in realtime games"-type stuff, then the "how do I make my soldiers cooperate as a realistic squad"-stuff.

alvaro#1: Currently i do about 10 lookups in a set of 2600 images, 10 times per sec = 100 checks in a 2600 set. I do it on an iPhone 3GS arm 32-bit processor. This is brute force check all strategy, so my feeling is that I should be able to do about 100 check is a 15000 set with optimizations. If i could do 200-300 check against a set of about 5000, I would be more then happy.

alvaro#2: Good idea; rank them along number of pixels in square, and then divide on that. Should give me the possibility to exclude a lot of the set + do some space splitting. I have a little pixel-lab where I try out stuff and measure improvement. I promise to post results if I try this out; should be interesting.

Emergent: Ah, thats how you do it :-) you don't need to split along axes. I did a lot of 3D BSP stuff back in the days, so I totally get this now (I sort of fell of the space-stuff when going from 3 to 256 dimensions :-) ). As I read trough your answer, I totally follow you. The two point axis makes perfectly sense regarding my bitstuff. I could even do a sqrt table lookup for distance, since square distance is int 0 - 256. Also thanks for a hint on how to find those points, the perfect divider-algo should be pretty simple to create.

Just to let you know:

Since last time, i optimized my distance-map system, and I optimized my bitstuff. Even if I hate to do low level optimizations; I managed to run in 50% time.

Distance map optimization: For all instances in set, I know the distance to all other instances. If my test is X away from instance 1, and instance 1 is Y away from instance 2, I don't need to do a full distance-test for instance 2 if Y > X*2.

Bitflick optimization: My old version had each case packed into bytes, so my 256 bits was inside 32 bytes. For each of those I did a bit-compare-count. My new version packs the bits into 8 32-bit ints. That way I get away with much less instructions, since I only need to do most of them 8 times (instead of 32).

I will implement the BSP solution and measure efficiency. Since the penalty for increasing the set is lower if I do BSP, I would actually be happy if i could match the current speed. Hopefully I _should_ be able to do it in 5-10% time :-)

Posted 13 April 2011 - 02:09 PM

Here is my superfast distance2 code. Found it on the web (as the fastest possible ), and updated it from 8-bits to 32-bits.

The code tag is killing me today :-)

int distance2 = 0; for(int i=0;i<8;i++){ uint32_t value; value = A->data32[i] ^ B->data32[i]; while (value) { distance2++; value = (value - 1) & value; } } return distance2;

The code tag is killing me today :-)

Posted 13 April 2011 - 02:20 PM

I thought of giving you some advice about the low-level comparison of images, but I took your description ("superfast bit-comparator") at face value.

Is your code anything like this?

EDIT: I didn't see your previous post. This method should be faster, unless your images are very very similar.

Is your code anything like this?

int popcount(unsigned x) { x -= (x >> 1) & 0x55555555u; x = (x & 0x33333333u) + ((x >> 2) & 0x33333333u); x = (x + (x >> 4)) & 0x0f0f0f0fu; x += x >> 8; x += x >> 16; return x & 0x3f; } int hamming_distance(const char c1[32], const char c2[32]) { const unsigned *u1 = reinterpret_cast<unsigned const *>(c1); const unsigned *u2 = reinterpret_cast<unsigned const *>(c2); int result = 0; for (int i=0; i<8; ++i) result += popcount(u1[i] ^ u2[i]); return result; }

EDIT: I didn't see your previous post. This method should be faster, unless your images are very very similar.

Posted 13 April 2011 - 02:21 PM

Emergent: The pdf, page 52 is perfect! Exactly what I need :-) Thanks again.

Posted 13 April 2011 - 02:24 PM

alvaro: Yes, but I have no idea of why the popcount return matching bits. I notice that you don't have a loop, so ill check that out :-) Thanks!

Posted 13 April 2011 - 02:30 PM

This thread is so cool that Ill post back every experiment + every result that i do!

Could i get it down to 5%? ( given that most interesting cases actually have a pretty close neighbor ? )

Could i get it down to 5%? ( given that most interesting cases actually have a pretty close neighbor ? )

Posted 13 April 2011 - 02:43 PM

alvaro: Yes, but I have no idea of why the popcount return matching bits. I notice that you don't have a loop, so ill check that out :-) Thanks!

You can find similar code with comments here. It should be easy for you to try it out. I suspect it will be significantly faster.

Posted 13 April 2011 - 03:04 PM

Amazing.

Optimization #1: use alvaros popcount tok me from benchmark 2480ms to 1138ms. rolling out the 8 step loop took me down to 1086ms, thats 43.8% of the original time! I guess those while-branches really killed my performance :-) (Even if I'm still at O(N) ) BTW; I also checked that the results are exactly the same.

Now lets get on with the space partitioning!

Optimization #1: use alvaros popcount tok me from benchmark 2480ms to 1138ms. rolling out the 8 step loop took me down to 1086ms, thats 43.8% of the original time! I guess those while-branches really killed my performance :-) (Even if I'm still at O(N) ) BTW; I also checked that the results are exactly the same.

Now lets get on with the space partitioning!

Posted 13 April 2011 - 03:08 PM

...I suspect it will be significantly faster.

About twice the speed :-)

Posted 13 April 2011 - 03:32 PM

For nearest neighbor search, a k-d tree or BSP is not likely to be a good idea in 256 dimensions. The reason for this is that the there is a lower bound on the query time for any range searching data structure with linear storage (ie all BSP type data strucutres) of Omega(n^((d-1)/d)). In this case, as d grows larger, this quickly approaches linear cost, which basically means that you will be back to your ordinary brute force in terms of performance. A better option for intermediate/small dimensions is to use an r-tree, which requires more space but guarantees polylog range queries. However, the space storage here is exponentially large in d, and so it would again be a terrible idea.

As a result, I would be inclined to recommend that you forgo using a spatial indexing tree and instead default to using a locality sensitive hash instead. The behavior of locality sensitive hashes for hamming distance computations are now quite well understood and should get you pretty good results. Also they are much easier to implement, and tend to behave nicely with cache performance. Here is a link from wikipedia: (scroll to the bottom to see how to use it to compute nearest neighbors)

http://en.wikipedia....amming_Distance

As a result, I would be inclined to recommend that you forgo using a spatial indexing tree and instead default to using a locality sensitive hash instead. The behavior of locality sensitive hashes for hamming distance computations are now quite well understood and should get you pretty good results. Also they are much easier to implement, and tend to behave nicely with cache performance. Here is a link from wikipedia: (scroll to the bottom to see how to use it to compute nearest neighbors)

http://en.wikipedia....amming_Distance

Posted 13 April 2011 - 03:59 PM

robotichrist: Thanks, I'll check that out :-) I have already implemented the BSP stuff (running it right now :-) ), so we will know in a really short time what the result is :-)

Posted 14 April 2011 - 12:05 AM

Hmm... BSP does not work :-) I clock in on 134% used time compared to normal brute force run, and thats before I use my distance-map-stuff, so I guess it runs in about 150% compared to that. I know that my BSP isnt super-optimized, but I was hoping to get better results; not to much to optimize in the BSP code.

Posted 14 April 2011 - 02:15 PM

I don't know if you are still working on this problem, but you may want to take a look at this paper:

Gionis, Indyk and Motwani, "Similarity Search in Higher Dimensions via Hashing", Proc. of VLDB (1999) PDF

As far as I know, it is the current best solution to the hamming distance nearest neighbor problem.

Gionis, Indyk and Motwani, "Similarity Search in Higher Dimensions via Hashing", Proc. of VLDB (1999) PDF

As far as I know, it is the current best solution to the hamming distance nearest neighbor problem.

Posted 14 April 2011 - 03:21 PM

For nearest neighbor search, a k-d tree or BSP is not likely to be a good idea in 256 dimensions. The reason for this is that the there is a lower bound on the query time for any range searching data structure with linear storage (ie all BSP type data strucutres) of O(n^((d-1)/d)). In this case, as d grows larger, this quickly approaches linear cost, which basically means that you will be back to your ordinary brute force in terms of performance.

Wow, that's not obvious. You must be referring to this. I think my intuition is going to be having screaming matches with that paper for a little while, until I trust its assumptions.

Posted 16 April 2011 - 01:35 PM

robotichrist: Thanks! Ill check it out :-) I have been working on the problem for half a year, so no reason to stop now! (not just the nearest neighbor stuff, but the whole system)

Emergent: I agree; it _SHOULD_ work :-)

Emergent: I agree; it _SHOULD_ work :-)