Sign in to follow this  
Kylotan

Matching algorithms

Recommended Posts

I am interested in finding out about any matching algorithms which can pair up items in one list with items in a second list, given a function that can return a degree of similarity between two items, and given that the lists may be of different sizes and some items may not match at all. To try every single permutation would run in factorial time and therefore be prohibitively slow on anything but short lists, so I'm hoping there is a good heuristic approach available. I suppose that ranking all possible matches and then using a stable matching algorithm might be one good approach?

Share this post


Link to post
Share on other sites
Just an idea:
If the function for determining similairity, f(a,b) can be written as something like g(a)-g(b)+smallfactor(a,b), then you could sort the list by g, and look only at close neighbours in the sorted list.
Depending on what you are using it for you could also take a look at sequence alignment algorithms.

Share this post


Link to post
Share on other sites
Yeah, sequence alignment is a better name for what I'm looking for, although there are not going to be any exact matches, just a lot of very close matches. Yes, that will be maximising the sum of the similarity, or minimising the sum of the difference, but I don't think that alone is sufficient, due to the presence of some data in one or both lists that may need to remain unmatched. The lists are already roughly sorted but the similarity function could be quite complex.

Share this post


Link to post
Share on other sites
Surely there is an existing matching/permutation algorithm that minimize the Total Least Square (TLS) of the differences. Try looking up with that keyword. Do you need the absolute best match, or a good local minima will do?

Share this post


Link to post
Share on other sites
It may not be possible to score the differences, only to rank them. So I'm interested in more algorithms than just one that will numerically reduce differences, and also in how they will handle sequences of different lengths - what is the square of the difference between an item and a non-existent item, for example?

Share this post


Link to post
Share on other sites
You can solve this problem exactly (wrt. some metric) in O(n^6) with a minimum cost flow algorithm (n=|a|+|b|). Maybe it's only n^5, I didn't try to find a tight upper bound.

Both lists can be of different length and arbitrary similarity relations are allowed.

Share this post


Link to post
Share on other sites
Quote:
Original post by Kylotan
what is the square of the difference between an item and a non-existent item, for example?


Usually it is a normalization constant is the minimization algorithm.

Share this post


Link to post
Share on other sites
This sounds like a classic nearest neighbours search problem, which can be solved efficiently using a k-d tree representation of the data. Create one tree per list (where data is distributed based on intra-list similarity). Then, if you want to find a match for each item in each list: for each item not presently matched, use it as the query to find the nearest neighbour in the other list. You can choose to find the n nearest neighbours with minimal additional runtime cost.

Your problem sounds very much like a data set cross-correlation problem I solved a couple of years ago. It becomes more efficient to use nearest neighbour search than brute force when the number of items for comparison reaches something like 10^5 (if I recall the literature correctly).

If you get desperate for an implementation, drop me an email and I'll forward you some code you can play with.

Cheers,

Timkin

Share this post


Link to post
Share on other sites
Would cross-corelation work for what you want?

I've used the Zero-mean normalized cross-correlation algorithm successfully to find similar chunks of data in two images.

Will

Share this post


Link to post
Share on other sites
Hmm, one more thing, which I didn't stress enough before: the two lists are roughly in order - so if the data was perfect, x[1] would match y[1], x[2] would match y[2], etc. Unfortunately, sometimes items are missing from one or the other, and more generally, the values in one list are biased an unknown amount relative to the other list.

A good analogy (in 2D rather than 1D) would be matching fingerprints, where the lines for a good input match and the model it's matching with will be largely superimposed, but with a degree of error either side. Also, the input sample is not going to be exactly in the same position in the image data as the training model is - perhaps one is off-centre, for example. (This corresponds to my bias factor.)

Does that make sense? :)

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Could you approach it in a similar way to a diff algorithm? Sounds very similar to me.

Share this post


Link to post
Share on other sites
Look up "dynamic time warping". Useful for roughly linear correlations between two sequences.

Share this post


Link to post
Share on other sites
Quote:
Original post by Kylotan
Hmm, one more thing, which I didn't stress enough before: the two lists are roughly in order - so if the data was perfect, x[1] would match y[1], x[2] would match y[2], etc. Unfortunately, sometimes items are missing from one or the other, and more generally, the values in one list are biased an unknown amount relative to the other list.


Okay, that's a slightly different problem to the way I interpreted your initial post. Correlation algorithms don't typically care where in the list of data you find the information, only what information is contained in the data set (actually, that one data set contains information about the other... Mutual Information is a better way of understanding correlation IMHO). So, my question is this: Consider lists A and B, such that A = { x1,x2,x3,x4,... } and B = { x2,x3,x4,... }. What do you want to say about these lists? Do you want to say they are completely mismatched (becase a1!=b1, a2!=b2, etc) or that they are offset (B(i) = A(i+1)) or that they have a distribution of offsets or something else entirely?

Cheers,

Timkin

Share this post


Link to post
Share on other sites
I don't really want to say anything about the lists, just about the data items within them. In this case, I want to report that there are good matches for x2, x3, and x4, and no good match for x1. I'm not interested in the offset except for how it aids matching, and the offset I referred to was not necessarily an offset in the ordinal value of list members - although that can and does sometimes exist - but an offset between the actual data being matched. eg. I would want the list [1,2,100,101] to match to [5,6,104,105]. Here the offset looks like +4 to get the second set of values from the first. However there are error values also so I'd want it to match with [5,6,105,106] too, for example.

Share this post


Link to post
Share on other sites
I'm not comparing strings. :) The items are largely judged on two floating point values, one of which dictates the ordering, the other of which just has to 'match'.

Share this post


Link to post
Share on other sites
This may not be the best solution, but...

From what you have described I think you could use a cross-correlation algorithm. The cross-correlation algorithm will return a degree of similarity for any offset you try.



want to find [1,2,3,4,5,6]

in [1, 1, 1, 1, 4,3,3,5,6,7, 1, 1, 1, 1]
-----------


start at offset 0, then 1, 2, 3, 4, 5, 6, ..... if currScore > bestScore then bestOffset = currOffset.

Cross-correlation should tell you that the first list appears at offset 4 in the second list.

Noise in the data shouldn't be a major problem, as the correct match (even if not perfect) will have a higher similarity score than any other.

If list 1 is bigger than list 2, then search list 1 for the occurance of list 2, otherwise search list 2 for list 1.

It might help to normalize the blocks of data (as opposed to normalizing the entire list).


Will


Share this post


Link to post
Share on other sites
If I understand the problem correctly, a similar algorithm is needed to do feature matching in 2 photos to recover 3D information. A method I particularly like is described here:

HP Labs: Stereo via SVD

Basically, features are found (you can use corners or edges in an image...maybe spikes in your 1D data), and the correlation is performed on the features of each data set using Singular Value Decomposition. So feature detection is used to reduce the data set, and then SVD is used to robustly correlate the data sets.

Share this post


Link to post
Share on other sites
Quote:
Original post by Kylotan
I'm not comparing strings. :) The items are largely judged on two floating point values, one of which dictates the ordering, the other of which just has to 'match'.


Okay, given the clarification in your last two posts, then you are looking at a cross correlation problem. Since you have 2 floats, then you can map your list members into a 2D continuous space and compute the nearest neighbours search I mentioned. The distributions of the distance to the nearest neighbours (intra and inter list) are needed for the cross correlation computation. If it can wait until Saturday, I'll dig out the code I have, remove the relevant sections and make sure it has appropriate documentation... and email it to you.

Cheers,

Timkin

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this