# Finding Best Fit Lines Algorithm

This topic is 3497 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

So I have a basic system set up where I have N (x,y) points in an array. My goal is to find all linear regressions with r^2 greater than 0.95 of these points. For a naive implementation, I did something like this:
#points is our array of points.  n = points.size
1.upto(2**n-1) { |i|
s = i.to_s(2)
s = '0'*(n-s.size)+s

included_pairs = []
(s.length-1).downto(0) { |j|
included_pairs << points[j] unless s[j] != 1
}

#perform linear regression on included pairs
}

Maybe not the best way of doing it. If you can't tell what I am doing, I am basically being cute and turning a number into a binary string, then using that string to determine which members of the array I should test on. This allows me to tells all 2**n-1 combinations of points for linear regression. Now, this works great. Except for the whole 2^n thing. After thinking for a bit, I realized that I could be culling a whole lot with a little logic (and what seems like a lot of work). The thought is this: if I start with [1,1,0,...,0] for my inclusion array, and my line doesn't work, then I can just jump to testing [1,0,1,0,...,0]. If it does work, than I can try [1,1,1,0,...,0]. If that doesn't work, then I can stop there and go on to [1,1,0,1,0,...,0]. This allows me to cull lots and lots of costly matrix multiplications (ordinary least squares, whoo!), and should drastically cut down on my total loop time (nobody likes an exponential loop!). Except I am having a brain far and can't code up this logic. In my head it makes sense, but actually getting it to work seems to be rather tricky. Cookie for the first working implementation (or shove in the right direction!) My first thought is defining a 'major' and 'minor' bit mask for each loop. When the minor doesn't work, you shift right and retest, adding a 0 to the major bit mask. If it does work, you add it to the major bit mask and shift the minor right and start testing again... but the logic gets really complicated really quickly trying to remember where you were. perhaps some sort of stack with pushing and popping is in order? Corey

##### Share on other sites
def test_regression(major, minor)        if minor == 0            return []         end	s = (major | minor).to_s(2)	s = '0'*(n-s.size)+s		included_pairs = []	(s.length-1).downto(0) { |j|		included_pairs << points[j] unless s[j] != 1	}				### do linear regression here				if worked		### here we fork -- continuing on two paths		return [r_squared, etc] + test_regression(major | minor, minor>>2) + test_regression(major, minor>>1)	else		return test_regression(major, minor>>1) 	endendworking_regressions = []n.downto(1) { |i|	j = 2 ** (i-1)		major = j	minor = j/2		working_regressions = working_regressions + test_regression(j, j/2)}

maybe? does this look right (besides the scoping issues with n)? testing this is fucking impossible.

[Edited by - visage on July 30, 2008 10:02:50 AM]

##### Share on other sites
I don't understand your problem. What are you trying to do?

##### Share on other sites
Figured it out!

n = points.sizevalid_trends = []		def test_regression(points, major, minor, n, type, date)	delta = 0.025			if minor == 0		return []	end			s = (major | minor).to_s(2)	s = '0'*(n-s.size)+s		included_pairs = []	0.upto(s.length-1) { |j|		included_pairs << points[j] if s[j].to_i == 1	}				#### test linear regression here	if !invalid		### here we fork -- continuing on two paths		#puts "\t\tValid: #{est[0]}x + #{est[1]} @ #{start}"		return  [[r, est, start, included_pairs.map { |e| e[0] }]] + 			test_regression(points, major | minor, minor>>1, n, type, date) + 			test_regression(points, major, minor>>1, n, type, date)		else			return test_regression(points, major, minor>>1, n, type, date) 		end	endendif n > 1	j = 2**(n-1)	while j > 0				valid_trends = valid_trends + test_regression(points, j, j/2, n, type, date)		j = j >> 1	endend

I think that is the solution...it seems to work in empirical tests of the data...

Basically, I was trying to derive an equation whose expected runtime was less than 2^n (though worst case was still 2^n) by using logic to cull out possibilities that couldn't occur...