Sign in to follow this  

Writing an Equation For Data ("regression"?)

This topic is 4592 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I have a bunch of data (x,y points) that I want to write an equation for (i.e. the points closely follow a certain behavior, and I want to write an equation that fits the data and its characteristics). Pictorally, here are the points (look at the last 2 columns): And here they are on the coordinate plane: You can see that the equation of the curve that would pass through all the points would closely resemble a fourth degree polynomial. Now, given the points, how can I write that best-fit equation for them? Can someone give a name for what I'm trying to do here (is it "regression"? Results on google return mostly linear regression, which is not what I'm interested in here - I think what I need is "quartic regression", but google returns only how to do it on the calculator) Here would be the ideal curve that I'd really like the equation for (turning points are denoted by blue arrows): When I do quartic regression on my TI-83, it gives me this, which leaves a lot to be left desired: Is there a procedure I could follow to try quartic regression myself? I've never dealt with it before, as I'm only completing trigonometry right now (next year is precalc, so I'm a long way off, I guess). I would even be more interested in studying fitting equations to data in general. I would like to be able to know what to tweak in an equation to achieve a desired effect (i.e., if I want to move a turning point one way or another, or make a curve more shallow or deep). Any suggestions?

Share this post


Link to post
Share on other sites
I'm not sure if this is what you are asking for, but search for Least Squared Error. You can fit all kinds of cool functions that way (polynomials are the easiest).

here's a link

you need to be sure that your function can actually represent the data points (e.g. no point trying to fit a line through 5 points on a sine curve).

Share this post


Link to post
Share on other sites
If your data was obtained by measurement, e.g. the position of a car as a function of time, you can use regression to fit a curve, such that the sum squared errors is minimal. For that you need to know what kind of function you are looking for. In the example, you expect the position of the car to be s(t) = a*t² + b*t + c, so fitting the curve will select the parameters a, b and c.

On the other hand, if you just want to interpolate a function to go through some points, search for splines.

Share this post


Link to post
Share on other sites
There is an easier way to attack this problem. You have 5 points right?
(x0,y0)
(x1,y1)
...
(x4,y4)

Obviously no two x's can be equal, otherwise you can't ever fit a function to them.

Create 5 functions, like so:

F0(x) = [(x-x1)*(x-x2)*(x-x3)*(x-x4)] / [(x0-x1)*(x0-x2)*(x0-x3)*(x0-x4)]
F1(x) = [(x-x0)*(x-x2)*(x-x3)*(x-x4)] / [(x1-x0)*(x1-x2)*(x1-x3)*(x1-x4)]
...
F4(x) = [(x-x0)*(x-x1)*(x-x2)*(x-x3)] / [(x4-x0)*(x4-x1)*(x4-x2)*(x4-x3)]

I forget what these are called. But if you look at them closely, you'll see that F0 for instance is zero at x1, x2, x3, and x4, (zero in the numerator) but 1 at x0 (because x = x0, so the numerator and denominator are the same). Similarly for all the other F's. Then you get your function:

F(x) = y0*F0(x) + y1*F1(x) + ... + y4*F4(x)

This is guaranteed to go through all the points in your set exactly. Whether it is smooth or not, I'm not sure and it may depend on your sample points.

Tom

Share this post


Link to post
Share on other sites
If you want to just solve this problem, what people have already said is excellent. If you want to learn something general about how to do this, you could google for Minuit, a C library for this kind of thing, and RooFit, a C++ wrapper with some nice histogram and function classes in addition. The C++ interpreter ROOT also has some function classes that can do fits. For fitting to five points these are kind of overkill, but they're very useful for larger jobs.

Share this post


Link to post
Share on other sites
I so want to answer this question for you (I'm a professional statistician, and I'm sure you'll find my fees are affordable[smile]), but you're really asking about a massive topic...

Quote:

I would even be more interested in studying fitting equations to data in general. I would like to be able to know what to tweak in an equation to achieve a desired effect (i.e., if I want to move a turning point one way or another, or make a curve more shallow or deep).


...'Cos this is what I make all my lovely money from.

The methodology for solving even simple linear regression involve using sums of squares (which means calculating minima, and therefore calculus), using matrix algebra (and still calculating minima, and therefore matrix calculus) or using geometrical solutions (which I can't stand).

Quote:

as I'm only completing trigonometry right now (next year is precalc, so I'm a long way off, I guess).


So this means that I won't be explaining in any great detail, and will be just presenting some basic methodology. Don't mean to sound offensive - just trying to explain that there's an awful lot going on here that I'm skimming over.

So - firstly, let's define linear regression - because it's not what I think you think it means.

Hopefully you've seen the equation of a straight line, the old faithful:

y = (m * x) + c

where y is the observed value, x the covariate and m and c the gradient and intercept of the fitted line. This is obviously linear regression - but so, surprisingly, is this:

y = (a * x) + (b * x * x) + c

Huh, I hear you say! But surely that's quadratic regression?

Nope. It's called linear regression because (to quote 'Kendalls Advanced Theory of Statistics') 'It should be noted that the adjective linear will always be understood to refer to the parameter structure, and not to the regressor variables'.

Look again at the 2 equations. We are adding (parameter / functions of covariate) pairs in a linear fashion. Even though the resultant curve is quadratic in form, it's still linear regression. Even this:

y = a * sin(x) + b * cos(2*x - 15.2)

is linear regression. However, if we used this:

y = a * sin(b * x)

or this, a sigmoid curve:

y = a + b * (x^c) / (x^c + d^c)

it's no longer linear regression - the function isn't a linear function of parameters and covariates.

Those whacky statisticians, eh?

The next step is how you want to fit the curve to those data. Do you want a curve that goes through exactly those points, or are you trying to summarise the majority of the information whilst minimising the number of parameters you're fitting (a prime example of Occams Razor, or the principle of parsimony). Hmm, maybe that's not entirely clear, so let me try and clarify.

Let's say you have a bunch of height and weight data, and you want to fit a model describing how weight is related to height - that is, you want to say 'Bob is 1.9 meters tall - how much do we think he weighs?'. The data suggest a linear relationship between height and weight, so you fit a:

weight = m * height + c

model to it.

Do you think that all the weight and height data are going to lay on top of the fitted line? Of course not; take 10 people with the same height and their weights are all going to vary. This model is an expression of the average result - we need a more descriptive model to describe how individuals vary around that mean.

The alternative would be an extremely wiggly line that ran through all the data. But how useful is that? None what-so-ever - it tells us everything about the sample of data, but nothing about the population from which it is has been collected.

Welcome to statistical inference. Have a nice day!

Let's ask another question. Let's stick to linear regression for the moment, and look at our data. We have 7 data (each an (x,y) pair) - hence the following curve would fit a line through exactly these data:

y = a * x + b * x2 + c * x3 + d * x4 + e * x5 + f * x6 + g

(where x2 = x*x etc, to save on typing). This curve would fit exactly though the data - but if we're interested in making a generalisation, that's not so interesting - all we've done is described the sample.

We could just fit a linear term (y=mx + c), but just a visual scan of the data suggest that this wouldn't be a good fit - it doesn't capture the key features of the data. A quadratic might do the trick, but so might a quartic.

Now we're into the realms of Analysis of Variance. This toolset allows us to evaluate whether adding additional terms to a model gives us any more extra inferential power (under a set of assumptions that I'm not going to go into here). I used the R package to run a quick ANOVA model, and found that a quadratic curve, as you've fitted, statistically 'best' fits the data. Please note, for more astute readers - I am making so many assumptions here it's not funny, but I'm trying to keep the technical level down.

But what if we want a curve that goes through exactly those points. Well we could use the afore-mentioned 6-power model above, which gives solution:


l6 = 14.70
- 8.10 * l5
+ 0.93 * l5^2
- 0.034 * l5^3
+ 0.00057 * l5^4
- 0.0000044 * l5^5
+ 0.000000013 * l5^6



which is a pain in the arse, because the parameters get very small (because the covariate powers get very large - you can get around this, but not late on a Thursday night - this is a very crude analysis).

Alternatively, you could run some splines through the data, or use something called Generalised Additive Models. There are some other non-parametric approaches you could take. There are some other parameterised models you could use. A model, after all, is just a mathematical approximation to a set of data.

And so ends tonights lecture on model fitting (/pomposity).

Let me know if you have any more questions,
Jim.

Share this post


Link to post
Share on other sites
If you're only interested in the equations, here are some I think fits quite nicely:

(1) y = 1E-08x^6 - 4E-06x^5 + 0.0006x^4 - 0.0342x^3 + 0.9329x^2 - 8.099x + 14.7
(2) y = -5E-07x^5 + 0.0001x^4 - 0.0113x^3 + 0.3677x^2 - 2.8096x + 9.9374

I did it using MS Excel.

Share this post


Link to post
Share on other sites

This topic is 4592 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this