Sign in to follow this  
choffstein

Correlation of Two Polynomials

Recommended Posts

Here is a question for all you crazy math nerds out there: how do you determine the correlation between two polynomials? First, let me outline my purpose. I am trying to analyze if two sets of data are correlated. Seems simple right? Just use Pearson, I hear you say. I know. But the issue is this: one data set has 350+ points, while the other only has 5. So I figured the best thing I could do was use same order polynomials, using a perfect fit for the 5 point plot. Then I could check the correlation on the derivatives, or even second derivatives. I have two fourth order polynomials -- one is a perfect fit (R^2 = 1 -- in fact, it only has 5 points) while the other one is a least squares fit (about 350 data points). Plotting them on top of each other seems to show a similar 'shape' -- but the equation that is a perfect fit seems to have a steeper rate of change. The exact equations are: f = -3.2575E-08x^4 + 1.3577E-05x^3 - 1.5590E-03x^2 + 3.5979E-02x + 2.2496E+01 g = -3.7034E-09x^4 + 1.7008E-06x^3 - 2.1462E-04x^2 + 4.0458E-03x + 7.1000E-01 (You can plot these in Excel if you want by just using x from 0 to 253) Now, when I do the first derivative -- the 'rate of change,' as it were -- I get a negative .4 correlation. Doesn't give me enough confidence to say there is any correlation. But when I do the second derivative, I get that the correlation is +.9 -- which is huge! But this doesn't make sense to my puny mind. The first derivative is the rate of change, while the second derivative is the rate of change of the rate of change. Why would the second derivative be more correlated than the first -- and does it make sense to use this value? So how does one define correlation between two polynomials that are only 'best-fit' curves and not the actual vector values? Thanks. EDIT: Changed equations. Sorry, wrote down the wrong ones! [Edited by - visage on June 4, 2007 10:36:10 AM]

Share this post


Link to post
Share on other sites
Unfortunately, checking the first derivatives against one-another won't do the trick. A corollary of the Weierstrass theorem is that given a continuous function g, we can create a continuous f that stays arbitrarily close to f, but still be able to specify f'(xi) to be anything we like at finitely many xi. An extreme example would allow us to have g identical to f (to floating-point accuracy) with their derivatives differing as much you like at the sample points. Of course, this only applies to polynomials of infinite order, but the principle stands.

When I plot those two functions, they look to 'do the same thing' - they both look vaguely sinusoidal with the same frequency and phase - but they have very different amplitude and bias. This is intentional, right? I get the impression that you don't care for the bias, but would you like your metric to reflect the differing 'amplitudes'?

As I see it, your best option is to create a 'distance' polynomial and integrate its square (or maybe modulus) over the range. This will give you a good measure of the L2 difference of the two data sets. There's not really much point going for L, as the least-squares suggests you're happy with RMS accuracy. What's not clear is exactly how to form the distance polynomial (which I'll call d). I imagine one of these three will suit your needs:

1. The true L2 distance between the functions: d(x) = f(x) - g(x).

2. The difference after optimal additive scaling. First, calculate the constants (L2 means)

f0 = ∫0253f(x).dx / 253,
g0 = ∫0253g(x).dx / 253,

then define F(x) = f(x) - f0, G(x) = g(x) - g0. Now F and G will have the same form as f and g, but will be centred around zero. The distance function here is d(x) = F(x) - G(x).

3. The difference function after additive and multiplicative scaling. First scale the two polynomials to have the same range (maximum - minimum) over (0, 253), then remove their bias as before. I'll go into more detail if you ask.

Think carefully which metric is most suitable, then calculate the final distance D (a non-negative scalar) using:

D2 = ∫0253d2(x).dx

The result will be about as good as your two interpolants will allow.



I must ask; are you writing code for general, equally-important data sets, or is this a once-off with the data you have shown us? If the latter, then there is no problem. Otherwise I take issue with your method:
I understand that the data sets have vastly different numbers of samples, but it isn't quite fair, in the general case, to interpolate the small data set, but approximate the large one; it gives the small set a stronger significance, and will lead to disproportionate anomalies if there is any noise in the small sample. Of course, interpolating both data sets would be a bad idea, but approximating them both seems far more suitable. The natural question is then 'to what degree should I approximate each data set?'. Although it leaves some scope for false-positives, the natural choice is to approximate them both with a polynomial of order less-than-or-equal-to the size of the smaller set. In the case of equality, approximation and interpolation become one and the same, and all is rosy. Was this your train of thought, or was the fourth-order polynomial decision based on something else? If you made the executive decision that five degrees of freedom was perfect for the situation, then I shall have to stop typing, but in any more general case, a Pearson or chi-squared correlation test would indeed have been the best call [wink].

Admiral

Share this post


Link to post
Share on other sites
First of all, thank you for the nice write up. Some is a bit over my head, but I am doing all the research on everything you wrote as we speak.

Here is my situation -- I am analyzing correlation of equity prices versus fundamental data. So, for example, stock price versus earnings per share information, which is only released four times a year. The hypothesis is that price should some how relate to the change in earnings per share -- so my goal is to plot EPS and Price and define what I mean by 'high correlation'. Unfortunately, I am having a tough time doing that.

To answer your question, the code will be used to analyze many similar data sets (all related to price) over several different time periods -- so I suppose the code is both general, and problem specific. The code is already written that allows me to find polynomial fits using Least Squares, as well as finding derivatives -- I don't believe adding integration methods would be too difficult.

So to be specific, I have five eps datapoints (call this set B), equally distributed over a years time with two points overlapping (this creates four equal time spans). I then have 253 price data points (set C), which I believe to be reactionary to the five data points -- as in, as each point in B is 'released,' the points following in C take the effects of the new point in B into consideration. As in, the function defining the points for set C could very well take set B as a parameter.

My thinking was this: because I believe set C (price) to be reactionary, I would create a perfect fit curve to B (eps). I considered this to be the 'definition' curve. Then, I would take set C, and 'move it back' by ~62 days, so that the functions perfectly overlap one another (as in, set C is no longer 'reactionary,' but 'concurrent'). If set C were a one-to-one correlation to B, it would have an identical function to B's polynomial. Hence, I used the same size polynomial to define my curvature for C. This also has the added benefit of smoothing a lot of the noise of the 253 points. The thought was this: identify their curvature either using the first or second derivative and use the Pearson method to find the correlation. Considering second derivative generally defines curvature, I figured it would be a better fit. In the one example I chose, it seems to work rather well (.9+ correlation coefficient).

Unfortunately, as it has been pointed out to me: identical looking equations can have very different derivatives. Does this apply for the second derivative, however? I assume it does, since the first derivative can be our new continuous function f in corollary of the Weierstrass theorem.

A few in #math @ freenode seemed to think I was going down the completely wrong path.

Maybe this clears up some stuff for you. Considering I am still trying to understand your answer, I don't know if it affects what you said. Thanks for the help so far.

Share this post


Link to post
Share on other sites
Well you're not necessarily going entirely down the wrong path (though I did suspect so at first). It really depends on the changeability of the ~62 day delay you speak of. I don't know much about the economics of the situation, but if this delay is fairly predictable, then your method should be reasonably sound. If it is very variable, then you'd need to introduce a new 'phase' parameter to be solved along with the rest of the unknowns.

My main concern in the second part of that first reply regarded the relative sizes and importances of the two data sets. Since you say that they will always be of roughly the same orders, and that the larger set stochastically fits itself to the smaller one, I'm appeased to some degree. For your current data set, and probably for a large majority of others, the current system should work out very well. However, for data sets with large booms or crashes, which translate to sharp peaks in the higher derivatives, things could quickly go awry.
In mathematical terms, your model is safe for data sets with a reasonably low modulus of continuity (a measure of how sharply it can change).

While it is a little naive to measure the fitting of the data sets by comparing their closeness under a phase translation (as you are), it is far more difficult to do much better without getting involved with some fairly high-power stochastic mathematics. This is the main reason we don't have many good 'open-source' models to predict the stock market's micro-evolution.

Anyway, it sounds like that third distance function would be best-suited to the task:

Given the two quartic polynomials f and g, calculate their respective maxima and minima in the range [0, 253]. This could be done via differentiation and root-finding, but considering the origins of the polynomials (if you're not in a hurry) a for-loop would be just as good:

float f_min = f.evaluate(0.0f);
float f_max = f_min;
float g_min = g.evaluate(0.0f);
float g_max = g_min;
for (float i = 1.0f; i =< 253.0f; i += 1.0f) {
float f_i = f.evaluate(i);
float g_i = g.evaluate(i);
if (f_i > f_max) f_max = f_i;
if (f_i < f_min) f_min = f_i;
if (g_i > g_max) g_max = g_i;
if (g_i < g_min) g_min = g_i;
}
Of course, this is a little rough, but it won't be far from the truth unless the source data is pretty wild, in which case the whole model is invalid anyway. I sure hope you'll have a human looking over the results before making any corresponding investments [razz].

Now scale the two polynomials to fit this range:

F(x) = (f(x) - f_min) / (f_max - f_min)
G(x) = (g(x) - g_min) / (g_max - g_min)

Calculate their difference:

d(x) = F(x) - G(x)

and integrate over the domain:

D = √(∫0253d2(x).dx)

If it isn't clear, D (a constant) will be zero if and only if f and g are identical. The more they differ, the larger D will be, with no upper bound. The units of D are dimensionless, and will depend on the min & max of f and g. Determining a reasonable threshold for D will be a matter of trial-and-error. If your model is accurate, you should find that the quantity defined to be

Q = (f_max - f_min) / (g_max - g_min)

will be reasonably uniform as you put different data sets through the algorithm. If Q changes significantly (as in by an order of magnitude or more) then you've probably just been calculating nonsense all along.

Good luck
Admiral

Share this post


Link to post
Share on other sites
Here's a non-mathematical view from a long-time stock market observer.

On average, you will find some correlation between stock price and EPS because EPS is a important factor in the valuation of a company for many people. However, the correlation can vary greatly from company to company.

If you are looking for a correlation between the change in EPS on a specific day and the behavior of the stock price for days following, I don't think you will find any (beyond the overall correlation).
  1. Stock prices are influenced by future EPS values more than past EPS values.
  2. EPS values are routinely pre-announced far in advance. With the pre-announcements and other information, the values can be predicted to a reasonable accuracy before they are actually announced. The differences in the expected values and the actual values affect the stock price in the short term more than the EPS itself (because the current price already reflects the expected value).
  3. If your analysis doesn't consider these influences, they appear as large random factors.
Again, the results can vary greatly from company to company, but I predict that any correlation you might find will be no more significant than noise.

Finally, you have to be careful about fitting high-order polynomials to your data -- Occam's razor, you know.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this