Sign in to follow this  
Dwiel

Finding maximum increases

Recommended Posts

Hey, I am working on a problem where the brute force method isnt very feasible. The problem is this. I have a large amount of data. a few hundred sets of 30,000+ data points. For each point in the time series, I would like to know the average of the top 10 highest slopes between other points. In otherwords, define si j to be the slope between the current point i and the point j points after it (the points are evenly spread out along the x axis). I would like to find, for each i (point in the set), the largest values of si j. Then, taken the largest 10 slopes, and find their average, preferably just an arbitrary function not restricted by the algorithm to be an average. A brute force search between 30,000 data points would be rediculous, even after some basic optomizations such as only testing the local maximums for slopes. It still is a very large problem. My data is also not going to be this small forever. Does anyone have any ideas on how I can solve this problem efficiently? I was thinking that maybe starting the search from the far right might alow some interesting optimizations. Thanks Dwiel

Share this post


Link to post
Share on other sites
"A brute force search between 30,000 data points would be rediculous"

What's so rediculus about that?

Your processor is capable of searching thru millions of data points like that in a matter of seconds. 30000 shouldn't be a problem.

Edited:

What is the nature of your data anyway?

Share this post


Link to post
Share on other sites
You could start from the back, and keep track of the ten highest points seen so far and build a new set of data points consisting of the average of those ten for each point i.

Makes si j an O(1) operation, after the O(N) initial calculation.

edit: that IS the problem you're trying to solve... right? your description seems kinda vague.

Share this post


Link to post
Share on other sites
how smooth is your data?
you might be able to take advantage of smoothness to find that... within some region the total slope is greater than this other region, and thus disregard a bunch of points all at once

if your data is totally jagged like a fractal tho
i dont see how actually checking each point is avoidable
its only one pass tho, and you only have to save the 10 largest values as you go thru
so it doenst sound too bad....

Share this post


Link to post
Share on other sites
are you saying that a single pass through the data is too much? or that sorting the data, then picking off the top ten is too much?

You can do this in a single pass. Just keep two arrays sized 10, index and value. keep track of the lowest value in the ten. Any time your current (sweep) value is higher than lowest_of_ten value, replace the lowest value with the new value and index, and search through your list of 10 again to find the new lowest. go through your whole data set once, and you have the top ten contenders. Then you can do this incrementally as new points are added.

[edit]
spelling and last sentence
[/edit]

[Edited by - lonesock on November 9, 2004 2:15:37 AM]

Share this post


Link to post
Share on other sites
Can point j be any point after i, or is it just simply the next point after i.. is s_ii+1? In latter case you can find the largest slope in one pass. If you mean the first case, then I understand your comment about 30000 data points being ridiculous.

Share this post


Link to post
Share on other sites
Sorry, I guess I wasnt very clear. The data very fractal like. I'm not trying to find the next 10 highest points, but rather the next 10 points whose slope with respect to the current one, is the largest. Example:

y|
|
4|
3| *
2| * * *
1|* *
+-----------
pt #
1 2 3 4 5 6

First, find the largest slopes with respect to point #1. The BEST slope is between point 1 and point 2. The slope between those two points is 1. The next best is between 1 and 3, the slope between those is 1/2. After that, 1 and 5 form a slope of 1/4 and finally point 6 forms a slope with point 1 of 1/5. What I want is, for each point in the set, the average value of the 10 best slopes between other points which come after it. Just because point 6 has the highest value, it is sufficiently far away from point 1, that it's slope is not as high as some other ones.

The reason why a few hundred sets, each of 30,000 data points is large, is because the brute force method, I would need to perform 450,015,000+ multiplies and sorted inserts per set. And then do that a few hundred times. Also, this data might change, in which I would need to do it all over again.

I was thinking that maybe I could find the maximum value in the set and then as soon as I find that no other points could possibly have a higher slope, break and go on to the next one. If you have more ideas on how to speed the algorithm up, I would greatly appreciate it.

In the future, my sets of 30,000 could easily become sets of well over a million, which I am sure would slow down a brute force algo pretty quickly (7*1011 mults and sorted inserts per set (probably closer to 2000 instead of 300))

Thanks

Dwiel

Share this post


Link to post
Share on other sites
You don't need to do sorted inserts constantly if you only need the top 10 slopes per point. Just keep a running list and which value is the lowest, and only insert into the list if the currently examined slope is higher than the lowest already in the list. That alone should reduce the time spent by an order of magnitude or two, since for most points you'll only be calculating and comparing the slope which only involves a very few operations.

You should try implementing this obvius algorithm and see how slow it is since nobody seems to have any alternatives. It might end up being fast enough anyway, or you might find an easy way to optimize it after usnig a profiler.

Share this post


Link to post
Share on other sites
I think you can do it in O(n). I'm asumming that the interval is 1 over the x axis just so I can explain this easier. I haven't checked the math so this is just an idea.

If you think about how slopes over shorter intervals combine into slopes of higher intervals, I'm pretty certain that the greatest slope has be over an interval of 1 (greater intervals has to contain either all peices of equal slope or at least one piece of greater slope). So you go through once finding the 10 greatest slopes of interval 1. Next check the slopes of interval 2 that include one of the 10 pieces of interval 1. If you find no new numbers then your done. else check interval 3 with the new pieces of interval 2 and so on, it should stop at interval 10 at most.

-Cocalus

Share this post


Link to post
Share on other sites
Sorry Found a mistake. I'm still certain that all the max slopes contain the max slopes of interval 1. But I found out that the 3rd highest could be the entire set. It's probally still doable, but the worst case could be bad. I'll think about it a little more and see if I can fix it.

Share this post


Link to post
Share on other sites
hmmm.... looks kinda like constructing map for self-shadowing, but i don't understand why you need averaging.

and ideas:
first, try to do something based on divide and conquer. Try to make fast enough algorithm for merging 2 results into one, then you just recursively do first half, second half, and merge 'em, like sorting.

edit: note that you need to consider only convex hull of right part when you merge two results together. And max slope points on convex hull can be found using something like binary search for maximum. So merge probably can be done in in O(N log N) time, and therefore entire problem too.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Quote:
Original post by Cocalus
Sorry Found a mistake. I'm still certain that all the max slopes contain the max slopes of interval 1. But I found out that the 3rd highest could be the entire set. It's probally still doable, but the worst case could be bad. I'll think about it a little more and see if I can fix it.


Thanks for the idea. I like the idea where I don't have to compare every point with every other point. (although some simple optimizations could probably cut this down a lot)

However, I just found a case where I think that your algorithm will break. Maybe not, I'm not 100% sure how it works. Consider this:

| *
|
|
| *
|* *
| *
+---------
1 2 3 4 5


The slope between 1 and 3 is large, 1 and 4 smaller, then 1 and 5 largest. I'm not positive if this breaks your algo, but it might if I am understanding you correctly.

I'm not at home atm, but when I am, I will code the easy solution to the problem and see if it is sufficiently fast. It might be the case where it is easiest to simply start with the 'slow' algo and then find special cases where I can avoid computation in order to speed it up.

I'll post as soon as I get some results, but in the meantime, I wouldn't mind more ideas incase a hacked version of the brute force algo isnt fast enough.

Thanks

Dwiel

Share this post


Link to post
Share on other sites
Do you have integers on the y axis or floats?

I can tell you where a section of the greatest slopes are after one pass (explained above) but I'm trying to find a good way to figure out when to stop searching for them is. The exception I found is

| *
| ************.........******
|*
+-------------.........-------

The 3 biggest slopes are 1 to 2,n-1 to n, and 1 to n.

Share this post


Link to post
Share on other sites
Quote:
Original post by Cocalus
It doesn't 4 to 5 is the largest.
No. The function si j was given i = 1 in his example. 4 to 5 is the largest slope, but that's i = 4. edit: Doh! you realized this seconds before I posted.

Share this post


Link to post
Share on other sites
you can remove all points that come after a dip in the graph. maybe thats the same as dmytry saying youre only interested in the convex set though.

also i agree a binary search would be good. i wouldnt know what criteria to search it for though. maybe if you stored the highest point contained in that set, youd know what the maximum and minimum slope is as seen from a certain point. that should converge pretty quickly.

Share this post


Link to post
Share on other sites
i think convex set idea must work. Highest slope point of set of points always will be point of convex set(imagine a convex polygon that contains your set). Don't really sure what to do with lower 10, probably you must just remove that highest-slope point from the convex set, and repeat check again, don't sure how fast it can be but probably it's possible to find all points that might be in top 10 and check only 'em.

Share this post


Link to post
Share on other sites
I guess there is no reason why I should be hinding what the data is from you guys. Its stock data. And for each point in time, I want to know not only what the maximum amount to be made is, but how long of a period that money can be made. It is most likely going to be harder to trade if there is only a split second where money could be made. Its a long story on how this information about 'possible profitability' could be useful, seeing as in the real world, we wont have the future prices, but it doesnt really belong in this thread. I was hoping that by keeping it vague it might apply to game better, esp self-shadowing of terrains which I noticed is a similar problem.

Be back soon to tell how fast diff algos are and if I need any more improvments.

Thanks for all the help thus far!

Dwiel

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this