Finding a mean for 2D samples.

Started by
8 comments, last by etothex 18 years ago
I used a GPS receiver to log about 10k data points for a stationary position. The idea was to isolate the error in GPS, analyze it a bit and get a better understanding of how to cleanup actual tracks. One of the first problems I ran into is how to find the mean of those points. Taking the average of the longitude and latitude seems like it would only be valid if they were completely independant variables and they aren't. Statistics is a week subject for me. It seems what I want is a point such that the mean distance of each point from all the rest of the points is it's distance from that point. I can't say such a point exists, but it does seem there would be one point that comes closest to satisfying that criteria. It also seems I could be totally off base. What I think I really need at this point is terminology. I don't really need a solution as much as just pointed in the right direction so I can start looking things up.
Keys to success: Ability, ambition and opportunity.
Advertisement
Assume random distribution, independant lat and long variables. Fit each value as gaussian distribution.

Unfortunately, without knowing anything at all about properties of distortion algorithm (which is artificial, and may exhibit very specific behavior) you cannot do much more, except treat the signal as regular noise.

If algorithm is time dependant then the distortion depends on some third parameter. It could be also modeled with some chaotic attractor, or some similar method, which makes its properties extremly complex and volatile.

While not 100% rigorous, another method for points that are close to one another is to simply find the average of the points in 3D space, and then project that average point to the sphere surface (making the simplistic approximation the Earth's surface is spherical.)
.
A couple of things to note: You can convert latitude and longitude to rectangular coordinates. Since the precision of GPS is extremely high compared to the curvature of the earth, you can assume that the earth is flat (locally). Those should simplify the math a lot.

Finally, it isn't clear exactly what you are trying to accomplish. Are you trying to find the point such that the sum of the distances from the point to the samples is minimized. Wouldn't simply finding the mean of the samples (add the components, divide them by the number of samples) be sufficient?
John BoltonLocomotive Games (THQ)Current Project: Destroy All Humans (Wii). IN STORES NOW!
My guess is what you're trying to do is leave the GPS unit in one area for a long time and "average" the data to get a measurement of latitude/longitude more accurate than the unit actually provides.

If so, I should mention that it doesn't work that well. Even after 24+ hours the mean will still do a random walk around an area of several meters (for a 3-5 meter accuracy unit) In order to do better postprocessing, it's necessary to have a unit that supports dumping the raw data per satellite and not just the final lat/lon data. And only really pricy units let you do that.
Overall what I'm doing is logging GPS tracks when I walk, hike, jog, rollerblade, etc. The goal is to basically be able to cleanup those tracks so they are of more practical value. So initially I'm trying to understand how the error varies over time. The starting point for that is basic descriptive statitics, but I'm not real sure what to do with 2D points.

Just as an example, looking at a scatter graph the points are not clustered in a circle, but more of an ellipse. So that raises the question of just what is a standard deviation in 2D? Overall, given a bunch of 2D points where do you start in describing them statistically? At this point it really doesn't have much to do with GPS.
Keys to success: Ability, ambition and opportunity.
Well I would just (with the data in rectangular, x-y coords in meters) take the average in the x and y direction seperately to find average x and average y. Then calculate the standard deviation for each x and y seperately also.

Then, (sigma)^2 = (sigma_x)^2 + (sigma_y)^2, where sigma is the total standard deviation and sigma_x and sigma_y are the partial standard deviations in each direction. You can think of this as calculating the variance with the euclidean distance between two (x,y) points instead of just (x_i - <x>)^2
I was in a bit of a hurry earlier so it might help to explain a bit what I'm ultimately trying to do. I have many tracks of various routes. What I want to do is scrub those tracks. I don't really much care about absolute position. Rather what I care about is time and distance between waypoints, i.e. split/lap times.

A GPS log isn't very accurate on distance. The problem is drift in the error orthogonal to the direction of travel. Those errors accumulate and even a good track can be over 10%-20% on distance. That's an unacceptable level of error. I'm a broken down old man and at the rate I run that's a minute or more off on a mile. I can get a mile withing five seconds or so from the log if I just know where the mile is.

I found I can basically smooth the track to get a much more accurate measure of distance. I also found the tracks are generally a pretty good representation of the shape of a route if not the position of it. I don't really care about position. Basically shape and scale is what I care about. Looking at the tracks it seems the GPS sometimes drifts off the path and stays there for awhile. So that's why I started logging a stationary position.

Ultimately I'm more interested in how the error drifts over time, but right now I'm basically trying to validate the sample. I apparently have a problem with my sample. Only 60% of my data points lie within a 10m circle of anywhere as near as I can tell. It takes about a 25m circle to contain 95% of the points. As far as I know it shouldn't be an elliptical cluster. I'm inclined to think I got a reflection off a hill.

I basically tried two things to select a "correct" position. One was the average longitude and latitude. The other was generating a histogram for the longitude and latitude then using the center of the highest frequence bin for each. Apparently, since those are two differant points, the data is skewed. That they are differant points, significantly so, I started thinking maybe there are alternative ways to select that "correct" position.
Keys to success: Ability, ambition and opportunity.
Well, I found this site. Apparently the author has put a great deal of time and effort into analysis errors in GPS while logging a stationary point. So I figured follow along with what he has done with my own data. That should keep me busy for awhile.

What I'm thinking of doing at this point is using the data I collect for a stationary point to simulate tracks. That eliminates many of the problems of using real tracks of real paths, i.e. length, shape, orientation, etc. Particularly I don't have to deal with all differant cases that occur on a real path starting out, i.e. I can start with a straight line, move a rectangular path, etc.
Keys to success: Ability, ambition and opportunity.
What you're trying to do is tough, especially when you have multipath fading (off a hill, as you said) and my guess is you're using a handheld unit which are pretty crummy at correcting for multipath.

Quote:
Ultimately I'm more interested in how the error drifts over time, but right now I'm basically trying to validate the sample. I apparently have a problem with my sample. Only 60% of my data points lie within a 10m circle of anywhere as near as I can tell. It takes about a 25m circle to contain 95% of the points. As far as I know it shouldn't be an elliptical cluster. I'm inclined to think I got a reflection off a hill.


Realistically, that's not bad accuracy. WAAS will go in and out, and when it drops out you can expect those 25m sigmas to be there.

What I suggest, if you only care about overall absolute distance anyway, is looking at velocity data instead. The velocity field, on even cheap gps units, is fairly good - just take the velocity data and time data and do some fancy numerical integration (Euler might work ok, but feel free to experiment)

That might give you loads better distance data than just computing euclidian distances between raw position points.

EDIT: And the error should be elliptical, that's not unusual.

This topic is closed to new replies.

Advertisement