Sign in to follow this  
graveyard filla

doing searches based on address info

Recommended Posts

Hi, I have a huge database filled with address info (e.g. just house / apartment addresses), along with some other data. This table has like 10 million records. I need to allow it so the user can find that "other data" by searching based on a given address. This is kind of a tricky problem since you can't tell how people spell things, space things, abbreviate things, etc. So the way we decided to handle it was to base it on zip code and house number only. The user enters a house number and zip code, and we display any addresses that have that zip code AND contain that house number in the "address_1" field. We do the latter with a wildcard search. The problem is that with 10 million records, doing a wildcard search / full table scan takes forever. This is something that is important to be relatively fast (e.g. only a few seconds instead of 30). Anyway, I have come up with a solution and would be interested to hear input on it and solutions of your own. Basically, we change the database schema from just one table with the address / other info, into 2 tables. The first table would be called "address_and_other_info" and the second table would be called "house_number". The data comes in via a CSV file that we parse, so the process would do something like this:
for each entry
  insert the address and other info into address_and_other_info table. grab the PK here
  grab the address_1 field and split it by whitespace, put into array s
  for each string in s
     if(string contains a number)
        insert this string into the "house_number" table. also assign its "parent" to be the PK of the row we inserted above
So, essentially we just split the address based on whitespace, and throw out any strings which don't contain numbers. We COULD just assume there is only 1 house number in each address, but this is bad since it might be "343 4th street" and we might throw away the "343" that way. Anyway, this would mean we now have a table with say 30 million entries instead of 10 million. However, now we can slap an index on those bad boys, which should make it much faster then doing a full table scan. The query would then turn from this: "select * from address_info where zip = @zip and address_1 like '%@house_number%'" to this: "select address_and_other_info.* from address_and_other_info,house_number where address_and_other_info.aaoi_id = house_number.aaoi_id and address_and_other_info.zip = @zip and house_number.house_number = @house_number " Anyway, what do you think?

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this