Sign in to follow this  
AncientPC

Best data structure for this problem?

Recommended Posts

A quick example outline: data: -name -age -other info problem: -find info from name -find the # oldest/youngest people -add/remove/change people (1M+) often Runtime performance is extremely important, memory shouldn't be an issue. I only know C++ / Java (so whichever is faster). I approached this problem by creating a vector of objects for search/sort and a second associative array to link name to their corresponding vector indexes. Before every search I would run a quicksort on the vector (and thus change the associative array's corresponding index). To modify people by name I would look up the index from the associative array. Is there a faster way to doing this? A binary tree based on age? Linked lists? [Edited by - AncientPC on March 8, 2007 2:43:35 PM]

Share this post


Link to post
Share on other sites
If you have small amounts of data, then it doesn't really matter. One million might not qualify as "small", though.

For larger datasets, a database is designed for this kind of operations, it will be both fast and easy on your memory. postgreSQL interacts pretty well with C++, and I'm pretty sure Java has some kind of database access as well. Once you get the interaction running, the operations you mentioned are simple:


CREATE TABLE people (name varchar(32), age int(4), other varchar(32))
SELECT info FROM people WHERE name = '$the_name'
SELECT name FROM people ORDER BY age LIMIT $howmany
INSERT INTO people VALUES ($name,$age,$other)
DELETE FROM people WHERE name = '$name'
UPDATE people SET age = $age WHERE name = '$name'

Share this post


Link to post
Share on other sites
Sorry, I probably worded the problem wrong. I've worked with PHP/MySQL a little but enough to know that a DB solution would be much more practical.

I'm looking at a smaller dataset (say 5-50k) but must run ~1M lines of instructions on them. Mostly find # of oldest people and change info.

Share this post


Link to post
Share on other sites
Wait... if you know the instructions in advance, wouldn't it be easier to perform the operations right away in an existing statistical analysis program (or a spreadsheet), than to code a "faster solution" ?

Otherwise, how are the instructions provided?

EDIT: I'm asking this because you might be able to do a two-step analysis, changing the representation from one step to another.

Share this post


Link to post
Share on other sites
A database would be the most extensible solution as Toohrvyk said, but if you're sticking with code I would do this:

problem:
-find info from name

It seems you've got this figured out, a hash map is probably the best solution, search the net for a hash function that's optimal for name strings, its a common enough application that there must be some good information on it.

-find the # oldest/youngest people
Instead of sorting before each list you'll want to simply insert new items in the correct position, and remove items without effecting the order. A Binary tree structure will easilly give you the youngest/oldest n people using left and right depth-first iterators respectively.

-add/remove/change people (1M+) often
Vectors aren't generally good for inserting into the middle. A Binary tree with a pooled allocator would be good here.

Share this post


Link to post
Share on other sites
While we're at it, a "general" approach here would be to use a sorted data structure for the age, to find the minimum and maximum K in O(K + log N). To uniquely identify people by identifiers or strings, I'd use hash tables. Then, all indexers would point at the actual objects, and the objects would point back using one mechanism or another to alter the keys.

Share this post


Link to post
Share on other sites
Quote:
Original post by ToohrVyk
Wait... if you know the instructions in advance, wouldn't it be easier to perform the operations right away in an existing statistical analysis program (or a spreadsheet), than to code a "faster solution" ?

Otherwise, how are the instructions provided?

EDIT: I'm asking this because you might be able to do a two-step analysis, changing the representation from one step to another.


Well the program will need to run on Linux with the instructions piped in command line.

Thanks for all your quick responses! :)

Share this post


Link to post
Share on other sites
Assuming c++, you can store the objects themselves in a set. Since a set is sorted, as long as you either define operator < or pass in an appropriate function to sort by age, the set will remain ordered. Inserting into the set will give you an iterator which won't be invalidated by inserting/deleting from the set (well, deleting an object will invalidate an iterator pointing to *it*, of course).

Insert that iterator into a map of names to iterators.

Just make sure that you delete from both containers when you're done, probably by wrapping them in a class.

That should do what you need.

Inserting and removing often is usually a sign that you shouldn't use vectors, especially if you also want to sort the data often.

Share this post


Link to post
Share on other sites
You basically have two indices - name and age.

I'd check for Boost solutions first ;), and then:

- Create comparators for Person*'s that compare the name or age of the pointed-at person.
- Create a map<std::string, Person*> and set<Person*, person_age_comparator>. The map is because we have to *look up* by name.
- Wrap the whole thing up in a class.

The Person objects would be dynamically allocated, because we want them not to get moved around in memory if we add more (as may happen with a std::vector or std::map or actually, almost anything other than a std::list). We could use std::list, but it doesn't give us anything because we don't care about iterating over Persons in the order of insertion.

Not at all tested!


// The lines with backslashes at the end have a space after the backslash to
// avoid munging by the forums. You would need to remove those spaces.
#define IMPLEMENT_COMPARATOR(m) \
template <typename T> \
struct compare_member_##m : public std::binary_function(T, T, bool) { \
bool operator()(const T& x, const T& y) { \
return x->m < y->m; \
} \
}

IMPLEMENT_COMPARATOR(age);

class Person {
std::string name;
int age;
// etc.
}

template <typename T>
void deleteAt(T* t) { delete t; }

class Census {
typedef std::map<std::string, Person*> lookup_t;
lookup_t name_lookup;
typedef std::set<Person*, compare_member_age> ordering_t;
ordering_t age_ordering;

typedef std::pair<ordering_t::iterator, ordering_t::iterator> age_range;

public:
void add(args) {
Person* p = new Person(args);
name_lookup[p.name] = p;
age_ordering.insert(p);
}

void remove(const std::string& s) {
lookup_t::iterator it = name_lookup.find(s);
if (it == name_lookup.end()) { return; }
age_ordering.remove(it->second);
delete it->second;
name_lookup.remove(s);
}

Person* get(const std::string& s) {
lookup_t::iterator it = name_lookup.find(s);
return (it == name_lookup.end()) ? 0 : it->second;
}

age_range nYoungest(int n) {
ordering_t::iterator youngest = age_ordering.begin();
return std::make_pair(youngest, std::distance(youngest, n));
}

age_range nOldest(int n) {
ordering_t::iterator end = age_ordering.end();
return std::make_pair(std::distance(end, -n), end);
}

~Census() {
std::for_each(age_ordering.begin(), age_ordering.end(), deleteAt<Person>);
}
};

Share this post


Link to post
Share on other sites
Quote:
// The lines with backslashes at the end have a space after the backslash to
// avoid munging by the forums. You would need to remove those spaces.
#define IMPLEMENT_COMPARATOR(m) \
template <typename T> \
struct compare_member_##m : public std::binary_function(T, T, bool) { \
bool operator()(const T& x, const T& y) { \
return x->m < y->m; \
} \
}


Oh, macros, evil! [grin] I wonder if it wouldn't be possible to get around this using pointers-to-members; I'm thinking:

template <typename T, typename M>
class compare_member : public std::binary_function<T*,T*,bool>
{
typedef M (T::*memtype);
typedef compare_member<T,M> self;
memtype mem;

public:

compare_member(memtype mem) : mem(mem) {}

compare_member(const self & other) : mem(other.mem) {}

self & operator=(const self & other)
{
mem = other.mem;
return *this;
}

bool operator()(const T*& x, const T*& y) const
{
return x->(*mem) < y->(*mem);
}
};

typedef compare_member<Person,int> compare_int;
typedef compare_member<Person,std::string> compare_string;

std::set<Person*,compare_int> by_age(compare_int(Person::age);
std::set<Person*,compare_string> by_name(compare_string(Person::age));



The compiler would then possibly optimize member pointer usage. Still, I say hashtables in Java are better. Another detail: given that input is given on the command line, I'd say that reading the commands might be a bottleneck as well.

Share this post


Link to post
Share on other sites
Quote:
Original post by ToohrVyk
Oh, macros, evil! [grin] I wonder if it wouldn't be possible to get around this using pointers-to-members; I'm thinking:


The difficulty is in getting the member-pointer to the constructor of the comparator... wait, never mind, I looked it up, and std::set indeed offers a constructor accepting an instance of the comparator. Naturally. :) Well played, sir.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this