Best data structure for this problem?

Started by
9 comments, last by Zahlman 17 years, 1 month ago
A quick example outline: data: -name -age -other info problem: -find info from name -find the # oldest/youngest people -add/remove/change people (1M+) often Runtime performance is extremely important, memory shouldn't be an issue. I only know C++ / Java (so whichever is faster). I approached this problem by creating a vector of objects for search/sort and a second associative array to link name to their corresponding vector indexes. Before every search I would run a quicksort on the vector (and thus change the associative array's corresponding index). To modify people by name I would look up the index from the associative array. Is there a faster way to doing this? A binary tree based on age? Linked lists? [Edited by - AncientPC on March 8, 2007 2:43:35 PM]
Advertisement
If you have small amounts of data, then it doesn't really matter. One million might not qualify as "small", though.

For larger datasets, a database is designed for this kind of operations, it will be both fast and easy on your memory. postgreSQL interacts pretty well with C++, and I'm pretty sure Java has some kind of database access as well. Once you get the interaction running, the operations you mentioned are simple:

CREATE TABLE people (name varchar(32), age int(4), other varchar(32))SELECT info FROM people WHERE name = '$the_name'SELECT name FROM people ORDER BY age LIMIT $howmanyINSERT INTO people VALUES ($name,$age,$other)DELETE FROM people WHERE name = '$name'UPDATE people SET age = $age WHERE name = '$name'
Sorry, I probably worded the problem wrong. I've worked with PHP/MySQL a little but enough to know that a DB solution would be much more practical.

I'm looking at a smaller dataset (say 5-50k) but must run ~1M lines of instructions on them. Mostly find # of oldest people and change info.
Wait... if you know the instructions in advance, wouldn't it be easier to perform the operations right away in an existing statistical analysis program (or a spreadsheet), than to code a "faster solution" ?

Otherwise, how are the instructions provided?

EDIT: I'm asking this because you might be able to do a two-step analysis, changing the representation from one step to another.
A database would be the most extensible solution as Toohrvyk said, but if you're sticking with code I would do this:

problem:
-find info from name

It seems you've got this figured out, a hash map is probably the best solution, search the net for a hash function that's optimal for name strings, its a common enough application that there must be some good information on it.

-find the # oldest/youngest people
Instead of sorting before each list you'll want to simply insert new items in the correct position, and remove items without effecting the order. A Binary tree structure will easilly give you the youngest/oldest n people using left and right depth-first iterators respectively.

-add/remove/change people (1M+) often
Vectors aren't generally good for inserting into the middle. A Binary tree with a pooled allocator would be good here.

throw table_exception("(? ???)? ? ???");

While we're at it, a "general" approach here would be to use a sorted data structure for the age, to find the minimum and maximum K in O(K + log N). To uniquely identify people by identifiers or strings, I'd use hash tables. Then, all indexers would point at the actual objects, and the objects would point back using one mechanism or another to alter the keys.
Quote:Original post by ToohrVyk
Wait... if you know the instructions in advance, wouldn't it be easier to perform the operations right away in an existing statistical analysis program (or a spreadsheet), than to code a "faster solution" ?

Otherwise, how are the instructions provided?

EDIT: I'm asking this because you might be able to do a two-step analysis, changing the representation from one step to another.


Well the program will need to run on Linux with the instructions piped in command line.

Thanks for all your quick responses! :)
Assuming c++, you can store the objects themselves in a set. Since a set is sorted, as long as you either define operator < or pass in an appropriate function to sort by age, the set will remain ordered. Inserting into the set will give you an iterator which won't be invalidated by inserting/deleting from the set (well, deleting an object will invalidate an iterator pointing to *it*, of course).

Insert that iterator into a map of names to iterators.

Just make sure that you delete from both containers when you're done, probably by wrapping them in a class.

That should do what you need.

Inserting and removing often is usually a sign that you shouldn't use vectors, especially if you also want to sort the data often.
You basically have two indices - name and age.

I'd check for Boost solutions first ;), and then:

- Create comparators for Person*'s that compare the name or age of the pointed-at person.
- Create a map<std::string, Person*> and set<Person*, person_age_comparator>. The map is because we have to *look up* by name.
- Wrap the whole thing up in a class.

The Person objects would be dynamically allocated, because we want them not to get moved around in memory if we add more (as may happen with a std::vector or std::map or actually, almost anything other than a std::list). We could use std::list, but it doesn't give us anything because we don't care about iterating over Persons in the order of insertion.

Not at all tested!

// The lines with backslashes at the end have a space after the backslash to// avoid munging by the forums. You would need to remove those spaces.#define IMPLEMENT_COMPARATOR(m) \ template <typename T> \ struct compare_member_##m : public std::binary_function(T, T, bool) { \   bool operator()(const T& x, const T& y) { \     return x->m < y->m; \   } \ }IMPLEMENT_COMPARATOR(age);class Person {  std::string name;  int age;  // etc.}template <typename T>void deleteAt(T* t) { delete t; }class Census {  typedef std::map<std::string, Person*> lookup_t;  lookup_t name_lookup;  typedef std::set<Person*, compare_member_age> ordering_t;  ordering_t age_ordering;  typedef std::pair<ordering_t::iterator, ordering_t::iterator> age_range;  public:  void add(args) {    Person* p = new Person(args);    name_lookup[p.name] = p;    age_ordering.insert(p);  }  void remove(const std::string& s) {    lookup_t::iterator it = name_lookup.find(s);    if (it == name_lookup.end()) { return; }    age_ordering.remove(it->second);    delete it->second;    name_lookup.remove(s);  }  Person* get(const std::string& s) {    lookup_t::iterator it = name_lookup.find(s);    return (it == name_lookup.end()) ? 0 : it->second;  }  age_range nYoungest(int n) {    ordering_t::iterator youngest = age_ordering.begin();    return std::make_pair(youngest, std::distance(youngest, n));  }  age_range nOldest(int n) {    ordering_t::iterator end = age_ordering.end();    return std::make_pair(std::distance(end, -n), end);  }  ~Census() {    std::for_each(age_ordering.begin(), age_ordering.end(), deleteAt<Person>);  }};
Quote:// The lines with backslashes at the end have a space after the backslash to
// avoid munging by the forums. You would need to remove those spaces.
#define IMPLEMENT_COMPARATOR(m) \
template <typename T> \
struct compare_member_##m : public std::binary_function(T, T, bool) { \
bool operator()(const T& x, const T& y) { \
return x->m < y->m; \
} \
}


Oh, macros, evil! [grin] I wonder if it wouldn't be possible to get around this using pointers-to-members; I'm thinking:

template <typename T, typename M>class compare_member : public std::binary_function<T*,T*,bool> {  typedef M (T::*memtype);  typedef compare_member<T,M> self;  memtype mem;public:  compare_member(memtype mem) : mem(mem) {}  compare_member(const self & other) : mem(other.mem) {}  self & operator=(const self & other)  {    mem = other.mem;    return *this;  }  bool operator()(const T*& x, const T*& y) const  {    return x->(*mem) < y->(*mem);  }};typedef compare_member<Person,int> compare_int;typedef compare_member<Person,std::string> compare_string;std::set<Person*,compare_int> by_age(compare_int(Person::age);std::set<Person*,compare_string> by_name(compare_string(Person::age));



The compiler would then possibly optimize member pointer usage. Still, I say hashtables in Java are better. Another detail: given that input is given on the command line, I'd say that reading the commands might be a bottleneck as well.

This topic is closed to new replies.

Advertisement