Algorithm to elimiate duplicates in linked list?

Started by
6 comments, last by MaulingMonkey 19 years, 6 months ago
Is it faster to take each entry and compare it against all the others(from its point onward) and removing them when a duplicate is found? ..or... Is it faster to sort the whole list, then go through it and just look at the next one. If that was a dup then remove it and dont advance in the list until its at the end or its not a dup? Or is there an even better way that I'm not thinking of?
Advertisement
If you can compare just for equality, you can sort them out one by one comparing with the sorted out elements.

If your elements support a < order for example, you can sort them into a tree and then sort them back by an inorder traversal. When inserting, duplicates should be removed.

When you dont care about the order afterwards, you can try to put the elements into a hash map.
i would say sort the list the first then when you iterate the list again all you have to do is check pairs for equality, here is an example with with std::list:

#include <iterator>#include <algorithm>#include <list>#include <iostream>template< typename C >inline void eliminate_duplicates(std::list< C >& c) {   if(!c.empty()) {     c.sort();     c.unique();   }}int main() {   std::list<int> lints;   lints.push_back(30);   lints.push_back(5);   lints.push_back(2);   lints.push_back(30);   lints.push_back(5);   std::copy(lints.begin(), lints.end(),             std::ostream_iterator<int>(std::cout, ", "));   std::cout << "\neliminating duplicates....\nresult is: ";   eliminate_duplicates(lints);   std::copy(lints.begin(), lints.end(),             std::ostream_iterator<int>(std::cout, ", "));   return 0;}


output:
30, 5, 2, 30, 5, eliminating duplicates....result is: 2, 5, 30, 


[Edited by - snk_kid on October 5, 2004 4:27:24 AM]
Quick analysis:
The first algorithm works like this.
for each item1 in list    for each item2 in list         if item2 equals item1 and item2 is not same reference as item1             remove item2 from list         fi     rofrof

It's clear that this is O(n2).
Assuming that you can sort the list in O(n log n), the second algorithm has a O(n log n) + O(n) complexity. Much better.

If you can, however, not sort the list in O(n log n) than it's basically O(n2) + O(n2) - doubled complexity.

Check if your sorting is O(n log n), otherwise you will get worse performance with the second approach.
Quote:
Is it faster to take each entry and compare it against all the others(from its point onward) and removing them when a duplicate is found?

This method has a complexity of the order O(n^2/2)

Quote:
Is it faster to sort the whole list, then go through it and just look at the next one. If that was a dup then remove it and dont advance in the list until its at the end or its not a dup?


The best way to do this idea, i believe, would be to put the elements in a vector, sort the vector and then run your "just look at the next one".
You have to put every element in a vector O(n), sort the vector O(nlog(n)) (with quicksort for example), and then making the search/elimination O(n). Notice that the elements are only sorted in the vector, not on the list (make a vector of pointers, for example). You can then compare on the vector and eliminate on the list (there's a small trick here because you want to compare an element and eliminate a cell). O(n) + O(nlog(n)) + O(n) -> given a complexity of the order O(nlog(n)). To do this, though, you need to be able to compare the elements using not only "==" but also with "<=" (for quick sort purposes).

One has lower complexity, but it also has some overhead. What is your estimated size for the list?

Quote:
Or is there an even better way that I'm not thinking of?

You can try what nmi said about the hashmap, but it also depends on what you are willing to do. For example, if you want to use a tree, think if it's worth the trouble of programming/testing the tree just to speed up a problem. Or perhaps you already have some data structures programmed.
darookie:

There is a little optomization on the double iteration methood though. When iterating through the second time only look at the ones after item1.

I think its less than o(n^2) if done this way, im not so good with figuring out o notation (why i posted this) so im not sure what it would be. Also, with the sorting, can't sometimes sorting be much worse than others just depending on the list (i think)? So an advantage to the first way would be that it would be constant (almost, if you had to remove something it would add a few things to do).

EDIT - got beaten to it


nmi:

whats a hashmap? i googled but couldnt seem to find a good explanation?
Quote:Original post by antistuff
whats a hashmap? i googled but couldnt seem to find a good explanation?


a map container implemented with a hash table data structure.
Quote:Original post by nmiWhen you dont care about the order afterwards, you can try to put the elements into a hash map.


A hash set would be more appropriate, and would be the method I'd suggest (without having actually profiled the different ways of doing things myself):

#include <list>#include <hash_set> //#include <set> for non hashed version#include <iterator>#include <iostream>int main ( int argc , char ** argv ){    std::list< int > somelist;    somelist.push_back( 20 );    somelist.push_back( 10 );    somelist.push_back( 5 );    somelist.push_back( 20 );    somelist.push_back( 5 );    std::hash_set< int > somehashset; //std::set< int > for non hashed version    somehashset.insert( somelist.begin() , somelist.end() );    std::cout << "somelist contents: ";    std::copy( somelist.begin() , somelist.end() , std::ostream_iterator<int>( std::cout , " " ) );    std::cout << std::endl << "somehashset contents: ";    std::copy( somehashset.begin() , somehashset.end() , std::ostream_iterator<int>( std::cout , " " ) );    std::cout << std::endl;    std::cin.get();}

Quote:Original post by snk_kid
Quote:Original post by antistuff
whats a hashmap? i googled but couldnt seem to find a good explanation?


a map container implemented with a hash table data structure.

Let me elaborate further:
A map is a container that has a unique key that is used to lookup a value. A multimap can have multiple keys of the same value, each with it's own value component.

"Hash" refers to a technique of generating a small(er) number from a given key, so you can reduce the list of items you have to check against - it is used for speeding things up.

"Hashmap" or more appropriately "A hashed map" refers to a map that uses hashing to speed the lookup of a key.

Now, as for why a (hashed) set is a more appropriate container:

In this case, I am taking advantage of the fact that a set is a set of unique keys (emphisis: unique!!! multiset is a set of unique keys, not necessairly unique, by contrast). They have no corresponding value as a map does. By inserting everything from the list into this set, I eliminate identical items (due to the unique requirement).

Both std::hash_set and std::hash_map are not part of the C++ standard, but rather are SGI extensions, meaning you need to download their version of the STL:

http://sgi.com/tech/stl/

Note: hashed sets (and multisets, maps, and multimaps) do not guarantee any ordering of their elements, nor that any order will be mantained the next time they iterated through. Sets (and multisets, maps, and multimaps) by contrast will be sorted - for details, see:
http://sgi.com/tech/stl/SortedAssociativeContainer.html
http://sgi.com/tech/stl/set.html
http://sgi.com/tech/stl/Map.html
http://sgi.com/tech/stl/Map.html
http://sgi.com/tech/stl/Multimap.html


Note: using a set in the first place instead of a list will be more efficient for this example, and may be more efficient for your case.

And because I'm bored, example using just the set:
#include <list>#include <hash_set> //#include <set>#include <iterator>#include <iostream>int main ( int argc , char ** argv ){    std::hash_set< int > someset; //std::set< int > someset;    someset.insert( 20 );    someset.insert( 10 );    someset.insert( 5 );    someset.insert( 20 );    someset.insert( 5 );    std::cout << "someset contents: ";    std::copy( someset.begin() , someset.end() , std::ostream_iterator<int>( std::cout , " " ) );    someset.erase( 20 );    std::cout << std::endl << "someset contents: ";    std::copy( someset.begin() , someset.end() , std::ostream_iterator<int>( std::cout , " " ) );    std::cout << std::endl;    std::cin.get();}


[Edited by - MaulingMonkey on October 5, 2004 5:00:01 AM]

This topic is closed to new replies.

Advertisement