# Duplicate Number Detection Algorithm

## Recommended Posts

This problem kept me up until 2:30am last night/this morning. I don't even know if there is a solution, but I figured that presenting it here might bring forth a clever algorithm that I have been unable to come up with. Say that I have an array of numbers (32-bit values). I traverse this list one item at a time, and keep track of some operation based on the current number, and the results of the previous operation. The question is: how can I detect if the same number shows up more than once in the list? (That is, what streaming operation can I perform that will allow me to detect if a number has ever been repeated, based solely on the current number in my traversal, and the result of the operation so far?) The only thing that you know about these values is that they are non-zero, if that makes a difference. My first thought was to use an xor-type operation, but that failed miserably, because, while the value will cancel itself out, you have no way of differentiating it from a unique value. I have this idea that I'm not explaining myself very clearly, so I'll put forth some pseudocode:
unsigned long array[ARRAY_SIZE];

// Some code that fills the array goes here...

// Detect duplicate entries
bool DetectDuplicateEntriesIn(unsigned long * array)
{
unsigned long currentSum; // <-- Or some other structure if the algorithm in question calls for it
for(int i=0; i<ARRAY_SIZE; i++)
{
if(CheckIfThisValueHasShownUpBeforeNow(currentSum, array[i]))
{
return true;
}
currentSum = PerformOperationOn(currentSum, array[i]);
}
return false;
}


So what I'm looking for is an algorithm that I can place in CheckIfThisValueHasShownUpBeforeNow(...) and PerformOperationOn(...). Any ideas? Thanks! - Adiss EDIT: Cleaned up comments in the code EDIT: Added information about non-zero values

##### Share on other sites
Nice Coder    366
Ok an, o(n^2) way, for each number, check each other number before it in the list. if you find a match, you've got a duplicate.

An o(n log2 n) way, keep the list sorted, and binary search for it. if you find one, you've got a match.

You hsould also look at hash tables (Which arefast).

From,
Nice coder

##### Share on other sites
Yes, that was my first impulse. Easy to implement, too.

Unfortunately, I can't keep the list sorted (impractical due to the nature of the data). I guess that in my gut I feel that there ought to be an O(N) solution to this problem.

If I added each item to an stl map or hash_map (as you suggest, Nice Coder), that would work too:

std::map myMap<unsigned long, int>;

And then "reference count" the values in the list.

myMap[currentValue]++;

And then check with:
if(myMap[currentValue] != 1)

(doing it this way would also necessitate reversing the order of the loop: the reference has to be incremented BEFORE it can be checked, not after)

That requires a map insertion for unique values, and a map lookup for all values. That's O(NlogN), which is better than O(N^2), but not at the level of O(N) that I'm searching for.

Thanks, though!

Any other ideas?

##### Share on other sites
Igni-

That's a really interesting link. Unfortunately, the problem that they present there is under different constraints than mine.

Quote:
 You have a read-only array A[1..n] which is populated by numbers from 1..n-1

I have a read-only array A[1..n] which is populated by numbers 0..MAX_INT, where n is significantly smaller than MAX_INT. As a result, I can't use their solution, because they are indexing the array by values in the array, and I'm not guaranteed of that being a safe operation.

Pity, because I ws really excited to see a link that seemed to be my exact problem!

Oh, well. Thank you, though! It's an interesting read.

##### Share on other sites
In case you must traverse the list array with spesific order
(
means for example
[for(int i=0; i<ARRAY_SIZE; i++)]
produce different results than
[for(int i=ARRAY_SIZE - 1; i=0; i--)]
)

one idea is:

Create an AVL tree with record key 32bit values.
CheckIfThisValueHasShownUpBeforeNow() function will try to insert the 32 bit value to the AVL tree.

If succeed (means the 32 bit key is has not been inserted before) continue
else
break.

I don't know how fast this could. AVL trees are fast in searching values but a bit slow when inserting values.
I don't know if that helps, but that is what I would do...

-------------------------

If you don't need to traverse in spesific order the list just sort the list and then if you find (while traversing the list) 2 (or more) continuous values to be same just ignore then.

Hope I've helped.

##### Share on other sites
civguy    308
Quote:
 Original post by adiss_mcbut not at the level of O(N) that I'm searching for.
hash map gives you O(N).

##### Share on other sites
Guest Anonymous Poster
Quote:
Original post by civguy
Quote:
 Original post by adiss_mcbut not at the level of O(N) that I'm searching for.
hash map gives you O(N).

When I interviewed at Microsoft I was asked this question. The guy didn't like my hash table answer because once a large enough group of numbers is seen the overhead is too high. Although, for a small group of numbers the overhead is far less than say making one giant array (numbers are bounded) on the stack. The interviewer didn't understand the concept of a dynamically sized hash table. He said I should've used the giant array on the stack and twiddled each bit to signify if a number had been seen before. The addressing mechanism would identify a word, then an offset within the word. Today is my last day at MS. Oh well. ;)

##### Share on other sites
Guest Anonymous Poster
Here's an implementation of what I was talking about in the above post. It assumes 32-bit integers.

#define page( i ) ( ( i ) >> 5 )#define offset( i ) ( 0x1 << ( ( i ) & 0x0000001f )int store[ INT_MAX / 32 ];bzero( store, sizeof( store ) );for (;;) {    unsigned int i = next();    if ( store[ page( i ) ] & offset( i ) ) {        // seen before    } else {        store[ page( i ) ] |= offset( i );    }}

Linear time, constant space, but it's a lot of constant space.

##### Share on other sites
Guest Anonymous Poster
Actually I made a mistake in that, but I'll let the readers figure it out. ;)