C++ O(n) Hashing? Algorithm

Started by
6 comments, last by iMalc 11 years, 2 months ago

Hi,

Pseudocode plus description of my problem.

I have a long array of structs, each containing three unsigned integers x, y, and z, and a variable that we'll use later.


struct A { unsigned int x,y,z; int index; };
//...
struct A* long_array; //has thousands of elements, most are repeats

I need to know how many unique triplets there are.

After this is known, a new array is allocated, containing a new struct corresponding to each unique triplet. Then, in the original array, set each structure's index variable to the location of the unique struct in the new array.


int num_unique = /*what goes here*/;
struct A* unique_array = new struct A[num_unique];
/*
Now execute some algorithm that solves the following:
for each element "elem" of long_array:
    unique_array[elem.index] has the same triplet as elem.
*/

C++11 is available, and everything else is fair game. I'm having trouble figuring out a good way.

I should note that the above isn't a prescriptive definition of an algorithm. I just need the end result--each struct in the original array points to its unique struct in the new array. For example, combining the two operations of counting and setting the index into one pass is perfectly fine.

As suggested in the title, there is an additional constraint: the algorithm needs to run in strictly O(n) time. This means that memory allocations of the std containers need to be taken into account. For example, one might cruftily try to solve the problem with a:


unordered_map<unsigned int,unordered_map<unsigned int,unordered_map<unsigned int,struct A*>>>

. . . that is constructed in a linear pass through the original array, but this doesn't work since the amortized allocations that must happen sum up to O(n^2) unless they are precomputed.

Bungling around with the std containers isn't exactly amenable to chance, so I'm asking for advice. Anyway, as above, any suggestions that solve the problem cleanly and quickly would be appreciated!

Thanks,
-G

[size="1"]And a Unix user said rm -rf *.* and all was null and void...|There's no place like 127.0.0.1|The Application "Programmer" has unexpectedly quit. An error of type A.M. has occurred.
[size="2"]

Advertisement

I didn't quite understand what you need to do once you find unique triplets, but what's wrong with unordered_map? It seems to be linear enough.


size_t foo(vector<A> &in) {
    auto hash = [](const A &a) {
        union {
            unsigned __int32 val;
            struct {
                unsigned __int32 x : 11;
                unsigned __int32 y : 11;
                unsigned __int32 z : 10;
            };
        };
        x = a.x;
        y = a.y;
        z = a.z;
        return val;
    };
    unordered_map<unsigned __int32, int> uniques;
    for(auto &val : in) {
        uniques[hash(val)] = val.index;
    }
    return uniques.size();
}

On my system I got these results:


num_elements    time_taken
10000           0 ms
100000          6 ms
1000000         47 ms
10000000        1670 ms
20000000        4000 ms

Seems it scales nicely up to a million, then it starts growing faster.

Yes, a single unordered map should be fine. If you want to be sure to avoid hash collisions (which may happen with the above example), you can define a hash for the struct, typically XORing the three numbers together, and use the struct itself as the key.

Non-linear performance, by the way, would probably indicate a poor hash. In that case, try std::hash on the integers first.
You could run a linear-time sorting algorithm over your data, then iterate the results while ignoring any items that were the same as the previous item (to remove duplicates / only count unique items).

I didn't quite understand what you need to do once you find unique triplets, but what's wrong with unordered_map? It seems to be linear enough.

Technically, that's not O(n), but in practice, the array won't have more than about a million elements. Still, 47ms is longer than it takes the rest of the program to actually generate the data in the first place. The main constraint is from the user, and 47ms is nearly unnoticable, but I'd still like to get it lower if possible, since multiple of these operations may have to run in sequence.

Non-linear performance, by the way, would probably indicate a poor hash. In that case, try std::hash on the integers first.

Not sure I follow. My conjecture is that unordered map effectively uses a dynamic array for its hash table. Add too many entries and you need to resize. I admit I don't know, but the above results would seem to suggest nonlinearity (though could also be cache incoherency).

You could run a linear-time sorting algorithm over your data, then iterate the results while ignoring any items that were the same as the previous item (to remove duplicates / only count unique items).

I had thought of sorting the data in linearithmic time, but I suppose radix sort is linear time. Feels like overkill to sort something, but this does meet the requirements.

I came up with another idea--it's essentially a generalization of my crufty algorithm from before. You make an associative array from a byte onto another associative array. Recurse. So, for each of the bytes of the three unsigned ints (for simplicity, 3*4=12), you work your way down. Since the size of each table is a constant (256), there are no amortized resizes. The memory and space allocation are at worst O(n).

[size="1"]And a Unix user said rm -rf *.* and all was null and void...|There's no place like 127.0.0.1|The Application "Programmer" has unexpectedly quit. An error of type A.M. has occurred.
[size="2"]

unordered_map<unsigned int,unordered_map<unsigned int,unordered_map<unsigned int,struct A*>>>


[font=georgia, serif]. . . that is constructed in a linear pass through the original array, but this doesn't work since the amortized allocations that must happen sum up to O(n^2) unless they are precomputed.Actually, that's wrong. With the required geometric growth of a standard library container, and assuming the common growth factor of 2:
The total number of allocations is log(n),
The sum of the size of the allocations is 2n,
The overall running time is amortised O(n).
Nowhere does n*n come into it.

Have you tried an unordered_map, and what were the results?
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms


unordered_map<unsigned int,unordered_map<unsigned int,unordered_map<unsigned int,struct A*>>>
. . . that is constructed in a linear pass through the original array, but this doesn't work since the amortized allocations that must happen sum up to O(n^2) unless they are precomputed.

Actually, that's wrong. With the required geometric growth of a standard library container, and assuming the common growth factor of 2:
The total number of allocations is log(n),
The sum of the size of the allocations is 2n,
The overall running time is amortised O(n).
Nowhere does n*n come into it.

But each reallocation requires O(n) copying I would think? So if the total allocations is log(n) instead of n, which is reasonable, that's still O(n*log(n)).

Have you tried an unordered_map, and what were the results?

Actually, yes. I'm trying out a set, since I really only care about the set of unique triples. I found that there's a preallocation option on construction of these sets, so I set it to twice the number of elements possible. It's ghastly slow--taking 500ms in release mode for a measly 250,000 elements. I think that may have something to do with how I'm implementing it, so I'm starting a new thread specifically.

[size="1"]And a Unix user said rm -rf *.* and all was null and void...|There's no place like 127.0.0.1|The Application "Programmer" has unexpectedly quit. An error of type A.M. has occurred.
[size="2"]



unordered_map<unsigned int,unordered_map<unsigned int,unordered_map<unsigned int,struct A*>>>
. . . that is constructed in a linear pass through the original array, but this doesn't work since the amortized allocations that must happen sum up to O(n^2) unless they are precomputed.

Actually, that's wrong. With the required geometric growth of a standard library container, and assuming the common growth factor of 2:
The total number of allocations is log(n),
The sum of the size of the allocations is 2n,
The overall running time is amortised O(n).
Nowhere does n*n come into it.

But each reallocation requires O(n) copying I would think? So if the total allocations is log(n) instead of n, which is reasonable, that's still O(n*log(n)).
Nope, that's not the case. This is how a vector grows, and an unordered_map is very similar. Starting from a typical initial size of 8:
Upon adding the 9th item, the existing 8 items are copied into the new block with room for 16 items.
Upon adding the 17th item, the existing 16 items are copied into the new block with room for 32 items.
Upon adding the 33rd item, the existing 32 items are copied into the new block with room for 64 items.
Upon adding the 65th item, the existing 64 items are copied into the new block with room for 128 items.
Upon adding the 129th item, the existing 128 items are copied into the new block with room for 256 items.
Total number of growth-related copies to expand to a total of 256 items: 8+16+32+64+128 = 248 which is very close to 256. continue adding more items up to the next growth point and it's the same - Always proportional to n.
If you don't follow my logic there, or for some reason don't believe it, just know that it really is amortised O(n) and you can probably search plenty of other more technical explanations as to why.

Stroustrup himself says:

People sometimes worry about the cost of std::vector growing incrementally. I used to worry about that and used reserve() to optimize the growth. After measuring my code and repeatedly having trouble finding the performance benefits of reserve() in real programs, I stopped using it except where it is needed to avoid iterator invalidation (a rare case in my code). Again: measure before you optimize.

I myself have found the same thing. It makes very little difference.

There is little sense, if any, in condiering anything else until you have measured the performance using unordered_map. You would be very very hard pressed to find anything better in practice.
"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

This topic is closed to new replies.

Advertisement