Interpreting unordered_map key as blob of bytes for hashing?

Started by
8 comments, last by jpetrie 8 years, 10 months ago

I use some unordered_maps with nonprimitive keys, and got a bit stuck not knowing how to write proper hash functions for them.

I feel the cleanest way is to dump the bytes of the key (the bytes I want to include) into an array and hash that with std::hash (or some specific hash function). Is that how its usually done?

Currently I use something more 'structure-aware' like hashing each member variable and summing the hashes, or such, but thats no good.

I dont think std::hash even works for arrays (seems to have specializations for strings and vectors only?), how would I go about specializing it for arrays? Should I just grab something like murmurhash3 at this point?

o3o

Advertisement

What's wrong with hashing each member together and combining the resulting hashes in some appropriate fashion, exactly? That's the typical approach. Reinterpreting the key as an array of bytes is also doable, but in some cases is basically the same algorithm in a different shape (and you'll want to be aware that compiler-insert pad bytes might change the hash if you do it this way, which is bad).


I feel the cleanest way is to dump the bytes of the key (the bytes I want to include) into an array and hash that with std::hash (or some specific hash function). Is that how its usually done?

Currently I use something more 'structure-aware' like hashing each member variable and summing the hashes, or such, but thats no good.

What is the real difference here?

With both solutions, you have to identify "the bytes I want to include", so you're choosing particular member variables to hash either way.

Meanwhile, the first solution involves copying things into an array. The second one doesn't. That's the only actual difference. The choice seems obvious.

If you don't want to come up with your own hash functions, you can use boost. It has boost::hash functions defined for a bunch of primitives and containers. It also has a hash_combine function for putting the individual hashes together. You don't have to think about what's going on internally.

The reason why I dont want to do it per-member is that if I do this:

hash = hash(member1) ? hash(member2);

I dont know what operation to use to combine the hashes in place of '?', and I dont know if doing this is going to weaken the hash and introduce more collisions or such. And the hash functions themselves probably run faster if they have a contiguous string of bytes instead of doing it one member at a time.

(I could do hash = hash(hash(member1) ? hash(member2)) but then I wont get 6000 FPS due to the extra hash *sadface*)

I changed to basically the following:


std::array<ui8, sizeof(member1) + sizeof(member2)> bytes;
*copy members to bytes array*
return murmurhash3(bytes); 

And it seems to run faster than what I had before (return hash(member1) + hash(member2)) which is surprising since the implementation of std hash seemed trivial (on msvc) compared to murmurhash3. But I cant be sure why its faster (less collisions?).

The copying adds overhead but it doesnt seem to be noticeable for objects this small. If I need bigger objects, Ill probably just split the murmurhash3 func into multiple functions so I have more control over where the data comes from... (currently it just takes a void* and length :/)

So from what I gather:

-I should use murmurhash3 because it happens to be marginally faster on msvc for now and I wont be reliant on the implementation to choose a good hash algo

-I should modify it to avoid the copying step currently required (it works on 128 bit blocks and has a 'finalizing' operation so its not as simple as just call it multiple times), basically either make it into a class or write a combineHash method

o3o

If you were summing the hashes, then that would explain getting a lot of collisions. Simple sums have a tendency to introduce non uniform distributions in the bit patterns. You could try using boost's hash_combine algorithm which basically boils down to:

template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
    std::hash<T> hasher;
    seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}
I dont know what operation to use to combine the hashes in place of '?', and I dont know if doing this is going to weaken the hash and introduce more collisions or such.

Okay, so turn it around, you have an array of bytes, which you compute the hash of by doing hash(byte[0]) ? hash(byte[1]). Same problem.

As Sicrane said, summing is probably unlikely to be a good choice for '?'.

And the hash functions themselves probably run faster if they have a contiguous string of bytes instead of doing it one member at a time.

Not likely. Member0 is likely quite next to member1 in memory, after all. And if you have members which can be efficiently hashed as a whole (integers larger than a byte for example) as a constant-time operation, member-wise hashing may reducing the constant factor of the hash's time complexity anyway over byte-by-byte hashing anyway.

In both cases it's unlikely to matter.


Okay, so turn it around, you have an array of bytes, which you compute the hash of by doing hash(byte[0]) ? hash(byte[1]). Same problem.

But now the task is delegated to the murmurhash3 function so I dont need to think about it :)

Eventually ill need to do some combining though, even if I modify murmurhash3 to break it into the "hash" and the "combination" parts... (to avoid the copy)


Not likely. Member0 is likely quite next to member1 in memory, after all. And if you have members which can be efficiently hashed as a whole (integers larger than a byte for example) as a constant-time operation, member-wise hashing may reducing the constant factor of the hash's time complexity anyway over byte-by-byte hashing anyway.

Murmurhash3 seems to operate on 128 bits of the input at a time so theoretically there should be some speedup (at least for larger keys). Dont know if its worth a copy.

o3o

Dont know if its worth a copy

Probably not.

There's no reason to copy at all, really, if the class is POD/standard layout. Just reinterpret_cast<std::uint8_t *>(this) and pass that to the hash function.


There's no reason to copy at all, really, if the class is POD/standard layout. Just reinterpret_cast(this) and pass that to the hash function.

You mentioned the reason not to do this earlier: the pad bytes will get hashed.

You mentioned the reason not to do this earlier: the pad bytes will get hashed.

Yes but so will copying the entire structure into a temporary buffer, for certain ways of doing that copy. My goal in mentioning the pad bytes was to just to be aware that the bytes might exist and act accordingly.
Regardless, any kind of copying into temporary storage purely for the purpose of hashing that storage instead of the source is not likely to be a performance win in the majority case. You're always incurring overhead there, it just doesn't seem worth it.

This topic is closed to new replies.

Advertisement