Looking for a good hash for 2 integers

Started by
3 comments, last by Hodgman 12 years, 7 months ago
I have two integers (incremented from 0, so not uniformly distributed in general) which I want to combine into a single integer in a small range (0 -> 0xffff) to best avoid hash collisions.

I've tried different things like: ((~a)-b)&0xffff, (a*106039+b)&0xffff, etc. but I can't seem to get any better than an 80% collision rate, even if both integers are well below 0xffff

I do want this hash to be as lightweight as possible also as it's in a performance critical area

--- edit:

seems i made a big error in determining the collision rate, it's actually more like 1.3%
Advertisement
I don't think it's practical.
Hash functions requires the interesting items count is rather smaller than all possible hash values.
So MD5 is still strong enough because it can hold 2^128 values.

What you are doing is to put 2^64 (or 2^32 values if A and B are in 0~0xffff) to 2^16 values, it's quite reasonable that you get a lot of conflicting.

You are 100% guaranteed to get conflicting to put N+1 elements to N values, no matter what hash algorithm you use.

So I suggest you change your underlying algorithm to avoid the need of the hash.

https://www.kbasm.com -- My personal website

https://github.com/wqking/eventpp  eventpp -- C++ library for event dispatcher and callback list

https://github.com/cpgf/cpgf  cpgf library -- free C++ open source library for reflection, serialization, script binding, callbacks, and meta data for OpenGL Box2D, SFML and Irrlicht.

Thanks to the Birthday Paradox, if I've done my maths right, on average with an ideal hash you'll get a collision once you have hashed about 300 values. That means your 1.3% isn't too bad.

If both values tend to be small you might do better with a function that looks something like: a ^ ((b << 8) | (b >> 8)). That will mean that if a and b are both under 255 you'll never get a collision. Reversing the order of the bits in b instead of just rotating by 8 might work even better, but it'll be slower too.
wqking, i can do without the hash, but using the hash has shown to be reliably faster.

I'm now using a hash: (a*106039+b)&0xffff

and i get about 0.3% to 0.5% of insertions resulting in a collision now. even with the id's growing towards 100K (id's above 100K would be very rare, and havn't tested beyond that) The test case has both id's growing with about 1000 objects in the hash at any time

Adam_42 your suggestion gives roughly an 8% collision rate.
You could try FNV:u16 a, b;
u32 data = a<<16 | b;

u8* in=(u8*)&data;
u32 h = 0x811c9dc5;
h = (h ^ in[0]) * 0x1000193;
h = (h ^ in[1]) * 0x1000193;
h = (h ^ in[2]) * 0x1000193;
h = (h ^ in[3]) * 0x1000193;
return h;

Or murmur:u32 h1 = seed;
u32 k1 = data;
k *= 0xcc9e2d51;
k = _rotl(k,15);
k *= 0x1b873593;
h ^= k;
h = _rotl(h,13);
h = h*5+0xe6546b64;
h ^= 4;
h ^= h >> 16;
h *= 0x85ebca6b;
h ^= h >> 13;
h *= 0xc2b2ae35;
h ^= h >> 16;
return h;
[hr]oops, i missed that you wanted a 16bit output; these are 32 bit hashes unsure.gif
You could still perform the above hashes and then fold them over themselves though:u32 h = ....;
u16 final = (u16)( (h>>16)^(h&0xffff) );
might do better with a function that looks something like: a ^ ((b << 8) | (b >> 8))

That's what I'd suggest too, though I would use different constants and three shifts (for example +11, -19, +8 or +6, -3, +17) . That's equivalent to Marsaglia's xorshift generator, which will pipeline to hash a and b in parallel on all modern CPUs, and will give a distribution as good as you can reasonably get with fewer than ten thousand instructions.

Of course you will inevitably still have collisions, but that's just something you can't help. Mapping 64 to 16 just doesn't work otherwise.

This topic is closed to new replies.

Advertisement