A question about CRC32

Started by
17 comments, last by Rockoon1 17 years, 6 months ago
But you just said that a collision did happened in a population of 24000 words. That means a non-zero probability and that rules CRC out.
Had those two words be first and second in the array you have a big problem.
The only zero probability of collision solution is the word itself.
Advertisement
Quote:Original post by jovani
But you just said that a collision did happened in a population of 24000 words. That means a non-zero probability and that rules CRC out.
Had those two words be first and second in the array you have a big problem.
The only zero probability of collision solution is the word itself.

It's not like a collision means death. According to the OP it just means that the programmer will need to pick a different identifier, so a small probability (less than 2%) of having one collision might be acceptable. The number of collisions will roughly follow a Poisson distribution with lambda=0.0116, which means that the probability of more than one collision is very small.

From the description of the original problem, it looks like you can pick the hash function after you know all the words in the set. If this is the case, I might be able to write something CRC-ish with some parameters, so you can try a different setting of the parameters is you get a collision.

Thank you for all your suggestions and ideas. It seems like my original estimate of the risk of collision was fairly accurate after all, even though it also could appear that short strings, like identifiers, actually behave fairly well. I'll now have to decide whether 1% risk of collision is too much (in which case I guess I could just use a 64bit crc) or if it's acceptable for the scripters.

I hope it did not appear that I was lazy when describing the problem, but I actually made a concious choice not to talk too much about the context of the problem, to keep focus on what I originally asked for: the probability of clashes, using different hash functions on identifier strings. It's not that I'm unwilling to talk more about the context, but I belive that would belong in the scripting forum :)

But I guess I should have mentioned that this is not something I need for a completely new technology. It's an improvement to an existing scripting language in an existing game engine, and thus I'm working within a certain time frame and does not have the luxury of making a complete redesign of the system. We have a system where each identifier needs a unique id. Currently these are assigned in an incremental fashion, but this leads to certain problems that I'm trying to fix. So I have to look at the problem like this: Find a better way to generate a unique ID for a set of different strings. The crc algorithm has the very nice property that the ID will depend only of the name of the variable, not on the order in which the variables are detected by the compiler/engine (like it is the case when assigning incremental IDs).

This is why my question was focussed on a very specific mathematical problem - the estimation of risk of collisions, given that I use crc32 as hash function.

I also agree that, from an academic point of view, any solution with a non-zero risk of collision is unsatisfying. But sadly I have to take a more practical approach to the problem, and in this case I think it is acceptable to ask the script programmer to rename a variable of message - if this happens very very rarely (like once or twice in an entire game production cycle).

Regarding the strings: The language is case in-sensitive, but since I don't like to work with identifiers this way I simply covert all identifiers to lower-case at parse time. So, an identifier consists of lower-case letters, underscores and numbers. There is no enforced limit to the length of the identifiers, but I guess I could introduce one - as long as its quite high. Generally I do not like to restrict the script programmers in this way though.

I'll look into the Radix 26 suggestion and read up on the concept of perfect hashes.

Best regards,
Kasper

Quote:Original post by TheAdmiral
Personally, I think any non-zero probability of a name-clash is unacceptable. There are plenty of other, and better, ways to tackle this problem.
I disagree, a collision is perfectly acceptable if it only happens with a very low probability. You can use chaining in that case, which is absolutely fine.

Quote:but it's a little hacky.

It sure is! :) However, given the right constraints (as stated above), it would nevertheless be the optimal solution.

Quote:I'd recommend using a std::map<name, long>

That's a clean, general solution that will of course work, but it will be O(lg n) as compared to O(1). For 10,000 elements, O(lg n) == 4, so it probably doesn't matter, though. Anyway, a std::map may not be the easiest thing to store and retrieve on another machine, either. (Google's sparsehash maybe being an exception).

Quote:Original post by Zipster
The CRC algorithm is really meant for long streams of bits, where it's virtually impossible for random errors to generate the same checksum (and even more difficult for a hacker to make meaningful changes that result in the same checksum). For short streams of bits, such as character strings, it's a lot more likely that two strings will generate the same checksum. The longer the strings on average the less likely, but again the risk for collision is there.
That's wrong in every respect, sorry.

1. Checksums were not specifically made for long messages or to prevent a hacker from making changes. Cryptographic hash functions (like MD5 or SHA) were designed with the latter in mind, but even these are not tamperproof. The idea behind CRC is to detect a random error (such as noise on a network cable or a defect on a storage medium).

2. Mathematically, you are guaranteed to have collisions if the number of possible messages is larger than the number of hashes. The larger the message/hash ratio, the more collisions you will necessarily get. This is evidently so because a hash function maps many messages to a small, constant number of hashes (in theory, you could also get collisions if the message is the same size or smaller, but a good hash should hopefully not do that).
Normally, this means that any message longer than 32 bits is guaranteed to have at least one collision of equal length.
However, in the OP's special case, there is only a very small subset (10,000) of possible distinct messages (they may be longer than 32 bits or not, but their number is around 10,000 anyway), so luckily, we have plenty of left-over hash values, and collisions should not be an issue.

Quote:Original post by Kasper FauerbyDamon gave a probability figure that was radically different from the one I arrived at. I'm not sure I agree with his way of calculating it though
A 32 bit hash has 2^32 == 4294967296 possible values. You plan to have around 10,000 keys.
This means that 4294967296/10000 == 429496 hash values exist for every possible key.
Think of a group of people drawing lots from a hat. The number of lots is 429,000 times larger than the number of people drawing. Will this lead to a brawl? Probably not :)

Quote:Original post by jovaniI can guarantee you than with 10000 words, any of the popularly CRC32 routines will generate at least one collision.
Mother's love! If it was that bad, we could not use any IP based service at all ;)
Quote:Original post by alvaroThere is something called "perfect hashing" that looks like what you need, but I know very little about it. http://en.wikipedia.org/wiki/Perfect_hashing
Ah, dang... skipped over that one :(

Perfect hashes only work if the list of keywords is known at compile time (so not applicable here). When applicable, however, they rock. See gperf. :)
Quote:Original post by Damon Shamkite (a while back)
Getting back to your question on CRC32:
You have 2^32 checksum values as opposed to only 10,000 unique keys. That's around 429,000 CRC values for each input string.
In other words, collisions can of course still happen (you can never be sure this won't occur), but the likelihood is on the order of 0,0002%. Using a hash with more bits will make that likelihood smaller, but the figures are generally the same.

That is not the probability he was asking for. If you already have 10,000 words that don't collide, what you computed is the probability of the 10,001st word colliding with one that was already there.

An easy and correct way to compute the probability p of having at least one collision with 10,000 words is to compute 1-p, i.e. the probability of having no collisions. Let's call the number of different keys N (N=2^32):
1-p = 1 * (N-1)/N * (N-2)/N * (N-3)/N * ... * (N-9999)/N ~= 0.988427110014, so p is approximately 0.011572889986.

Quote:Original post by Damon Shamkite
Quote:Original post by alvaroThere is something called "perfect hashing" that looks like what you need, but I know very little about it. http://en.wikipedia.org/wiki/Perfect_hashing
Ah, dang... skipped over that one :(

Perfect hashes only work if the list of keywords is known at compile time (so not applicable here). When applicable, however, they rock. See gperf. :)


Thats not exactly true.

Not only does perfect hashing exist beyond compile-time knowledge..

...but also minimal perfect hashing.

And i'll even go one further again! Minimal perfect hashing while maintaining sort order.

These techniques arent for the casual player however, and they do incure some runtime overhead. In practice you can still get it down to approximately O(1) but you've got a fairly big constant involved in comparison to blindly hashing.

If you don't need minimal perfect hashing then its almost never the best solution.

For something like what the OP is considering if there is time for an initializer to scan the input beforehand, I'd use a zobrist hash (character x position) and simply validate that the zobrist hash isnt producing any collisions. (The zobrist table can be re-seeded until no collisions exist)
If you are too lazy to write up CRC64, you could do:

CRC32 the following 4 strings
"zzz0!String"
"zzz1!String"
"zzz2!String"
"zzz3!String"

You now have 4 different 32 bit numbers with a low chance of collision between them.

On N inputs, it has about a:
N^2/2^128
chance of collision.

How big does N have to get to have a 0.1% (1 in 1000) chance of collision?

N^2/2^128 = 2^-10
N^2 = 2^118
N = 2^59

Or, about 1 billion billion (1000000000000000000) different strings in order to have a 1 in 1000 chance of having a single collision.

The problem with this approach is that the code that tests for collision and asks for new input would never run, and you'd have rotting code.
Quote:Original post by NotAYakk

The problem with this approach is that the code that tests for collision and asks for new input would never run, and you'd have rotting code.


lol


This topic is closed to new replies.

Advertisement