A question about CRC32

Started by
17 comments, last by Rockoon1 17 years, 6 months ago
Hi guys, Here's a question for the math-experts, since my own probability math skills are a bit rusty :) My problem is that I need an algorithm to give me a unique number for each member of a set of strings (variable and function names in a scripting language). The set can be quite large for a full game - for example, 10.000 different names or something in that order. I need the number to be unique, so there can be a one-to-one mapping from a variable name to a number. Furthermore the algorithm must give the same number every time it is called on a specific string (obviously). So, I've pretty much decided to use a CRC algorithm - but I need to figure out if a crc32 is good enough or if I have to use a crc64. If a clash occur it would not be completely fatal, but annoying since I would have my script compiler ask the user to rename the offending new variable/function. This should obviously only happen very very rarely - and hopefully not at all. So, assuming a set of 10.000 strings I've tried to estimate the probability of a clash if I only use 32 bits for my crc. Here are my thoughts on the problem: When the first string is inserted there is 0% risk of a clash. When the second string is inserted there is a 1/2^32 risk of a clash, next we have 2/2^32 risk - and so forth. This gives me a total risk of: Sum(1, 10000) / 2^32 = 5000*10000 / 2^32 = 0.0116 So, assuming my math is correct, there is 1% risk of a clash when using crc32 on a set of 10000 strings? My question then is: Is the calculation above correct? Or am I way of track here :) Thanks for reading! - Kasper
Advertisement
If I understand correctly, you need this checksum to be consistent only for the lifetime of your program (script variable names), not for permanently storing it, and not for being binary-compatible with another machine.

Also, since variable names normally don't change, we can boldly assume that no re-allocation/memory move of strings takes place.

So, why don't you just use a string's memory address as its key - it's guaranteed to be unique and requires no calculation at all.


Getting back to your question on CRC32:
You have 2^32 checksum values as opposed to only 10,000 unique keys. That's around 429,000 CRC values for each input string.
In other words, collisions can of course still happen (you can never be sure this won't occur), but the likelihood is on the order of 0,0002%. Using a hash with more bits will make that likelihood smaller, but the figures are generally the same.
Personally, I think any non-zero probability of a name-clash is unacceptable. There are plenty of other, and better, ways to tackle this problem.

The memory address method is quite good, assuming you never ned to decorate the variable names, but it's a little hacky.
I'd recommend using a std::map<name, long> to keep track of name indices. This makes for easy appendage, easy removal and easy lookup and has no possibility of error.

Regards
Admiral
Ring3 Circus - Diary of a programmer, journal of a hacker.
I second that. The CRC algorithm is really meant for long streams of bits, where it's virtually impossible for random errors to generate the same checksum (and even more difficult for a hacker to make meaningful changes that result in the same checksum). For short streams of bits, such as character strings, it's a lot more likely that two strings will generate the same checksum. The longer the strings on average the less likely, but again the risk for collision is there.
Thank you for your replies, but I *do* need to save the values and use them across many machines, so memory addresses etc. will not work. I really need a (persistent) unique key for each string :)

So, back to the original topic, Damon gave a probability figure that was radically different from the one I arrived at. I'm not sure I agree with his way of calculating it though, so if anyone else could verify his result it would be much appreciated.

Alternatively, if you think crc in general is ill suited for the problem I'm trying to solve, I'm open for suggestions to other hash functions that will generate unique keys for different strings.

- Kasper
That's okay, we can do it your way. But could you give use some more information as to the restrictions on variable names?

If names are case-insensitive and may only contain letters, for instance, then you could encode the variable names in radix 26 to generate unique, high-entropy identifier. Add to the mix a limit on the length of the names and you can efficiently store the entire space of valid names uniquely in a variable of known size. Obviously, this extends to the case-sensitive, quasi-alpanumeric case, but the maximum name length will suffer.

How caught up are you on the size of the identifiers? Clearly it is an issue, or you would store the variable names themselves. If variable-sized identfiers are okay (but you just want minimal waste), then any entropy-coding will do the trick, such as Huffman coding.

Regards
Admiral
Ring3 Circus - Diary of a programmer, journal of a hacker.
The only hash that will guarentee uniqueness is the string itself. If you need to store the hash/unique id in less space then the string, then you need to look into manually assigning unique ids to each string and distributing those ids as needed as a lookup table.

Simply put, you cannot compress 1-x bytes (whatever the maximum length of the string is) into 4 or 8 or 16 bytes. There is no way to uniquely express every possible string as a 32bit or 64bit or 128bit value without some generated lookup table. If you find such a method, you have discovered a means of infinitely compressing any file into just 32/64/128bits. Hashs allow you to have an approximately distributed list of ids, but there can always be collisions. The bigger the hash, the less likely the collision.
I computed the probability differently (and your method is slightly wrong), by I got to the same 0.0116 result.

In my experience, CRC will have *fewer* collisions than expected for identifier strings. I have an intuition for why this is, but I would have a hard time explaining it. It has to do with the fact that two strings that only differ in a few bits are guaranteed to have different CRCs. Also, Donald Knuth recomends using CRC as a string hash.

Not having any way to deal with collisions is too bad. There is something called "perfect hashing" that looks like what you need, but I know very little about it. http://en.wikipedia.org/wiki/Perfect_hashing

Why making it so hard? Why not using a variable size arithmetic or Huffman encoding?
I can guarantee you than with 10000 words, any of the popularly CRC32 routines will generate at least one collision.



Quote:Original post by jovani
Why making it so hard? Why not using a variable size arithmetic or Huffman encoding?
I can guarantee you than with 10000 words, any of the popularly CRC32 routines will generate at least one collision.

I would take an even-odds bet on that.

I just tried using the first CRC32 routine I found on the web on a set of 24001 words (found here, and there is only one collision: "carnage" and "soggy". I used the code found here. The first 10000 or the last 10000 words don't collide. Another test with a different list of 43298 words gave me another single collision: "carmody" and "slide".

I don't know how it would work with the identifiers he is using, but I don't think it should be any worse than words.

This topic is closed to new replies.

Advertisement