by Clifford M. Roche
In this article I intend to address the basics, and the slightly less basic aspects, of the CRC32 algorithm. This is a relatively straightforward article, most of all the source is provided - demos for both Borland C++ and MSVC++ .NET (which should be easily ported to MSVC++6) are available with this article as well.
A slight comprehension of math (polynomial division) would be helpful, but I am writing this article so that it is not required. You will however need to understand C++ - this, of course, is a given.
Why use CRC32?
Ok, so as one of my smaller projects I have decided to write an online application updater that simply generates a unique signature for a set of files, compares these signature values with a master list on the server, and updates any files that do not match.
Now there are a couple of ways that I can check the integrity of these files and decide as to whether or not they require an update from the master server. File time/date stamps, file size, and CRC32 come to mind. There are some obviously more complex ways such as databasing the file entries; however there is no need to be overly elaborate. The same goes for most uses of CRC32 in gaming. Besides, most of the more complex algorithms take a lot more computation than CRC32.
CRC32, of all the choices present, is most feasible since you do not have to worry about files that are of the same size when modified (very common with bitmap images, and a few other common game file types), and we do not have to worry about inaccurate file time/date stamps due to modified files. Though CRC32 is not a perfect algorithm and there is always a chance that 2 files will have the same CRC32 value it is rare enough that CRC32 makes a perfect candidate for file signature generation.
A CRC32 value is a 32bit (4byte) number (also known as a signature) that uniquely represents a specified chunk of data. The advantage with CRC32 for signature checking is that the algorithm is designed so that a slight change in the file (be it a change, a delete, or an insert) should cause a large change in the CRC32 signature value. CRC32 has 2^32 possible values; a total of 4.294967E+09.
CRC32 is very useful in gaming for these reasons as it allows us to do many things. They can be used heck files to see if they have been modified to help prevent game hacking, or they can be used to check files or data that is sent over an internet connection during game play such as a map file.
How does CRC32 work?
CRC32 is a somewhat involved process, and relies on a lot of mathematical reasoning. But this is generally how it goes:
First we are going to take a polynomial value, and with this value we generate a CRC32 lookup. I will, towards the end of this article, demonstrate the proper way to generate a lookup table with a given polynomial. What we will do next is take the data we want to generate a CRC32 signature for and using a CRC32 lookup function generate an updated CRC32 value for each byte in the file, string, or structure we want to verify. This is done for every byte within the string, structure, or file. When complete the most updated value becomes the signature.
The heart of the CRC32 algorithm is simply a system that takes the present byte to compute from the data set, and the present CRC32 value that we have and performs a lookup into our lookup table. If there is no present CRC32 value available we will use 0xFFFFFFFF. With a little bit of math, more specifically a polynomial division, the function will return an updated CRC32 value.
The implementation is very simple.
This is what our class is going to look like.
There is also one static array, this is simply for us to store the CRC32 lookup table. You do not need to actually store this table permanently if you want. With a constructor and destructor you can generate and cleanup a table only when you plan on using it. For our purposes it is more sensible to use the table since this can greatly reduce the processing time.
Generating the lookup table
To generate a lookup table we need to specify a polynomial value to use in which will give us the values we use in our lookup table. The polynomial value you use will be a value that should theoretically reduce the likeliness of duplicate signatures in your data; however the math behind one of these values is overly complex and not within the scope of this article. This is ok though since we have a value already picked out for us. The value that we will use to generate our lookup table will be 0x04c11db7.
Consequently this polynomial is also used by many other applications and systems, such as WinZip, PkZip, and Ethernet.
Note: You will need to generate the lookup table at least once even if you are using a hard coded lookup table. Unless of course you use the lookup table provided with my demo.
Before we continue, CRC standard states that we have to reflect the values in our lookup table; this can be done by also reflecting the polynomial. In this case 0x04c11db7 becomes 0xedb88320. Though some equations will reflect the table values as they are generating I am going to reflect the polynomial. It all ends up being the same in the end anyway, this just means a few less lines of code.
Note: Reflecting is a simple process of reversing the binary pattern of a given binary segment. For example 10100111 would become 11100101 when reflected.
Depending on your situation you may or may not want to spend CPU cycles to keep regenerating the lookup table every time you want to generate a signature. If this is the case simply write this code into a separate application, dump the table to a file, to your screen, or where ever you find is most convenient, and copy and paste the values right into the table. Though you have the cost of 256*4 bytes of memory (which really is not a whole lot) you will drastically improve calculation time without the need to constantly recalculate the lookup table.
For those of you who are attempting to generate a lookup table using the polynomial listed above the first 14 values should look like this:
With the lookup table in place now we can look at the GetCRC32 function. This function takes a single byte and the previous calculated CRC32 value (if one exists).
Note: the use of inline should not be taken lightly. It can cause extremely large increases in code size and may in the end decrease the overall performance of your application from consuming too much memory (which leads to limited resources and page faults during application execution). In our situation the actual instances where GetCRC32 is called are very limited and thus overall application size is not a concern.
You may also want to look into the __fastcall specifier. __fastcall works by passing the arguments to a function through the CPU registers (when they will fit) as opposed to the standard way of passing them through the stack. This can create a notable speed increase just as inline does here.
With the present byte, and previous CRC32 value the function performs a polynomial division, with a reference to the lookup table, and updates the CRC32 signature.
Calculating for Structures
The two previous functions core of our CRC32 algorithm. All that is left is to read the data, and using those two functions, generate a CRC32 signature for our data.
Before filling a structure with data that is to be calculated it is very important that we initialize the entire structure to NULL, the reason for this being that structures are usually padded to 8 bytes. In other words if you have a structure that has 7 chars (7bytes) sizeof( struct ) will still return 8bytes. The last bit will be initialized to a garbage value and may not be the same if you recreate that structure later, this will change the CRC32 value, causing them not to match. There is one other option though, when passing in the structure size, if you know the exact unpadded size of the structure you may use that value instead which will cause our algorithm to ignore any extra bytes from padding. You should not mix the two methods though since this will yield different non-matching CRC32 values between two exactly the same structures.
Calculating for a file
Calculating the CRC signature for a file is very similar though the function to do so does seem somewhat more complex - it really is not. We do however need to add an additional function that will calculate the size of the file before processing.
The next little bit actually reads the file and passes the data to our CRC32 algorithm.
Other uses for CRC
CRC32, like it is used in gaming, is also used outside of gaming for the same reasons; those being for file integrity checking and data integrity checking when data is copied.
A great extension to this use that is also commonly employed in gaming is to add a CRC32 value into the packets that you are sending over the internet, this insures that data that arrives at a destination is the same as it was before it was sent. Though the TCP and UDP protocols have CRC values in their headers they will only verify each packet as they are transmitted. Structures and data that are sent over multiple packets may become corrupt and used without a CRC check.
CRC can even be used for security purposes to insure that text messages, and other types of message, are not intentionally modified. This is a system that is commonly used in PGP when signing a message so that the receiving party can verify that it was in fact you who signed it. PGP does not actually use CRC32 with the signatures since CRC32 is somewhat weak in these situations. Other common algorithms used for these purposes would be algorithms such as MD5, SHA8, SHA256, etc. These are called HASH algorithms when used for security purposes.
If you need to use a signature generator that produces a larger number of values than CRC32 and has to have a greater accuracy than you can look into the three algorithms I mentioned above.
And finally, if you have any comments, question, or so forth you can usually find me on the #gamedev channel as kurifu, seta, or something else.