Jump to content
  • Advertisement
  • 07/16/99 05:58 PM
    Sign in to follow this  

    Huffman Compression

    General and Gameplay Programming

    GameDev.net
    <%Topic="Huffman Compression"%>
                                Huffman Compression
                                -------------------
    
             Written for the PC-GPE and the World by Joe Koss (Shades)
                           Contact me on irc in #coders
    
    Introduction
    ------------
    
    Huffman Compression, also known as Huffman Encoding, is one of many
    compression techniques in use today. Others are LZW, Arithmetic Encoding, RLE
    and many many more. One of the main benefits of Huffman Compression is how
    easy it is to understand and impliment yet still gets a decent compression
    ratio on average files.
    
    Many thanks to Al Stevens for his Feb. 1991 article in Dr. Dobb's Programmers
    Technical Journal which helped me greatly in understanding Huffman Compression.
    
    (Al Stevens's C source code for Huffman Compression is available from DDJ and
     can also be found on various FTP sites with DDJ source collections.)
    
    Overview
    --------
    
    The Huffman compression algorythm assumes data files consist of some byte
    values that occur more frequently than other byte values in the same file.
    This is very true for text files and most raw gfx images, as well as EXE and
    COM file code segments.
    
    By analyzing, the algorythm builds a "Frequency Table" for each byte value
    within a file. With the frequency table the algorythm can then build the
    "Huffman Tree" from the frequency table. The purpose of the tree is to
    associate each byte value with a bit string of variable length. The more
    frequently used characters get shorter bit strings, while the less frequent
    characters get longer bit strings. Thusly the data file may be compressed.
    
    To compress the file, the Huffman algorythm reads the file a second time,
    converting each byte value into the bit string assigned to it by the Huffman
    Tree and then writing the bit string to a new file.
    
    The decompression routine reverses the process by reading in the stored
    frequency table (presumably stored in the compressed file as a header) that
    was used in compressing the file. With the frequency table the decompressor
    can then re-build the Huffman Tree, and from that, extrapolate all the bit
    strings stored in the compressed file to their original byte value form.
    
    The Frequency Table
    -------------------
    
    The Huffman algorythms first job is to convert a data file into a frequency
    table. As an example, our data file might contain the text (exluding the
    quotation marks): "this is a test"
    
    The frequency table will tell you the frequency of each character in the file,
    in this case the frequency table will look like this:
    
    
                              Character | Frequency
                              ---------------------
                                  t     |     3 times
                                  s     |     3  ..
                                 |     3  ..
                                  i     |     2  ..
                                  e     |     1  ..
                                  h     |     1  ..
                                  a     |     1  ..
    
    The Huffman Tree
    ----------------
    
    The huffman algorythm then builds the Huffman Tree using the frequency table.
    The tree structure containes nodes, each of which contains a character, its
    frequency, a pointer to a parent node, and pointers to the left and right child
    nodes. The tree can contain entries for all 256 possible characters and all 255
    possible parent nodes. At first there are no parent nodes. The tree grows by
    making successive passes through the existing nodes. Each pass searches for two
    nodes *that have not grown a parent node* and that have the two lowest
    frequency counts. When the algorythm finds those two nodes, it allocates a new
    node, assigns it as the parrent of the two nodes, and gives the new node a
    frequency count that is the sum of the two child nodes. The next iterations
    ignores those two child nodes but includes the new parent node. The passes
    continue until only one node with no parent remains. That node will be the root
    node of the tree. The tree for our example text will look like this:
    
       t ----[3]---------------------\
                                     |
       s ----[3]-\                   |
                  -[6]--------------- ------\
    --[3]-/                   |       -[14]--
                                     \      /
       i ----[2]---------------\      -[8]-/
                                -[5]-/
       e ----[1]--------\      /
                         -[3]-/
       h ----[1]-\      /
                  -[2]-/
       a ----[1]-/
    
    
    Compressing
    -----------
    
    Compression then involves traversing the tree beginning at the leaf node for
    the character to be compressed and navigating to the root. This navigation
    iteratively selects the parent of the current node and sees whether the
    current node is the "right" or "left" child of the parent, thus determining
    if the next bit is a one (1) or a zero (0). Because you are proceeding from
    leaf to root, you are collecting bits in the *reverse* order in which you will
    write them to the compressed file.
    
    The assignment of the 1 bit to the left branch and the 0 bit to the right is
    arbitrary. Also, the actual tree will almost never look quite like the one
    in my example. Here is the tree with 1's and 0's assigned to each branch:
    
       t ----------------------[0]-\
                                   |
       s ----[0]-\                 |
                  ----------------- -[0]---\
    --[1]-/                 |        --root
                                   \---[1]-/
       i ----------------[0]---\      /
                                -[1]-/
       e ----------[0]--\      /         <- writen this direction to compressed file.
                         -[1]-/
       h ----[0]-\      /
                  -[1]-/
       a ----[1]-/
    
    
    The tree in my example would compress "this is a test" into the bit stream:
    
       T     H      I     S         I     S          A           T     E     S    T
    ---------------------------------------------------------------------------------
    | 10 | 11110 | 110 | 00 | 01 | 110 | 00 | 01 | 11111 | 01 | 10 | 1110 | 00 | 10 |
    ---------------------------------------------------------------------------------
      |            ||               ||                ||              ||Partial|
      \    Byte1   /\     Byte2     /\      Byte3     /\     Byte4    /\ Byte5 /
       ------------  ---------------  ----------------  --------------  -------
         10111101       10000111          00001111         11011011     100010..
    
    
    Decompressing
    -------------
    
    Decompression involves re-building the Huffman tree from a stored frequency
    table (again, presumable in the header of the compressed file), and converting
    its bit streams into characters. You read the file a bit at a time. Beginning
    at the root node in the Huffman Tree and depending on the value of the bit,
    you take the right or left branch of the tree and then return to read another
    bit. When the node you select is a leaf (it has no right and left child nodes)
    you write its character value to the decompressed file and go back to the root
    node for the next bit.
    
    Final words
    -----------
    
    If you still dont understand Huffman Compression, no amount of source code
    is going to help you (one reason I didn't include any, not even psuedo code
    helps _understand_ Huffman). Re-read this file again a few times, it will
    all seem obvious soon enough (or else you shouldn't be attempting to write
    your own compression routines).
    
    Many thanks to Mark Feldman (Myndale) for the first real attempt at a
    -collection- of free programmers information that includes both sourcecode
    and/or DETAILED documention when necessary.
    
    (IMHO: detailed docs are volumes better than sourcecode .. if you don't
    understand it, your just bloody rippin' code!)
    


      Report Article
    Sign in to follow this  


    User Feedback


    There are no comments to display.



    Create an account or sign in to comment

    You need to be a member in order to leave a comment

    Create an account

    Sign up for a new account in our community. It's easy!

    Register a new account

    Sign in

    Already have an account? Sign in here.

    Sign In Now

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

Participate in the game development conversation and more when you create an account on GameDev.net!

Sign me up!