Sign in to follow this  
Lode

How to detect if binary tree codeword lengths are invalid?

Recommended Posts

First an example to show what I'm talking about:
    *
  0/ \1
  *   *
0/ \1/ \0
a  b c  d

a has code 00
b has code 01
c has code 11
d has code 10

a, b, c and d all have codeword length 2

Second example: again 4 symbols but they'll appear to have different codeword lenghts:

    *
  0/ \1
  a   *
    1/ \0
     b  *
      0/ \1
      c   d

a has code 0
b has code 11
c has code 100
d has code 101

now they have different codeworth lenghts
Now the actual question :) If you give a list of codeword lengths, you can construct a binary tree from it. Say there are 19 symbols. Say the given codeword lengths are 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7 I think there is no possible binary tree that has these exact codeword lengths for every symbol. Is there a simple way to detect if a tree can be constructed from this or not? I need this info to decode huffman compressed data, without segmentation faults while accidently trying to construct an invalid tree from the given codeword lenghts: I want to do a simple test on the codeworth lenghts first to see if a tree can actually be constructed

Share this post


Link to post
Share on other sites
I'm having trouble visualizing what exactly you're looking for (although I suspect using Dyck path properties would be a straighforward solution). Do you have any example smaller than 19 7-bit codewords that does not "work" ?

Share this post


Link to post
Share on other sites
Lode I do not understand how you are getting your code for the tree. Left and right have the same value no matter which part of the tree they are contained in(ie 1 or 0 not both).
Your first example should read:
a has code 00
b has code 01
c has code 10
d has code 11

Second:
a has code 0
b has code 10
c has code 110
d has code 111

Share this post


Link to post
Share on other sites
If I understand it correctly then it means that a code words always marks one of the ends of the binary tree, so you couldn't have: 11 and 110 in the same binary tree. If your longest code word is of length n, then there can be 2n code words of length n. However if we have a code word of length n-1 we will remove two possibilites for code words of length n. If we had one of length n-2 we would remove two possibilities for code words of length n-1, which would then each remove two possibilities for code words of length n. In general, we totally have 2n ends of length n, every code word subtracts 2n-x from that number, where x is the code word length. So in your case with 19 7s, we have a maximum number of max length code words of 27=128, here all code words are of size 7 so they each subtract 27-7=1 from 128, there are 19 of them so we still have 109 endings left and therefore it's possible to create such a binary tree. Just use 0000000, 0000001, 0000010, 0000011, 0000100, 0000101, 0000110, 0000111, 0001000, 0001001, 0001010, 0001011, 0001100, 0001101, 0001110, 0001111, 0010000, 0010001, 0010010

This is how I see it at least, please say so if I misunderstood something.

Share this post


Link to post
Share on other sites
Quote:
I need this info to decode huffman compressed data,


I do think clarification is needed, because this is not a huffman tree and if it were what would be the point of having 19 symbols all with code length 7? This is only a saving of one bit per character, as apposed to having the most frequent having small code lenghts.

Share this post


Link to post
Share on other sites
The existence of a uniquely decodable code for any set of codeword lengths is governed by Kraft's Inequality. When all your codeword lengths are the same it's a pretty straightforward equation, but what makes it interesting is when your codeword lengths are potentially different.

There's no reason you need to check for any sort of validity though. The Huffman encoding process generates an optimal prefix code, so the existence of Huffman encoded data implies existence of such a code. All you need is the frequency table so you can regenerate the code.

Share this post


Link to post
Share on other sites
There is exactly 1 0-length codeword.
Two length X codewords uses up one length X-1 codeword.

Equivilently:

One length X codeword takes up two length X+1 codewords.
There are 2^X codewords of length X.

Either of those two statements can be used to validate the existance of a Huffman tree with a certain set of codeword lengths.

The second one is probably easier, if you have arbitrary precision ints to work with, or you can bound your codewords to a certain max length.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this