Compression algorithms

Started by
6 comments, last by polly 20 years ago
Hello everyone, I was wondering if anyone had any suggestions when it comes to the use of compression algorithms. Here''s what I need to do: Given a list of usernames such as: bill1, bob3, franky4, jimbo89 I need to compress all of these usernames into some data type using a reasonably fast algorithm and store them in a column in a database table. I then need to be able to reteive the compressed usernames and turn them back into String usernames after retreiving the row from the database. I havent done any compression stuff before so this is new to me. Cheers Jon
Advertisement
I think a hash or hashmap will work well for what you want. You sound as though you want to pass small values around and then expand them only when you need to. With a hash, you can pass keys around and only dereference the keys when you need to. The keys can be quite small (a single number) and the values being referenced can be as big as you like.

Hash description; quite academic.
Perl Hash Howto.

Your database could then have each row as an array or list of hash keys. e.g.

@row1 = (1, 3, 99, 76);

To get a value out you would use: $column_hash{$row1[column]};

If you really really just want a quick compression algorithm for text, look up run-length encoding, z compression, or Huffman encoding. Also, this[1] book has many useful algorithms. It''s not cheap, but it is /very/ good.

[1] Introduction to Algorithms, Second Edition
by Thomas H. Cormen (Author), Charles E. Leiserson (Author), Ronald L. Rivest (Author), Clifford Stein (Author)
ISBN: 0262032937
My understanding of Hash algorithms may be a bit rusty, but.. doesnt Hashing work along the lines of providing a non-unique value (the Hash value) and associating it with some object (used to generate the value)?

So each of the 'buckets' in a hashtable corresponds to a NON-UNIQUE hash value, each of which can have >= 0 asscociated objects.

Im not sure this is appropriate to my problem, but maybe Ive missed something. Maybe I ought to explain my problem a bit better.

Say in my system I have a large number of documents and a large number of users. Users have the power (through a UI) to view the documents and "mark them of interest". Kind of like the way you can mark threads of interest in gamedev. For reasons I cant go into I cant store the "document interest" values with the User information (my preferred solution), instead I have to somehow mark each document as of interest to a given set of users by storing the information with the document.

So the schema for the document table would include a column which is simply a list of usernames.

Of course, taken literally this is a very inefficient way of doing things. The String which is a list of unique user ids could be very long indeed, and I wondered if there is a better way of doing things. I'm not sure how a Hash can be used in this situation, but I dont think it can.

Thanks for the book recommendation by the way.

Jon


EDIT: my ascii art didnt work!


[edited by - jonpolly99 on April 3, 2004 3:31:52 PM]
No, that''s not the schema you''ll want. Factor DocID + Username into another table. A hash-table, of course, is perfect here, if you want to look up by username.

"Sneftel is correct, if rather vulgar." --Flarelocke
Im sorry Sneftel, but I dont see how it can be used. If I have docid and a List of usernames in another table, where does the hashing come in?

[edited by - jonpolly99 on April 3, 2004 3:40:03 PM]
OK. After doing a bit of reading I can see where Im a bit unsure on all this.

Are you assuming that the hashing algorithm can produce a unique number for every unique username?

Jon
It doesn''t matter that much whether hashes are unique or not. Ideally they are, but hash tables can handle collisions just fine. As long as the hash function is "mostly" well distributed there will be no problems.

Think of a hash-table as a lookup table, from one key to zero or more values. If you have a hash-table keyed on a username and data-ed with the docIDs that they''re interested, that means you can hash the username, do a lookup, and retrieve all the docids that that user would be interested in. Then you can use those DocIDs in your other table to retrieve the actual document.

Depending on the size of your table, you actually might be better off with a B+Tree than a hashtable, but that''s a somewhat advanced concept if you want to do it right. All of this stuff is really Databases 101. I suggest you pick up a good book on database design; it''s important stuff to know, even if you aren''t going to be using a SQL database.

"Sneftel is correct, if rather vulgar." --Flarelocke
Just create a lookup table. Assign each new name an ID. To make it easy and convenient, make the ID equal to the index in the lookup table. Easy.
Brianmiserere nostri Domine miserere nostri

This topic is closed to new replies.

Advertisement