Sign in to follow this  

Can I do this with ZLib ?

This topic is 4419 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I use ZLib in my engine, but only the compress and decompress functions, which grab a piece of data in memmory and (de/)compress it to another buffer. I'm now wiritng a micro database system, and each record on that database is pretty much a string. Now, I dont want to compress on a per-string basis, because I don't think I would achieve any sort of decent compression with that system. What I want to know is if I can compress the whole lot of strings, and then use that dictionary to compress each individual string, and save the compression data per string. Why? Because if I compress the whole lot of strings, zlib will produce a compressed buffer with all my string jumbled together. That means I can't directly access string number 147 without decomrpessing the *whole* buffer. That's why I want to compress each string individually, but using the dictionary of the whole group to achieve higher compression... Did it make any sense?

Share this post


Link to post
Share on other sites
do you mean compressing each string sperately and choosing the compressed buffer as image to
identify a single string in a huge compressed collection of strings?

If so, no. The compression algorithms don t care about string boundaries instead they search for the frequency of a character in the collection and decoded them appropriately

huffman compression e.g.:

you can sum up the frequency of each char

this gives you a sorted list of frequencies

now the first 2 elements are combined into a tree node and inserted in to sorted list

with frequency = freqA + freqB and the A and B are removed from the list


repeat this til there s only one element in the list

now to encode each character you follow the branches of the tree top->down

left = 0; right = 1; this gives you a binary representation

very frequent characters will be encoded with less bits than low frequent since there are less branches until you reach the leaf where they are stored


-> this shouldn t be possible

but you could compress you strings on another way
you 26 characters in you alphabet *2 for big letters is 52 characters so you only need 6 bits 2exp6 is 64, 12 slots for characters left

this should reduce the amount of memory you need to 75%

alternatively you can sort you strings from a->Z and compress all strings beginning with a into a single buffer
this way you reduce the amount of compression to 1/n where n is the amount of different characters you use

yet another alternative:

Are you string sentences? If so you could represent your strings as a indexed list of words imagine 2 characters are 2 bytes = 2exp16 =65536 words which is quite a lot you just store 2 bytes per word and a large table of words

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
If you want to do per-string compression, a better idea would probably be a huffman-based algorithm. If your strings are in a particular language and/or are of a particular type (ie titles, names, etc), you could even generate a dictionary from a large set of data a single time (during development) and hard-code that dictionary into the program. This way, there is no need to scan data at runtime to build a dictionary, but there should still generally be some space saving. The only problem is that most compression algorithms (including huffman) take advantage of bit-level access, and if your strings are too small it is likely that you'll end up saving less than a whole byte per string, which means you'll end up using the same space but more time.

An alternative might be to store the strings on a disk, and use a hash-like index that uses (for example) a CRC32 checksum modulo some value as a key, and that key is associated with the location (on disk) of the string.

Share this post


Link to post
Share on other sites
Do you need to be able to update strings or insert new ones in the middle of the database?
If not then the standard approach is simply to compress the file in blocks (this done internally be zlib anyway) and record the beginning of each block. That way you can access data compressed with any algorithm as a random access stream, some will suffer from having too small blocks however. It's almost transparent when wrapped up in a neat interface with a decent caching scheme behind it.
I've even seen hacks which use the MMU to make the whole compressed file look like a single continous memory block, much like a memory mapped file really.

Of course this works best for sequencial access patterns, so it's probably not such a good idea if you need to performing a binary search on the database for instance.

Share this post


Link to post
Share on other sites
I've thought about using zlib/bzlib2 to compress the keys in B+-tree nodes, but just haven't really wanted to write such a library again. I think it's a great idea and you should give it a shot.

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
If you want to do per-string compression, a better idea would probably be a huffman-based algorithm. If your strings are in a particular language and/or are of a particular type (ie titles, names, etc), you could even generate a dictionary from a large set of data a single time (during development) and hard-code that dictionary into the program. This way, there is no need to scan data at runtime to build a dictionary, but there should still generally be some space saving. The only problem is that most compression algorithms (including huffman) take advantage of bit-level access, and if your strings are too small it is likely that you'll end up saving less than a whole byte per string, which means you'll end up using the same space but more time.

An alternative might be to store the strings on a disk, and use a hash-like index that uses (for example) a CRC32 checksum modulo some value as a key, and that key is associated with the location (on disk) of the string.


you need to be careful with the huffman approach since you might store more information for decoding as you spare which results in even more memory usage


I suggest you use the dictionary approch I mentioned above

People like Goethe knew around 40000 words, so a 2 byte index should be enough room for a lookup table
you just had to make sure you store real sentences and no concatination of integers converted to strings

Share this post


Link to post
Share on other sites
Quote:
Original post by Basiror
[...]you need to be careful with the huffman approach since you might store more information for decoding as you spare which results in even more memory usage[...]
Perhaps I didn't clearly state my suggestion: I am not suggestion that each string be huffman compressed. Instead, I'm saying compress them using the idea behind huffman with some modifications. Normally, with huffman you get the frequences for each piece of data to be compressed and apply the algorithm to each piece using that information. I suggest that you only use a single set of information for all strings, which will work well if the strings are similar kinds of information. This way, you only have a single binary tree for any number of strings. If he had a small number of strings, he probably wouldn't need to worry about memory, so it is safe to assume he has many strings.

The problem with your apporoach is that many languages use more than 26 characters from the perspective of the computer. For example, there are (iirc) 8 kinds of accents in common use on each of the 5 common vowels, plus various unusual markings on several other characters. Since Prozak's location says Portugal, he might have to deal with more than 64 possible characters or possibly even more than 128 characters. Even if he was restricting the strings to english letters, he'd probably need to include numbers and maybe a few punctuation marks which easily require more than 6 bits per character. Even 'worse', he might be using Unicode to support internationalization (I find internationalization is a very common practice outside the US).

Share this post


Link to post
Share on other sites
Quote:
Original post by Extrarius
Quote:
Original post by Basiror
[...]you need to be careful with the huffman approach since you might store more information for decoding as you spare which results in even more memory usage[...]
Perhaps I didn't clearly state my suggestion: I am not suggestion that each string be huffman compressed. Instead, I'm saying compress them using the idea behind huffman with some modifications. Normally, with huffman you get the frequences for each piece of data to be compressed and apply the algorithm to each piece using that information. I suggest that you only use a single set of information for all strings, which will work well if the strings are similar kinds of information. This way, you only have a single binary tree for any number of strings. If he had a small number of strings, he probably wouldn't need to worry about memory, so it is safe to assume he has many strings.

The problem with your apporoach is that many languages use more than 26 characters from the perspective of the computer. For example, there are (iirc) 8 kinds of accents in common use on each of the 5 common vowels, plus various unusual markings on several other characters. Since Prozak's location says Portugal, he might have to deal with more than 64 possible characters or possibly even more than 128 characters. Even if he was restricting the strings to english letters, he'd probably need to include numbers and maybe a few punctuation marks which easily require more than 6 bits per character. Even 'worse', he might be using Unicode to support internationalization (I find internationalization is a very common practice outside the US).


in other words probability statistics, yes that might be a solution
however you could still encode a string no matter what language it comes from with a dictionary with 2^16 entries thus 2 bytes per token

you encode all tokes len(token)>= 2 bytes if they match an entry in the 2^16 sized table, if not you might add in a string combination that is unlikely to be part of the strings language that way you only need a table for common tokens and you skip small tokes of len(token) = 1 no need to wast 1 byte

Share this post


Link to post
Share on other sites

This topic is 4419 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this