Sign in to follow this  

Micro-compression algorithms

This topic is 4378 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Here's some background. I'm coding a micro-database for Pocket PCs for my company. Yes, it has to be something done in-house. Basicly I chose an auto-balancing AVL Tree ([1] [2]), wich stores the index fields of that particular table, and then each node in the tree points to rest of the fields on that record. All of this is stored in a file format that is designed in such a way that if I chose to, I do not need to load the AVL tree into memory, if all I want to do is search for a particular record. In case I need to add or delete nodes, then I'll need to load the AVL tree, but I'm still working on that bit too. So, one of the final things to take into consideration is the final size of the file. Due to the adoption of this new DB format, the file size has grown a bit, and so I'm looking for easy to implement, fast lossless compression agorithms. The compression algorithm would be applied per record so that, whenever I searched for a certain item in the DB, it would go something like this: * Locate Client record with Client ID = 11223344 * (Client record is found) Get pointer from Node to the remaining fields (remember, the Node only stores the index field, in this case the Client ID) * Decompress the remaining fields * We're ready to access the Client's Data So, I'm looking for micro-compression algorithms, something similar to Tiny Encryption Algorithm (TEA), but in the compression world. Thanks for any tips on this... [wink]

Share this post


Link to post
Share on other sites
I use ZLib in my 3D engine, and I guess there is a PocketPc version available, I've seen it somewhere, but the main thing to bare in mind here is that we're compressing strings that go from 100 bytes to 400 bytes (depending on which database table we're working on) at a time.

This means that if I compress "The Brown fox", and then I get another string "The Brown fox", the compressor will not remember it has already seen such a string before, and therefore we'll be far from an "optimal" compression scenario.

Also, I'm not storing a dictionary with the file, I'll just be storing the raw compressed data.

So I think that ZLib's compression on a per-string base could be very close to other out-of-the-shelf compression algorithms that work without a dictionary, so I would avoid linking to it before I can find another smaller compression algorithm I can use. (In other words, trying to kill a fly with a shotgun).

As always, correct me if I'm wrong.

Share this post


Link to post
Share on other sites
Do you know anything about the data to compress in advance?
I'd probably go for a dictionary based compression scheme combined with a precomputed huffman tree.

Share this post


Link to post
Share on other sites
Well, it's a DataBase, so the user will define the various fields, and when those get defined, they get defined as number, char string, etc, so yes, I do know something about the data being entered...

Share this post


Link to post
Share on other sites
you could implement a huffman tree

get all the symbols you want to compress

order them by frequency
merge the two lowest to a new node with (first.count + second.count) as frequency
insert into the list
sort again or insert at the right position
go on until the list contains only 1 member the root node

going down the tree

left branches are 0s right branches are 1s

as you see symbols with high frequency have smaller word length

huffman trees are said to be the optimal compression method with variable word length codition

so if you travers the tree from top to the leafs the are the first bits

e.g.:
11110 means
right right right right left

as you use strings of ~400 bytes you should get a average compression rate of ~20-30%

I did this once I counted the frequency of the hexa decimals of the bytes
this required only 16 entries to be stored at the beginning of the compressed buffer

now in order to find already stored strings you could create a crc32 checksum

[crc32][huffman header][huffmancompressed buffer]


here is a bit packing code it wrote some time ok til now it worked fine if you find a bug let me know please the email is in the top comment of the file

/*
* You may use this code for nonprofit projects.
* You have to include this header in any copy of that code.
*
* Copyright @ Eduard Aumüller 2005
* basiror@gmx.de
*/


/*
* bitpack just serves for fast and easy bitpacking
*/


#ifndef CROSS_BITPACK
#define CROSS_BITPACK

namespace cross
{
//set_mask[n] & value to ensure only proper bit setting
const int set_masks[33] = {
0x00000000,
0x00000001, 0x00000003, 0x00000007, 0x0000000F,
0x0000001F, 0x0000003F, 0x0000007F, 0x000000FF,
0x000001FF, 0x000003FF, 0x000007FF, 0x00000FFF,
0x00001FFF, 0x00003FFF, 0x00007FFF, 0x0000FFFF,
0x0001FFFF, 0x0003FFFF, 0x0007FFFF, 0x000FFFFF,
0x001FFFFF, 0x003FFFFF, 0x007FFFFF, 0x00FFFFFF,
0x01FFFFFF, 0x03FFFFFF, 0x07FFFFFF, 0x0FFFFFFF,
0x1FFFFFFF, 0x3FFFFFFF, 0x7FFFFFFF, 0xFFFFFFFF
};
template<uint32 bits, bool networkbyteorder = false> class bitpack
{
public:
enum bitpack_m
{
MAX_ELEMENT_BITS = (sizeof(uint32)*8),
MAX_ARRAY_ELEMENTS = (((bits-1)/(sizeof(uint32)*8))+1),//bits -1 should solve the case of a additional element
BITPACK_ERROR = 0xFFFFFFFF
};
bitpack()
{
std::fill<uint32*,uint32>(&m_Array[0],&m_Array[MAX_ARRAY_ELEMENTS],0);
};
//maximum bit index
uint32 max_idx()
{
return static_cast<uint32>(bits -1);
};
//set all bits
void set()
{
std::fill<uint32*,uint32>(&m_Array[0],&m_Array[MAX_ARRAY_ELEMENTS],set_mask[32]);
};
//set bit at position idx to value
void set(uint32 idx, bool value=true)
{
set(idx,1,static_cast<uint32>(value));
};
//set n bits of value make sure no bit above (n-1)th is set
void set(uint32 idx,uint32 n, uint32 value)
{
uint32 i = idx / MAX_ELEMENT_BITS;
uint32 mod = idx % MAX_ELEMENT_BITS;
uint32 m = MAX_ELEMENT_BITS - mod;

if(i >= MAX_ARRAY_ELEMENTS)
return;

//only 32 bits available to set
if(n > MAX_ELEMENT_BITS)
return;

value&= set_masks[n];

//set bits at position idx
m_Array[i] &=~ set_masks[n]<<mod;
m_Array[i] |= value<<mod;

if(networkbyteorder == true)
m_Arrayhtonl[i] = htonl(m_Array[i]);

//uint32 crossing m bits were written
if(m < n && (i+1 < MAX_ARRAY_ELEMENTS))
{
m_Array[i+1] &=~ set_masks[n]>>m;
m_Array[i+1] |= value>>m;

if(networkbyteorder == true)
m_Arrayhtonl[i+1] = htonl(m_Array[i+1]);
}
};
//read n bits
uint32 read(uint32 idx, uint32 n)
{
uint32 i = idx / MAX_ELEMENT_BITS;
uint32 mod = idx% MAX_ELEMENT_BITS;
uint32 m = MAX_ELEMENT_BITS - mod;
uint32 value = 0;

if(i >= MAX_ARRAY_ELEMENTS)
return BITPACK_ERROR;

//we can only read 32 bits at once
if(n > MAX_ELEMENT_BITS)
return BITPACK_ERROR;

value = m_Array[i]>>mod;

//anything left to read? shift left for to save the m bits we already wrote!
if(m < n && (i+1 < MAX_ARRAY_ELEMENTS))
value |= m_Array[i+1]<<m;

//now mask all n bits
value&= set_masks[n];

return value;

};
//clear all bits
void clear()
{
std::fill<uint32*,uint32>(&m_Array[0],&m_Array[MAX_ARRAY_ELEMENTS],0);
};
//returns the data array in network byte order optionally
const byte8* data(bool bNetByteOrder = false)
{
if(bNetByteOrder == false)
return reinterpret_cast<byte8*>(m_Array);

return reinterpret_cast<byte8*>(m_Arrayhtonl);
}
void print()
{
for(int i=0;i<MAX_ARRAY_ELEMENTS;i++)
{
std::cout<<i<<" "<<m_Array[i]<<std::endl;
}
};
private:
uint32 m_Array[MAX_ARRAY_ELEMENTS];
uint32 m_Arrayhtonl[MAX_ARRAY_ELEMENTS];
};
};

#endif /*CROSS_BITPACK*/




I hope this helped

Share this post


Link to post
Share on other sites
Quote:
you could implement a huffman tree

Problem is that you need to store the tree, and they can become quite large, so for very small data this isn't feasible.
You could use a slow dynamically updated huffman tree (or for a faster approximation a splay-tree).
The "problem" with huffman, arithmetic encoding etc is that you need to fiddle around at bit level, which will lead to poor performance (especially on small embedded devices).

Quote:
I do know something about the data being entered...

If you build a custom made compressor based on that knowledge, its proably going to beat a general compressor (in terms of speed and compression ratio).
I've made MANY small compressors targeted at different data over the years.

I put together a small dictionary based compressor working at byte level with NO memory foot print.
I haven't tried it that much, but it seems to work ok and does indeed compress data (well most of the time ;).
Here are the results for a few small files in my windows folder:

Source: C:\WINDOWS\ideinit.cfg
Compress: 146 to 115 bytes (78.77%)
Decompress: 115 to 146 bytes (126.96%)
Data verified, all ok!

Source: C:\WINDOWS\Blue Lace 16.bmp
Compress: 1272 to 436 bytes (34.28%)
Decompress: 436 to 1272 bytes (291.74%)
Data verified, all ok!

Source: C:\WINDOWS\win.ini
Compress: 1014 to 544 bytes (53.65%)
Decompress: 544 to 1014 bytes (186.40%)
Data verified, all ok!

Source: C:\WINDOWS\hh.exe
Compress: 10752 to 5052 bytes (46.99%)
Decompress: 5052 to 10752 bytes (212.83%)
Data verified, all ok!

Source: C:\WINDOWS\system.ini
Compress: 227 to 136 bytes (59.91%)
Decompress: 136 to 227 bytes (166.91%)
Data verified, all ok!

Source: C:\WINDOWS\desktop.ini
Compress: 2 to 3 bytes (150.00%)
Decompress: 3 to 2 bytes (66.67%)
Data verified, all ok!

Source: C:\WINDOWS\Runservice.exe
Compress: 2560 to 932 bytes (36.41%)
Decompress: 932 to 2560 bytes (274.68%)
Data verified, all ok!

Source: C:\WINDOWS\clock.avi
Compress: 82944 to 44303 bytes (53.41%)
Decompress: 44303 to 82944 bytes (187.22%)
Data verified, all ok!








The source code and test code is here:

#include <stdio.h>

typedef unsigned int uint;
typedef unsigned byte u8;

uint
compress(u8* const dest, const u8* const source, const uint sourceSize)
{
// Handle zero size compression
if (sourceSize == 0)
return 0;
// First byte is always a copy
dest[0] = source[0];
// Handle one byte compression
if (sourceSize == 1)
return 1;
// Setup control byte
uint controlOffset = 1;
uint controlMask = 1;
uint controlByte = 0;
// Init source and dest offsets
uint offset = 1;
uint destOffset = 2;
// Loop while there's source data
while (offset < sourceSize){
// Determine maximum search offset (depends on position in source data)
uint maxOffsetSearch = offset;
if (maxOffsetSearch > 32)
maxOffsetSearch = 32;
// Init best found so far
uint bestSearchOffset = 0;
uint bestSearchLen = 0;
// Search all possible combinations
uint searchOffset;
for (searchOffset = 1; searchOffset <= maxOffsetSearch; ++ searchOffset){
// Determine max search length (depends source position start and source size)
uint maxSearchLen = sourceSize - offset + searchOffset;
if (maxSearchLen > 9)
maxSearchLen = 9;
// Find out how long string we can match
uint searchLen;
for (searchLen = 0; searchLen < maxSearchLen; ++ searchLen){
// Break as soon as we find an mismatch
if (source[offset - searchOffset + searchLen] != source[offset + searchLen])
break;
}
// If this isn't the best match so far, continue searching
if (searchLen <= bestSearchLen)
continue;
// Is the best match so far, store this as the best found so far
bestSearchLen = searchLen;
bestSearchOffset = searchOffset;
// If this string is of the maximum length, there's no need to continue searching, break loop
if (searchLen == 9)
break;
}
// Test if it's more efficient to use a string copy
if (bestSearchLen > 1){
// Mark this byte in the control byte as a string copy
controlByte |= controlMask;
// Build string copy byte. 5 bits of offset [1, 32] and 3 bits of length [2, 9]
dest[destOffset] = u8(((bestSearchOffset - 1) << 3) | (bestSearchLen - 2));
offset += bestSearchLen;
}else {
// Copy the byte as is
dest[destOffset] = source[offset];
++ offset;
}
// Get next dest position
++ destOffset;
// Adjust control mask
controlMask += controlMask;
// If there's space in the control byte, continue compression
if (controlMask < 0x100)
continue;
// Store control byte
dest[controlOffset] = u8(controlByte);
// Reset control byte
controlByte = 0;
controlMask = 1;
controlOffset = destOffset;
// Allocate space for the mask
++ destOffset;
}
// Test if we need to write the control byte
if ((controlOffset + 1) == destOffset)
-- destOffset;
else
dest[controlOffset] = u8(controlByte);
return destOffset;
}

uint
decompress(u8* const dest, const u8* const source, const uint sourceSize)
{
// Handle zero size compression
if (sourceSize == 0)
return 0;
// First byte is always a copy
dest[0] = source[0];
// Handle one byte compression
if (sourceSize == 1)
return 1;
// Setup control byte
uint controlMask = 1;
uint controlByte = source[1];
// Init source and dest offsets
uint offset = 2;
uint destOffset = 1;
// Loop while there's source data
while (offset < sourceSize){
// Test if this is a byte or a string
if ((controlByte & controlMask) == 0){
// Copy byte
dest[destOffset] = source[offset];
// Step output offset
++ destOffset;
}else {
// Decode length and offset of string
uint copyOffset = source[offset];
uint copyLen = (copyOffset & 0x7) + 2;
copyOffset >>= 3;
copyOffset += 1;
// Copy string
uint o;
for (o = 0; o < copyLen; ++ o)
dest[destOffset + o] = dest[destOffset - copyOffset + o];
// Step output offset
destOffset += copyLen;
}
// Step source offset
++ offset;
// Adjust control mask
controlMask += controlMask;
// If there's space in the control byte, continue compression
if (controlMask < 0x100)
continue;
// Read control byte
controlByte= source[offset];
++ offset;
// Reset control mask
controlMask = 1;
}
return destOffset;
}


void compTest(const std::string& filename)
{
printf("\nSource: %s\n", filename.c_str());
// Read source file
FILE* f = fopen(filename.c_str(), "rb");
if (!f)
return;
fseek(f, 0, SEEK_END);
uint t = ftell(f);
if (t == 0){
fclose(f);
return;
}
fseek(f, 0, SEEK_SET);
std::vector<u8> source;
source.resize(t);
if (fread(&source[0], t, 1, f) != 1){
fclose(f);
return;
}
fclose(f);
// Allocate dest buffer
std::vector<u8> dest;
dest.resize(((t * 9 + 7) >> 3) + 1);
// Compress
uint c = compress(&dest[0], &source[0], t);
printf(" Compress: %d to %d bytes (%.2f%%)\n", t, c, float(c) / float(t) * 100.0f);
// Decompress
std::vector<u8> recomp;
recomp.resize(t + 8);
recomp[t + 0] = 0x12;
recomp[t + 1] = 0x34;
recomp[t + 2] = 0x56;
recomp[t + 3] = 0x78;
recomp[t + 4] = 0x9a;
recomp[t + 5] = 0xbc;
recomp[t + 6] = 0xde;
recomp[t + 7] = 0xf0;
uint k = decompress(&recomp[0], &dest[0], c);
printf(" Decompress: %d to %d bytes (%.2f%%)\n", c, k, float(k) / float(c) * 100.0f);
if (t != k){
printf("Error: Decompressed data size differs from source data size!\n");
return;
}
// Verify
uint i;
for (i = 0; i < k; ++ i)
if (recomp[i] != source[i])
break;
if (i != k){
printf("Error: Decompressed data is different from compressed data at offset: %d!\n", i);
return;
}
if ((recomp[t + 0] != 0x12) || (recomp[t + 1] != 0x34) || (recomp[t + 2] != 0x56)
|| (recomp[t + 3] != 0x78) || (recomp[t + 4] != 0x9a) || (recomp[t + 5] != 0xbc)
|| (recomp[t + 6] != 0xde) || (recomp[t + 7] != 0xf0)){
printf("Error: Decompression overwrites more than the %d source bytes!\n", k);
return;
}
//
printf(" Data verified, all ok!\n");
}

int
main(const uint argCount, const char* const args[])
{
compTest("C:\\WINDOWS\\ideinit.cfg");
compTest("C:\\WINDOWS\\Blue Lace 16.bmp");
compTest("C:\\WINDOWS\\win.ini");
compTest("C:\\WINDOWS\\hh.exe");
compTest("C:\\WINDOWS\\system.ini");
compTest("C:\\WINDOWS\\desktop.ini");
compTest("C:\\WINDOWS\\Runservice.exe");
compTest("C:\\WINDOWS\\clock.avi");
return 0;
}








I even maneged to add some sparse comments in the source!
Keep in mind that the source is designed for readability, not speed.

Edit: Forgot to mention that the worst case is a 9/8 compression ratio (plus one byte).

Share this post


Link to post
Share on other sites
Quote:
Original post by eq
Quote:
you could implement a huffman tree

Problem is that you need to store the tree, and they can become quite large, so for very small data this isn't feasible.
You could use a slow dynamically updated huffman tree (or for a faster approximation a splay-tree).
The "problem" with huffman, arithmetic encoding etc is that you need to fiddle around at bit level, which will lead to poor performance (especially on small embedded devices).


the storage of the huffmantree costs almost nothing if you encode hexadecimals as i described above
for 400 bytes you get 800 hexa decimals and all you need to do is store values small enough that still create the same huffman tree
so you can shrink the values of the hexa decimals e.g.: divide by a certain factor since all you do is add them together

a*c +a*b = a*(c+b) you can do this by examining the tree from top to bottom check which nodes you can scale down so the tree still stays the same


as for performance that depends on what you wish to do with it i guess this is just a lookup table to print a compressed string onto the screen

a decent decompression implementation will do this pretty quickly
even on a pocketpc

Share this post


Link to post
Share on other sites
Quote:
the storage of the huffmantree costs almost nothing if you encode hexadecimals as i described above

This of course all depends on your needs.
I do not say that huffman trees are impractical, I've used them many times.
If the OP is really serious about getting a good compression ratio he/she needs to abandon the byte level compression and simplicity in code.

If that's the case, I'd go for a Burrow-Wheeler transform, follwed by a move to front operation and finally use artithmetic compression on the results, but the code complexity of that is way beyond the simplicity of the referenced TEA implementaion.

Share this post


Link to post
Share on other sites
Quote:
The compression algorithm would be applied per record

Unless your records are of a substantial size, you'll inflate your file trying to compress each record seperately. Compression works by removing repetition. If there's almost no repetition within a given sample of data, almost no gain can be made from compressing it. If your records are small, the sample will be too small to compress very much. You'll be working with a simple compression scheme too, which unlike composite formats like zip, won't default to uncompressed data when there's no bonuses to be made. Instead, it'll inflate your record size. A worst case for LZ77 compression for example would inflate your data by 12.5%.

In order to get substantial benefits from compression, you would need to compress sets of records together. This would require increased processing power to decompress/compress this larger data block, and also increased memory to store the decompressed data. I think you'd probably find compression will prove more of a hinderence than a help. It depends on a lot of details we don't have however.
-How many records do you expect to have?
-What is the nature of the data within them (eg, mostly strings, lots of integers with low numbers)
-How big do you expect them to be, in bytes?
-How large are the tables?
-What do you expect the overall filesize of the database to be?
-What are the specs of these units in terms of storage space, RAM, and processing power?
-Will users need to modify records (I'm guessing yes) or simply view them?

At any rate, if your records are going to be less than 80 bytes, I'd say you'll need to compress sets of records, perhaps entire tables, together in one block. This will increase processing power required to access and modify these records significantly. If storage space and CPU usage are both equally at a premium, I simply don't think your units are powerful enough.

Share this post


Link to post
Share on other sites
Made an slightly more advanced implementation that predicts the number of bits to use for length/offset based on the previous byte.
It needs 256 bytes on the stack and is a bit slower.
The same files compresses like this:


Source: C:\WINDOWS\ideinit.cfg
Compress: 146 to 115 bytes (78.77%)
Decompress: 115 to 146 bytes (126.96%)
Data verified, all ok!

Source: C:\WINDOWS\Blue Lace 16.bmp
Compress: 1272 to 428 bytes (33.65%)
Decompress: 428 to 1272 bytes (297.20%)
Data verified, all ok!

Source: C:\WINDOWS\win.ini
Compress: 1014 to 521 bytes (51.38%)
Decompress: 521 to 1014 bytes (194.63%)
Data verified, all ok!

Source: C:\WINDOWS\hh.exe
Compress: 10752 to 4634 bytes (43.10%)
Decompress: 4634 to 10752 bytes (232.02%)
Data verified, all ok!

Source: C:\WINDOWS\system.ini
Compress: 227 to 135 bytes (59.47%)
Decompress: 135 to 227 bytes (168.15%)
Data verified, all ok!

Source: C:\WINDOWS\desktop.ini
Compress: 2 to 3 bytes (150.00%)
Decompress: 3 to 2 bytes (66.67%)
Data verified, all ok!

Source: C:\WINDOWS\Runservice.exe
Compress: 2560 to 768 bytes (30.00%)
Decompress: 768 to 2560 bytes (333.33%)
Data verified, all ok!

Source: C:\WINDOWS\clock.avi
Compress: 82944 to 42413 bytes (51.13%)
Decompress: 42413 to 82944 bytes (195.56%)
Data verified, all ok!


Ping if you're interesetd in source.

Share this post


Link to post
Share on other sites

This topic is 4378 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this