Jump to content

  • Log In with Google      Sign In   
  • Create Account


Ordering Byte Pairs by Frequency In A File


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
11 replies to this topic

#1 Ectara   Crossbones+   -  Reputation: 2764

Like
1Likes
Like

Posted 27 October 2012 - 03:00 AM

I have an arbitrary file, and for purposes of file compression, I have a table of 2-byte pairs that need to be ordered by frequency to determine which pairs would provide the most benefit by being focused upon by the compression algorithm.

In layman's terms, I have up to 65536 (two bytes) distinct values, and with one pass through the file, I need to order the values in order of frequency of occurrence in the file with a minimal memory footprint. My previous method was an ugly one, that went through the file multiple times, and picked any pair that occurred more than three times. I would like a better method of choosing priority values, and with a minimum amount of memory consumed.

The way I see it, I could consume six to ten bytes, and read through the file up to 65536 times, or I could consume 262144 to 524288 bytes and count every occurrence in one pass.

Unless there is another way. Can anyone think of a way to sort the numbers 0 - 65535 by order of frequency in an arbitrary file?

Sponsor:

#2 Washu   Senior Moderators   -  Reputation: 4464

Like
0Likes
Like

Posted 27 October 2012 - 03:15 AM

Having a simple array of counts (262144 bytes assuming you use a 32bit integer) and then std::sort the array to find the largest values would probably be your best bet. This only requires a single pass to compute while using practically NOTHING in terms of memory. Seriously, 256KB is nothing on todays computer. Heck it took more ram to start your process than that.

In time the project grows, the ignorance of its devs it shows, with many a convoluted function, it plunges into deep compunction, the price of failure is high, Washu's mirth is nigh.
ScapeCode - Blog | SlimDX


#3 slayemin   Members   -  Reputation: 2042

Like
0Likes
Like

Posted 27 October 2012 - 05:24 AM

Check out Huffman Coding.

Eric Nevala

Currently a self-employed indie game dev


#4 Ectara   Crossbones+   -  Reputation: 2764

Like
0Likes
Like

Posted 27 October 2012 - 10:33 AM

Having a simple array of counts (262144 bytes assuming you use a 32bit integer) and then std::sort the array to find the largest values would probably be your best bet. This only requires a single pass to compute while using practically NOTHING in terms of memory. Seriously, 256KB is nothing on todays computer. Heck it took more ram to start your process than that.

On today's computer, perhaps, but on yesterday's phone, like mine that has 23mb of usable RAM, and a lot of it is taken up by other applications, it would take up a noticeable amount of memory just to do a simple compression algorithm. Letting alone my small embedded devices with 4mb and 16mb of RAM.

Check out Huffman Coding.

I already have a Huffman compression implementation, but it is trivial to count bytes as opposed than byte pairs.

#5 Ectara   Crossbones+   -  Reputation: 2764

Like
0Likes
Like

Posted 29 October 2012 - 03:07 PM

Does anyone else have any ideas?

#6 Ryan_001   Prime Members   -  Reputation: 1218

Like
0Likes
Like

Posted 30 October 2012 - 06:52 AM

If memory must be absolutely minimum, create a series of vectors. The 1st vector contains every entry found once, the 2nd every entry found twice, 3rd 3x, ect... As you read the file, search the vectors for the entry and then move it up a vector when its found. That way each entry only uses 16 bits + a little extra (based on how much padding each vector has). Even at a worst case scenario you're looking at 4 bytes/entry (which is what Washu's version would have used) but most likely u'll be closer to 3 bytes/entry and rarely would you find every 16-bit combination in a file. It'd probably take around 32kB-64kB for an average file.

You could pack it even tighter by using 1 vector instead of N. Just have pointers/iterators to the category divisions. So entries 0-20 would be pairs found one, 21 - 30 pairs found twice, ect... You'd have to adjust the pointers/iterators though as you moved items around in the table. It'd take a bit of work but it could work.

Course it'd be a whole lot slower...

Edited by Ryan_001, 30 October 2012 - 06:57 AM.


#7 Erik Rufelt   Crossbones+   -  Reputation: 3031

Like
0Likes
Like

Posted 30 October 2012 - 07:36 AM

Write the counts to file, only keeping a few in RAM at a time.

#8 Cagnazzo   Members   -  Reputation: 140

Like
0Likes
Like

Posted 30 October 2012 - 06:46 PM

Yeah, if you can make a file in storage, things will be a lot easier.

You can make a 64k bytemap and count up to 256 occurrences there - if you have more crop up, make a 64k temp file and increment the corresponding byte in it, then set the one in your memory to zero.

Unless a number comes up more than 65536 times you won't need to worry (and if it does, you can create a second file in storage). After you're done reading your input, you just have to go through the files in storage in descending order - the highest there is the highest overall, and ties are broken by a lesser storage file if there's more than one or the one in memory as the base case.

#9 Yrjö P.   Crossbones+   -  Reputation: 1412

Like
0Likes
Like

Posted 31 October 2012 - 10:02 AM

On today's computer, perhaps, but on yesterday's phone, like mine that has 23mb of usable RAM, and a lot of it is taken up by other applications, it would take up a noticeable amount of memory just to do a simple compression algorithm. Letting alone my small embedded devices with 4mb and 16mb of RAM.

What is the absolute minimum memory you want this to run in?
Do you need the full ordered 65536-long list of byte pair frequencies, or do you just need K most frequent byte pairs? If so, how large can K get?
What is the absolute maximum size of the files being scanned?

#10 Ectara   Crossbones+   -  Reputation: 2764

Like
0Likes
Like

Posted 31 October 2012 - 01:09 PM

If memory must be absolutely minimum, create a series of vectors. The 1st vector contains every entry found once, the 2nd every entry found twice, 3rd 3x, ect... As you read the file, search the vectors for the entry and then move it up a vector when its found. That way each entry only uses 16 bits + a little extra (based on how much padding each vector has). Even at a worst case scenario you're looking at 4 bytes/entry (which is what Washu's version would have used) but most likely u'll be closer to 3 bytes/entry and rarely would you find every 16-bit combination in a file. It'd probably take around 32kB-64kB for an average file.

This one is interesting, as it doesn't keep track of pairs that are never found. However, there's always the chance that they might all exist. Also, the overhead incurred by the bookkeeping for an unknown amount of vectors sounds like it will add up.

Write the counts to file, only keeping a few in RAM at a time.

I'd like to avoid using any other files; this would optimally run even in a read-only file system that just compresses/decompresses and writes to a pipe or sends data through the network.

What is the absolute minimum memory you want this to run in?
Do you need the full ordered 65536-long list of byte pair frequencies, or do you just need K most frequent byte pairs? If so, how large can K get?
What is the absolute maximum size of the files being scanned?

All good questions.

1) Uh... I guess I'd have a minimum of 0 bytes for the algorithm, because I don't have a lower bound on memory usage; I'd like it to be as small as possible, but I accept that it will consume _some_ memory. The maximum memory taken by the algorithm would be 256k, the most memory I'd hope to consume. If compiling on a larger register sized computer makes it use higher than this threshold, I'd like to avoid that method.

Thinking about it again, this might be a question of the minimum amount of memory the process has available to it. Enough to get the job done, really. The devices with very little memory only run one task at a time, and after the compression, that memory is reclaimed, so as long as one doesn't plan to exhaust all available memory (since this is in a library, not a standalone application), it should be fine.

2) A very good question. At most, I will need up to the top 255 most frequent pairs. If K is 256 or 0, the data cannot be compressed. K is likely to be lower for larger sets of data.

3) The absolute maximum is the largest available file size on the platform. It accepts my custom file stream type, and so the source and destination files can be anything that can be read from or written to in a linear fashion (and buffered as need be, if it isn't seek-able).

#11 Yrjö P.   Crossbones+   -  Reputation: 1412

Like
1Likes
Like

Posted 31 October 2012 - 07:13 PM

When enough memory is not available to do a count for the whole thing in one pass, I'd probably try maintaining K highest frequency values together with their frequency, take as much extra memory as I can get (M bytes) and then run 65536*4/M + 1 passes of the file, with each pass counting the frequencies of M/4 different byte pair values. In between passes the K value-frequency pairs have to be updated if the last value block had any higher frequencies.

#12 Ectara   Crossbones+   -  Reputation: 2764

Like
0Likes
Like

Posted 03 November 2012 - 10:06 AM

When enough memory is not available to do a count for the whole thing in one pass, I'd probably try maintaining K highest frequency values together with their frequency, take as much extra memory as I can get (M bytes) and then run 65536*4/M + 1 passes of the file, with each pass counting the frequencies of M/4 different byte pair values. In between passes the K value-frequency pairs have to be updated if the last value block had any higher frequencies.

So, in layman's terms, for however much memory I can hold, count as many different pairs as I can, and keep track of the highest ones between passes.

I think I could make that work to scale how much reading it has to do based on how much memory it has available to it. This scale-able approach to the obvious solution seems like my best bet so far, but I'm open to any other suggestions.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS