Back to General and Gameplay Programming

Ordering Byte Pairs by Frequency In A File

Ectara · 2012-11-03T16:06:15

I have an arbitrary file, and for purposes of file compression, I have a table of 2-byte pairs that need to be ordered by frequency to determine which pairs would provide the most benefit by being focused upon by the compression algorithm. In layman's terms, I have up to 65536 (two bytes) distinct values, and with one pass through the file, I need to order the values in order of frequency of occurrence in the file with a minimal memory footprint. My previous method was an ugly one, that went through the file multiple times, and picked any pair that occurred more than three times. I would like a better method of choosing priority values, and with a minimum amount of memory consumed. The way I see it, I could consume six to ten bytes, and read through the file up to 65536 times, or I could consume 262144 to 524288 bytes and count every occurrence in one pass. Unless there is another way. Can anyone think of a way to sort the numbers 0 - 65535 by order of frequency in an arbitrary file?

General and Gameplay Programming Programming

Started by Ectara October 27, 2012 09:00 AM

10 comments, last by Ectara 11 years, 5 months ago

Stroppy Katamari

1,416

November 01, 2012 01:13 AM

When enough memory is not available to do a count for the whole thing in one pass, I'd probably try maintaining K highest frequency values together with their frequency, take as much extra memory as I can get (M bytes) and then run 65536*4/M + 1 passes of the file, with each pass counting the frequencies of M/4 different byte pair values. In between passes the K value-frequency pairs have to be updated if the last value block had any higher frequencies.

Ectara

3,097

Author

November 03, 2012 04:06 PM

When enough memory is not available to do a count for the whole thing in one pass, I'd probably try maintaining K highest frequency values together with their frequency, take as much extra memory as I can get (M bytes) and then run 65536*4/M + 1 passes of the file, with each pass counting the frequencies of M/4 different byte pair values. In between passes the K value-frequency pairs have to be updated if the last value block had any higher frequencies.

So, in layman's terms, for however much memory I can hold, count as many different pairs as I can, and keep track of the highest ones between passes.

I think I could make that work to scale how much reading it has to do based on how much memory it has available to it. This scale-able approach to the obvious solution seems like my best bet so far, but I'm open to any other suggestions.

Ordering Byte Pairs by Frequency In A File

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Ordering Byte Pairs by Frequency In A File

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines